Skip to content

"CUDA error: No kernel image" still exists after reinstalling torch-points-kernels #100

@maosuli

Description

@maosuli

Hi,

I have to compile the "torch-points-kernels" library in my workstation and then run the code in a remote server using the same conda environment.

The "CUDA error" happened after I submitted the job to the remote server although I could run the code well in my workstation.

Following your solution, I uninstalled the library, cleared the cache, and reinstalled it on my workstation after setting the TORCH_CUDA_ARCH_LIST.

But the same error still happened.

I checked the two GPU cards, which were Quadro RTX 6000 (Turing SM 75) and Tesla V100 (Volta SM70), respectively. And I set 'export TORCH_CUDA_ARCH_LIST="7.0;7.5"' before I reinstalled the library.

The error details are as follows,

Traceback (most recent call last):
File "train_s_stransformer.py", line 613, in
main()
File "train_s_stransformer.py", line 92, in main
main_worker(args.train_gpu, args.ngpus_per_node, args)
File "train_s_stransformer.py", line 327, in main_worker
loss_train, mIoU_train, mAcc_train, allAcc_train= train(train_loader, model, criterion, optimizer, epoch, scaler, scheduler)
File "train_s_stransformer.py", line 426, in train
output = model(feat, coord, offset, batch, neighbor_idx)
File "/home/xxx/.conda/envs/s_transformer10/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xxx/.conda/envs/s_transformer10/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/xxx/.conda/envs/s_transformer10/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xxx/3dSegmentation/stratified_transformer/Stratified-Transformer-main/model/stratified_transformer.py", line 453, in forward
feats, xyz, offset, feats_down, xyz_down, offset_down = layer(feats, xyz, offset)
File "/home/xxx/.conda/envs/s_transformer10/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xxx/3dSegmentation/stratified_transformer/Stratified-Transformer-main/model/stratified_transformer.py", line 281, in forward
v2p_map, p2v_map, counts = grid_sample(xyz, batch, window_size, start=None)
File "/home/xxx/3dSegmentation/stratified_transformer/Stratified-Transformer-main/model/stratified_transformer.py", line 59, in grid_sample
unique, cluster, counts = torch.unique(cluster, sorted=True, return_inverse=True, return_counts=True)
File "/home/xxx/.conda/envs/s_transformer10/lib/python3.7/site-packages/torch/_jit_internal.py", line 421, in fn
return if_true(*args, **kwargs)
File "/home/xxx/.conda/envs/s_transformer10/lib/python3.7/site-packages/torch/_jit_internal.py", line 421, in fn
return if_true(*args, **kwargs)
File "/home/xxx/.conda/envs/s_transformer10/lib/python3.7/site-packages/torch/functional.py", line 769, in _unique_impl
return_counts=return_counts,
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Please give me some advice on how to use it.

Best,

Eric.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions