Skip to content

CUDA illegal memory access when running Segger on multi-GPU machine #12

@enric-bazz

Description

@enric-bazz

When running Segger CLI to segment Xenium version 4.0.0. data on a machine with 2 separate GPUs, I encounter a cascade of CUDA: illegal memory access errors when both GPUs are visible. This appears to be caused by automatic distributed process spawning. Limiting execution to a single GPU avoids the issue.


Steps to reproduce

  1. Run Segger CLI on a 2× NVIDIA GeForce RTX 4090 machine with the default environment.
  2. Observe output:
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
...
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------
  1. Segger crashes with multiple errors such as:
RuntimeError: parallel_for: failed to synchronize: cudaErrorIllegalAddress
During handling of the above exception, another exception occurred:
RuntimeError: CUDA error: an illegal memory access was encountered
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS
  1. Limiting the session to a single GPU resolves the issue:
export CUDA_VISIBLE_DEVICES=0

Output:

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
...
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

Execution completes without errors.


Environment

  • Python: 3.11.14
  • Segger: 0.2.0
  • PyTorch: 2.5.0+cu121
  • Lightning: 2.6.0
  • CUDA: 12.2.0
  • NVIDIA Drivers: 535.247.01
  • GPU: 2 × NVIDIA GeForce RTX 4090

Relevant packages and versions:

Package Version
torch_scatter 2.1.2+pt25cu121
cuml-cu12 25.4.0
cugraph-cu12 25.4.1
cuspatial-cu12 25.4.0
cudf-cu12 25.4.0
cupy-cuda12x 13.6.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions