Skip to content

UCX segfault during Ray shutdown on SLURM/Apptainer #2073

Description

@mgrbyte

Describe the bug
NeMo Curator pipelines running on SLURM with Apptainer containers complete all stages successfully, but crash with a segmentation fault (signal 11) during Ray cluster shutdown. This causes SLURM to mark the job as FAILED despite all data being correctly written, which breaks afterok dependency chains for post-pipeline jobs (report generation, HF sync).

Steps/Code to reproduce bug
The crash occurs intermittently (~50% of runs, sample size 3) after the pipeline completes and SlurmRayClient calls ray.shutdown(). All output data is valid on disk.

Expected behavior
Pipeline that completes all stages successfully should exit 0.
Instead, Ray shutdown triggers a SIGSEGV in UCX libucs.so, causing SLURM to mark the job as FAILED.

Additional context
The stack trace, related issues (openucx/ucx#7283, ray-project/ray#6025, ray-project/ray#59551), impact on afterok chains, and the questions about UCX_TLS=tcp.

Example of stacktrace:

[<hostname-here>:2186963:0:2290267] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xa0)
==== backtrace (tid:2290267) ====
 0  /opt/venv/lib/python3.12/site-packages/libucx/lib/libucs.so(ucs_handle_error+0x294)
 1  /opt/venv/lib/python3.12/site-packages/libucx/lib/libucs.so(+0x34cca)
 2  /opt/venv/lib/python3.12/site-packages/libucx/lib/libucs.so(+0x34f7e)
 3  /lib/x86_64-linux-gnu/libc.so.6(+0x45330)
 4  /opt/venv/lib/python3.12/site-packages/ray/_raylet.so(+0xcae8cd)
 5  /opt/venv/lib/python3.12/site-packages/ray/_raylet.so(+0xcae997)
 6  /opt/venv/lib/python3.12/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x8f)
 7  /opt/venv/lib/python3.12/site-packages/ray/_raylet.so(+0x10f4ada)
 8  /opt/venv/lib/python3.12/site-packages/ray/_raylet.so(+0x111a6ff)
 9  /opt/venv/lib/python3.12/site-packages/ray/_raylet.so(+0x1122ec6)
10  /opt/venv/lib/python3.12/site-packages/ray/_raylet.so(+0x11197a5)
11  /opt/venv/lib/python3.12/site-packages/ray/_raylet.so(+0x17db5a0)
12  /lib/x86_64-linux-gnu/libc.so.6(+0x9caa4)
13  /lib/x86_64-linux-gnu/libc.so.6(+0x129c3c)
=================================

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions