Describe the bug
NeMo Curator pipelines running on SLURM with Apptainer containers complete all stages successfully, but crash with a segmentation fault (signal 11) during Ray cluster shutdown. This causes SLURM to mark the job as FAILED despite all data being correctly written, which breaks afterok dependency chains for post-pipeline jobs (report generation, HF sync).
Steps/Code to reproduce bug
The crash occurs intermittently (~50% of runs, sample size 3) after the pipeline completes and SlurmRayClient calls ray.shutdown(). All output data is valid on disk.
Expected behavior
Pipeline that completes all stages successfully should exit 0.
Instead, Ray shutdown triggers a SIGSEGV in UCX libucs.so, causing SLURM to mark the job as FAILED.
Additional context
The stack trace, related issues (openucx/ucx#7283, ray-project/ray#6025, ray-project/ray#59551), impact on afterok chains, and the questions about UCX_TLS=tcp.
Example of stacktrace:
[<hostname-here>:2186963:0:2290267] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xa0)
==== backtrace (tid:2290267) ====
0 /opt/venv/lib/python3.12/site-packages/libucx/lib/libucs.so(ucs_handle_error+0x294)
1 /opt/venv/lib/python3.12/site-packages/libucx/lib/libucs.so(+0x34cca)
2 /opt/venv/lib/python3.12/site-packages/libucx/lib/libucs.so(+0x34f7e)
3 /lib/x86_64-linux-gnu/libc.so.6(+0x45330)
4 /opt/venv/lib/python3.12/site-packages/ray/_raylet.so(+0xcae8cd)
5 /opt/venv/lib/python3.12/site-packages/ray/_raylet.so(+0xcae997)
6 /opt/venv/lib/python3.12/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x8f)
7 /opt/venv/lib/python3.12/site-packages/ray/_raylet.so(+0x10f4ada)
8 /opt/venv/lib/python3.12/site-packages/ray/_raylet.so(+0x111a6ff)
9 /opt/venv/lib/python3.12/site-packages/ray/_raylet.so(+0x1122ec6)
10 /opt/venv/lib/python3.12/site-packages/ray/_raylet.so(+0x11197a5)
11 /opt/venv/lib/python3.12/site-packages/ray/_raylet.so(+0x17db5a0)
12 /lib/x86_64-linux-gnu/libc.so.6(+0x9caa4)
13 /lib/x86_64-linux-gnu/libc.so.6(+0x129c3c)
=================================
Describe the bug
NeMo Curator pipelines running on SLURM with Apptainer containers complete all stages successfully, but crash with a segmentation fault (signal 11) during Ray cluster shutdown. This causes SLURM to mark the job as FAILED despite all data being correctly written, which breaks
afterokdependency chains for post-pipeline jobs (report generation, HF sync).Steps/Code to reproduce bug
The crash occurs intermittently (~50% of runs, sample size 3) after the pipeline completes and SlurmRayClient calls ray.shutdown(). All output data is valid on disk.
Expected behavior
Pipeline that completes all stages successfully should exit 0.
Instead, Ray shutdown triggers a SIGSEGV in UCX libucs.so, causing SLURM to mark the job as FAILED.
Additional context
The stack trace, related issues (openucx/ucx#7283, ray-project/ray#6025, ray-project/ray#59551), impact on afterok chains, and the questions about UCX_TLS=tcp.
Example of stacktrace: