Skip to content

fix: make quickstart Ray startup Xenna-safe#2089

Closed
nightcityblade wants to merge 4 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1526
Closed

fix: make quickstart Ray startup Xenna-safe#2089
nightcityblade wants to merge 4 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1526

Conversation

@nightcityblade

Copy link
Copy Markdown
Contributor

Description

Fixes #1526.

This updates the quickstart to use SlurmRayClient, which preserves the existing single-node fallback while waiting for allocated Slurm workers before starting the Xenna pipeline. It also sets RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1 before Ray cluster startup so Xenna can manage GPU assignment from cluster creation time.

Usage

from nemo_curator.core.client import SlurmRayClient

ray_client = SlurmRayClient()
ray_client.start()

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Local validation:

  • ruff check nemo_curator/core/utils.py tests/core/test_ray_cluster_utils.py tutorials/quickstart.py
  • python3 -m py_compile nemo_curator/core/utils.py tests/core/test_ray_cluster_utils.py tutorials/quickstart.py
  • python3 -m pytest tests/core/test_ray_cluster_utils.py could not run locally because this environment is missing ray (ModuleNotFoundError: No module named 'ray').

Signed-off-by: nightcityblade <nightcityblade@gmail.com>
@nightcityblade nightcityblade requested a review from a team as a code owner June 18, 2026 03:20
@nightcityblade nightcityblade requested review from weijiac0619 and removed request for a team June 18, 2026 03:20
@copy-pr-bot

copy-pr-bot Bot commented Jun 18, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1 (via os.environ.setdefault) in both init_cluster (head-node path) and SlurmRayClient._run_as_worker (worker-node path) so that Xenna can manage GPU assignment before Ray masks CUDA_VISIBLE_DEVICES. It also switches tutorials/quickstart.py from RayClient to SlurmRayClient, which falls back to single-node behaviour when SLURM_JOB_ID is absent, and adds four unit tests covering the env-var ordering and preservation logic.

  • nemo_curator/core/utils.py and client.py each get a single os.environ.setdefault(...) call placed just before the respective Popen/subprocess.run that launches the Ray process, with a comment explaining the Xenna requirement.
  • tutorials/quickstart.py replaces the bare RayClient() with SlurmRayClient(), transparently enabling multi-node Slurm start while keeping the existing single-node path.
  • tests/core/test_ray_cluster_utils.py adds four tests: two for init_cluster (env set / env preserved) and two for _run_as_worker (env set / env preserved), using monkeypatch to avoid needing a real Ray install.

Confidence Score: 5/5

Safe to merge — all three execution paths (single-node, Slurm head, Slurm worker) correctly set the env var before launching Ray, and existing user overrides are preserved via setdefault.

The changes are narrow: two os.environ.setdefault calls each placed immediately before their respective subprocess launch, a single import swap in the quickstart tutorial that is fully backward-compatible, and four tests that cover the ordering and preservation invariants. No regressions introduced.

No files require special attention.

Important Files Changed

Filename Overview
nemo_curator/core/utils.py Adds os.environ.setdefault('RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES', '1') before Popen in init_cluster; change is minimal, correctly ordered, and uses setdefault to honour user overrides.
nemo_curator/core/client.py Adds the same os.environ.setdefault call in _run_as_worker before subprocess.run, ensuring worker nodes inherit the flag; also includes a cosmetic nodes-or-nodelist simplification.
tests/core/test_ray_cluster_utils.py New test file covering env-var set/preserve for both init_cluster and _run_as_worker; worker test verifies ordering by asserting inside the fake_run callback.
tutorials/quickstart.py Switches from RayClient() to SlurmRayClient(), which is backward-compatible (falls back to single-node when SLURM_JOB_ID is absent).

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Script as quickstart.py
    participant SRC as SlurmRayClient
    participant RC as RayClient (super)
    participant IC as init_cluster
    participant ENV as os.environ
    participant Ray as ray start (subprocess)

    Script->>SRC: SlurmRayClient()
    Script->>SRC: start()

    alt No SLURM_JOB_ID (single-node fallback)
        SRC->>RC: super().start()
        RC->>IC: init_cluster(...)
        IC->>ENV: setdefault(RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES, 1)
        IC->>Ray: Popen(ray start --head --block)
    else SLURM head node
        SRC->>RC: super().start()
        RC->>IC: init_cluster(...)
        IC->>ENV: setdefault(RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES, 1)
        IC->>Ray: Popen(ray start --head --block)
        SRC->>SRC: _write_head_port()
        SRC->>SRC: _wait_for_workers()
    else SLURM worker node
        SRC->>SRC: _run_as_worker(head_ip)
        SRC->>ENV: setdefault(RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES, 1)
        SRC->>Ray: subprocess.run(ray start --address head:port --block)
        SRC->>Script: sys.exit(returncode)
    end
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Script as quickstart.py
    participant SRC as SlurmRayClient
    participant RC as RayClient (super)
    participant IC as init_cluster
    participant ENV as os.environ
    participant Ray as ray start (subprocess)

    Script->>SRC: SlurmRayClient()
    Script->>SRC: start()

    alt No SLURM_JOB_ID (single-node fallback)
        SRC->>RC: super().start()
        RC->>IC: init_cluster(...)
        IC->>ENV: setdefault(RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES, 1)
        IC->>Ray: Popen(ray start --head --block)
    else SLURM head node
        SRC->>RC: super().start()
        RC->>IC: init_cluster(...)
        IC->>ENV: setdefault(RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES, 1)
        IC->>Ray: Popen(ray start --head --block)
        SRC->>SRC: _write_head_port()
        SRC->>SRC: _wait_for_workers()
    else SLURM worker node
        SRC->>SRC: _run_as_worker(head_ip)
        SRC->>ENV: setdefault(RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES, 1)
        SRC->>Ray: subprocess.run(ray start --address head:port --block)
        SRC->>Script: sys.exit(returncode)
    end
Loading

Reviews (4): Last reviewed commit: "fix: use SlurmRayClient in quickstart" | Re-trigger Greptile

Comment on lines +44 to +52

core_utils.init_cluster(
ray_port=6379,
ray_temp_dir=str(tmp_path),
ray_dashboard_port=8265,
ray_metrics_port=8080,
ray_client_server_port=10001,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing required ray_dashboard_host argument

init_cluster has ray_dashboard_host: str as a required positional parameter (no default value), but both test invocations omit it entirely. This causes an immediate TypeError: init_cluster() missing 1 required positional argument: 'ray_dashboard_host', meaning neither test can pass even when ray is installed. The same omission appears in the second test at line 62–69.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — addressed in 421e8c5. I added the missing ray_dashboard_host argument to both init_cluster test calls, and also covered/fixed the worker-node path by setting RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES before SlurmRayClient._run_as_worker launches ray start.

Validation:

  • uv run ruff check nemo_curator/core/client.py tests/core/test_ray_cluster_utils.py — passed
  • uv run pytest tests/core/test_ray_cluster_utils.py — blocked locally because NeMo Curator raises on non-Linux hosts (darwin); attempting to spoof Linux then fails on Ray/psutil Linux native extension import on macOS.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label Jun 21, 2026
Comment thread tutorials/quickstart.py
"""Main function to run the pipeline."""
# Create pipeline
ray_client = RayClient()
ray_client = SlurmRayClient()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is what we want, the quickstart should be generic to a single node run.

Reading through the issue it looks like the main ask is either a way to verify that Ray has been initialized, or a clearer error message if the cluster is not ready.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in afaff1d.

I reverted the quickstart back to RayClient() so it stays generic for the single-node path. The PR is now scoped back to the Xenna-safe env-var setup before Ray subprocess startup, without changing the quickstart client behavior.

Local validation: python -m py_compile tutorials/quickstart.py.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added waiting-on-customer Waiting on the original author to respond and removed waiting-on-maintainers Waiting on maintainers to respond labels Jun 26, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label Jun 27, 2026
@nightcityblade

Copy link
Copy Markdown
Contributor Author

Updated the branch with the quickstart fix that was called out in review: tutorials/quickstart.py now instantiates SlurmRayClient, matching the import and avoiding the NameError. I could not run the Curator test slice locally on this macOS host because the package hard-fails on non-Linux platforms, but the branch update is pushed on fix/issue-1526.

@sarahyurick

Copy link
Copy Markdown
Contributor

Hi @nightcityblade I don't think this is the solution the issue is looking for. Closing the PR, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray cluster startup race condition in quickstart (Slurm) + Xenna GPU env error

4 participants