fix: make quickstart Ray startup Xenna-safe#2089
Conversation
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
|
|
||
| core_utils.init_cluster( | ||
| ray_port=6379, | ||
| ray_temp_dir=str(tmp_path), | ||
| ray_dashboard_port=8265, | ||
| ray_metrics_port=8080, | ||
| ray_client_server_port=10001, | ||
| ) | ||
|
|
There was a problem hiding this comment.
Missing required
ray_dashboard_host argument
init_cluster has ray_dashboard_host: str as a required positional parameter (no default value), but both test invocations omit it entirely. This causes an immediate TypeError: init_cluster() missing 1 required positional argument: 'ray_dashboard_host', meaning neither test can pass even when ray is installed. The same omission appears in the second test at line 62–69.
There was a problem hiding this comment.
Thanks — addressed in 421e8c5. I added the missing ray_dashboard_host argument to both init_cluster test calls, and also covered/fixed the worker-node path by setting RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES before SlurmRayClient._run_as_worker launches ray start.
Validation:
uv run ruff check nemo_curator/core/client.py tests/core/test_ray_cluster_utils.py— passeduv run pytest tests/core/test_ray_cluster_utils.py— blocked locally because NeMo Curator raises on non-Linux hosts (darwin); attempting to spoof Linux then fails on Ray/psutil Linux native extension import on macOS.
| """Main function to run the pipeline.""" | ||
| # Create pipeline | ||
| ray_client = RayClient() | ||
| ray_client = SlurmRayClient() |
There was a problem hiding this comment.
I don't think this is what we want, the quickstart should be generic to a single node run.
Reading through the issue it looks like the main ask is either a way to verify that Ray has been initialized, or a clearer error message if the cluster is not ready.
There was a problem hiding this comment.
Updated in afaff1d.
I reverted the quickstart back to RayClient() so it stays generic for the single-node path. The PR is now scoped back to the Xenna-safe env-var setup before Ray subprocess startup, without changing the quickstart client behavior.
Local validation: python -m py_compile tutorials/quickstart.py.
|
Updated the branch with the quickstart fix that was called out in review: |
|
Hi @nightcityblade I don't think this is the solution the issue is looking for. Closing the PR, thanks. |
Description
Fixes #1526.
This updates the quickstart to use
SlurmRayClient, which preserves the existing single-node fallback while waiting for allocated Slurm workers before starting the Xenna pipeline. It also setsRAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1before Ray cluster startup so Xenna can manage GPU assignment from cluster creation time.Usage
Checklist
Local validation:
ruff check nemo_curator/core/utils.py tests/core/test_ray_cluster_utils.py tutorials/quickstart.pypython3 -m py_compile nemo_curator/core/utils.py tests/core/test_ray_cluster_utils.py tutorials/quickstart.pypython3 -m pytest tests/core/test_ray_cluster_utils.pycould not run locally because this environment is missingray(ModuleNotFoundError: No module named 'ray').