-
Notifications
You must be signed in to change notification settings - Fork 227
[Feature] Add SkyPilot examples #422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces first-class support for SkyPilot, enabling AReaL experiments to be run on cloud and Kubernetes infrastructure. The changes include a new SkyPilotLauncherConfig, the SkyPilotLauncher implementation, and extensive documentation and examples.
The overall implementation is solid and follows SkyPilot's best practices. The new launcher is well-structured, handling cluster provisioning, job submission, and state management correctly. The documentation is also comprehensive and will be very helpful for users.
I've found a few issues that should be addressed:
- There are hardcoded network ports in the launcher, which could cause conflicts.
- There's a bug in the calculation of trainer nodes, leading to incorrect resource allocation.
- The example
ray_cluster.yamland its corresponding documentation contain a shell script with syntax errors and a logic bug that would cause worker nodes to terminate prematurely.
Addressing these points will improve the robustness and correctness of the SkyPilot integration. Great work on adding this powerful feature!
examples/skypilot/README.md
Outdated
| future launches. | ||
| ```bash | ||
| sky volumes apply storage-volume.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to make it clear from where the user needs to execute steps from this README.
Here it assumes examples/skypilot, but later it assumes the root of the repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could make contents about cloud buckets and volumes shorter, and refer to SkyPilot cloud bucket and volume guide.
Also, I have checked other places to ensure that users can execute these commands in the root of the repo.
examples/skypilot/README.md
Outdated
| /storage: areal-shared-storage | ||
| setup: | | ||
| pip3 install -e . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth creating a virtual env instead of installing with pip as root
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AReaL repo root directory as workdir and our image ensure that we do not need pip install -e . (or any other installation) before launching the experiment. Therefore setup section here is removed.
examples/skypilot/README.md
Outdated
|
|
||
| ```bash | ||
| export WANDB_API_KEY=<your-wandb-api-key> | ||
| sky launch -c areal --secret WANDB_API_KEY examples/skypilot/ray_cluster.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This command fails for me with this:
(head, rank=0, pid=4232) Executing training script on head node...
(worker1, rank=1, pid=3359, ip=10.170.27.163) Node setup complete for rank 1.
(head, rank=0, pid=4232) 2025-10-11 02:34:06,774 WARNING services.py:394 -- Found multiple active Ray instances: {'10.156.61.243:6380', '10.156.61.243:6379'}. Connecting to latest cluster at 10.156.61.243:6379. You can override this by setting the `--address` flag or `RAY_ADDRESS` environment variable.
(head, rank=0, pid=4232) 2025-10-11 02:34:06,774 INFO worker.py:1771 -- Connecting to existing Ray cluster at address: 10.156.61.243:6379...
(head, rank=0, pid=4232) 2025-10-11 02:34:06,785 INFO worker.py:1942 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265
(head, rank=0, pid=4232) Traceback (most recent call last):
(head, rank=0, pid=4232) File "<frozen runpy>", line 198, in _run_module_as_main
(head, rank=0, pid=4232) File "<frozen runpy>", line 88, in _run_code
(head, rank=0, pid=4232) File "/root/sky_workdir/areal/launcher/ray.py", line 591, in <module>
(head, rank=0, pid=4232) main()
(head, rank=0, pid=4232) File "/root/sky_workdir/areal/launcher/ray.py", line 330, in main
(head, rank=0, pid=4232) config, _ = parse_cli_args(sys.argv[1:])
(head, rank=0, pid=4232) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(head, rank=0, pid=4232) File "/root/sky_workdir/areal/api/cli_args.py", line 1308, in parse_cli_args
(head, rank=0, pid=4232) cfg = hydra_compose(
(head, rank=0, pid=4232) ^^^^^^^^^^^^^^
(head, rank=0, pid=4232) File "/usr/local/lib/python3.12/dist-packages/hydra/compose.py", line 38, in compose
(head, rank=0, pid=4232) cfg = gh.hydra.compose_config(
(head, rank=0, pid=4232) ^^^^^^^^^^^^^^^^^^^^^^^^
(head, rank=0, pid=4232) File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/hydra.py", line 594, in compose_config
(head, rank=0, pid=4232) cfg = self.config_loader.load_configuration(
(head, rank=0, pid=4232) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(head, rank=0, pid=4232) File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/config_loader_impl.py", line 142, in load_configuration
(head, rank=0, pid=4232) return self._load_configuration_impl(
(head, rank=0, pid=4232) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(head, rank=0, pid=4232) File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/config_loader_impl.py", line 244, in _load_configuration_impl
(head, rank=0, pid=4232) parsed_overrides, caching_repo = self._parse_overrides_and_create_caching_repo(
(head, rank=0, pid=4232) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(head, rank=0, pid=4232) File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/config_loader_impl.py", line 228, in _parse_overrides_and_create_caching_repo
(head, rank=0, pid=4232) parsed_overrides = parser.parse_overrides(overrides=overrides)
(head, rank=0, pid=4232) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(head, rank=0, pid=4232) File "/usr/local/lib/python3.12/dist-packages/hydra/core/override_parser/overrides_parser.py", line 99, in parse_overrides
(head, rank=0, pid=4232) raise OverrideParseException(
(head, rank=0, pid=4232) hydra.errors.OverrideParseException: mismatched input '=' expecting <EOF>
(head, rank=0, pid=4232) See https://hydra.cc/docs/1.2/advanced/override_grammar/basic for details
(head, rank=0, pid=4232) Node setup complete for rank 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is caused by +trainer_env_vars="WANDB_API_KEY=$WANDB_API_KEY". This is a limitation of hydra, which does not allow = to appear in the command line arguments. Currently, users can only set environment variables in the yaml config file. We are finding workarounds for users to set environment variables in the command lines.
Now I think we just remove WANDB_API_KEY from examples to make it clear and runnable.
examples/skypilot/README.md
Outdated
| --config examples/skypilot/gsm8k_grpo_ray.yaml \ | ||
| experiment_name=<your experiment name> \ | ||
| trial_name=<your trial name> \ | ||
| trainer_env_vars="WANDB_API_KEY=$WANDB_API_KEY" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong and needs to be replaced with '+launcher.trainer_env_vars="WANDB_API_KEY='$WANDB_API_KEY'"'
otherwise it fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
| If `GCP: enabled` or `Kubernetes: enabled` are shown, you're ready to use SkyPilot with | ||
| AReaL. Check [here](../examples/skypilot.md) for a detailed example to run AReaL with | ||
| SkyPilot. For more options and details for SkyPilot, see the official | ||
| [SkyPilot installation guide](https://docs.skypilot.co/en/latest/getting-started/installation.html). | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to link to this page on how to configure K8s with work with SkyPilot: https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-setup.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added reference in the Kubernetes setup section above.
examples/skypilot/README.md
Outdated
| resolve and distributed checkpointing. The following guideline shows how to use SkyPilot | ||
| volumes to setup a high-performance shared storage. | ||
|
|
||
| 1. **Define the volume.** Create a YAML file describing the volume you want SkyPilot to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While using volumes is fine, this is not required. And using cloud buckets could be simpler: https://docs.skypilot.co/en/latest/reference/storage.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added cloud bucket usage in the example.
docs/tutorial/installation.md
Outdated
| ``` | ||
|
|
||
| If `GCP: enabled` or `Kubernetes: enabled` are shown, you're ready to use SkyPilot with | ||
| AReaL. Check [here](../examples/skypilot.md) for a detailed example to run AReaL with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file doesn't exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we changed this link to https://github.com/inclusionAI/AReaL/blob/main/examples/skypilot/README.md to ensure the link is available in our documentation pages after this PR is merged into main.
docs/tutorial/installation.md
Outdated
| ```bash | ||
| # Ensure your kubeconfig is at ~/.kube/config | ||
| mkdir -p ~/.kube | ||
| cp /path/to/kubeconfig ~/.kube/config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear where /path/to/kubeconfig comes from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this and referred to skypilot k8s setup guide instead.
examples/skypilot/README.md
Outdated
| ```yaml | ||
| resources: | ||
| accelerators: H100:8 | ||
| image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this image is configured to use a custom PyPI index https://pypi.antfin-inc.com/simple.
It doesn't work for me. Here's what I see:
(setup pid=4496) Looking in indexes: https://pypi.antfin-inc.com/simple
(setup pid=4496) Obtaining file:///root/sky_workdir
(setup pid=4496) Installing build dependencies: started
(setup pid=3465, ip=10.170.27.38) Looking in indexes: https://pypi.antfin-inc.com/simple
(setup pid=3465, ip=10.170.27.38) Obtaining file:///root/sky_workdir
(setup pid=3465, ip=10.170.27.38) Installing build dependencies: started
(setup pid=4496) Installing build dependencies: still running...
(setup pid=3465, ip=10.170.27.38) Installing build dependencies: still running...
(setup pid=4496) Installing build dependencies: still running...
(setup pid=3465, ip=10.170.27.38) Installing build dependencies: still running...
(setup pid=4496) Installing build dependencies: still running...
(setup pid=3465, ip=10.170.27.38) Installing build dependencies: still running...
(setup pid=4496) Installing build dependencies: still running...
(setup pid=3465, ip=10.170.27.38) Installing build dependencies: still running...
(setup pid=4496) Installing build dependencies: still running...
(setup pid=3465, ip=10.170.27.38) Installing build dependencies: still running...
(setup pid=4496) Installing build dependencies: still running...
(setup pid=4496) Installing build dependencies: finished with status 'error'
(setup pid=4496) error: subprocess-exited-with-error
(setup pid=4496)
(setup pid=4496) × pip subprocess to install build dependencies did not run successfully.
(setup pid=4496) │ exit code: 1
(setup pid=4496) ╰─> [8 lines of output]
(setup pid=4496) Looking in indexes: https://pypi.antfin-inc.com/simple
(setup pid=4496) WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd80fda300>, 'Connection to pypi.antfin-inc.com timed out. (connect timeout=100.0)')': /simple/setuptools/
(setup pid=4496) WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd80bca2d0>, 'Connection to pypi.antfin-inc.com timed out. (connect timeout=100.0)')': /simple/setuptools/
(setup pid=4496) WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd80bca480>, 'Connection to pypi.antfin-inc.com timed out. (connect timeout=100.0)')': /simple/setuptools/
(setup pid=4496) WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd80bca5d0>, 'Connection to pypi.antfin-inc.com timed out. (connect timeout=100.0)')': /simple/setuptools/
(setup pid=4496) WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd80bca780>, 'Connection to pypi.antfin-inc.com timed out. (connect timeout=100.0)')': /simple/setuptools/
(setup pid=4496) ERROR: Could not find a version that satisfies the requirement setuptools>=61.0 (from versions: none)
(setup pid=4496) ERROR: No matching distribution found for setuptools>=61.0
(setup pid=4496) [end of output]
(setup pid=4496)
(setup pid=4496) note: This error originates from a subprocess, and is likely not a problem with pip.
(setup pid=4496) error: subprocess-exited-with-error
(setup pid=4496)
(setup pid=4496) × pip subprocess to install build dependencies did not run successfully.
(setup pid=4496) │ exit code: 1
(setup pid=4496) ╰─> See above for output.
(setup pid=4496)
(setup pid=4496) note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Job 1's setup failed with return code list: [137, 1]
✓ Job finished (status: FAILED_SETUP).
command terminated with exit code 100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not requires any installation to run experiment now. However the PyPI index is still custom for our public image. We will mark this and fix this problem in our next image release.
examples/skypilot/README.md
Outdated
| echo "Starting Ray head node..." | ||
| ray start --head --port=6379 | ||
| while [ $(ray nodes | grep NODE_ID | wc -l) -lt $num_nodes ]; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ray nodes command doesn't exist:
(head, rank=0, pid=4484) Usage: ray [OPTIONS] COMMAND [ARGS]...
(head, rank=0, pid=4484) Try 'ray --help' for help.
(head, rank=0, pid=4484)
(head, rank=0, pid=4484) Error: No such command 'nodes'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this by using ray status instead.
examples/skypilot/README.md
Outdated
| echo "Executing training script on head node..." | ||
| python3 -m areal.launcher.ray examples/math/gsm8k_grpo.py \ | ||
| --config examples/skypilot/gsm8k_grpo_ray.yaml \ | ||
| experiment_name=<your experiment name> \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use envs and secrets to set these as env vars:
envs:
EXPERIMENT_NAME: my-areal-experiment
TRIAL_NAME: my-trial-name
secrets:
WANDB_API_KEY: null
and then:
experiment_name=$EXPERIMENT_NAME\
trial_name=$TRIAL_NAME \
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
alex000kim
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR a bit raw.
The training job doesn't run due to incorrect syntax in several places:
- non-existent commands
- incorrect parameters
- etc.
Thanks for your review! We have GCP access now and we will be able to test and debug this PR by ourselves. We will start fixing this PR right away. |
…nto mzy/skypilot
Great suggestion! Changed |
Added a note on additional config when using a cluster with infiniband. |
|
Thanks for your review! Please check again if recent changes have addressed your comments. @alex000kim @garrett4wade |
Michaelvll
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this @nuzant! It looks great to me! Just curious, would it be possible to add a section in the quick start as well: https://inclusionai.github.io/AReaL/tutorial/quickstart.html
Sure! We added command lines to run skypilot examples and a portal to |
This pull request adds comprehensive support and documentation for running AReaL experiments with SkyPilot on cloud and Kubernetes infrastructures. It introduces example YAML configurations for both single-node and multi-node experiments, a detailed README for SkyPilot usage, and step-by-step installation instructions. These changes make it much easier to launch distributed AReaL experiments on GCP or Kubernetes using SkyPilot.
SkyPilot Integration and Documentation
docs/tutorial/installation.mdwith step-by-step instructions for installing and verifying SkyPilot, including GCP and Kubernetes setup guidance.examples/skypilot/README.mdproviding detailed usage examples, explanations, and command lines for running AReaL experiments with SkyPilot, covering both single-node and multi-node setups.Example Configurations for SkyPilot
examples/skypilot/local.yamlas a template for launching a single-node AReaL experiment with SkyPilot on GCP, specifying resources, storage, and launch commands.examples/skypilot/ray_cluster.yamlfor launching a multi-node Ray cluster with SkyPilot, including setup for distributed training and shared storage.examples/skypilot/gsm8k_grpo_ray.yamlas a sample AReaL experiment configuration for Ray-based distributed training, detailing experiment parameters and resource allocation.UPDATE: Separated examples and launcher into 2 PRs: #464