-
Notifications
You must be signed in to change notification settings - Fork 233
[Feature] Add SkyPilot examples #422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 10 commits
3037f9a
936c55e
c30f5b0
a617678
b1bece4
f8e10df
d3dbc7b
77f062f
1acb09a
bc421eb
3ab5114
1b4fc15
aa9ebbe
7de4c02
d937968
3401d74
1560c23
fd44f42
a3fc8d3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,158 @@ | ||
| # Running AReaL with SkyPilot | ||
|
|
||
| This README includes examples and guidelines to running AReaL experiments with SkyPilot. | ||
| Make sure you have SkyPilot properly installed following | ||
| [our installation guide](../../docs/tutorial/installation.md#optional-install-skypilot) | ||
| before running this example. Note that all command lines shown in this file are assumed | ||
| to be execute under the root of AReaL repository. | ||
|
|
||
| ## Running a Single Node Experiment | ||
|
|
||
| To run a single node experiment, you only need to setup the node with SkyPilot and | ||
| launch the experiment with AReaL local launcher. [The following file](local.yaml) shows | ||
| a SkyPilot yaml that could launch a simple GSM8K GRPO experiment in a single command | ||
| line. This example runs on GCP, but could be easily migrated to other cloud or K8S | ||
| cluster by changing `resource.infra` field in SkyPilot YAML file. | ||
|
|
||
| ```yaml | ||
| name: areal-test-skypilot | ||
|
|
||
| resources: | ||
| infra: gcp | ||
| accelerators: A100:2 | ||
| autostop: | ||
| idle_minutes: 10 | ||
| down: true | ||
| cpus: 8+ | ||
| memory: 32GB+ | ||
| disk_size: 256GB | ||
| image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.4 | ||
|
|
||
|
|
||
| num_nodes: 1 | ||
|
|
||
| workdir: . | ||
|
|
||
| envs: | ||
| EXPERIMENT_NAME: my-areal-experiment | ||
| TRIAL_NAME: my-trial-name | ||
|
|
||
| run: | | ||
| python3 -m areal.launcher.local examples/math/gsm8k_grpo.py \ | ||
| --config examples/math/gsm8k_grpo.yaml \ | ||
| experiment_name=$EXPERIMENT_NAME \ | ||
| trial_name=$TRIAL_NAME \ | ||
| cluster.n_gpus_per_node=2 \ | ||
| allocation_mode=sglang.d1+d1 \ | ||
| train_dataset.batch_size=4 \ | ||
| actor.mb_spec.max_tokens_per_mb=4096 | ||
| ``` | ||
| To run the experiment, execute: | ||
| ```bash | ||
| sky launch -c areal-test examples/skypilot/local.yaml | ||
| ``` | ||
|
|
||
| ## Running a Multi-Node Experiment | ||
|
|
||
| ### Running AReaL with Ray Launcher | ||
|
|
||
| The following example shows how to setup a ray cluster with SkyPilot and then use AReaL | ||
| to run GRPO with GSM8K dataset on 2 nodes, each with 1 A100 GPU. This example runs on | ||
|
||
| GCP, but could be easily migrated to other cloud or K8S cluster by changing | ||
| `resource.infra` field in SkyPilot YAML file. | ||
|
|
||
| Specify the resources and image used to run the experiment. | ||
|
|
||
| ```yaml | ||
| resources: | ||
| infra: gcp | ||
| accelerators: A100:1 | ||
| image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.4 | ||
| memory: 256+ | ||
| cpus: 32+ | ||
|
|
||
| num_nodes: 2 | ||
|
|
||
| workdir: . | ||
| ``` | ||
| Designate shared storage. You could either use an existing cloud bucket or volume: | ||
| ```yaml | ||
| file_mounts: | ||
| /storage: gs://areal-default | ||
|
||
| ``` | ||
| or create a new bucket or volume with SkyPilot: | ||
| ```yaml | ||
| file_mounts: | ||
| /storage: | ||
| name: areal-test | ||
| store: gcs | ||
| ``` | ||
| For more information about shared storage with SkyPilot, check | ||
| [SkyPilot Cloud Buckets](https://docs.skypilot.co/en/latest/reference/storage.html) and | ||
| [SkyPilot Volume](https://docs.skypilot.co/en/latest/reference/volumes.html). | ||
| Next, prepare commands used to setup ray cluster and run the experiment. | ||
| ```yaml | ||
| envs: | ||
| EXPERIMENT_NAME: my-areal-experiment | ||
| TRIAL_NAME: my-trial-name | ||
|
|
||
| run: | | ||
| # Get the Head node's IP and total number of nodes (environment variables injected by SkyPilot). | ||
| head_ip=$(echo "$SKYPILOT_NODE_IPS" | head -n1) | ||
| num_nodes=$(echo "$SKYPILOT_NODE_IPS" | wc -l) | ||
|
||
| if [ "$SKYPILOT_NODE_RANK" = "0" ]; then | ||
| echo "Starting Ray head node..." | ||
| ray start --head --port=6379 | ||
| while [ $(ray status | grep node_ | wc -l) -lt $num_nodes ]; do | ||
| echo "Waiting for all nodes to join... Current nodes: $(ray status | grep node_ | wc -l) / $num_nodes" | ||
| sleep 5 | ||
| done | ||
| echo "Executing training script on head node..." | ||
| python3 -m areal.launcher.ray examples/math/gsm8k_grpo.py \ | ||
| --config examples/skypilot/gsm8k_grpo_ray.yaml \ | ||
| experiment_name=$EXPERIMENT_NAME \ | ||
| trial_name=$TRIAL_NAME | ||
| else | ||
| sleep 10 | ||
| echo "Starting Ray worker node..." | ||
| ray start --address $head_ip:6379 | ||
| sleep 5 | ||
| fi | ||
| echo "Node setup complete for rank $SKYPILOT_NODE_RANK." | ||
| ``` | ||
| ### Launch the Ray Cluster and AReaL | ||
| Then you are ready to run AReaL with command line: | ||
| ```bash | ||
| sky launch -c areal-test examples/skypilot/ray_cluster.yaml | ||
| ``` | ||
|
|
||
| You should be able to see your AReaL running and producing training logs in your | ||
| terminal. | ||
|
|
||
| Successfully launched 2 nodes on GCP and deployed a ray cluster: | ||
| <img align="center" alt="Launching Ray Cluster" src="ray_launch.png" width="100%"> | ||
|
|
||
| Successfully ran a training step: | ||
| <img align="center" alt="Running a train step" src="train_step_success.png" width="100%"> | ||
|
|
||
| ### Running AReaL with SkyPilot Launcher | ||
|
|
||
| AReaL plans to support a SkyPilot native launcher with | ||
| [SkyPilot Python SDK](https://docs.skypilot.co/en/latest/reference/api.html), which is | ||
| currently under development. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,153 @@ | ||
| experiment_name: gsm8k-grpo-on-ray | ||
| trial_name: trial0 | ||
|
|
||
| seed: 1 | ||
| total_train_epochs: 10 | ||
| tokenizer_path: ${actor.path} | ||
| async_training: true | ||
|
|
||
| cluster: | ||
| n_nodes: 2 | ||
| n_gpus_per_node: 1 | ||
|
||
| fileroot: /storage/experiments | ||
| name_resolve: | ||
| type: ray | ||
| ray_actor_name: ray_kv_store | ||
|
|
||
| allocation_mode: sglang.d1+d1 | ||
|
|
||
| rollout: | ||
| experiment_name: ${experiment_name} | ||
| trial_name: ${trial_name} | ||
| max_concurrent_rollouts: 256 | ||
| queue_size: null | ||
| consumer_batch_size: ${train_dataset.batch_size} | ||
| max_head_offpolicyness: 2 | ||
| enable_rollout_tracing: false | ||
|
|
||
| gconfig: | ||
| n_samples: 4 | ||
| min_new_tokens: 0 | ||
| max_new_tokens: 1024 | ||
| greedy: false | ||
| temperature: 1.0 | ||
|
|
||
| actor: | ||
| experiment_name: ${experiment_name} | ||
| trial_name: ${trial_name} | ||
| path: Qwen/Qwen2.5-1.5B-Instruct | ||
| init_from_scratch: false | ||
| disable_dropout: true | ||
| gradient_checkpointing: false | ||
| dtype: bfloat16 | ||
| mb_spec: | ||
| max_tokens_per_mb: 4096 | ||
| optimizer: | ||
| type: adam | ||
| lr: 1.70e-5 | ||
| weight_decay: 0.017 | ||
| beta1: 0.9 | ||
| beta2: 0.999 | ||
| eps: 1e-8 | ||
| lr_scheduler_type: constant | ||
| gradient_clipping: 1.0 | ||
| warmup_steps_proportion: 0.001 | ||
| backend: fsdp | ||
| group_size: ${gconfig.n_samples} | ||
| eps_clip: 0.4 | ||
| temperature: ${gconfig.temperature} | ||
| reward_scaling: 10.0 | ||
| reward_bias: -0.5 | ||
| kl_ctl: 0.0 | ||
| ppo_n_minibatches: 1 | ||
| recompute_logprob: true | ||
| use_decoupled_loss: true | ||
| behav_imp_weight_cap: 5.0 | ||
| dynamic_sampling: false | ||
| reward_norm: | ||
| mean_level: group | ||
| std_level: group | ||
| group_size: ${gconfig.n_samples} | ||
| adv_norm: | ||
| mean_level: batch | ||
| std_level: batch | ||
| max_new_tokens: ${gconfig.max_new_tokens} | ||
|
|
||
| ref: | ||
| experiment_name: ${experiment_name} | ||
| trial_name: ${trial_name} | ||
| path: ${actor.path} | ||
| init_from_scratch: false | ||
| disable_dropout: true | ||
| dtype: ${actor.dtype} | ||
| mb_spec: | ||
| max_tokens_per_mb: 10240 | ||
| optimizer: null | ||
| backend: fsdp | ||
|
|
||
| # SGLang | ||
| sglang: | ||
| model_path: ${actor.path} | ||
| random_seed: ${seed} | ||
| skip_tokenizer_init: true | ||
| dtype: ${actor.dtype} | ||
| max_running_requests: null | ||
| context_length: 32768 | ||
| mem_fraction_static: 0.8 | ||
|
|
||
| # datasets | ||
| train_dataset: | ||
| batch_size: 4 | ||
| shuffle: true | ||
| pin_memory: true | ||
| num_workers: 4 | ||
| path: openai/gsm8k | ||
| type: rl | ||
| max_length: 1024 | ||
|
|
||
| valid_dataset: | ||
| batch_size: 4 | ||
| shuffle: true | ||
| pin_memory: true | ||
| num_workers: 4 | ||
| path: openai/gsm8k | ||
| type: rl | ||
|
|
||
| # Utilities | ||
| saver: | ||
| experiment_name: ${experiment_name} | ||
| trial_name: ${trial_name} | ||
| fileroot: ${cluster.fileroot} | ||
| freq_epochs: 1 | ||
| freq_steps: null | ||
| freq_secs: null | ||
|
|
||
| recover: | ||
| mode: disabled | ||
| experiment_name: ${experiment_name} | ||
| trial_name: ${trial_name} | ||
| fileroot: ${cluster.fileroot} | ||
| freq_epochs: 1 | ||
| freq_steps: null | ||
| freq_secs: 3600 | ||
|
|
||
| evaluator: | ||
| experiment_name: ${experiment_name} | ||
| trial_name: ${trial_name} | ||
| fileroot: ${cluster.fileroot} | ||
| freq_epochs: 1 | ||
| freq_steps: null | ||
| freq_secs: null | ||
|
|
||
| stats_logger: | ||
| experiment_name: ${experiment_name} | ||
| trial_name: ${trial_name} | ||
| fileroot: ${cluster.fileroot} | ||
| wandb: | ||
| mode: disabled | ||
|
|
||
| launcher: | ||
| inference_server_cpus_per_gpu: 4 | ||
| inference_server_mem_per_gpu: 32768 | ||
| trainer_cpus_per_gpu: 4 | ||
| trainer_mem_per_gpu: 32768 | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| name: areal-test-skypilot | ||
|
|
||
| resources: | ||
| infra: gcp | ||
| accelerators: A100:2 | ||
| autostop: | ||
| idle_minutes: 10 | ||
| down: true | ||
| cpus: 8+ | ||
| memory: 32GB+ | ||
| disk_size: 256GB | ||
| image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.4 | ||
|
|
||
| num_nodes: 1 | ||
|
|
||
| file_mounts: | ||
| /storage: gs://areal-default | ||
|
||
|
|
||
| workdir: . | ||
|
|
||
| run: | | ||
| python3 -m areal.launcher.local examples/math/gsm8k_grpo.py \ | ||
| --config examples/math/gsm8k_grpo.yaml \ | ||
| experiment_name=gsm8k-grpo \ | ||
| trial_name=trial0 \ | ||
| cluster.n_gpus_per_node=2 \ | ||
|
||
| allocation_mode=sglang.d1+d1 \ | ||
| train_dataset.batch_size=4 \ | ||
| actor.mb_spec.max_tokens_per_mb=4096 | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a link to SkyPilot docs + mention that it supports 17+ clouds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course!