|
| 1 | +# Running AReaL with SkyPilot |
| 2 | + |
| 3 | +This README includes examples and guidelines to running AReaL experiments with SkyPilot. |
| 4 | +Make sure you have SkyPilot properly installed following |
| 5 | +[our installation guide](../../docs/tutorial/installation.md#optional-install-skypilot) |
| 6 | +before running this example. Note that all command lines shown in this file are assumed |
| 7 | +to be execute under the root of AReaL repository. |
| 8 | + |
| 9 | +## Running a Single Node Experiment |
| 10 | + |
| 11 | +To run a single node experiment, you only need to setup the node with SkyPilot and |
| 12 | +launch the experiment with AReaL local launcher. |
| 13 | +[The following file](single_node.sky.yaml) shows a SkyPilot yaml that could launch a |
| 14 | +simple GSM8K GRPO experiment in a single command line. This example is tested on both |
| 15 | +GCP and a K8S cluster. |
| 16 | + |
| 17 | +```yaml |
| 18 | +name: areal-test-skypilot |
| 19 | + |
| 20 | +resources: |
| 21 | + accelerators: A100:2 |
| 22 | + autostop: |
| 23 | + idle_minutes: 10 |
| 24 | + down: true |
| 25 | + cpus: 8+ |
| 26 | + memory: 32GB+ |
| 27 | + disk_size: 256GB |
| 28 | + image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.4 |
| 29 | + |
| 30 | +num_nodes: 1 |
| 31 | + |
| 32 | +file_mounts: |
| 33 | + /storage: # Should be consistent with the storage paths set in gsm8k_grpo_ray.yaml |
| 34 | + source: s3://my-bucket/ # or gs://, https://<azure_storage_account>.blob.core.windows.net/<container>, r2://, cos://<region>/<bucket>, oci://<bucket_name> |
| 35 | + mode: MOUNT # MOUNT or COPY or MOUNT_CACHED. Defaults to MOUNT. Optional. |
| 36 | + |
| 37 | +workdir: . |
| 38 | + |
| 39 | +run: | |
| 40 | + python3 -m areal.launcher.local examples/math/gsm8k_grpo.py \ |
| 41 | + --config examples/math/gsm8k_grpo.yaml \ |
| 42 | + experiment_name=gsm8k-grpo \ |
| 43 | + trial_name=trial0 \ |
| 44 | + cluster.n_nodes=1 \ |
| 45 | + cluster.n_gpus_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ |
| 46 | + allocation_mode=sglang.d1+d1 \ |
| 47 | + train_dataset.batch_size=4 \ |
| 48 | + actor.mb_spec.max_tokens_per_mb=4096 |
| 49 | +``` |
| 50 | +
|
| 51 | +To run the experiment, execute: |
| 52 | +
|
| 53 | +```bash |
| 54 | +sky launch -c areal-test examples/skypilot/single_node.sky.yaml |
| 55 | +``` |
| 56 | + |
| 57 | +To designate the cloud or infrastructure you wish to run your experiment on by adding |
| 58 | +`--infra xxx`. For example: |
| 59 | + |
| 60 | +```bash |
| 61 | +sky launch -c areal-test examples/skypilot/single_node.sky.yaml --infra gcp |
| 62 | +sky launch -c areal-test examples/skypilot/single_node.sky.yaml --infra aws |
| 63 | +sky launch -c areal-test examples/skypilot/single_node.sky.yaml --infra k8s |
| 64 | +``` |
| 65 | + |
| 66 | +## Running a Multi-Node Experiment |
| 67 | + |
| 68 | +### Running AReaL with Ray Launcher |
| 69 | + |
| 70 | +The following example shows how to setup a ray cluster with SkyPilot and then use AReaL |
| 71 | +to run GRPO with GSM8K dataset on 2 nodes, each with 1 A100 GPU. This example is tested |
| 72 | +on GCP and a K8S cluster. |
| 73 | + |
| 74 | +Specify the resources and image used to run the experiment. |
| 75 | + |
| 76 | +```yaml |
| 77 | +resources: |
| 78 | + accelerators: A100:8 |
| 79 | + image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.4 |
| 80 | + memory: 256+ |
| 81 | + cpus: 32+ |
| 82 | + |
| 83 | +num_nodes: 2 |
| 84 | + |
| 85 | +workdir: . |
| 86 | +``` |
| 87 | +
|
| 88 | +Designate shared storage. You could either use an existing cloud bucket or volume: |
| 89 | +
|
| 90 | +```yaml |
| 91 | +file_mounts: |
| 92 | + /storage: # Should be consistent with the storage paths set in gsm8k_grpo_ray.yaml |
| 93 | + source: s3://my-bucket/ # or gs://, https://<azure_storage_account>.blob.core.windows.net/<container>, r2://, cos://<region>/<bucket>, oci://<bucket_name> |
| 94 | + mode: MOUNT # MOUNT or COPY or MOUNT_CACHED. Defaults to MOUNT. Optional. |
| 95 | +``` |
| 96 | +
|
| 97 | +or create a new bucket or volume with SkyPilot: |
| 98 | +
|
| 99 | +```yaml |
| 100 | +# Create an empty gcs bucket |
| 101 | +file_mounts: |
| 102 | + /storage: # Should be consistent with the storage paths set in gsm8k_grpo_ray.yaml |
| 103 | + name: my-sky-bucket |
| 104 | + store: gcs # Optional: either of s3, gcs, azure, r2, ibm, oci |
| 105 | +``` |
| 106 | +
|
| 107 | +For more information about shared storage with SkyPilot, check |
| 108 | +[SkyPilot Cloud Buckets](https://docs.skypilot.co/en/latest/reference/storage.html) and |
| 109 | +[SkyPilot Volume](https://docs.skypilot.co/en/latest/reference/volumes.html). |
| 110 | +
|
| 111 | +Next, prepare commands used to setup ray cluster and run the experiment. |
| 112 | +
|
| 113 | +```yaml |
| 114 | +envs: |
| 115 | + EXPERIMENT_NAME: my-areal-experiment |
| 116 | + TRIAL_NAME: my-trial-name |
| 117 | + |
| 118 | +run: | |
| 119 | + run: | |
| 120 | + # Get the Head node's IP and total number of nodes (environment variables injected by SkyPilot). |
| 121 | + head_ip=$(echo "$SKYPILOT_NODE_IPS" | head -n1) |
| 122 | +
|
| 123 | + if [ "$SKYPILOT_NODE_RANK" = "0" ]; then |
| 124 | + echo "Starting Ray head node..." |
| 125 | + ray start --head --port=6379 |
| 126 | +
|
| 127 | + while [ $(ray status | grep node_ | wc -l) -lt $SKYPILOT_NUM_NODES ]; do |
| 128 | + echo "Waiting for all nodes to join... Current nodes: $(ray status | grep node_ | wc -l) / $SKYPILOT_NUM_NODES" |
| 129 | + sleep 5 |
| 130 | + done |
| 131 | +
|
| 132 | + echo "Executing training script on head node..." |
| 133 | + python3 -m areal.launcher.ray examples/math/gsm8k_grpo.py \ |
| 134 | + --config examples/skypilot/gsm8k_grpo_ray.yaml \ |
| 135 | + experiment_name=gsm8k-grpo \ |
| 136 | + trial_name=trial0 \ |
| 137 | + cluster.n_nodes=$SKYPILOT_NUM_NODES \ |
| 138 | + cluster.n_gpus_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ |
| 139 | + allocation_mode=sglang.d8+d8 |
| 140 | + else |
| 141 | + sleep 10 |
| 142 | + echo "Starting Ray worker node..." |
| 143 | + ray start --address $head_ip:6379 |
| 144 | + sleep 5 |
| 145 | + fi |
| 146 | +
|
| 147 | + echo "Node setup complete for rank $SKYPILOT_NODE_RANK." |
| 148 | +``` |
| 149 | +
|
| 150 | +**Note**: If you are running on a cluster in which nodes are connected via infiniband, |
| 151 | +you might need an additional config field to the example yaml file for the experiment to |
| 152 | +run: |
| 153 | +
|
| 154 | +```yaml |
| 155 | +config: |
| 156 | + kubernetes: |
| 157 | + pod_config: |
| 158 | + spec: |
| 159 | + containers: |
| 160 | + - securityContext: |
| 161 | + capabilities: |
| 162 | + add: |
| 163 | + - IPC_LOCK |
| 164 | +``` |
| 165 | +
|
| 166 | +### Launch the Ray Cluster and Run AReaL Experiment |
| 167 | +
|
| 168 | +Then you are ready to run AReaL with command line: |
| 169 | +
|
| 170 | +```bash |
| 171 | +sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml |
| 172 | +``` |
| 173 | + |
| 174 | +To designate the cloud or infrastructure you wish to run your experiment on by adding |
| 175 | +`--infra xxx`. For example: |
| 176 | + |
| 177 | +```bash |
| 178 | +sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml --infra gcp |
| 179 | +sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml --infra aws |
| 180 | +sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml --infra k8s |
| 181 | +``` |
| 182 | + |
| 183 | +You should be able to see your AReaL running and producing training logs in your |
| 184 | +terminal. |
| 185 | + |
| 186 | +Successfully launched 2 nodes on GCP and deployed a ray cluster: |
| 187 | +<img align="center" alt="Launching Ray Cluster" src="ray_launch.png" width="100%"> |
| 188 | + |
| 189 | +Successfully ran a training step: |
| 190 | +<img align="center" alt="Running a train step" src="train_step_success.png" width="100%"> |
| 191 | + |
| 192 | +### Running AReaL with SkyPilot Launcher |
| 193 | + |
| 194 | +AReaL plans to support a SkyPilot native launcher with |
| 195 | +[SkyPilot Python SDK](https://docs.skypilot.co/en/latest/reference/api.html), which is |
| 196 | +currently under development. |
0 commit comments