Skip to content

Commit 087b32a

Browse files
authored
[Feature] Add SkyPilot examples (#422)
* add skypilot examples * split PR, add log screenshot
1 parent 2dddc28 commit 087b32a

File tree

8 files changed

+500
-0
lines changed

8 files changed

+500
-0
lines changed

docs/tutorial/installation.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,58 @@ python3 examples/env/validate_installation.py
9090

9191
After installation validation passed, you are good to go!
9292

93+
(install-skypilot)=
94+
95+
## (Optional) Install SkyPilot
96+
97+
SkyPilot helps you run AReaL easily on 17+ different cloud or your own Kubernetes
98+
infrastructure. For more details about Skypilot, check
99+
[SkyPilot Documentation](https://docs.skypilot.co/en/latest/overview.html). Below shows
100+
the minimal steps to setup skypilot on GCP or Kubernetes.
101+
102+
### Install SkyPilot by pip
103+
104+
```bash
105+
# In your conda environment
106+
# NOTE: SkyPilot requires 3.7 <= python <= 3.13
107+
pip install -U "skypilot[gcp,kubernetes]"
108+
```
109+
110+
### GCP setup
111+
112+
```bash
113+
# Install Google Cloud SDK
114+
conda install -y -c conda-forge google-cloud-sdk
115+
116+
# Initialize gcloud and select your account/project
117+
gcloud init
118+
119+
# (Optional) choose a project explicitly
120+
gcloud config set project <PROJECT_ID>
121+
122+
# Create Application Default Credentials
123+
gcloud auth application-default login
124+
```
125+
126+
### Kubernetes setup
127+
128+
Check
129+
[here](https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-setup.html)
130+
for a comprehensive guide on how to set up a kubernetes cluster for SkyPilot.
131+
132+
### Verify
133+
134+
```bash
135+
sky check
136+
```
137+
138+
If `GCP: enabled` or `Kubernetes: enabled` are shown, you're ready to use SkyPilot with
139+
AReaL. Check
140+
[here](https://github.com/inclusionAI/AReaL/blob/main/examples/skypilot/README.md) for a
141+
detailed example to run AReaL with SkyPilot. For more options and details for SkyPilot,
142+
see the official
143+
[SkyPilot installation guide](https://docs.skypilot.co/en/latest/getting-started/installation.html).
144+
93145
## (Optional) Launch Ray Cluster for Distributed Training
94146

95147
On the first node, start the Ray Head:

docs/tutorial/quickstart.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,29 @@ Additional references:
111111
> **Note**: Ray and Slurm launchers only work for distributed experiments with more than 1 node (`cluster.n_nodes > 1`). They allocate GPUs for training and generation at the granularity of **nodes**, which means the number of GPUs allocated for generation and training must be integer multiples of `cluster.n_gpus_per_node`.
112112
-->
113113

114+
## Distributed Experiments on Cloud or K8s with SkyPilot
115+
116+
If you want to directly run an experiment on cloud or your own Kubernetes
117+
infrastructure, we recommend you to use SkyPilot. After installing and setting up
118+
SkyPilot (see [Install SkyPilot](installation.md#install-skypilot)), you could launch a
119+
distributed experiment based on our SkyPilot example (two 8xA100 GPU nodes) with one
120+
command line:
121+
122+
```bash
123+
# Launch on GCP
124+
sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml --infra gcp
125+
# Launch on AWS
126+
sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml --infra aws
127+
# Launch on your K8s Cluster
128+
sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml --infra k8s
129+
```
130+
131+
Check
132+
[Running AReaL with SkyPilot](https://github.com/inclusionAI/AReaL/blob/main/examples/skypilot/README.md),
133+
for more details about the examples. Check
134+
[SkyPilot Documentation](https://docs.skypilot.co/en/latest/docs/index.html) for more
135+
information about SkyPilot.
136+
114137
(switching-from-legacy-areal-to-areal-lite)=
115138

116139
## Switching from legacy AReaL to AReaL-lite

examples/skypilot/README.md

Lines changed: 196 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,196 @@
1+
# Running AReaL with SkyPilot
2+
3+
This README includes examples and guidelines to running AReaL experiments with SkyPilot.
4+
Make sure you have SkyPilot properly installed following
5+
[our installation guide](../../docs/tutorial/installation.md#optional-install-skypilot)
6+
before running this example. Note that all command lines shown in this file are assumed
7+
to be execute under the root of AReaL repository.
8+
9+
## Running a Single Node Experiment
10+
11+
To run a single node experiment, you only need to setup the node with SkyPilot and
12+
launch the experiment with AReaL local launcher.
13+
[The following file](single_node.sky.yaml) shows a SkyPilot yaml that could launch a
14+
simple GSM8K GRPO experiment in a single command line. This example is tested on both
15+
GCP and a K8S cluster.
16+
17+
```yaml
18+
name: areal-test-skypilot
19+
20+
resources:
21+
accelerators: A100:2
22+
autostop:
23+
idle_minutes: 10
24+
down: true
25+
cpus: 8+
26+
memory: 32GB+
27+
disk_size: 256GB
28+
image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.4
29+
30+
num_nodes: 1
31+
32+
file_mounts:
33+
/storage: # Should be consistent with the storage paths set in gsm8k_grpo_ray.yaml
34+
source: s3://my-bucket/ # or gs://, https://<azure_storage_account>.blob.core.windows.net/<container>, r2://, cos://<region>/<bucket>, oci://<bucket_name>
35+
mode: MOUNT # MOUNT or COPY or MOUNT_CACHED. Defaults to MOUNT. Optional.
36+
37+
workdir: .
38+
39+
run: |
40+
python3 -m areal.launcher.local examples/math/gsm8k_grpo.py \
41+
--config examples/math/gsm8k_grpo.yaml \
42+
experiment_name=gsm8k-grpo \
43+
trial_name=trial0 \
44+
cluster.n_nodes=1 \
45+
cluster.n_gpus_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
46+
allocation_mode=sglang.d1+d1 \
47+
train_dataset.batch_size=4 \
48+
actor.mb_spec.max_tokens_per_mb=4096
49+
```
50+
51+
To run the experiment, execute:
52+
53+
```bash
54+
sky launch -c areal-test examples/skypilot/single_node.sky.yaml
55+
```
56+
57+
To designate the cloud or infrastructure you wish to run your experiment on by adding
58+
`--infra xxx`. For example:
59+
60+
```bash
61+
sky launch -c areal-test examples/skypilot/single_node.sky.yaml --infra gcp
62+
sky launch -c areal-test examples/skypilot/single_node.sky.yaml --infra aws
63+
sky launch -c areal-test examples/skypilot/single_node.sky.yaml --infra k8s
64+
```
65+
66+
## Running a Multi-Node Experiment
67+
68+
### Running AReaL with Ray Launcher
69+
70+
The following example shows how to setup a ray cluster with SkyPilot and then use AReaL
71+
to run GRPO with GSM8K dataset on 2 nodes, each with 1 A100 GPU. This example is tested
72+
on GCP and a K8S cluster.
73+
74+
Specify the resources and image used to run the experiment.
75+
76+
```yaml
77+
resources:
78+
accelerators: A100:8
79+
image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.4
80+
memory: 256+
81+
cpus: 32+
82+
83+
num_nodes: 2
84+
85+
workdir: .
86+
```
87+
88+
Designate shared storage. You could either use an existing cloud bucket or volume:
89+
90+
```yaml
91+
file_mounts:
92+
/storage: # Should be consistent with the storage paths set in gsm8k_grpo_ray.yaml
93+
source: s3://my-bucket/ # or gs://, https://<azure_storage_account>.blob.core.windows.net/<container>, r2://, cos://<region>/<bucket>, oci://<bucket_name>
94+
mode: MOUNT # MOUNT or COPY or MOUNT_CACHED. Defaults to MOUNT. Optional.
95+
```
96+
97+
or create a new bucket or volume with SkyPilot:
98+
99+
```yaml
100+
# Create an empty gcs bucket
101+
file_mounts:
102+
/storage: # Should be consistent with the storage paths set in gsm8k_grpo_ray.yaml
103+
name: my-sky-bucket
104+
store: gcs # Optional: either of s3, gcs, azure, r2, ibm, oci
105+
```
106+
107+
For more information about shared storage with SkyPilot, check
108+
[SkyPilot Cloud Buckets](https://docs.skypilot.co/en/latest/reference/storage.html) and
109+
[SkyPilot Volume](https://docs.skypilot.co/en/latest/reference/volumes.html).
110+
111+
Next, prepare commands used to setup ray cluster and run the experiment.
112+
113+
```yaml
114+
envs:
115+
EXPERIMENT_NAME: my-areal-experiment
116+
TRIAL_NAME: my-trial-name
117+
118+
run: |
119+
run: |
120+
# Get the Head node's IP and total number of nodes (environment variables injected by SkyPilot).
121+
head_ip=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
122+
123+
if [ "$SKYPILOT_NODE_RANK" = "0" ]; then
124+
echo "Starting Ray head node..."
125+
ray start --head --port=6379
126+
127+
while [ $(ray status | grep node_ | wc -l) -lt $SKYPILOT_NUM_NODES ]; do
128+
echo "Waiting for all nodes to join... Current nodes: $(ray status | grep node_ | wc -l) / $SKYPILOT_NUM_NODES"
129+
sleep 5
130+
done
131+
132+
echo "Executing training script on head node..."
133+
python3 -m areal.launcher.ray examples/math/gsm8k_grpo.py \
134+
--config examples/skypilot/gsm8k_grpo_ray.yaml \
135+
experiment_name=gsm8k-grpo \
136+
trial_name=trial0 \
137+
cluster.n_nodes=$SKYPILOT_NUM_NODES \
138+
cluster.n_gpus_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
139+
allocation_mode=sglang.d8+d8
140+
else
141+
sleep 10
142+
echo "Starting Ray worker node..."
143+
ray start --address $head_ip:6379
144+
sleep 5
145+
fi
146+
147+
echo "Node setup complete for rank $SKYPILOT_NODE_RANK."
148+
```
149+
150+
**Note**: If you are running on a cluster in which nodes are connected via infiniband,
151+
you might need an additional config field to the example yaml file for the experiment to
152+
run:
153+
154+
```yaml
155+
config:
156+
kubernetes:
157+
pod_config:
158+
spec:
159+
containers:
160+
- securityContext:
161+
capabilities:
162+
add:
163+
- IPC_LOCK
164+
```
165+
166+
### Launch the Ray Cluster and Run AReaL Experiment
167+
168+
Then you are ready to run AReaL with command line:
169+
170+
```bash
171+
sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml
172+
```
173+
174+
To designate the cloud or infrastructure you wish to run your experiment on by adding
175+
`--infra xxx`. For example:
176+
177+
```bash
178+
sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml --infra gcp
179+
sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml --infra aws
180+
sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml --infra k8s
181+
```
182+
183+
You should be able to see your AReaL running and producing training logs in your
184+
terminal.
185+
186+
Successfully launched 2 nodes on GCP and deployed a ray cluster:
187+
<img align="center" alt="Launching Ray Cluster" src="ray_launch.png" width="100%">
188+
189+
Successfully ran a training step:
190+
<img align="center" alt="Running a train step" src="train_step_success.png" width="100%">
191+
192+
### Running AReaL with SkyPilot Launcher
193+
194+
AReaL plans to support a SkyPilot native launcher with
195+
[SkyPilot Python SDK](https://docs.skypilot.co/en/latest/reference/api.html), which is
196+
currently under development.

0 commit comments

Comments
 (0)