Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions docs/tutorial/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,58 @@ python3 examples/env/validate_installation.py

After installation validation passed, you are good to go!

(install-skypilot)=

## (Optional) Install SkyPilot

SkyPilot helps you run AReaL easily on 17+ different cloud or your own Kubernetes
infrastructure. For more details about Skypilot, check
[SkyPilot Documentation](https://docs.skypilot.co/en/latest/overview.html). Below shows
the minimal steps to setup skypilot on GCP or Kubernetes.

### Install SkyPilot by pip

```bash
# In your conda environment
# NOTE: SkyPilot requires 3.7 <= python <= 3.13
pip install -U "skypilot[gcp,kubernetes]"
```

### GCP setup

```bash
# Install Google Cloud SDK
conda install -y -c conda-forge google-cloud-sdk

# Initialize gcloud and select your account/project
gcloud init

# (Optional) choose a project explicitly
gcloud config set project <PROJECT_ID>

# Create Application Default Credentials
gcloud auth application-default login
```

### Kubernetes setup

Check
[here](https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-setup.html)
for a comprehensive guide on how to set up a kubernetes cluster for SkyPilot.

### Verify

```bash
sky check
```

If `GCP: enabled` or `Kubernetes: enabled` are shown, you're ready to use SkyPilot with
AReaL. Check
[here](https://github.com/inclusionAI/AReaL/blob/main/examples/skypilot/README.md) for a
detailed example to run AReaL with SkyPilot. For more options and details for SkyPilot,
see the official
[SkyPilot installation guide](https://docs.skypilot.co/en/latest/getting-started/installation.html).

## (Optional) Launch Ray Cluster for Distributed Training

On the first node, start the Ray Head:
Expand Down
23 changes: 23 additions & 0 deletions docs/tutorial/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,29 @@ Additional references:
> **Note**: Ray and Slurm launchers only work for distributed experiments with more than 1 node (`cluster.n_nodes > 1`). They allocate GPUs for training and generation at the granularity of **nodes**, which means the number of GPUs allocated for generation and training must be integer multiples of `cluster.n_gpus_per_node`.
-->

## Distributed Experiments on Cloud or K8s with SkyPilot

If you want to directly run an experiment on cloud or your own Kubernetes
infrastructure, we recommend you to use SkyPilot. After installing and setting up
SkyPilot (see [Install SkyPilot](installation.md#install-skypilot)), you could launch a
distributed experiment based on our SkyPilot example (two 8xA100 GPU nodes) with one
command line:

```bash
# Launch on GCP
sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml --infra gcp
# Launch on AWS
sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml --infra aws
# Launch on your K8s Cluster
sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml --infra k8s
```

Check
[Running AReaL with SkyPilot](https://github.com/inclusionAI/AReaL/blob/main/examples/skypilot/README.md),
for more details about the examples. Check
[SkyPilot Documentation](https://docs.skypilot.co/en/latest/docs/index.html) for more
information about SkyPilot.

(switching-from-legacy-areal-to-areal-lite)=

## Switching from legacy AReaL to AReaL-lite
Expand Down
196 changes: 196 additions & 0 deletions examples/skypilot/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
# Running AReaL with SkyPilot

This README includes examples and guidelines to running AReaL experiments with SkyPilot.
Make sure you have SkyPilot properly installed following
[our installation guide](../../docs/tutorial/installation.md#optional-install-skypilot)
before running this example. Note that all command lines shown in this file are assumed
to be execute under the root of AReaL repository.

## Running a Single Node Experiment

To run a single node experiment, you only need to setup the node with SkyPilot and
launch the experiment with AReaL local launcher.
[The following file](single_node.sky.yaml) shows a SkyPilot yaml that could launch a
simple GSM8K GRPO experiment in a single command line. This example is tested on both
GCP and a K8S cluster.

```yaml
name: areal-test-skypilot

resources:
accelerators: A100:2
autostop:
idle_minutes: 10
down: true
cpus: 8+
memory: 32GB+
disk_size: 256GB
image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.4

num_nodes: 1

file_mounts:
/storage: # Should be consistent with the storage paths set in gsm8k_grpo_ray.yaml
source: s3://my-bucket/ # or gs://, https://<azure_storage_account>.blob.core.windows.net/<container>, r2://, cos://<region>/<bucket>, oci://<bucket_name>
mode: MOUNT # MOUNT or COPY or MOUNT_CACHED. Defaults to MOUNT. Optional.

workdir: .

run: |
python3 -m areal.launcher.local examples/math/gsm8k_grpo.py \
--config examples/math/gsm8k_grpo.yaml \
experiment_name=gsm8k-grpo \
trial_name=trial0 \
cluster.n_nodes=1 \
cluster.n_gpus_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
allocation_mode=sglang.d1+d1 \
train_dataset.batch_size=4 \
actor.mb_spec.max_tokens_per_mb=4096
```

To run the experiment, execute:

```bash
sky launch -c areal-test examples/skypilot/single_node.sky.yaml
```

To designate the cloud or infrastructure you wish to run your experiment on by adding
`--infra xxx`. For example:

```bash
sky launch -c areal-test examples/skypilot/single_node.sky.yaml --infra gcp
sky launch -c areal-test examples/skypilot/single_node.sky.yaml --infra aws
sky launch -c areal-test examples/skypilot/single_node.sky.yaml --infra k8s
```

## Running a Multi-Node Experiment

### Running AReaL with Ray Launcher

The following example shows how to setup a ray cluster with SkyPilot and then use AReaL
to run GRPO with GSM8K dataset on 2 nodes, each with 1 A100 GPU. This example is tested
on GCP and a K8S cluster.

Specify the resources and image used to run the experiment.

```yaml
resources:
accelerators: A100:8
image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.4
memory: 256+
cpus: 32+

num_nodes: 2

workdir: .
```

Designate shared storage. You could either use an existing cloud bucket or volume:

```yaml
file_mounts:
/storage: # Should be consistent with the storage paths set in gsm8k_grpo_ray.yaml
source: s3://my-bucket/ # or gs://, https://<azure_storage_account>.blob.core.windows.net/<container>, r2://, cos://<region>/<bucket>, oci://<bucket_name>
mode: MOUNT # MOUNT or COPY or MOUNT_CACHED. Defaults to MOUNT. Optional.
```

or create a new bucket or volume with SkyPilot:

```yaml
# Create an empty gcs bucket
file_mounts:
/storage: # Should be consistent with the storage paths set in gsm8k_grpo_ray.yaml
name: my-sky-bucket
store: gcs # Optional: either of s3, gcs, azure, r2, ibm, oci
```

For more information about shared storage with SkyPilot, check
[SkyPilot Cloud Buckets](https://docs.skypilot.co/en/latest/reference/storage.html) and
[SkyPilot Volume](https://docs.skypilot.co/en/latest/reference/volumes.html).

Next, prepare commands used to setup ray cluster and run the experiment.

```yaml
envs:
EXPERIMENT_NAME: my-areal-experiment
TRIAL_NAME: my-trial-name

run: |
run: |
# Get the Head node's IP and total number of nodes (environment variables injected by SkyPilot).
head_ip=$(echo "$SKYPILOT_NODE_IPS" | head -n1)

if [ "$SKYPILOT_NODE_RANK" = "0" ]; then
echo "Starting Ray head node..."
ray start --head --port=6379

while [ $(ray status | grep node_ | wc -l) -lt $SKYPILOT_NUM_NODES ]; do
echo "Waiting for all nodes to join... Current nodes: $(ray status | grep node_ | wc -l) / $SKYPILOT_NUM_NODES"
sleep 5
done

echo "Executing training script on head node..."
python3 -m areal.launcher.ray examples/math/gsm8k_grpo.py \
--config examples/skypilot/gsm8k_grpo_ray.yaml \
experiment_name=gsm8k-grpo \
trial_name=trial0 \
cluster.n_nodes=$SKYPILOT_NUM_NODES \
cluster.n_gpus_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
allocation_mode=sglang.d8+d8
else
sleep 10
echo "Starting Ray worker node..."
ray start --address $head_ip:6379
sleep 5
fi

echo "Node setup complete for rank $SKYPILOT_NODE_RANK."
```

**Note**: If you are running on a cluster in which nodes are connected via infiniband,
you might need an additional config field to the example yaml file for the experiment to
run:

```yaml
config:
kubernetes:
pod_config:
spec:
containers:
- securityContext:
capabilities:
add:
- IPC_LOCK
```

### Launch the Ray Cluster and Run AReaL Experiment

Then you are ready to run AReaL with command line:

```bash
sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml
```

To designate the cloud or infrastructure you wish to run your experiment on by adding
`--infra xxx`. For example:

```bash
sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml --infra gcp
sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml --infra aws
sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml --infra k8s
```

You should be able to see your AReaL running and producing training logs in your
terminal.

Successfully launched 2 nodes on GCP and deployed a ray cluster:
<img align="center" alt="Launching Ray Cluster" src="ray_launch.png" width="100%">

Successfully ran a training step:
<img align="center" alt="Running a train step" src="train_step_success.png" width="100%">

### Running AReaL with SkyPilot Launcher

AReaL plans to support a SkyPilot native launcher with
[SkyPilot Python SDK](https://docs.skypilot.co/en/latest/reference/api.html), which is
currently under development.
Loading
Loading