inclusionAI · garrett4wade · Oct 27, 2025 · Oct 9, 2025 · Oct 9, 2025 · Oct 9, 2025
diff --git a/docs/tutorial/installation.md b/docs/tutorial/installation.md
@@ -79,6 +79,54 @@ However, `flash-attn==2.8.3` is not compatible with Megatron training backend. I
 want to use Megatron training backend, please compile and install `flash-attn==2.8.1` in
 your custom environment, or use docker installation instead.
 
+## (Optional) Install SkyPilot
+
+SkyPilot helps you run AReaL easily on cloud or Kubernetes infrastructures. Below shows
+the minimal steps to setup skypilot on GCP or Kubernetes.
+
+### Install SkyPilot
+
+```bash
+# In your conda environment
+# NOTE: SkyPilot requires 3.7 <= python <= 3.13
+pip install -U "skypilot[gcp,kubernetes]"
+```
+
+### GCP setup
+
+```bash
+# Install Google Cloud SDK
+conda install -y -c conda-forge google-cloud-sdk
+
+# Initialize gcloud and select your account/project
+gcloud init
+
+# (Optional) choose a project explicitly
+gcloud config set project <PROJECT_ID>
+
+# Create Application Default Credentials
+gcloud auth application-default login
+```
+
+### Kubernetes setup
+
+Check
+[here](https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-setup.html)
+for a comprehensive guide on how to set up a kubernetes cluster for SkyPilot.
+
+### Verify
+
+```bash
+sky check
+```
+
+If `GCP: enabled` or `Kubernetes: enabled` are shown, you're ready to use SkyPilot with
+AReaL. Check
+[here](https://github.com/inclusionAI/AReaL/blob/main/examples/skypilot/README.md) for a
+detailed example to run AReaL with SkyPilot. For more options and details for SkyPilot,
+see the official
+[SkyPilot installation guide](https://docs.skypilot.co/en/latest/getting-started/installation.html).
+
 ## (Optional) Launch Ray Cluster for Distributed Training
 
 On the first node, start the Ray Head:

diff --git a/examples/skypilot/README.md b/examples/skypilot/README.md
@@ -0,0 +1,158 @@
+# Running AReaL with SkyPilot
+
+This README includes examples and guidelines to running AReaL experiments with SkyPilot.
+Make sure you have SkyPilot properly installed following
+[our installation guide](../../docs/tutorial/installation.md#optional-install-skypilot)
+before running this example. Note that all command lines shown in this file are assumed
+to be execute under the root of AReaL repository.
+
+## Running a Single Node Experiment
+
+To run a single node experiment, you only need to setup the node with SkyPilot and
+launch the experiment with AReaL local launcher. [The following file](local.yaml) shows
+a SkyPilot yaml that could launch a simple GSM8K GRPO experiment in a single command
+line. This example runs on GCP, but could be easily migrated to other cloud or K8S
+cluster by changing `resource.infra` field in SkyPilot YAML file.
+
+```yaml
+name: areal-test-skypilot
+
+resources:
+  infra: gcp
+  accelerators: A100:2
+  autostop:
+    idle_minutes: 10
+    down: true
+  cpus: 8+
+  memory: 32GB+
+  disk_size: 256GB
+  image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.4
+
+
+num_nodes: 1
+
+workdir: .
+
+envs:
+  EXPERIMENT_NAME: my-areal-experiment
+  TRIAL_NAME: my-trial-name
+
+run: |
+  python3 -m areal.launcher.local examples/math/gsm8k_grpo.py \
+    --config examples/math/gsm8k_grpo.yaml \
+    experiment_name=$EXPERIMENT_NAME \
+    trial_name=$TRIAL_NAME \
+    cluster.n_gpus_per_node=2 \
+    allocation_mode=sglang.d1+d1 \
+    train_dataset.batch_size=4 \
+    actor.mb_spec.max_tokens_per_mb=4096
+```
+
+To run the experiment, execute:
+
+```bash
+sky launch -c areal-test examples/skypilot/local.yaml
+```
+
+## Running a Multi-Node Experiment
+
+### Running AReaL with Ray Launcher
+
+The following example shows how to setup a ray cluster with SkyPilot and then use AReaL
+to run GRPO with GSM8K dataset on 2 nodes, each with 1 A100 GPU. This example runs on
+GCP, but could be easily migrated to other cloud or K8S cluster by changing
+`resource.infra` field in SkyPilot YAML file.
+
+Specify the resources and image used to run the experiment.
+
+```yaml
+resources:
+  infra: gcp
+  accelerators: A100:1
+  image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.4
+  memory: 256+
+  cpus: 32+
+
+num_nodes: 2
+
+workdir: .
+```
+
+Designate shared storage. You could either use an existing cloud bucket or volume:
+
+```yaml
+file_mounts:
+  /storage: gs://areal-default
+```
+
+or create a new bucket or volume with SkyPilot:
+
+```yaml
+file_mounts:
+  /storage:
+    name: areal-test
+    store: gcs
+```
+
+For more information about shared storage with SkyPilot, check
+[SkyPilot Cloud Buckets](https://docs.skypilot.co/en/latest/reference/storage.html) and
+[SkyPilot Volume](https://docs.skypilot.co/en/latest/reference/volumes.html).
+
+Next, prepare commands used to setup ray cluster and run the experiment.
+
+```yaml
+envs:
+  EXPERIMENT_NAME: my-areal-experiment
+  TRIAL_NAME: my-trial-name
+
+run: |
+  # Get the Head node's IP and total number of nodes (environment variables injected by SkyPilot).
+  head_ip=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
+  num_nodes=$(echo "$SKYPILOT_NODE_IPS" | wc -l)
+
+  if [ "$SKYPILOT_NODE_RANK" = "0" ]; then
+    echo "Starting Ray head node..."
+    ray start --head --port=6379
+
+    while [ $(ray status | grep node_ | wc -l) -lt $num_nodes ]; do
+      echo "Waiting for all nodes to join... Current nodes: $(ray status | grep node_ | wc -l) / $num_nodes"
+      sleep 5
+    done
+
+    echo "Executing training script on head node..."
+    python3 -m areal.launcher.ray examples/math/gsm8k_grpo.py \
+            --config examples/skypilot/gsm8k_grpo_ray.yaml \
+            experiment_name=$EXPERIMENT_NAME \
+            trial_name=$TRIAL_NAME
+  else
+    sleep 10
+    echo "Starting Ray worker node..."
+    ray start --address $head_ip:6379
+    sleep 5
+  fi
+
+  echo "Node setup complete for rank $SKYPILOT_NODE_RANK."
+```
+
+### Launch the Ray Cluster and AReaL
+
+Then you are ready to run AReaL with command line:
+
+```bash
+sky launch -c areal-test examples/skypilot/ray_cluster.yaml
+```
+
+You should be able to see your AReaL running and producing training logs in your
+terminal.
+
+Successfully launched 2 nodes on GCP and deployed a ray cluster:
+<img align="center" alt="Launching Ray Cluster" src="ray_launch.png" width="100%">
+
+Successfully ran a training step:
+<img align="center" alt="Running a train step" src="train_step_success.png" width="100%">
+
+### Running AReaL with SkyPilot Launcher
+
+AReaL plans to support a SkyPilot native launcher with
+[SkyPilot Python SDK](https://docs.skypilot.co/en/latest/reference/api.html), which is
+currently under development.
diff --git a/examples/skypilot/gsm8k_grpo_ray.yaml b/examples/skypilot/gsm8k_grpo_ray.yaml
@@ -0,0 +1,153 @@
+experiment_name: gsm8k-grpo-on-ray
+trial_name: trial0
+
+seed: 1
+total_train_epochs: 10
+tokenizer_path: ${actor.path}
+async_training: true
+
+cluster:
+  n_nodes: 2
+  n_gpus_per_node: 1
+  fileroot: /storage/experiments
+  name_resolve:
+    type: ray
+    ray_actor_name: ray_kv_store
+
+allocation_mode: sglang.d1+d1
+
+rollout:
+  experiment_name: ${experiment_name}
+  trial_name: ${trial_name}
+  max_concurrent_rollouts: 256
+  queue_size: null
+  consumer_batch_size: ${train_dataset.batch_size}
+  max_head_offpolicyness: 2
+  enable_rollout_tracing: false
+
+gconfig:
+  n_samples: 4
+  min_new_tokens: 0
+  max_new_tokens: 1024
+  greedy: false
+  temperature: 1.0
+
+actor:
+  experiment_name: ${experiment_name}
+  trial_name: ${trial_name}
+  path: Qwen/Qwen2.5-1.5B-Instruct
+  init_from_scratch: false
+  disable_dropout: true
+  gradient_checkpointing: false
+  dtype: bfloat16
+  mb_spec:
+    max_tokens_per_mb: 4096
+  optimizer:
+    type: adam
+    lr: 1.70e-5
+    weight_decay: 0.017
+    beta1: 0.9
+    beta2: 0.999
+    eps: 1e-8
+    lr_scheduler_type: constant
+    gradient_clipping: 1.0
+    warmup_steps_proportion: 0.001
+  backend: fsdp
+  group_size: ${gconfig.n_samples}
+  eps_clip: 0.4
+  temperature: ${gconfig.temperature}
+  reward_scaling: 10.0
+  reward_bias: -0.5
+  kl_ctl: 0.0
+  ppo_n_minibatches: 1
+  recompute_logprob: true
+  use_decoupled_loss: true
+  behav_imp_weight_cap: 5.0
+  dynamic_sampling: false
+  reward_norm:
+    mean_level: group
+    std_level: group
+    group_size: ${gconfig.n_samples}
+  adv_norm:
+    mean_level: batch
+    std_level: batch
+  max_new_tokens: ${gconfig.max_new_tokens}
+
+ref:
+  experiment_name: ${experiment_name}
+  trial_name: ${trial_name}
+  path: ${actor.path}
+  init_from_scratch: false
+  disable_dropout: true
+  dtype: ${actor.dtype}
+  mb_spec:
+    max_tokens_per_mb: 10240
+  optimizer: null
+  backend: fsdp
+
+# SGLang
+sglang:
+  model_path: ${actor.path}
+  random_seed: ${seed}
+  skip_tokenizer_init: true
+  dtype: ${actor.dtype}
+  max_running_requests: null
+  context_length: 32768
+  mem_fraction_static: 0.8
+
+# datasets
+train_dataset:
+  batch_size: 4
+  shuffle: true
+  pin_memory: true
+  num_workers: 4
+  path: openai/gsm8k
+  type: rl
+  max_length: 1024
+
+valid_dataset:
+  batch_size: 4
+  shuffle: true
+  pin_memory: true
+  num_workers: 4
+  path: openai/gsm8k
+  type: rl
+
+# Utilities
+saver:
+  experiment_name: ${experiment_name}
+  trial_name: ${trial_name}
+  fileroot: ${cluster.fileroot}
+  freq_epochs: 1
+  freq_steps: null
+  freq_secs: null
+
+recover:
+  mode: disabled
+  experiment_name: ${experiment_name}
+  trial_name: ${trial_name}
+  fileroot: ${cluster.fileroot}
+  freq_epochs: 1
+  freq_steps: null
+  freq_secs: 3600
+
+evaluator:
+  experiment_name: ${experiment_name}
+  trial_name: ${trial_name}
+  fileroot: ${cluster.fileroot}
+  freq_epochs: 1
+  freq_steps: null
+  freq_secs: null
+
+stats_logger:
+  experiment_name: ${experiment_name}
+  trial_name: ${trial_name}
+  fileroot: ${cluster.fileroot}
+  wandb:
+    mode: disabled
+
+launcher:
+  inference_server_cpus_per_gpu: 4
+  inference_server_mem_per_gpu: 32768
+  trainer_cpus_per_gpu: 4
+  trainer_mem_per_gpu: 32768
diff --git a/examples/skypilot/local.yaml b/examples/skypilot/local.yaml
@@ -0,0 +1,29 @@
+name: areal-test-skypilot
+
+resources:
+  infra: gcp
+  accelerators: A100:2
+  autostop:
+    idle_minutes: 10
+    down: true
+  cpus: 8+
+  memory: 32GB+
+  disk_size: 256GB
+  image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.4
+
+num_nodes: 1
+
+file_mounts:
+  /storage: gs://areal-default
+
+workdir: .
+
+run: |
+  python3 -m areal.launcher.local examples/math/gsm8k_grpo.py \
+    --config examples/math/gsm8k_grpo.yaml \
+    experiment_name=gsm8k-grpo \
+    trial_name=trial0 \
+    cluster.n_gpus_per_node=2 \
+    allocation_mode=sglang.d1+d1 \
+    train_dataset.batch_size=4 \
+    actor.mb_spec.max_tokens_per_mb=4096