diff --git a/docs/tutorial/installation.md b/docs/tutorial/installation.md index 356e8a2a8..63cdcc7c3 100644 --- a/docs/tutorial/installation.md +++ b/docs/tutorial/installation.md @@ -90,6 +90,58 @@ python3 examples/env/validate_installation.py After installation validation passed, you are good to go! +(install-skypilot)= + +## (Optional) Install SkyPilot + +SkyPilot helps you run AReaL easily on 17+ different cloud or your own Kubernetes +infrastructure. For more details about Skypilot, check +[SkyPilot Documentation](https://docs.skypilot.co/en/latest/overview.html). Below shows +the minimal steps to setup skypilot on GCP or Kubernetes. + +### Install SkyPilot by pip + +```bash +# In your conda environment +# NOTE: SkyPilot requires 3.7 <= python <= 3.13 +pip install -U "skypilot[gcp,kubernetes]" +``` + +### GCP setup + +```bash +# Install Google Cloud SDK +conda install -y -c conda-forge google-cloud-sdk + +# Initialize gcloud and select your account/project +gcloud init + +# (Optional) choose a project explicitly +gcloud config set project + +# Create Application Default Credentials +gcloud auth application-default login +``` + +### Kubernetes setup + +Check +[here](https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-setup.html) +for a comprehensive guide on how to set up a kubernetes cluster for SkyPilot. + +### Verify + +```bash +sky check +``` + +If `GCP: enabled` or `Kubernetes: enabled` are shown, you're ready to use SkyPilot with +AReaL. Check +[here](https://github.com/inclusionAI/AReaL/blob/main/examples/skypilot/README.md) for a +detailed example to run AReaL with SkyPilot. For more options and details for SkyPilot, +see the official +[SkyPilot installation guide](https://docs.skypilot.co/en/latest/getting-started/installation.html). + ## (Optional) Launch Ray Cluster for Distributed Training On the first node, start the Ray Head: diff --git a/docs/tutorial/quickstart.md b/docs/tutorial/quickstart.md index 7d963b738..143b92e18 100644 --- a/docs/tutorial/quickstart.md +++ b/docs/tutorial/quickstart.md @@ -111,6 +111,29 @@ Additional references: > **Note**: Ray and Slurm launchers only work for distributed experiments with more than 1 node (`cluster.n_nodes > 1`). They allocate GPUs for training and generation at the granularity of **nodes**, which means the number of GPUs allocated for generation and training must be integer multiples of `cluster.n_gpus_per_node`. --> +## Distributed Experiments on Cloud or K8s with SkyPilot + +If you want to directly run an experiment on cloud or your own Kubernetes +infrastructure, we recommend you to use SkyPilot. After installing and setting up +SkyPilot (see [Install SkyPilot](installation.md#install-skypilot)), you could launch a +distributed experiment based on our SkyPilot example (two 8xA100 GPU nodes) with one +command line: + +```bash +# Launch on GCP +sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml --infra gcp +# Launch on AWS +sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml --infra aws +# Launch on your K8s Cluster +sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml --infra k8s +``` + +Check +[Running AReaL with SkyPilot](https://github.com/inclusionAI/AReaL/blob/main/examples/skypilot/README.md), +for more details about the examples. Check +[SkyPilot Documentation](https://docs.skypilot.co/en/latest/docs/index.html) for more +information about SkyPilot. + (switching-from-legacy-areal-to-areal-lite)= ## Switching from legacy AReaL to AReaL-lite diff --git a/examples/skypilot/README.md b/examples/skypilot/README.md new file mode 100644 index 000000000..ac5b21126 --- /dev/null +++ b/examples/skypilot/README.md @@ -0,0 +1,196 @@ +# Running AReaL with SkyPilot + +This README includes examples and guidelines to running AReaL experiments with SkyPilot. +Make sure you have SkyPilot properly installed following +[our installation guide](../../docs/tutorial/installation.md#optional-install-skypilot) +before running this example. Note that all command lines shown in this file are assumed +to be execute under the root of AReaL repository. + +## Running a Single Node Experiment + +To run a single node experiment, you only need to setup the node with SkyPilot and +launch the experiment with AReaL local launcher. +[The following file](single_node.sky.yaml) shows a SkyPilot yaml that could launch a +simple GSM8K GRPO experiment in a single command line. This example is tested on both +GCP and a K8S cluster. + +```yaml +name: areal-test-skypilot + +resources: + accelerators: A100:2 + autostop: + idle_minutes: 10 + down: true + cpus: 8+ + memory: 32GB+ + disk_size: 256GB + image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.4 + +num_nodes: 1 + +file_mounts: + /storage: # Should be consistent with the storage paths set in gsm8k_grpo_ray.yaml + source: s3://my-bucket/ # or gs://, https://.blob.core.windows.net/, r2://, cos:///, oci:// + mode: MOUNT # MOUNT or COPY or MOUNT_CACHED. Defaults to MOUNT. Optional. + +workdir: . + +run: | + python3 -m areal.launcher.local examples/math/gsm8k_grpo.py \ + --config examples/math/gsm8k_grpo.yaml \ + experiment_name=gsm8k-grpo \ + trial_name=trial0 \ + cluster.n_nodes=1 \ + cluster.n_gpus_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ + allocation_mode=sglang.d1+d1 \ + train_dataset.batch_size=4 \ + actor.mb_spec.max_tokens_per_mb=4096 +``` + +To run the experiment, execute: + +```bash +sky launch -c areal-test examples/skypilot/single_node.sky.yaml +``` + +To designate the cloud or infrastructure you wish to run your experiment on by adding +`--infra xxx`. For example: + +```bash +sky launch -c areal-test examples/skypilot/single_node.sky.yaml --infra gcp +sky launch -c areal-test examples/skypilot/single_node.sky.yaml --infra aws +sky launch -c areal-test examples/skypilot/single_node.sky.yaml --infra k8s +``` + +## Running a Multi-Node Experiment + +### Running AReaL with Ray Launcher + +The following example shows how to setup a ray cluster with SkyPilot and then use AReaL +to run GRPO with GSM8K dataset on 2 nodes, each with 1 A100 GPU. This example is tested +on GCP and a K8S cluster. + +Specify the resources and image used to run the experiment. + +```yaml +resources: + accelerators: A100:8 + image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.4 + memory: 256+ + cpus: 32+ + +num_nodes: 2 + +workdir: . +``` + +Designate shared storage. You could either use an existing cloud bucket or volume: + +```yaml +file_mounts: + /storage: # Should be consistent with the storage paths set in gsm8k_grpo_ray.yaml + source: s3://my-bucket/ # or gs://, https://.blob.core.windows.net/, r2://, cos:///, oci:// + mode: MOUNT # MOUNT or COPY or MOUNT_CACHED. Defaults to MOUNT. Optional. +``` + +or create a new bucket or volume with SkyPilot: + +```yaml +# Create an empty gcs bucket +file_mounts: + /storage: # Should be consistent with the storage paths set in gsm8k_grpo_ray.yaml + name: my-sky-bucket + store: gcs # Optional: either of s3, gcs, azure, r2, ibm, oci +``` + +For more information about shared storage with SkyPilot, check +[SkyPilot Cloud Buckets](https://docs.skypilot.co/en/latest/reference/storage.html) and +[SkyPilot Volume](https://docs.skypilot.co/en/latest/reference/volumes.html). + +Next, prepare commands used to setup ray cluster and run the experiment. + +```yaml +envs: + EXPERIMENT_NAME: my-areal-experiment + TRIAL_NAME: my-trial-name + +run: | + run: | + # Get the Head node's IP and total number of nodes (environment variables injected by SkyPilot). + head_ip=$(echo "$SKYPILOT_NODE_IPS" | head -n1) + + if [ "$SKYPILOT_NODE_RANK" = "0" ]; then + echo "Starting Ray head node..." + ray start --head --port=6379 + + while [ $(ray status | grep node_ | wc -l) -lt $SKYPILOT_NUM_NODES ]; do + echo "Waiting for all nodes to join... Current nodes: $(ray status | grep node_ | wc -l) / $SKYPILOT_NUM_NODES" + sleep 5 + done + + echo "Executing training script on head node..." + python3 -m areal.launcher.ray examples/math/gsm8k_grpo.py \ + --config examples/skypilot/gsm8k_grpo_ray.yaml \ + experiment_name=gsm8k-grpo \ + trial_name=trial0 \ + cluster.n_nodes=$SKYPILOT_NUM_NODES \ + cluster.n_gpus_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ + allocation_mode=sglang.d8+d8 + else + sleep 10 + echo "Starting Ray worker node..." + ray start --address $head_ip:6379 + sleep 5 + fi + + echo "Node setup complete for rank $SKYPILOT_NODE_RANK." +``` + +**Note**: If you are running on a cluster in which nodes are connected via infiniband, +you might need an additional config field to the example yaml file for the experiment to +run: + +```yaml +config: + kubernetes: + pod_config: + spec: + containers: + - securityContext: + capabilities: + add: + - IPC_LOCK +``` + +### Launch the Ray Cluster and Run AReaL Experiment + +Then you are ready to run AReaL with command line: + +```bash +sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml +``` + +To designate the cloud or infrastructure you wish to run your experiment on by adding +`--infra xxx`. For example: + +```bash +sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml --infra gcp +sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml --infra aws +sky launch -c areal-test examples/skypilot/ray_cluster.sky.yaml --infra k8s +``` + +You should be able to see your AReaL running and producing training logs in your +terminal. + +Successfully launched 2 nodes on GCP and deployed a ray cluster: +Launching Ray Cluster + +Successfully ran a training step: +Running a train step + +### Running AReaL with SkyPilot Launcher + +AReaL plans to support a SkyPilot native launcher with +[SkyPilot Python SDK](https://docs.skypilot.co/en/latest/reference/api.html), which is +currently under development. diff --git a/examples/skypilot/gsm8k_grpo_ray.yaml b/examples/skypilot/gsm8k_grpo_ray.yaml new file mode 100644 index 000000000..97cf55a15 --- /dev/null +++ b/examples/skypilot/gsm8k_grpo_ray.yaml @@ -0,0 +1,153 @@ +experiment_name: gsm8k-grpo-on-ray +trial_name: trial0 + +seed: 1 +total_train_epochs: 10 +tokenizer_path: ${actor.path} +async_training: true + +cluster: + n_nodes: 2 + n_gpus_per_node: 8 + fileroot: /storage/experiments + name_resolve: + type: ray + ray_actor_name: ray_kv_store + +allocation_mode: sglang.d8+d8 + +rollout: + experiment_name: ${experiment_name} + trial_name: ${trial_name} + max_concurrent_rollouts: 256 + queue_size: null + consumer_batch_size: ${train_dataset.batch_size} + max_head_offpolicyness: 2 + enable_rollout_tracing: false + +gconfig: + n_samples: 4 + min_new_tokens: 0 + max_new_tokens: 1024 + greedy: false + temperature: 1.0 + +actor: + experiment_name: ${experiment_name} + trial_name: ${trial_name} + path: Qwen/Qwen2.5-1.5B-Instruct + init_from_scratch: false + disable_dropout: true + gradient_checkpointing: false + dtype: bfloat16 + mb_spec: + max_tokens_per_mb: 4096 + optimizer: + type: adam + lr: 1.70e-5 + weight_decay: 0.017 + beta1: 0.9 + beta2: 0.999 + eps: 1e-8 + lr_scheduler_type: constant + gradient_clipping: 1.0 + warmup_steps_proportion: 0.001 + backend: fsdp + group_size: ${gconfig.n_samples} + eps_clip: 0.4 + temperature: ${gconfig.temperature} + reward_scaling: 10.0 + reward_bias: -0.5 + kl_ctl: 0.0 + ppo_n_minibatches: 1 + recompute_logprob: true + use_decoupled_loss: true + behav_imp_weight_cap: 5.0 + dynamic_sampling: false + reward_norm: + mean_level: group + std_level: group + group_size: ${gconfig.n_samples} + adv_norm: + mean_level: batch + std_level: batch + max_new_tokens: ${gconfig.max_new_tokens} + +ref: + experiment_name: ${experiment_name} + trial_name: ${trial_name} + path: ${actor.path} + init_from_scratch: false + disable_dropout: true + dtype: ${actor.dtype} + mb_spec: + max_tokens_per_mb: 10240 + optimizer: null + backend: fsdp + +# SGLang +sglang: + model_path: ${actor.path} + random_seed: ${seed} + skip_tokenizer_init: true + dtype: ${actor.dtype} + max_running_requests: null + context_length: 32768 + mem_fraction_static: 0.8 + +# datasets +train_dataset: + batch_size: 128 + shuffle: true + pin_memory: true + num_workers: 4 + path: openai/gsm8k + type: rl + max_length: 1024 + +valid_dataset: + batch_size: 128 + shuffle: true + pin_memory: true + num_workers: 4 + path: openai/gsm8k + type: rl + +# Utilities +saver: + experiment_name: ${experiment_name} + trial_name: ${trial_name} + fileroot: ${cluster.fileroot} + freq_epochs: 1 + freq_steps: null + freq_secs: null + +recover: + mode: disabled + experiment_name: ${experiment_name} + trial_name: ${trial_name} + fileroot: ${cluster.fileroot} + freq_epochs: 1 + freq_steps: null + freq_secs: 3600 + +evaluator: + experiment_name: ${experiment_name} + trial_name: ${trial_name} + fileroot: ${cluster.fileroot} + freq_epochs: 1 + freq_steps: null + freq_secs: null + +stats_logger: + experiment_name: ${experiment_name} + trial_name: ${trial_name} + fileroot: ${cluster.fileroot} + wandb: + mode: disabled + +launcher: + inference_server_cpus_per_gpu: 4 + inference_server_mem_per_gpu: 32768 + trainer_cpus_per_gpu: 4 + trainer_mem_per_gpu: 32768 diff --git a/examples/skypilot/ray_cluster.sky.yaml b/examples/skypilot/ray_cluster.sky.yaml new file mode 100644 index 000000000..f04112c37 --- /dev/null +++ b/examples/skypilot/ray_cluster.sky.yaml @@ -0,0 +1,45 @@ + +resources: + accelerators: A100:8 + image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.4 + memory: 32+ + cpus: 8+ + +num_nodes: 2 + +workdir: . + +file_mounts: + /storage: # Should be consistent with the storage paths set in gsm8k_grpo_ray.yaml + source: s3://my-bucket/ # or gs://, https://.blob.core.windows.net/, r2://, cos:///, oci:// + mode: MOUNT # MOUNT or COPY or MOUNT_CACHED. Defaults to MOUNT. Optional. + +run: | + # Get the Head node's IP and total number of nodes (environment variables injected by SkyPilot). + head_ip=$(echo "$SKYPILOT_NODE_IPS" | head -n1) + + if [ "$SKYPILOT_NODE_RANK" = "0" ]; then + echo "Starting Ray head node..." + ray start --head --port=6379 + + while [ $(ray status | grep node_ | wc -l) -lt $SKYPILOT_NUM_NODES ]; do + echo "Waiting for all nodes to join... Current nodes: $(ray status | grep node_ | wc -l) / $SKYPILOT_NUM_NODES" + sleep 5 + done + + echo "Executing training script on head node..." + python3 -m areal.launcher.ray examples/math/gsm8k_grpo.py \ + --config examples/skypilot/gsm8k_grpo_ray.yaml \ + experiment_name=gsm8k-grpo \ + trial_name=trial0 \ + cluster.n_nodes=$SKYPILOT_NUM_NODES \ + cluster.n_gpus_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ + allocation_mode=sglang.d8+d8 + else + sleep 10 + echo "Starting Ray worker node..." + ray start --address $head_ip:6379 + sleep 5 + fi + + echo "Node setup complete for rank $SKYPILOT_NODE_RANK." diff --git a/examples/skypilot/ray_launch.png b/examples/skypilot/ray_launch.png new file mode 100644 index 000000000..207251b28 Binary files /dev/null and b/examples/skypilot/ray_launch.png differ diff --git a/examples/skypilot/single_node.sky.yaml b/examples/skypilot/single_node.sky.yaml new file mode 100644 index 000000000..11f5015cd --- /dev/null +++ b/examples/skypilot/single_node.sky.yaml @@ -0,0 +1,31 @@ +name: areal-test-skypilot + +resources: + accelerators: A100:2 + autostop: + idle_minutes: 10 + down: true + cpus: 8+ + memory: 32GB+ + disk_size: 256GB + image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.4 + +num_nodes: 1 + +file_mounts: + /storage: # Should be consistent with the storage paths set in gsm8k_grpo_ray.yaml + source: s3://my-bucket/ # or gs://, https://.blob.core.windows.net/, r2://, cos:///, oci:// + mode: MOUNT # MOUNT or COPY or MOUNT_CACHED. Defaults to MOUNT. Optional. + +workdir: . + +run: | + python3 -m areal.launcher.local examples/math/gsm8k_grpo.py \ + --config examples/math/gsm8k_grpo.yaml \ + experiment_name=gsm8k-grpo \ + trial_name=trial0 \ + cluster.n_nodes=1 \ + cluster.n_gpus_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ + allocation_mode=sglang.d1+d1 \ + train_dataset.batch_size=4 \ + actor.mb_spec.max_tokens_per_mb=4096 diff --git a/examples/skypilot/train_step_success.png b/examples/skypilot/train_step_success.png new file mode 100644 index 000000000..4acb5f6ba Binary files /dev/null and b/examples/skypilot/train_step_success.png differ