Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: v2
name: a4_jobset_workload
description: a4_jobset_workload
type: application
version: 0.1.0
appVersion: "1.16.0"
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
<!-- mdformat global-off -->
# Pretrain llama3-1-70b-seq8192-gbs2048-mbs1-gpus16 workloads on a4 GKE Node pools with Nvidia NeMo Framework

This recipe outlines the steps for running a llama3-1-70b-seq8192-gbs2048-mbs1-gpus16 pretraining
workload on [a4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the
[NVIDIA NeMo framework](https://github.com/NVIDIA/nemo).

## Orchestration and deployment tools

For this recipe, the following setup is used:

- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
- Pretraining job configuration and deployment - A Helm chart is used to
configure and deploy the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) resource which manages the execution of the
[NeMo pretraining workload](https://github.com/NVIDIA/nemo).

## Test environment

This recipe has been optimized for and tested with the following configuration:

- GKE cluster
Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4)
to create your a4 GKE cluster.

## Training dataset

This recipe uses a mock pretraining dataset provided by the NeMo framework.

## Docker container image

This recipe uses the following docker images:

- `nvcr.io/nvidia/nemo:25.07`
- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.0`

## Run the recipe

From your client workstation, complete the following steps:

### Configure environment settings

Set the environment variables to match your environment:

```bash
export PROJECT_ID=<PROJECT_ID>
export CLUSTER_REGION=<CLUSTER_REGION>
export CLUSTER_NAME=<CLUSTER_NAME>
export GCS_BUCKET=<GCS_BUCKET> # Note: path should not be prefixed with gs://
export KUEUE_NAME=<KUEUE_NAME>
```

Replace the following values:

- `<PROJECT_ID>`: your Google Cloud project ID.
- `<CLUSTER_REGION>`: the region where your cluster is located.
- `<CLUSTER_NAME>`: the name of your GKE cluster.
- `<GCS_BUCKET>`: the name of your Cloud Storage bucket. Don't include the `gs://` prefix.
- `<KUEUE_NAME>`: the name of the Kueue local queue. The default queue created by the cluster toolkit is `a4`. Make sure to verify the name of the local queue in your cluster.

Set the default project:

```bash
gcloud config set project $PROJECT_ID
```

### Get the recipe

Clone the `gpu-recipes` repository and set a reference to the recipe folder.

```
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a4/llama3-1-70b-seq8192-gbs2048-mbs1-gpus16/nemo-pretraining-gke/2_nodes
cd $RECIPE_ROOT
```

### Get cluster credentials

```
gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
```

### Configure and submit a pretraining job

#### Using 2 node (16 gpus) bf16 precision
To execute the job with the default settings, run the following command from
your client:

```bash
cd $RECIPE_ROOT
export WORKLOAD_NAME=$USER-a4-llama3-1-70b
helm install $WORKLOAD_NAME . -f values.yaml \
--set-file workload_launcher=launcher.sh \
--set-file workload_config=llama3-1-70b-seq8192-gbs2048-mbs1-gpus16.py \
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set volumes.gcsMounts[0].mountPath=/job-logs \
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
--set queue=${KUEUE_NAME}
```

**Examples**

- To set the number of training steps to 100, run the following command from
your client:

```bash
cd $RECIPE_ROOT
export WORKLOAD_NAME=$USER-a4-llama3-1-70b
helm install $WORKLOAD_NAME . -f values.yaml \
--set-file workload_launcher=launcher.sh \
--set-file workload_config=llama3-1-70b-seq8192-gbs2048-mbs1-gpus16.py \
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set volumes.gcsMounts[0].mountPath=/job-logs \
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
--set queue=${KUEUE_NAME} \
--set workload.arguments[0]="trainer.max_steps=100"
```

### Monitor the job

To check the status of pods in your job, run the following command:

```
kubectl get pods | grep $USER-a4-llama3-1-70b
```

Replace the following:

- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4-llama3-1-70b.

To get the logs for one of the pods, run the following command:

```
kubectl logs POD_NAME
```

Information about the training job's progress, including crucial details such as
loss, step count, and step time, is generated by the rank 0 process.
This process runs on the pod whose name begins with
`JOB_NAME_PREFIX-workload-0-0`.
For example: `$USER-a4-llama3-1-70b-workload-0-0-s9zrv`.

### Uninstall the Helm release

You can delete the job and other resources created by the Helm chart. To
uninstall Helm, run the following command from your client:

```bash
helm uninstall $USER-a4-llama3-1-70b
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
usage()
{
cat << EOF
usage: bash ./launcher.sh [config-override [config-override ...]]
config-override (Optional) A NeMo configuration override. E.g. trainer.max_steps=10000.
EOF
}

parse_args() {
while [ "$1" != "" ]; do
case $(grep -o "=" <<< "$1" | wc -l) in
1 )
config_overrides+=("$1")
;;
* )
echo "Invalid config override: $1"
usage
exit 1
esac
shift
done
config_overrides="${config_overrides[*]}"
}

config_overrides=()
parse_args "$@"

if [ -z "${config_overrides}" ]; then
echo "No NeMo config overrides specified"
else
echo "NeMo config overrides:"
echo " ${config_overrides}"
fi

export LD_LIBRARY_PATH="$NCCL_PLUGIN_PATH"
ldconfig $LD_LIBRARY_PATH
echo "Added $LD_LIBRARY_PATH to ldconfig:"
ldconfig -p | grep libcuda | sed 's/^/ /'
echo ""

if [[ -n "${EXPLICIT_LOG_DIR}" ]]; then
explicit_log_dir=${EXPLICIT_LOG_DIR}
else
explicit_log_dir=workload_logs
fi
echo "Logging to ${explicit_log_dir}"

if [[ -n "${TOKENIZER_PATH}" ]]; then
echo "Getting tokenizer files"
cp ${TOKENIZER_PATH}/* .
echo ""
fi

echo "Launching Torch distributed on the node rank $JOB_COMPLETION_INDEX out of $NNODES nodes"


# Update nemo run so we can export the config.
pip install git+https://github.com/NVIDIA/NeMo-Run.git@6550ff68204e5095452098eed3765ed765de5d33
pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger


# Export the nemo2 config to yaml.
python ${NEMO_LAUNCH_SCRIPT} --factory "recipe()" \
trainer.num_nodes="$NNODES" \
log.explicit_log_dir="${explicit_log_dir}" \
trainer.max_steps=25 \
trainer.num_nodes=2 \
trainer.devices=8 \
${config_overrides} \
--to-yaml exported_nemo_config.yaml

# Create the nsys directory.
mkdir -p ${explicit_log_dir}/nsys

OMP_NUM_THREADS=12 NSYS_CONFIG_DIRECTIVES="AgentLaunchTimeoutSec=240;AppLaunchTimeoutSec=240" TORCH_NCCL_ENABLE_MONITORING=0 \
/usr/local/bin/nsys profile -s none -t nvtx,cuda --capture-range=cudaProfilerApi --capture-range-end=stop \
-o ${explicit_log_dir}/nsys/noderank-${JOB_COMPLETION_INDEX} \
--session-new "nemo-rank${JOB_COMPLETION_INDEX}"-$RANDOM \
--wait all \
torchrun \
--nproc-per-node="${GPUS_PER_NODE}" \
--nnodes="${NNODES}" \
--node_rank="${JOB_COMPLETION_INDEX}" \
--rdzv_id="${JOB_IDENTIFIER}" \
--master_addr="${MASTER_ADDR}" \
--master_port="${MASTER_PORT}" \
${NEMO_LAUNCH_SCRIPT} --factory "recipe()" \
trainer.num_nodes="$NNODES" \
log.explicit_log_dir="${explicit_log_dir}" \
trainer.max_steps=25 \
trainer.num_nodes=2 \
trainer.devices=8 \
${config_overrides}

if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then
mkdir -p ${ARTIFACT_DIR}
cp -r ${explicit_log_dir}/* ${ARTIFACT_DIR}/
cp ${NEMO_LAUNCH_SCRIPT} ${ARTIFACT_DIR}/run-cli.py
cp dllogger.json ${ARTIFACT_DIR}/dllogger.json
cp exported_nemo_config.yaml ${ARTIFACT_DIR}/nemo-configuration.yaml
env > ${ARTIFACT_DIR}/environ.txt
ls ${ARTIFACT_DIR}
fi
echo "Training completed"
echo "Pod on $(hostname --fqdn) is exiting"
Loading