-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi node Multi card MPIJob #1702
Open
sramakintel
wants to merge
10
commits into
huggingface:main
Choose a base branch
from
sramakintel:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 8 commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
5c5a114
validate on v1.19.0 stack
sramakintel e38c59e
change only kubeversion
sramakintel 8754c17
add cluster
sramakintel a35e643
install pre-requisites
sramakintel bcb3196
Merge branch 'huggingface:main' into main
sramakintel 31dc74a
update readme
sramakintel b7ee194
restore chart description
sramakintel 0297a75
update based on review comments
sramakintel ebbe21a
comment security context
sramakintel eb8a15c
Merge branch 'huggingface:main' into main
sramakintel File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
142 changes: 142 additions & 0 deletions
142
examples/kubernetes/ci/multi-node-multi-card-lora-clm-values.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,142 @@ | ||
# Default values for examples. | ||
# This is a YAML-formatted file. | ||
# Declare variables to be passed into your templates. | ||
|
||
image: | ||
# -- Determines when the kubelet will pull the image to the worker nodes. Choose from: `IfNotPresent`, `Always`, or `Never`. If updates to the image have been made, use `Always` to ensure the newest image is used. | ||
pullPolicy: Always | ||
cleanPodPolicy: Running | ||
# -- Repository and name of the docker image | ||
repository: | ||
# -- Tag of the docker image | ||
tag: | ||
|
||
imagePullSecrets: [] | ||
|
||
# # -- Pod [annotations](https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/) to attach metadata to the job | ||
podAnnotations: {} | ||
|
||
# # -- Specify a pod security context to run as a non-root user | ||
# podSecurityContext: | ||
# fsGroup: 1000 | ||
|
||
# securityContext: | ||
# # -- Run as privileged or unprivileged. Certain deployments may require running as privileged, check with your system admin. | ||
privileged: false | ||
|
||
# -- The default 64MB of shared memory for docker containers can be insufficient when using more than one HPU. Setting hostIPC: true allows reusing the host's shared memory space inside the container. | ||
hostIPC: true | ||
|
||
# -- Define a config map's data as container environment variables | ||
envFrom: [] | ||
|
||
# -- Define environment variables to set in the container | ||
env: | ||
- name: LOGLEVEL | ||
value: INFO | ||
|
||
secret: | ||
# -- Hugging Face token encoded using base64. | ||
encodedToken: | ||
# -- If a token is provided, specify a mount path that will be used to set HF_TOKEN_PATH | ||
secretMountPath: /tmp/hf_token | ||
|
||
storage: | ||
# -- Name of the storage class to use for the persistent volume claim. To list the available storage classes use: `kubectl get storageclass`. | ||
storageClassName: nfs-client | ||
# -- [Access modes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes) for the persistent volume. | ||
accessModes: | ||
- "ReadWriteMany" | ||
# -- Storage [resources](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#resources) | ||
resources: | ||
requests: | ||
storage: 30Gi | ||
# -- Locaton where the PVC will be mounted in the pods | ||
pvcMountPath: &pvcMountPath /tmp/pvc-mount | ||
# -- A data access pod will be deployed when set to true | ||
deployDataAccessPod: false | ||
|
||
resources: | ||
limits: | ||
# -- Specify the number of Gaudi card(s) | ||
cpu: 16 | ||
habana.ai/gaudi: 2 | ||
# -- Specify [Memory limits](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory) requests for the job | ||
memory: 64Gi | ||
sramakintel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# -- Specify hugepages-2Mi requests for the job | ||
hugepages-2Mi: 4400Mi | ||
requests: | ||
# -- Specify the number of Gaudi card(s) | ||
cpu: 16 | ||
habana.ai/gaudi: 2 | ||
sramakintel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# -- Specify [Memory resource](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory) requests for the job | ||
memory: 64Gi | ||
# -- Specify hugepages-2Mi requests for the job | ||
hugepages-2Mi: 4400Mi | ||
|
||
|
||
# -- Number of Gaudi nodes to be used | ||
numNodes: 2 | ||
# -- Number of Gaudi cards to be used per one node | ||
numCards: 1 | ||
# -- Number of slots per worker | ||
slotsPerWorker: 1 | ||
|
||
|
||
# Define the command to run in the container | ||
command: | ||
# python command to supply mpi run commands: | ||
- python | ||
- /optimum-habana/examples/language-modeling/run_lora_clm.py | ||
- --model_name_or_path | ||
- huggyllama/llama-7b | ||
- --dataset_name | ||
- tatsu-lab/alpaca | ||
- --bf16 | ||
- --output_dir | ||
- *pvcMountPath | ||
- --num_train_epochs | ||
- "3" | ||
- --per_device_train_batch_size | ||
- "12" | ||
- --evaluation_strategy | ||
- "no" | ||
- --save_strategy | ||
- "no" | ||
- --learning_rate | ||
- "1e-4" | ||
- --warmup_ratio | ||
- "0.03" | ||
- --lr_scheduler_type | ||
- "constant" | ||
- --max_grad_norm | ||
- "0.3" | ||
- --logging_steps | ||
- "1" | ||
- --do_train | ||
- --do_eval | ||
- --use_habana | ||
- --use_lazy_mode | ||
- --throughput_warmup_steps | ||
- "3" | ||
- --lora_rank | ||
- "8" | ||
- --lora_alpha=16 | ||
- --lora_dropout=0.05 | ||
- --lora_target_modules | ||
- "q_proj" | ||
- "v_proj" | ||
- --dataset_concatenation | ||
- --max_seq_length=512 | ||
- --low_cpu_mem_usage=True | ||
- --validation_split_percentage=4 | ||
- --adam_epsilon=1e-08 | ||
|
||
# # -- Optionally specify a [node selector](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector) with labels the determine which node your worker pod will land on | ||
nodeSelector: {} | ||
|
||
# # -- Optionally specify [tolerations](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) to allow the worker pod to land on a node with a taint. | ||
tolerations: [] | ||
|
||
# # -- Optionally provide node [affinities](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity) to constrain which node your worker pod will be scheduled on | ||
affinity: {} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
{{- if and .Values.numNodes (gt (int .Values.numNodes) 1) }} | ||
apiVersion: kubeflow.org/v2beta1 | ||
kind: MPIJob | ||
metadata: | ||
name: {{ .Release.Name }}-mpijob | ||
spec: | ||
slotsPerWorker: {{ .Values.slotsPerWorker }} | ||
runPolicy: | ||
cleanPodPolicy: {{ .Values.image.cleanPodPolicy }} | ||
mpiReplicaSpecs: | ||
Launcher: | ||
replicas: 1 | ||
template: | ||
spec: | ||
hostIPC: {{ .Values.hostIPC }} | ||
containers: | ||
- name: {{ .Release.Name }}-mpijob-container | ||
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}" | ||
imagePullPolicy: {{ .Values.image.pullPolicy }} | ||
command: ["/bin/bash", "-c"] | ||
args: | ||
- >- | ||
/usr/bin/ssh-keygen -A; | ||
/usr/sbin/sshd; | ||
HOSTSFILE=$OMPI_MCA_orte_default_hostfile; | ||
MASTER_ADDR="$(head -n 1 $HOSTSFILE | sed -n s/[[:space:]]slots.*//p)"; | ||
echo $MASTER_ADDR; | ||
NUM_NODES=$(wc -l < $HOSTSFILE); | ||
CARDS_PER_NODE={{ .Values.numCards }}; | ||
N_CARDS=$((NUM_NODES*CARDS_PER_NODE)); | ||
|
||
SETUP_CMD="git clone --single-branch --branch v1.15.0 https://github.com/huggingface/optimum-habana.git; \ | ||
pip install -r optimum-habana/examples/language-modeling/requirements.txt; | ||
|
||
eval $SETUP_CMD; | ||
|
||
mpirun --npernode 1 \ | ||
--tag-output \ | ||
--allow-run-as-root \ | ||
--prefix $MPI_ROOT \ | ||
-mca routed direct \ | ||
git clone --single-branch --branch v1.15.0 https://github.com/huggingface/optimum-habana.git; | ||
|
||
mpirun --npernode 1 \ | ||
--tag-output \ | ||
--allow-run-as-root \ | ||
--prefix $MPI_ROOT \ | ||
-mca routed direct \ | ||
pip install -r optimum-habana/examples/language-modeling/requirements.txt; | ||
|
||
MODEL_PATH=/optimum-habana/examples/language-modeling; | ||
cd $MODEL_PATH; | ||
mpirun -np $N_CARDS --npernode $CARDS_PER_NODE \ | ||
--allow-run-as-root \ | ||
--bind-to core \ | ||
--map-by ppr:$CARDS_PER_NODE:node:PE=6 \ | ||
-rank-by core --report-bindings \ | ||
--tag-output \ | ||
--merge-stderr-to-stdout --prefix $MPI_ROOT \ | ||
-x MASTER_ADDR=$MASTER_ADDR \ | ||
-mca btl_tcp_if_include eth0 \ | ||
-mca oob_tcp_if_include eth0 \ | ||
-mca plm_rsh_no_tree_spawn 1 \ | ||
{{ .Values.command | join " " }}; | ||
resources: | ||
limits: | ||
cpu: 16 | ||
memory: 64Gi | ||
hugepages-2Mi: 4400Mi | ||
requests: | ||
cpu: 16 | ||
memory: 64Gi | ||
hugepages-2Mi: 4400Mi | ||
Worker: | ||
replicas: {{ .Values.numNodes }} | ||
template: | ||
spec: | ||
hostIPC: {{ .Values.hostIPC }} | ||
containers: | ||
- name: {{ .Release.Name }}-mpijob-container | ||
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}" | ||
imagePullPolicy: {{ .Values.image.pullPolicy }} | ||
command: ["/bin/bash", "-c"] | ||
args: | ||
- >- | ||
/usr/bin/ssh-keygen -A; | ||
/usr/sbin/sshd; | ||
sleep 365d; | ||
resources: | ||
{{- toYaml .Values.resources | nindent 16 }} | ||
{{- end }} |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been tested and validated to run on < 8 cards on multiple nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ltran5991 it has been tested on 2 nodes with one card each
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How bout 2 nodes with 2 cards each?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sramakintel ,
Could you test with 2node/2cards and confirm the code works. Thanks.