Skip to content

feat: add imagePullPolicy to pytorchjob-generator #177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 17, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion tools/pytorchjob-generator/chart/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@ customize the Jobs generated by the tool.
| priority | string | `"default-priority"` | Type of priority for the job (choose from: "default-priority", "low-priority" or "high-priority"). |
| customLabels | array | `nil` | Optional array of custom labels to add to all the resources created by the Job (the PyTorchJob, the PodGroup, and the AppWrapper). |
| containerImage | string | must be provided by the user | Image used for creating the Job's containers (needs to have all the applications your job may need) |
| imagePullSecrets | array | `nil` | List of image-pull-secrets to be used for pulling containerImages |
| imagePullSecrets | array | `nil` | List of image-pull-secrets to be used for pulling containerImages |
| imagePullPolicy | string | `"IfNotPresent"` | Policy for pulling images (choose from: "IfNotPresent", "Always", or "Never") https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy |

### Resource Requirements

Expand Down
4 changes: 2 additions & 2 deletions tools/pytorchjob-generator/chart/templates/appwrapper.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ spec:
containers:
- name: pytorch
image: {{ required "Please specify a 'containerImage' in the user file" .Values.containerImage }}
imagePullPolicy: IfNotPresent
imagePullPolicy: {{ .Values.imagePullPolicy | default "IfNotPresent" }}
{{- include "mlbatch.securityContext" . | indent 44 }}
{{- include "mlbatch.env" . | indent 44 }}
{{- include "mlbatch.volumeMounts" . | indent 44 }}
Expand All @@ -140,7 +140,7 @@ spec:
containers:
- name: pytorch
image: {{ required "Please specify a 'containerImage' in the user file" .Values.containerImage }}
imagePullPolicy: IfNotPresent
imagePullPolicy: {{ .Values.imagePullPolicy | default "IfNotPresent" }}
{{- include "mlbatch.securityContext" . | indent 44 }}
{{- include "mlbatch.env" . | indent 44 }}
{{- include "mlbatch.volumeMounts" . | indent 44 }}
Expand Down
4 changes: 4 additions & 0 deletions tools/pytorchjob-generator/chart/values.schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,10 @@
{ "type": "null" },
{ "type": "array" }
]},
"imagePullPolicy": { "oneOf": [
{ "type": "null" },
{ "type": "string" }
]},
"volumes": { "oneOf": [
{ "type": "null" },
{ "type": "array" }
Expand Down
27 changes: 14 additions & 13 deletions tools/pytorchjob-generator/chart/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,12 +32,15 @@ customLabels:
# @section -- Job Metadata
containerImage:

# -- (array) List of image-pull-secrets to be used for pulling containerImages
# -- (array) List of image-pull-secrets to be used for pulling containerImages
# @section -- Job Metadata
imagePullSecrets: # <optional, default=[]>
imagePullSecrets: # <optional, default=[]>
# - name: secret-one
# - name: secret-two

# -- (string) Policy for pulling images (choose from: "IfNotPresent", "Always", or "Never") https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy
# @section -- Job Metadata
imagePullPolicy: IfNotPresent

##################################
# Resource Requirements
Expand Down Expand Up @@ -74,15 +77,13 @@ limitGpusPerPod: # <optional, default=numGpusPerPod> Limit of number of GPUs per
# @section -- Resource Requirements
limitMemoryPerPod: # <optional, default=totalMemoryPerPod> Limit of total memory per pod for elastic jobs


########################
# Workload Specification
########################


# -- (array) List of variables/values to be defined for all the ranks. Values can be literals or
# references to Kuberetes secrets or configmaps. See [values.yaml](values.yaml) for examples of supported syntaxes.
#
#
# NOTE: The following standard [PyTorch Distributed environment variables](https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization)
# are set automatically and can be referenced in the commands without being set manually: WORLD_SIZE, RANK, MASTER_ADDR, MASTER_PORT.
# @section -- Workload Specification
Expand All @@ -100,8 +101,8 @@ environmentVariables:
# name: configmap-name
# key: configmap-key

# Private GitHub clone support.
#
# Private GitHub clone support.
#
# 0) Create a secret and configMap to enable Private GitHub cloning as documented for your organization.
# 1) Then fill the name of the secret and configMap below in sshGitCloneConfig
# 2) Finally, add your (ssh) git clone command to setupCommands in the next section
Expand All @@ -123,7 +124,7 @@ sshGitCloneConfig: # <optional, default=""> Field with "(secretName, configMapNa
# -- (array) List of custom commands to be ran at the beginning of the execution. Use `setupCommand` to clone code, download data, and change directories.
# @default -- no custom commands are executed
# @section -- Workload Specification
setupCommands: # <optional, default=[]>
setupCommands: # <optional, default=[]>
# - git clone https://github.com/dbarnett/python-helloworld
# - cd python-helloworld

Expand All @@ -136,7 +137,7 @@ setupCommands: # <optional, default=[]>
# -- (string) Name of the PyTorch program to be executed by `torchrun`. Please provide your program name here and NOT in "setupCommands" as this helm template provides the necessary "torchrun" arguments for the parallel execution. WARNING: this program is relative to the current path set by change-of-directory commands in "setupCommands".
# If no value is provided; then only `setupCommands` are executed and torchrun is elided.
# @section -- Workload Specification
mainProgram: # <optional, default="">
mainProgram: # <optional, default="">

# -- (array) List of "(name, claimName, mountPath)" of volumes, with persistentVolumeClaim, to be mounted to the infrastructure
# @default -- No volumes are mounted
Expand All @@ -158,7 +159,7 @@ volumes:
# -- (string) RoCE GDR resource name (can vary by cluster configuration)
# @default -- nvidia.com/roce_gdr
# @section -- Advanced Options
roceGdrResName: # <optional, default="">
roceGdrResName: # <optional, default="">

# -- (integer) number of nvidia.com/roce_grd resources (0 means disabled; >0 means enable GDR over RoCE). Must be 0 unless numPods > 1.
# @section -- Advanced Options
Expand Down Expand Up @@ -188,11 +189,11 @@ disableSharedMemory: false
# The environment variable MOUNT_PATH_NVME provides the runtime mount path
# @section -- Advanced Options
mountNVMe:
# storage: 800Gi
# mountPath: "/workspace/scratch-nvme"
# storage: 800Gi
# mountPath: "/workspace/scratch-nvme"

# -- (array) List of "(name, image, command[])" specifying an init containers to be run before the main job. The 'command' field is a list of commands to run in the container, see the Kubernetes entry on initContainers for reference.
#
#
# @section -- Advanced Options
initContainers:
# - name: init-container-1
Expand Down