Skip to content

Commit b7de53a

Browse files
author
Tyler Titsworth
authored
Distributed k8s BKM (#81)
* add tf dist section * add pyt implementation * add EOFs * grammar * update k8s docs location for tlt * adjust whitespace on configmap proxy trblshoot section * address feedback to readmes
1 parent c7ec9d4 commit b7de53a

File tree

17 files changed

+491
-50
lines changed

17 files changed

+491
-50
lines changed

Diff for: README.md

-26
Original file line numberDiff line numberDiff line change
@@ -2,32 +2,6 @@
22

33
This repository contains Dockerfiles, scripts, yaml files, Helm charts, etc. used to scale out AI containers with versions of TensorFlow and PyTorch that have been optimized for Intel platforms. Scaling is done with python, Docker, kubernetes, kubeflow, cnvrg.io, Helm, and other container orchestration frameworks for use in the cloud and on-premise.
44

5-
## Project Structure
6-
7-
```text
8-
├── CODE_OF_CONDUCT.md
9-
├── CONTRIBUTING.md
10-
├── LICENSE
11-
├── README.md
12-
├── SECURITY.md
13-
├── classical-ml
14-
│ ├── Dockerfile
15-
│ ├── README.md
16-
│ └── docker-compose.yaml
17-
├── pytorch
18-
│ ├── Dockerfile
19-
│ ├── README.md
20-
│ ├── docker-compose.yaml
21-
└── tensorflow
22-
├── Dockerfile
23-
├── README.md
24-
├── docker-compose-serving.yaml
25-
├── docker-compose.yaml
26-
├── jupyter
27-
│ └── third_party_programs.txt
28-
└── serving
29-
```
30-
315
## Project Setup
326

337
Define your project's registry each time you use the project:

Diff for: classical-ml/README.md

-6
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,5 @@
11
# Classical ML Ingredients
22

3-
```mermaid
4-
%%{init: {'theme': 'dark'}}%%
5-
flowchart TB
6-
mlbase[ml-base]
7-
```
8-
93
## Classical ML
104

115
### Base

Diff for: pytorch/README.md

+78-7
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,5 @@
11
# PyTorch Ingredients
22

3-
```mermaid
4-
%%{init: {'theme': 'dark'}}%%
5-
flowchart TB
6-
ipexbase[ipex-base]
7-
inc
8-
```
9-
103
## PyTorch
114

125
### Base
@@ -31,3 +24,81 @@ Built from Base
3124
| --- | --- | --- |
3225
| INC_VERSION | `2.1.1` | Neural Compressor Version |
3326
| ONECCL_VERSION | `2.0.0+cpu` | TorchCCL Version |
27+
28+
#### Distributed Training on k8s
29+
30+
Use _N_-Nodes in your Training with PyTorchJobs and Kubeflow's Training Operator with an optimized production container.
31+
32+
##### Distributed Production Container
33+
34+
Create a Distributed Production Container using Intel Optimized PyTorch MultiNode layers. For Example:
35+
36+
```dockerfile
37+
# Add Some Multinode image layers
38+
FROM intel/intel-optimized-pytorch:2.0.0-pip-multinode as prod-base
39+
# Use an existing container target
40+
FROM base as prod
41+
42+
# Copy in Intel Optimized PyTorch MultiNode python environment, this will overwrite any packages with the same name
43+
COPY --from=prod-base /usr/local/lib/python3.10/dist-packages /usr/local/lib/python3.10/dist-packages
44+
COPY --from=prod-base /usr/local/bin /usr/local/bin
45+
46+
...
47+
```
48+
49+
##### Build the Container with New Stage
50+
51+
```bash
52+
docker build ... --target prod -t my_container:prod .
53+
```
54+
55+
##### Configure Kubernetes
56+
57+
Using an existing Kubernetes Cluster of any flavor, install the standalone training operator from GitHub or use a pre-existing Kubeflow configuration.
58+
59+
```bash
60+
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"
61+
```
62+
63+
Ensure that the training operator deployment readiness status `1/1` before proceeding.
64+
65+
##### Deploy Distributed Job
66+
67+
Install [Helm](https://helm.sh/docs/intro/install/)
68+
69+
```bash
70+
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 && \
71+
chmod 700 get_helm.sh && \
72+
./get_helm.sh
73+
```
74+
75+
Configure the Helm Chart by changing [pytorchjob](chart/templates/pytorchjob.yaml#L18-L46), [pvc](chart/templates/pvc.yaml), and [values](chart/values.yaml) files.
76+
77+
Afterwards, deploy to the cluster with `helm install`. To see all of the options, see the [README](chart/README.md) for the chart.
78+
79+
```bash
80+
export NAMESPACE=kubeflow
81+
helm install ---namespace ${NAMESPACE} \
82+
--set metadata.name=<workflow-name> \
83+
--set metadata.namespace=<namespace with training operator> \
84+
--set imageName=<Docker Image repository/Name> \
85+
--set imageTag=<Docker Image Tag> \
86+
...
87+
ipex-distributed
88+
./chart
89+
```
90+
91+
To see an existing configuration utilizing this method, check out [Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers/README.md#kubernetes)' implementation.
92+
93+
##### Troubleshooting
94+
95+
- [TorchCCL Reference](https://github.com/intel/torch-ccl)
96+
- [PyTorchJob Reference](https://www.kubeflow.org/docs/components/training/pytorch/)
97+
- [Training Operator Reference](https://github.com/kubeflow/training-operator)
98+
- When applying proxies specify all of your proxies in a configmap in the same namespace, and add the following to both your launcher and workers:
99+
100+
```yaml
101+
envFrom:
102+
- configMapRef:
103+
name: my-proxy-configmap-name
104+
```

Diff for: pytorch/chart/.helmignore

+23
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Patterns to ignore when building packages.
2+
# This supports shell glob matching, relative path matching, and
3+
# negation (prefixed with !). Only one pattern per line.
4+
.DS_Store
5+
# Common VCS dirs
6+
.git/
7+
.gitignore
8+
.bzr/
9+
.bzrignore
10+
.hg/
11+
.hgignore
12+
.svn/
13+
# Common backup files
14+
*.swp
15+
*.bak
16+
*.tmp
17+
*.orig
18+
*~
19+
# Various IDEs
20+
.project
21+
.idea/
22+
*.tmproj
23+
.vscode/

Diff for: pytorch/chart/Chart.yaml

+24
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
apiVersion: v2
2+
name: IPEX Distributed
3+
description: A Helm chart for Kubernetes
4+
5+
# A chart can be either an 'application' or a 'library' chart.
6+
#
7+
# Application charts are a collection of templates that can be packaged into versioned archives
8+
# to be deployed.
9+
#
10+
# Library charts provide useful utilities or functions for the chart developer. They're included as
11+
# a dependency of application charts to inject those utilities and functions into the rendering
12+
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
13+
type: application
14+
15+
# This is the chart version. This version number should be incremented each time you make changes
16+
# to the chart and its templates, including the app version.
17+
# Versions are expected to follow Semantic Versioning (https://semver.org/)
18+
version: 0.1.0
19+
20+
# This is the version number of the application being deployed. This version number should be
21+
# incremented each time you make changes to the application. Versions are not expected to
22+
# follow Semantic Versioning. They should reflect the version the application is using.
23+
# It is recommended to use it with quotes.
24+
appVersion: "1.16.0"

Diff for: pytorch/chart/README.md

+22
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# IPEX Distributed
2+
3+
![Version: 0.1.0](https://img.shields.io/badge/Version-0.1.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 1.16.0](https://img.shields.io/badge/AppVersion-1.16.0-informational?style=flat-square)
4+
5+
A Helm chart for Kubernetes
6+
7+
## Values
8+
9+
| Key | Type | Default | Description |
10+
|-----|------|---------|-------------|
11+
| imageName | string | `"intel/intel-optimized-pytorch"` | |
12+
| imageTag | string | `"2.0.0-pip-multinode"` | |
13+
| masterResources.cpu | int | `32` | Number of Compute for Master |
14+
| masterResources.memory | string | `"16Gi"` | Amount of Memory for Master |
15+
| metadata.name | string | `"ipex-distributed"` | |
16+
| metadata.namespace | string | `"kubeflow"` | |
17+
| pvcName | string | `"ipex"` | |
18+
| pvcResources | string | `"2Gi"` | Amount of shared storage for workers and launcher |
19+
| pvcScn | string | `"nil"` | PVC `StorageClassName` |
20+
| workerResources.cpu | int | `32` | Number of Compute per Worker |
21+
| workerResources.memory | string | `"16Gi"` | Amount of Memory per Worker |
22+
| workers | int | `4` | Number of Workers |

Diff for: pytorch/chart/templates/pvc.yaml

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
apiVersion: v1
2+
kind: PersistentVolumeClaim
3+
metadata:
4+
name: {{ .Values.pvcName }}
5+
namespace: {{ .Values.metadata.namespace }}
6+
spec:
7+
storageClassName: {{ .Values.pvcScn }}
8+
accessModes:
9+
- "ReadWriteOnce"
10+
resources:
11+
requests:
12+
storage: {{ .Values.pvcResources }}

Diff for: pytorch/chart/templates/pytorchjob.yaml

+60
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
apiVersion: "kubeflow.org/v1"
2+
kind: PyTorchJob
3+
metadata:
4+
name: {{ .Values.metadata.name }}
5+
namespace: {{ .Values.metadata.namespace }}
6+
spec:
7+
pytorchReplicaSpecs:
8+
Master:
9+
replicas: 1
10+
template:
11+
spec:
12+
containers:
13+
- name: pytorch
14+
image: "{{ .Values.imageName }}:{{ .Values.imageTag }}"
15+
imagePullPolicy: Always
16+
command:
17+
- torchrun
18+
- myscript.py
19+
resources:
20+
limits:
21+
cpu: {{ .Values.masterResources.cpu }}
22+
memory: {{ .Values.masterResources.memory }}
23+
volumeMounts:
24+
- name: dataset-dir
25+
mountPath: /tmp/output
26+
volumes:
27+
- name: dshm
28+
emptyDir:
29+
medium: Memory
30+
- name: dataset-dir
31+
persistentVolumeClaim:
32+
claimName: {{ .Values.pvcName }}
33+
Worker:
34+
replicas: {{ .Values.workers }}
35+
template:
36+
spec:
37+
containers:
38+
- name: pytorch
39+
image: "{{ .Values.imageName }}:{{ .Values.imageTag }}"
40+
imagePullPolicy: Always
41+
envFrom:
42+
- configMapRef:
43+
name: intel-proxy-config
44+
command:
45+
- torchrun
46+
- myscript.py
47+
resources:
48+
limits:
49+
cpu: {{ .Values.workerResources.cpu }}
50+
memory: {{ .Values.workerResources.memory }}
51+
volumeMounts:
52+
- name: dataset-dir
53+
mountPath: /tmp/output
54+
volumes:
55+
- name: dshm
56+
emptyDir:
57+
medium: Memory
58+
- name: dataset-dir
59+
persistentVolumeClaim:
60+
claimName: {{ .Values.pvcName }}

Diff for: pytorch/chart/values.yaml

+17
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
metadata:
2+
name: ipex-distributed
3+
namespace: kubeflow
4+
5+
imageName: intel/intel-optimized-pytorch
6+
imageTag: 2.0.0-pip-multinode
7+
masterResources:
8+
cpu: 32
9+
memory: 16Gi
10+
workerResources:
11+
cpu: 32
12+
memory: 16Gi
13+
workers: 4
14+
15+
pvcName: ipex
16+
pvcScn: nil
17+
pvcResources: 2Gi

Diff for: tensorflow/Dockerfile

+2-1
Original file line numberDiff line numberDiff line change
@@ -182,7 +182,8 @@ ARG HOROVOD_VERSION
182182
ARG HOROVOD_WITH_TENSORFLOW=1
183183
ARG HOROVOD_WITHOUT_MXNET=1
184184
ARG HOROVOD_WITHOUT_PYTORCH=1
185-
ARG ONECCL_VERSION
185+
ARG HOROVOD_WITHOUT_GLOO=1
186+
ARG HOROVOD_WITH_MPI=1
186187

187188
RUN apt-get install -y --no-install-recommends --fix-missing \
188189
build-essential \

0 commit comments

Comments
 (0)