intel
diff --git a/‎README.md
Lines changed: 0 additions & 26 deletions b/‎README.md
Lines changed: 0 additions & 26 deletions
diff --git a/‎classical-ml/README.md
Lines changed: 0 additions & 6 deletions b/‎classical-ml/README.md
Lines changed: 0 additions & 6 deletions
diff --git a/‎pytorch/README.md
Lines changed: 78 additions & 7 deletions b/‎pytorch/README.md
Lines changed: 78 additions & 7 deletions
diff --git a/‎pytorch/chart/.helmignore
Lines changed: 23 additions & 0 deletions b/‎pytorch/chart/.helmignore
Lines changed: 23 additions & 0 deletions
diff --git a/‎pytorch/chart/Chart.yaml
Lines changed: 24 additions & 0 deletions b/‎pytorch/chart/Chart.yaml
Lines changed: 24 additions & 0 deletions
diff --git a/‎pytorch/chart/README.md
Lines changed: 22 additions & 0 deletions b/‎pytorch/chart/README.md
Lines changed: 22 additions & 0 deletions
diff --git a/‎pytorch/chart/templates/pvc.yaml
Lines changed: 12 additions & 0 deletions b/‎pytorch/chart/templates/pvc.yaml
Lines changed: 12 additions & 0 deletions
diff --git a/‎pytorch/chart/templates/pytorchjob.yaml
Lines changed: 60 additions & 0 deletions b/‎pytorch/chart/templates/pytorchjob.yaml
Lines changed: 60 additions & 0 deletions
diff --git a/‎pytorch/chart/values.yaml
Lines changed: 17 additions & 0 deletions b/‎pytorch/chart/values.yaml
Lines changed: 17 additions & 0 deletions
diff --git a/‎tensorflow/Dockerfile
Lines changed: 2 additions & 1 deletion b/‎tensorflow/Dockerfile
Lines changed: 2 additions & 1 deletion
@@ -2,32 +2,6 @@
 
 This repository contains Dockerfiles, scripts, yaml files, Helm charts, etc. used to scale out AI containers with versions of TensorFlow and PyTorch that have been optimized for Intel platforms. Scaling is done with python, Docker, kubernetes, kubeflow, cnvrg.io, Helm, and other container orchestration frameworks for use in the cloud and on-premise.
 
-## Project Structure
-
-```text
-├── CODE_OF_CONDUCT.md
-├── CONTRIBUTING.md
-├── LICENSE
-├── README.md
-├── SECURITY.md
-├── classical-ml
-│   ├── Dockerfile
-│   ├── README.md
-│   └── docker-compose.yaml
-├── pytorch
-│   ├── Dockerfile
-│   ├── README.md
-│   ├── docker-compose.yaml
-└── tensorflow
-    ├── Dockerfile
-    ├── README.md
-    ├── docker-compose-serving.yaml
-    ├── docker-compose.yaml
-    ├── jupyter
-    │   └── third_party_programs.txt
-    └── serving
-```
-
 ## Project Setup
 
 Define your project's registry each time you use the project:
 
@@ -1,11 +1,5 @@
 # Classical ML Ingredients
 
-```mermaid
-%%{init: {'theme': 'dark'}}%%
-flowchart TB
-  mlbase[ml-base]
-```
-
 ## Classical ML
 
 ### Base
 
@@ -1,12 +1,5 @@
 # PyTorch Ingredients
 
-```mermaid
-%%{init: {'theme': 'dark'}}%%
-flowchart TB
-  ipexbase[ipex-base]
-  inc
-```
-
 ## PyTorch
 
 ### Base
@@ -31,3 +24,81 @@ Built from Base
 | --- | --- | --- |
 | INC_VERSION | `2.1.1` | Neural Compressor Version |
 | ONECCL_VERSION | `2.0.0+cpu` | TorchCCL Version |
+
+#### Distributed Training on k8s
+
+Use _N_-Nodes in your Training with PyTorchJobs and Kubeflow's Training Operator with an optimized production container.
+
+##### Distributed Production Container
+
+Create a Distributed Production Container using Intel Optimized PyTorch MultiNode layers. For Example:
+
+```dockerfile
+# Add Some Multinode image layers
+FROM intel/intel-optimized-pytorch:2.0.0-pip-multinode as prod-base
+# Use an existing container target
+FROM base as prod
+
+# Copy in Intel Optimized PyTorch MultiNode python environment, this will overwrite any packages with the same name
+COPY --from=prod-base /usr/local/lib/python3.10/dist-packages /usr/local/lib/python3.10/dist-packages
+COPY --from=prod-base /usr/local/bin /usr/local/bin
+
+...
+```
+
+##### Build the Container with New Stage
+
+```bash
+docker build ... --target prod -t my_container:prod .
+```
+
+##### Configure Kubernetes
+
+Using an existing Kubernetes Cluster of any flavor, install the standalone training operator from GitHub or use a pre-existing Kubeflow configuration.
+
+```bash
+kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"
+```
+
+Ensure that the training operator deployment readiness status `1/1` before proceeding.
+
+##### Deploy Distributed Job
+
+Install [Helm](https://helm.sh/docs/intro/install/)
+
+```bash
+curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 && \
+chmod 700 get_helm.sh && \
+./get_helm.sh
+```
+
+Configure the Helm Chart by changing [pytorchjob](chart/templates/pytorchjob.yaml#L18-L46), [pvc](chart/templates/pvc.yaml), and [values](chart/values.yaml) files.
+
+Afterwards, deploy to the cluster with `helm install`. To see all of the options, see the [README](chart/README.md) for the chart.
+
+```bash
+export NAMESPACE=kubeflow
+helm install ---namespace ${NAMESPACE} \
+     --set metadata.name=<workflow-name> \
+     --set metadata.namespace=<namespace with training operator> \
+     --set imageName=<Docker Image repository/Name> \
+     --set imageTag=<Docker Image Tag> \
+     ...
+     ipex-distributed
+     ./chart
+```
+
+To see an existing configuration utilizing this method, check out [Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers/README.md#kubernetes)' implementation.
+
+##### Troubleshooting
+
+- [TorchCCL Reference](https://github.com/intel/torch-ccl)
+- [PyTorchJob Reference](https://www.kubeflow.org/docs/components/training/pytorch/)
+- [Training Operator Reference](https://github.com/kubeflow/training-operator)
+- When applying proxies specify all of your proxies in a configmap in the same namespace, and add the following to both your launcher and workers:
+
+```yaml
+envFrom:
+  - configMapRef:
+      name: my-proxy-configmap-name
+```
@@ -0,0 +1,23 @@
+# Patterns to ignore when building packages.
+# This supports shell glob matching, relative path matching, and
+# negation (prefixed with !). Only one pattern per line.
+.DS_Store
+# Common VCS dirs
+.git/
+.gitignore
+.bzr/
+.bzrignore
+.hg/
+.hgignore
+.svn/
+# Common backup files
+*.swp
+*.bak
+*.tmp
+*.orig
+*~
+# Various IDEs
+.project
+.idea/
+*.tmproj
+.vscode/
@@ -0,0 +1,24 @@
+apiVersion: v2
+name: IPEX Distributed
+description: A Helm chart for Kubernetes
+
+# A chart can be either an 'application' or a 'library' chart.
+#
+# Application charts are a collection of templates that can be packaged into versioned archives
+# to be deployed.
+#
+# Library charts provide useful utilities or functions for the chart developer. They're included as
+# a dependency of application charts to inject those utilities and functions into the rendering
+# pipeline. Library charts do not define any templates and therefore cannot be deployed.
+type: application
+
+# This is the chart version. This version number should be incremented each time you make changes
+# to the chart and its templates, including the app version.
+# Versions are expected to follow Semantic Versioning (https://semver.org/)
+version: 0.1.0
+
+# This is the version number of the application being deployed. This version number should be
+# incremented each time you make changes to the application. Versions are not expected to
+# follow Semantic Versioning. They should reflect the version the application is using.
+# It is recommended to use it with quotes.
+appVersion: "1.16.0"
@@ -0,0 +1,22 @@
+# IPEX Distributed
+
+![Version: 0.1.0](https://img.shields.io/badge/Version-0.1.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 1.16.0](https://img.shields.io/badge/AppVersion-1.16.0-informational?style=flat-square)
+
+A Helm chart for Kubernetes
+
+## Values
+
+| Key | Type | Default | Description |
+|-----|------|---------|-------------|
+| imageName | string | `"intel/intel-optimized-pytorch"` |  |
+| imageTag | string | `"2.0.0-pip-multinode"` |  |
+| masterResources.cpu | int | `32` | Number of Compute for Master |
+| masterResources.memory | string | `"16Gi"` | Amount of Memory for Master |
+| metadata.name | string | `"ipex-distributed"` |  |
+| metadata.namespace | string | `"kubeflow"` |  |
+| pvcName | string | `"ipex"` |  |
+| pvcResources | string | `"2Gi"` | Amount of shared storage for workers and launcher |
+| pvcScn | string | `"nil"` | PVC `StorageClassName` |
+| workerResources.cpu | int | `32` | Number of Compute per Worker |
+| workerResources.memory | string | `"16Gi"` | Amount of Memory per Worker |
+| workers | int | `4` | Number of Workers |
@@ -0,0 +1,12 @@
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: {{ .Values.pvcName }}
+  namespace: {{ .Values.metadata.namespace }}
+spec:
+  storageClassName: {{ .Values.pvcScn }}
+  accessModes:
+    - "ReadWriteOnce"
+  resources: 
+    requests:
+      storage: {{ .Values.pvcResources }}
@@ -0,0 +1,60 @@
+apiVersion: "kubeflow.org/v1"
+kind: PyTorchJob
+metadata:
+  name: {{ .Values.metadata.name }}
+  namespace: {{ .Values.metadata.namespace }}
+spec:
+  pytorchReplicaSpecs:
+    Master:
+      replicas: 1
+      template:
+        spec:
+          containers:
+            - name: pytorch
+              image: "{{ .Values.imageName }}:{{ .Values.imageTag }}"
+              imagePullPolicy: Always
+              command:
+                - torchrun
+                - myscript.py
+              resources:
+                limits:
+                  cpu: {{ .Values.masterResources.cpu }}
+                  memory: {{ .Values.masterResources.memory }}
+              volumeMounts:
+              - name: dataset-dir
+                mountPath: /tmp/output
+          volumes:
+          - name: dshm
+            emptyDir:
+              medium: Memory
+          - name: dataset-dir
+            persistentVolumeClaim:
+              claimName: {{ .Values.pvcName }}
+    Worker:
+      replicas: {{ .Values.workers }}
+      template:
+        spec:
+          containers:
+            - name: pytorch
+              image: "{{ .Values.imageName }}:{{ .Values.imageTag }}"
+              imagePullPolicy: Always
+              envFrom:
+              - configMapRef:
+                  name: intel-proxy-config
+              command:
+                - torchrun
+                - myscript.py
+              resources:
+                limits:
+                  cpu: {{ .Values.workerResources.cpu }}
+                  memory: {{ .Values.workerResources.memory }}
+              volumeMounts:
+              - name: dataset-dir
+                mountPath: /tmp/output
+          volumes:
+          - name: dshm
+            emptyDir:
+              medium: Memory
+          - name: dataset-dir
+            persistentVolumeClaim:
+              claimName: {{ .Values.pvcName }}
@@ -0,0 +1,17 @@
+metadata:
+  name: ipex-distributed
+  namespace: kubeflow
+
+imageName: intel/intel-optimized-pytorch
+imageTag: 2.0.0-pip-multinode
+masterResources:
+  cpu: 32
+  memory: 16Gi
+workerResources:
+  cpu: 32
+  memory: 16Gi
+workers: 4
+
+pvcName: ipex
+pvcScn: nil
+pvcResources: 2Gi
@@ -182,7 +182,8 @@ ARG HOROVOD_VERSION
 ARG HOROVOD_WITH_TENSORFLOW=1
 ARG HOROVOD_WITHOUT_MXNET=1
 ARG HOROVOD_WITHOUT_PYTORCH=1
-ARG ONECCL_VERSION
+ARG HOROVOD_WITHOUT_GLOO=1
+ARG HOROVOD_WITH_MPI=1
 
 RUN apt-get install -y --no-install-recommends --fix-missing \
     build-essential \