Skip to content

Commit ec0dc84

Browse files
julienmancusoJason Zhou
authored andcommitted
feat: Add Grove and Kai scheduler as part of dynamo cloud helm chart (#2755)
Signed-off-by: Julien Mancuso <[email protected]> Signed-off-by: Jason Zhou <[email protected]>
1 parent 25ea1ca commit ec0dc84

File tree

17 files changed

+993
-185
lines changed

17 files changed

+993
-185
lines changed

deploy/cloud/helm/platform/Chart.yaml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,3 +34,12 @@ dependencies:
3434
version: 11.1.0
3535
repository: "https://charts.bitnami.com/bitnami"
3636
condition: etcd.enabled
37+
- name: kai-scheduler
38+
version: v0.8.1
39+
repository: oci://ghcr.io/nvidia/kai-scheduler
40+
condition: kai-scheduler.enabled
41+
- name: grove-charts
42+
alias: grove
43+
version: v0.0.0-6e30275
44+
repository: oci://ghcr.io/nvidia/grove
45+
condition: grove.enabled
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
<!--
2+
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
SPDX-License-Identifier: Apache-2.0
4+
5+
Licensed under the Apache License, Version 2.0 (the "License");
6+
you may not use this file except in compliance with the License.
7+
You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
-->
17+
18+
# dynamo-platform
19+
20+
A Helm chart for NVIDIA Dynamo Platform.
21+
22+
![Version: 0.5.0](https://img.shields.io/badge/Version-0.5.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square)
23+
24+
## 🚀 Overview
25+
26+
The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure on Kubernetes, including:
27+
28+
- **Dynamo Operator**: Kubernetes operator for managing Dynamo deployments
29+
- **NATS**: High-performance messaging system for component communication
30+
- **etcd**: Distributed key-value store for operator state management
31+
- **Grove**: Multi-node inference orchestration (optional)
32+
- **Kai Scheduler**: Advanced workload scheduling (optional)
33+
34+
## 📋 Prerequisites
35+
36+
- Kubernetes cluster (v1.20+)
37+
- Helm 3.8+
38+
- Sufficient cluster resources for your deployment scale
39+
- Container registry access (if using private images)
40+
41+
## 🔧 Configuration
42+
43+
## Requirements
44+
45+
| Repository | Name | Version |
46+
|------------|------|---------|
47+
| file://components/operator | dynamo-operator | 0.5.0 |
48+
| https://charts.bitnami.com/bitnami | etcd | 11.1.0 |
49+
| https://nats-io.github.io/k8s/helm/charts/ | nats | 1.3.2 |
50+
| oci://ghcr.io/nvidia/grove | grove(grove-charts) | v0.0.0-6e30275 |
51+
| oci://ghcr.io/nvidia/kai-scheduler | kai-scheduler | v0.8.1 |
52+
53+
## Values
54+
55+
| Key | Type | Default | Description |
56+
|-----|------|---------|-------------|
57+
| dynamo-operator.enabled | bool | `true` | Whether to enable the Dynamo Kubernetes operator deployment |
58+
| dynamo-operator.natsAddr | string | `""` | NATS server address for operator communication (leave empty to use the bundled NATS chart). Format: "nats://hostname:port" |
59+
| dynamo-operator.etcdAddr | string | `""` | etcd server address for operator state storage (leave empty to use the bundled etcd chart). Format: "http://hostname:port" or "https://hostname:port" |
60+
| dynamo-operator.namespaceRestriction.enabled | bool | `true` | Whether to restrict operator to specific namespaces |
61+
| dynamo-operator.namespaceRestriction.targetNamespace | string | `nil` | Target namespace for operator deployment (leave empty for current namespace) |
62+
| dynamo-operator.controllerManager.tolerations | list | `[]` | Node tolerations for controller manager pods |
63+
| dynamo-operator.controllerManager.manager.image.repository | string | `"nvcr.io/nvidia/ai-dynamo/kubernetes-operator"` | Official NVIDIA Dynamo operator image repository |
64+
| dynamo-operator.controllerManager.manager.image.tag | string | `""` | Image tag (leave empty to use chart default) |
65+
| dynamo-operator.controllerManager.manager.image.pullPolicy | string | `"IfNotPresent"` | Image pull policy - when to pull the image |
66+
| dynamo-operator.controllerManager.manager.args[0] | string | `"--health-probe-bind-address=:8081"` | Health probe endpoint for Kubernetes health checks |
67+
| dynamo-operator.controllerManager.manager.args[1] | string | `"--metrics-bind-address=127.0.0.1:8080"` | Metrics endpoint for Prometheus scraping (localhost only for security) |
68+
| dynamo-operator.imagePullSecrets | list | `[]` | Secrets for pulling private container images |
69+
| dynamo-operator.dynamo.groveTerminationDelay | string | `"15m"` | How long to wait before forcefully terminating Grove instances |
70+
| dynamo-operator.dynamo.internalImages.debugger | string | `"python:3.12-slim"` | Debugger image for troubleshooting deployments |
71+
| dynamo-operator.dynamo.enableRestrictedSecurityContext | bool | `false` | Whether to enable restricted security contexts for enhanced security |
72+
| dynamo-operator.dynamo.dockerRegistry.useKubernetesSecret | bool | `false` | Whether to use Kubernetes secrets for registry authentication |
73+
| dynamo-operator.dynamo.dockerRegistry.server | string | `nil` | Docker registry server URL |
74+
| dynamo-operator.dynamo.dockerRegistry.username | string | `nil` | Registry username |
75+
| dynamo-operator.dynamo.dockerRegistry.password | string | `nil` | Registry password (consider using existingSecretName instead) |
76+
| dynamo-operator.dynamo.dockerRegistry.existingSecretName | string | `nil` | Name of existing Kubernetes secret containing registry credentials |
77+
| dynamo-operator.dynamo.dockerRegistry.secure | bool | `true` | Whether the registry uses HTTPS |
78+
| dynamo-operator.dynamo.ingress.enabled | bool | `false` | Whether to create ingress resources |
79+
| dynamo-operator.dynamo.ingress.className | string | `nil` | Ingress class name (e.g., "nginx", "traefik") |
80+
| dynamo-operator.dynamo.ingress.tlsSecretName | string | `"my-tls-secret"` | Secret name containing TLS certificates |
81+
| dynamo-operator.dynamo.istio.enabled | bool | `false` | Whether to enable Istio integration |
82+
| dynamo-operator.dynamo.istio.gateway | string | `nil` | Istio gateway name for routing |
83+
| dynamo-operator.dynamo.ingressHostSuffix | string | `""` | Host suffix for generated ingress hostnames |
84+
| dynamo-operator.dynamo.virtualServiceSupportsHTTPS | bool | `false` | Whether VirtualServices should support HTTPS routing |
85+
| grove.enabled | bool | `false` | Whether to enable Grove for multi-node inference coordination, if enabled, the Grove operator will be deployed cluster-wide |
86+
| kai-scheduler.enabled | bool | `false` | Whether to enable Kai Scheduler for intelligent resource allocation, if enabled, the Kai Scheduler operator will be deployed cluster-wide |
87+
| etcd.enabled | bool | `true` | Whether to enable etcd deployment, disable if you want to use an external etcd instance |
88+
| nats.enabled | bool | `true` | Whether to enable NATS deployment, disable if you want to use an external NATS instance |
89+
90+
### NATS Configuration
91+
92+
For detailed NATS configuration options beyond `nats.enabled`, please refer to the official NATS Helm chart documentation:
93+
**[NATS Helm Chart Documentation](https://github.com/nats-io/k8s/tree/main/helm/charts/nats)**
94+
95+
### etcd Configuration
96+
97+
For detailed etcd configuration options beyond `etcd.enabled`, please refer to the official Bitnami etcd Helm chart documentation:
98+
**[etcd Helm Chart Documentation](https://github.com/bitnami/charts/tree/main/bitnami/etcd)**
99+
100+
## 📚 Additional Resources
101+
102+
- [Dynamo Cloud Deployment Guide](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md)
103+
- [NATS Documentation](https://docs.nats.io/)
104+
- [etcd Documentation](https://etcd.io/docs/)
105+
- [Kubernetes Operator Pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/)
106+
107+
----------------------------------------------
108+
Autogenerated from chart metadata using [helm-docs v1.14.2](https://github.com/norwoodj/helm-docs/releases/v1.14.2)
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
<!--
2+
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
SPDX-License-Identifier: Apache-2.0
4+
5+
Licensed under the Apache License, Version 2.0 (the "License");
6+
you may not use this file except in compliance with the License.
7+
You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
-->
17+
18+
{{ template "chart.header" . }}
19+
20+
{{ template "chart.description" . }}
21+
22+
{{ template "chart.versionBadge" . }}{{ template "chart.typeBadge" . }}{{ template "chart.appVersionBadge" . }}
23+
24+
## 🚀 Overview
25+
26+
The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure on Kubernetes, including:
27+
28+
- **Dynamo Operator**: Kubernetes operator for managing Dynamo deployments
29+
- **NATS**: High-performance messaging system for component communication
30+
- **etcd**: Distributed key-value store for operator state management
31+
- **Grove**: Multi-node inference orchestration (optional)
32+
- **Kai Scheduler**: Advanced workload scheduling (optional)
33+
34+
## 📋 Prerequisites
35+
36+
- Kubernetes cluster (v1.20+)
37+
- Helm 3.8+
38+
- Sufficient cluster resources for your deployment scale
39+
- Container registry access (if using private images)
40+
41+
## 🔧 Configuration
42+
43+
{{ template "chart.requirementsSection" . }}
44+
45+
{{ template "chart.valuesSection" . }}
46+
47+
### NATS Configuration
48+
49+
For detailed NATS configuration options beyond `nats.enabled`, please refer to the official NATS Helm chart documentation:
50+
**[NATS Helm Chart Documentation](https://github.com/nats-io/k8s/tree/main/helm/charts/nats)**
51+
52+
### etcd Configuration
53+
54+
For detailed etcd configuration options beyond `etcd.enabled`, please refer to the official Bitnami etcd Helm chart documentation:
55+
**[etcd Helm Chart Documentation](https://github.com/bitnami/charts/tree/main/bitnami/etcd)**
56+
57+
58+
## 📚 Additional Resources
59+
60+
- [Dynamo Cloud Deployment Guide](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md)
61+
- [NATS Documentation](https://docs.nats.io/)
62+
- [etcd Documentation](https://etcd.io/docs/)
63+
- [Kubernetes Operator Pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/)
64+
65+
{{ template "helm-docs.versionFooter" . }}

deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -491,7 +491,7 @@ subjects:
491491
apiVersion: rbac.authorization.k8s.io/v1
492492
kind: ClusterRole
493493
metadata:
494-
name: {{ include "dynamo-operator.fullname" . }}-queue-reader
494+
name: {{ include "dynamo-operator.fullname" . }}-{{ .Release.Namespace }}-queue-reader
495495
labels:
496496
app.kubernetes.io/component: rbac
497497
app.kubernetes.io/created-by: dynamo-operator
@@ -510,7 +510,7 @@ rules:
510510
apiVersion: rbac.authorization.k8s.io/v1
511511
kind: ClusterRoleBinding
512512
metadata:
513-
name: {{ include "dynamo-operator.fullname" . }}-queue-reader-binding
513+
name: {{ include "dynamo-operator.fullname" . }}-{{ .Release.Namespace }}-queue-reader-binding
514514
labels:
515515
app.kubernetes.io/component: rbac
516516
app.kubernetes.io/created-by: dynamo-operator
@@ -519,7 +519,7 @@ metadata:
519519
roleRef:
520520
apiGroup: rbac.authorization.k8s.io
521521
kind: ClusterRole
522-
name: {{ include "dynamo-operator.fullname" . }}-queue-reader
522+
name: {{ include "dynamo-operator.fullname" . }}-{{ .Release.Namespace }}-queue-reader
523523
subjects:
524524
- kind: ServiceAccount
525525
name: '{{ include "dynamo-operator.fullname" . }}-controller-manager'
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
---
16+
{{- if .Capabilities.APIVersions.Has "scheduling.run.ai/v2" }}
17+
18+
{{- /* Create parent queue first */ -}}
19+
{{- $defaultQueue := lookup "scheduling.run.ai/v2" "Queue" "" "dynamo-default" }}
20+
{{- if not $defaultQueue }}
21+
---
22+
apiVersion: scheduling.run.ai/v2
23+
kind: Queue
24+
metadata:
25+
name: dynamo-default
26+
annotations:
27+
"helm.sh/hook": post-install,post-upgrade
28+
"helm.sh/hook-weight": "100"
29+
"helm.sh/hook-delete-policy": before-hook-creation
30+
spec:
31+
resources:
32+
cpu:
33+
quota: -1
34+
limit: -1
35+
overQuotaWeight: 1
36+
gpu:
37+
quota: -1
38+
limit: -1
39+
overQuotaWeight: 1
40+
memory:
41+
quota: -1
42+
limit: -1
43+
overQuotaWeight: 1
44+
{{- end }}
45+
46+
{{- /* Create child queue second */ -}}
47+
{{- $dynamoQueue := lookup "scheduling.run.ai/v2" "Queue" "" "dynamo" }}
48+
{{- if not $dynamoQueue }}
49+
---
50+
apiVersion: scheduling.run.ai/v2
51+
kind: Queue
52+
metadata:
53+
name: dynamo
54+
annotations:
55+
"helm.sh/hook": post-install,post-upgrade
56+
"helm.sh/hook-weight": "110"
57+
"helm.sh/hook-delete-policy": before-hook-creation
58+
spec:
59+
parentQueue: dynamo-default
60+
resources:
61+
cpu:
62+
quota: -1
63+
limit: -1
64+
overQuotaWeight: 1
65+
gpu:
66+
quota: -1
67+
limit: -1
68+
overQuotaWeight: 1
69+
memory:
70+
quota: -1
71+
limit: -1
72+
overQuotaWeight: 1
73+
{{- end }}
74+
75+
{{- end }}

0 commit comments

Comments
 (0)