-
Notifications
You must be signed in to change notification settings - Fork 17
Gpu resource driver chart improvements #76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
3230139
e2ab5d8
4722f57
9574294
9665cb4
7bbd7e5
63df78c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,6 @@ | ||
apiVersion: v1 | ||
kind: Namespace | ||
metadata: | ||
name: intel-gpu-resource-driver | ||
name: {{ include "intel-gpu-resource-driver.namespace" . }} | ||
labels: | ||
pod-security.kubernetes.io/enforce: privileged | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Privileged namespace security is required to mount host path files, so if the Chart managed the namespace it should by default set the required permissions. Let me know if this should be templated instead, I personally don't think so since it's a requirement for the pod to start. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Could you expand a bit on why, where that applies? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See https://kubernetes.io/docs/concepts/security/pod-security-standards/ Pod Security Standards is the replacement for Pod Security Policy. It defines three standard security levels (privileged, baseline and restricted). When Pod Security Standards is enabled on a namespace, Host Path volume mounts are only allowed in the privileged profile. All volumes mounted by the intel DRA driver are host path volumes: volumes:
- name: plugins-registry
hostPath:
path: /var/lib/kubelet/plugins_registry
- name: plugins
hostPath:
path: /var/lib/kubelet/plugins
- name: cdi
hostPath:
path: {{ .Values.cdi.staticPath }}
- name: varruncdi
hostPath:
path: {{ .Values.cdi.dynamicPath}}
- name: sysfs
hostPath:
path: /sys So any cluster using Pod Security Standards (a default component of Kubernetes) will require this annotation, otherwise admission control will deny the daemonset. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, that's good to know! The annotation seems to be documented e.g. here: https://kubernetes.io/docs/tasks/configure-pod-container/enforce-standards-namespace-labels/ EDIT: please resolve. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,6 @@ | ||
# Default values for intel-gpu-resource-driver. | ||
nameOverride: "" | ||
namespaceOverride: "intel-gpu-resource-driver" | ||
namespaceOverride: "" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why you're removing the current default? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You have an existing function:
If you set a namespaceoverride by default, this function is useless. Namespace override should be used when you want to not use the default. The default function is now correctly used. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At least you need to add the necessary Helm options to README, so that users do not install it to |
||
fullnameOverride: "" | ||
selectorLabelsOverride: {} | ||
|
||
|
@@ -14,14 +14,12 @@ image: | |
serviceAccount: | ||
create: true | ||
annotations: {} | ||
name: intel-gpu-resource-driver-service-account | ||
name: "" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why you're removing the current default? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because you have a default function that you're not fully utilizing:
If you want I can set the default in the default helper function to the previous value. It doesn't really matter, the service account name is now fully generated using the template helpers in all usages. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, updating the helper default would make this change clearer in the commit diff. |
||
automount: true | ||
|
||
kubeletPlugin: | ||
podAnnotations: {} | ||
nodeSelector: {} | ||
# label used when nfd.enabled is true | ||
#intel.feature.node.kubernetes.io/gpu: "true" | ||
nodeSelector: {} # ignored when .Values.nodeFeatureRules.enabled or .Values.nfd.enabled | ||
tolerations: | ||
- key: node-role.kubernetes.io/master | ||
operator: Exists | ||
|
@@ -37,6 +35,13 @@ kubeletPlugin: | |
effect: "NoSchedule" | ||
affinity: {} | ||
|
||
cdi: | ||
staticPath: /etc/cdi | ||
dynamicPath: /var/run/cdi | ||
Comment on lines
+38
to
+40
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My environment requires alternative paths for cdi, so I need to be able to configure the host paths |
||
|
||
nodeFeatureRules: | ||
enabled: false | ||
|
||
nfd: | ||
enabled: false # change to true to install NFD to the cluster | ||
Comment on lines
+42
to
46
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. NFD settings should be under same heading[1], e.g:
Then you can use just: [1] If Helm variable names are changed, they should be changed consistently for all 3 Intel DRA helm charts (gaudi, gpu, qat), not just GPU one. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You pass all values under the nfd key through to the nfd helm chart, so I did not want to potentially break something by passing nonexisting values through to nfd. I added a new key for backwards compatibility with the previous behavior. Let me know if this is fine or if you want a breaking change. If you want to keep all three helm charts consistent, have you considered making it one chart? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. Having 2 separate variable hierarchies for NFD is rather ugly though => @oxxenix do you see problems in adding/changing variables under
Not really my decision as I'm not involved with this project currently, except for doing a bit of review for the PRs. @byako, any comments on this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm fine with this proposal. |
||
nameOverride: intel-gpu-nfd | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So node feature rules can be deployed when you already have an existing NFD running in the cluster.