-
Notifications
You must be signed in to change notification settings - Fork 17
Gpu resource driver chart improvements #76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Gpu resource driver chart improvements #76
Conversation
Signed-off-by: Rouke Broersma <[email protected]>
…requires privileged Signed-off-by: Rouke Broersma <[email protected]>
Signed-off-by: Rouke Broersma <[email protected]>
…, use template functions instead Signed-off-by: Rouke Broersma <[email protected]>
Signed-off-by: Rouke Broersma <[email protected]>
…untName is sufficient Signed-off-by: Rouke Broersma <[email protected]>
… default helper functions instead Signed-off-by: Rouke Broersma <[email protected]>
name: intel-gpu-resource-driver | ||
name: {{ include "intel-gpu-resource-driver.namespace" . }} | ||
labels: | ||
pod-security.kubernetes.io/enforce: privileged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Privileged namespace security is required to mount host path files, so if the Chart managed the namespace it should by default set the required permissions. Let me know if this should be templated instead, I personally don't think so since it's a requirement for the pod to start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Privileged namespace security is required to mount host path files
Could you expand a bit on why, where that applies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://kubernetes.io/docs/concepts/security/pod-security-standards/
Pod Security Standards is the replacement for Pod Security Policy. It defines three standard security levels (privileged, baseline and restricted). When Pod Security Standards is enabled on a namespace, Host Path volume mounts are only allowed in the privileged profile.
All volumes mounted by the intel DRA driver are host path volumes:
volumes:
- name: plugins-registry
hostPath:
path: /var/lib/kubelet/plugins_registry
- name: plugins
hostPath:
path: /var/lib/kubelet/plugins
- name: cdi
hostPath:
path: {{ .Values.cdi.staticPath }}
- name: varruncdi
hostPath:
path: {{ .Values.cdi.dynamicPath}}
- name: sysfs
hostPath:
path: /sys
So any cluster using Pod Security Standards (a default component of Kubernetes) will require this annotation, otherwise admission control will deny the daemonset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's good to know!
The annotation seems to be documented e.g. here: https://kubernetes.io/docs/tasks/configure-pod-container/enforce-standards-namespace-labels/
EDIT: please resolve.
@@ -1,4 +1,4 @@ | |||
{{- if .Values.nfd.enabled }} | |||
{{- if or .Values.nodeFeatureRules.enabled .Values.nfd.enabled }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So node feature rules can be deployed when you already have an existing NFD running in the cluster.
cdi: | ||
staticPath: /etc/cdi | ||
dynamicPath: /var/run/cdi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My environment requires alternative paths for cdi, so I need to be able to configure the host paths
name: intel-gpu-resource-driver | ||
name: {{ include "intel-gpu-resource-driver.namespace" . }} | ||
labels: | ||
pod-security.kubernetes.io/enforce: privileged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Privileged namespace security is required to mount host path files
Could you expand a bit on why, where that applies?
charts/intel-gpu-resource-driver/templates/validating-admission-policy.yaml
Show resolved
Hide resolved
@@ -1,6 +1,6 @@ | |||
# Default values for intel-gpu-resource-driver. | |||
nameOverride: "" | |||
namespaceOverride: "intel-gpu-resource-driver" | |||
namespaceOverride: "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you're removing the current default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have an existing function:
{{- define "intel-gpu-resource-driver.namespace" -}}
{{- default .Release.Namespace .Values.namespaceOverride }}
{{- end }}
If you set a namespaceoverride by default, this function is useless. Namespace override should be used when you want to not use the default.
The default function is now correctly used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least you need to add the necessary Helm options to README, so that users do not install it to default
ns by default, see: https://github.com/intel/intel-resource-drivers-for-kubernetes/blob/main/doc/gpu/USAGE.md#helm-charts
@@ -14,14 +14,12 @@ image: | |||
serviceAccount: | |||
create: true | |||
annotations: {} | |||
name: intel-gpu-resource-driver-service-account | |||
name: "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you're removing the current default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because you have a default function that you're not fully utilizing:
{{- define "intel-gpu-resource-driver.serviceAccountName" -}}
{{- if .Values.serviceAccount.create -}}
{{- default "intel-gpu-sa" .Values.serviceAccount.name -}}
{{- end -}}
{{- end }}
If you want I can set the default in the default helper function to the previous value. It doesn't really matter, the service account name is now fully generated using the template helpers in all usages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, updating the helper default would make this change clearer in the commit diff.
nodeFeatureRules: | ||
enabled: false | ||
|
||
nfd: | ||
enabled: false # change to true to install NFD to the cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NFD settings should be under same heading[1], e.g:
nfdRules:
use: true
install: true
Then you can use just:
{{- if .Values.nfdRules.use }}
instead of:
{{- if or .Values.nodeFeatureRules.enabled .Values.nfd.enabled }}
[1] If Helm variable names are changed, they should be changed consistently for all 3 Intel DRA helm charts (gaudi, gpu, qat), not just GPU one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You pass all values under the nfd key through to the nfd helm chart, so I did not want to potentially break something by passing nonexisting values through to nfd. I added a new key for backwards compatibility with the previous behavior. Let me know if this is fine or if you want a breaking change.
If you want to keep all three helm charts consistent, have you considered making it one chart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Having 2 separate variable hierarchies for NFD is rather ugly though => @oxxenix do you see problems in adding/changing variables under nfd
?
If you want to keep all three helm charts consistent, have you considered making it one chart?
Not really my decision as I'm not involved with this project currently, except for doing a bit of review for the PRs. @byako, any comments on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
FYI - we're migrated to use GH OCI-registry for Helm charts: https://github.com/orgs/intel/packages?repo_name=intel-resource-drivers-for-kubernetes , and the chart source is moving (back to where it was originally) to the DRA drivers repo, so our charts in this repository are getting deprecated. I can apply this PR to the helm chart in the DRA drivers repo later this week when we do the release.
nodeFeatureRules: | ||
enabled: false | ||
|
||
nfd: | ||
enabled: false # change to true to install NFD to the cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this proposal.
@byako just to confirm is anything still needed from me at this point or do you take care of any remaining changes when you move the chart? Happy to help, just unsure what's left. |
@rouke-broersma nothing is needed, I can update the Helm doc section when pulling this into the repo. |
The helm chart contains a lot of hard coded values, and in some places hard coded values that actual have configurable values in other places. I need these values configurable in my environment. Let me know if any changes are unwelcome.