Skip to content

Gpu resource driver chart improvements #76

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

rouke-broersma
Copy link

The helm chart contains a lot of hard coded values, and in some places hard coded values that actual have configurable values in other places. I need these values configurable in my environment. Let me know if any changes are unwelcome.

name: intel-gpu-resource-driver
name: {{ include "intel-gpu-resource-driver.namespace" . }}
labels:
pod-security.kubernetes.io/enforce: privileged
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Privileged namespace security is required to mount host path files, so if the Chart managed the namespace it should by default set the required permissions. Let me know if this should be templated instead, I personally don't think so since it's a requirement for the pod to start.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Privileged namespace security is required to mount host path files

Could you expand a bit on why, where that applies?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://kubernetes.io/docs/concepts/security/pod-security-standards/

Pod Security Standards is the replacement for Pod Security Policy. It defines three standard security levels (privileged, baseline and restricted). When Pod Security Standards is enabled on a namespace, Host Path volume mounts are only allowed in the privileged profile.

All volumes mounted by the intel DRA driver are host path volumes:

      volumes:
      - name: plugins-registry
        hostPath:
          path: /var/lib/kubelet/plugins_registry
      - name: plugins
        hostPath:
          path: /var/lib/kubelet/plugins
      - name: cdi
        hostPath:
          path: {{ .Values.cdi.staticPath }}
      - name: varruncdi
        hostPath:
          path: {{ .Values.cdi.dynamicPath}}
      - name: sysfs
        hostPath:
          path: /sys

So any cluster using Pod Security Standards (a default component of Kubernetes) will require this annotation, otherwise admission control will deny the daemonset.

Copy link

@eero-t eero-t Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's good to know!

The annotation seems to be documented e.g. here: https://kubernetes.io/docs/tasks/configure-pod-container/enforce-standards-namespace-labels/

EDIT: please resolve.

@@ -1,4 +1,4 @@
{{- if .Values.nfd.enabled }}
{{- if or .Values.nodeFeatureRules.enabled .Values.nfd.enabled }}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So node feature rules can be deployed when you already have an existing NFD running in the cluster.

Comment on lines +38 to +40
cdi:
staticPath: /etc/cdi
dynamicPath: /var/run/cdi
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My environment requires alternative paths for cdi, so I need to be able to configure the host paths

@mythi mythi requested a review from oxxenix March 31, 2025 05:04
@oxxenix
Copy link
Contributor

oxxenix commented Mar 31, 2025

@byako, @eero-t PTAL

@oxxenix oxxenix removed the request for review from softwarerecipes March 31, 2025 09:01
name: intel-gpu-resource-driver
name: {{ include "intel-gpu-resource-driver.namespace" . }}
labels:
pod-security.kubernetes.io/enforce: privileged
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Privileged namespace security is required to mount host path files

Could you expand a bit on why, where that applies?

@@ -1,6 +1,6 @@
# Default values for intel-gpu-resource-driver.
nameOverride: ""
namespaceOverride: "intel-gpu-resource-driver"
namespaceOverride: ""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you're removing the current default?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have an existing function:

{{- define "intel-gpu-resource-driver.namespace" -}}
{{- default .Release.Namespace .Values.namespaceOverride }}
{{- end }}

If you set a namespaceoverride by default, this function is useless. Namespace override should be used when you want to not use the default.

The default function is now correctly used.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least you need to add the necessary Helm options to README, so that users do not install it to default ns by default, see: https://github.com/intel/intel-resource-drivers-for-kubernetes/blob/main/doc/gpu/USAGE.md#helm-charts

@@ -14,14 +14,12 @@ image:
serviceAccount:
create: true
annotations: {}
name: intel-gpu-resource-driver-service-account
name: ""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you're removing the current default?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you have a default function that you're not fully utilizing:

{{- define "intel-gpu-resource-driver.serviceAccountName" -}}
{{- if .Values.serviceAccount.create -}}
{{- default "intel-gpu-sa" .Values.serviceAccount.name -}}
{{- end -}}
{{- end }}

If you want I can set the default in the default helper function to the previous value. It doesn't really matter, the service account name is now fully generated using the template helpers in all usages.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updating the helper default would make this change clearer in the commit diff.

Comment on lines +42 to 46
nodeFeatureRules:
enabled: false

nfd:
enabled: false # change to true to install NFD to the cluster
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NFD settings should be under same heading[1], e.g:

nfdRules:
  use: true
  install: true

Then you can use just:
{{- if .Values.nfdRules.use }}
instead of:
{{- if or .Values.nodeFeatureRules.enabled .Values.nfd.enabled }}


[1] If Helm variable names are changed, they should be changed consistently for all 3 Intel DRA helm charts (gaudi, gpu, qat), not just GPU one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You pass all values under the nfd key through to the nfd helm chart, so I did not want to potentially break something by passing nonexisting values through to nfd. I added a new key for backwards compatibility with the previous behavior. Let me know if this is fine or if you want a breaking change.

If you want to keep all three helm charts consistent, have you considered making it one chart?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Having 2 separate variable hierarchies for NFD is rather ugly though => @oxxenix do you see problems in adding/changing variables under nfd?

If you want to keep all three helm charts consistent, have you considered making it one chart?

Not really my decision as I'm not involved with this project currently, except for doing a bit of review for the PRs. @byako, any comments on this?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this proposal.

Copy link

@byako byako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!
FYI - we're migrated to use GH OCI-registry for Helm charts: https://github.com/orgs/intel/packages?repo_name=intel-resource-drivers-for-kubernetes , and the chart source is moving (back to where it was originally) to the DRA drivers repo, so our charts in this repository are getting deprecated. I can apply this PR to the helm chart in the DRA drivers repo later this week when we do the release.

Comment on lines +42 to 46
nodeFeatureRules:
enabled: false

nfd:
enabled: false # change to true to install NFD to the cluster
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this proposal.

@rouke-broersma
Copy link
Author

@byako just to confirm is anything still needed from me at this point or do you take care of any remaining changes when you move the chart? Happy to help, just unsure what's left.

@byako
Copy link

byako commented Apr 1, 2025

@rouke-broersma nothing is needed, I can update the Helm doc section when pulling this into the repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants