Skip to content

Gpu resource driver chart improvements #76

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{{- if .Values.nfd.enabled }}
{{- if or .Values.nodeFeatureRules.enabled .Values.nfd.enabled }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So node feature rules can be deployed when you already have an existing NFD running in the cluster.

apiVersion: nfd.k8s-sigs.io/v1alpha1
kind: NodeFeatureRule
metadata:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
apiVersion: v1
kind: Namespace
metadata:
name: intel-gpu-resource-driver
name: {{ include "intel-gpu-resource-driver.namespace" . }}
labels:
pod-security.kubernetes.io/enforce: privileged
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Privileged namespace security is required to mount host path files, so if the Chart managed the namespace it should by default set the required permissions. Let me know if this should be templated instead, I personally don't think so since it's a requirement for the pod to start.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Privileged namespace security is required to mount host path files

Could you expand a bit on why, where that applies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://kubernetes.io/docs/concepts/security/pod-security-standards/

Pod Security Standards is the replacement for Pod Security Policy. It defines three standard security levels (privileged, baseline and restricted). When Pod Security Standards is enabled on a namespace, Host Path volume mounts are only allowed in the privileged profile.

All volumes mounted by the intel DRA driver are host path volumes:

      volumes:
      - name: plugins-registry
        hostPath:
          path: /var/lib/kubelet/plugins_registry
      - name: plugins
        hostPath:
          path: /var/lib/kubelet/plugins
      - name: cdi
        hostPath:
          path: {{ .Values.cdi.staticPath }}
      - name: varruncdi
        hostPath:
          path: {{ .Values.cdi.dynamicPath}}
      - name: sysfs
        hostPath:
          path: /sys

So any cluster using Pod Security Standards (a default component of Kubernetes) will require this annotation, otherwise admission control will deny the daemonset.

Copy link

@eero-t eero-t Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's good to know!

The annotation seems to be documented e.g. here: https://kubernetes.io/docs/tasks/configure-pod-container/enforce-standards-namespace-labels/

EDIT: please resolve.

Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@ spec:
labels:
app: intel-gpu-resource-driver
spec:
serviceAccount: intel-gpu-resource-driver-service-account
serviceAccountName: {{ include "intel-gpu-resource-driver.serviceAccountName" . }}
containers:
- name: kubelet-plugin
Expand Down Expand Up @@ -62,18 +61,18 @@ spec:
path: /var/lib/kubelet/plugins
- name: cdi
hostPath:
path: /etc/cdi
path: {{ .Values.cdi.staticPath }}
- name: varruncdi
hostPath:
path: /var/run/cdi
path: {{ .Values.cdi.dynamicPath}}
- name: sysfs
hostPath:
path: /sys
{{- with .Values.kubeletPlugin.tolerations }}
tolerations:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- if .Values.nfd.enabled }}
{{- if or .Values.nodeFeatureRules.enabled .Values.nfd.enabled }}
nodeSelector:
intel.feature.node.kubernetes.io/gpu: "true"
{{- else }}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ spec:
matchConditions:
- name: isRestrictedUser
expression: >-
request.userInfo.username == "system:serviceaccount:intel-gpu-resource-driver:intel-gpu-resource-driver-service-account"
request.userInfo.username == "system:serviceaccount:{{ include "intel-gpu-resource-driver.namespace" . }}:{{ include "intel-gpu-resource-driver.serviceAccountName" . }}"
variables:
- name: userNodeName
expression: >-
Expand Down
15 changes: 10 additions & 5 deletions charts/intel-gpu-resource-driver/values.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Default values for intel-gpu-resource-driver.
nameOverride: ""
namespaceOverride: "intel-gpu-resource-driver"
namespaceOverride: ""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you're removing the current default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have an existing function:

{{- define "intel-gpu-resource-driver.namespace" -}}
{{- default .Release.Namespace .Values.namespaceOverride }}
{{- end }}

If you set a namespaceoverride by default, this function is useless. Namespace override should be used when you want to not use the default.

The default function is now correctly used.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least you need to add the necessary Helm options to README, so that users do not install it to default ns by default, see: https://github.com/intel/intel-resource-drivers-for-kubernetes/blob/main/doc/gpu/USAGE.md#helm-charts

fullnameOverride: ""
selectorLabelsOverride: {}

Expand All @@ -14,14 +14,12 @@ image:
serviceAccount:
create: true
annotations: {}
name: intel-gpu-resource-driver-service-account
name: ""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you're removing the current default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you have a default function that you're not fully utilizing:

{{- define "intel-gpu-resource-driver.serviceAccountName" -}}
{{- if .Values.serviceAccount.create -}}
{{- default "intel-gpu-sa" .Values.serviceAccount.name -}}
{{- end -}}
{{- end }}

If you want I can set the default in the default helper function to the previous value. It doesn't really matter, the service account name is now fully generated using the template helpers in all usages.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updating the helper default would make this change clearer in the commit diff.

automount: true

kubeletPlugin:
podAnnotations: {}
nodeSelector: {}
# label used when nfd.enabled is true
#intel.feature.node.kubernetes.io/gpu: "true"
nodeSelector: {} # ignored when .Values.nodeFeatureRules.enabled or .Values.nfd.enabled
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
Expand All @@ -37,6 +35,13 @@ kubeletPlugin:
effect: "NoSchedule"
affinity: {}

cdi:
staticPath: /etc/cdi
dynamicPath: /var/run/cdi
Comment on lines +38 to +40
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My environment requires alternative paths for cdi, so I need to be able to configure the host paths


nodeFeatureRules:
enabled: false

nfd:
enabled: false # change to true to install NFD to the cluster
Comment on lines +42 to 46
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NFD settings should be under same heading[1], e.g:

nfdRules:
  use: true
  install: true

Then you can use just:
{{- if .Values.nfdRules.use }}
instead of:
{{- if or .Values.nodeFeatureRules.enabled .Values.nfd.enabled }}


[1] If Helm variable names are changed, they should be changed consistently for all 3 Intel DRA helm charts (gaudi, gpu, qat), not just GPU one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You pass all values under the nfd key through to the nfd helm chart, so I did not want to potentially break something by passing nonexisting values through to nfd. I added a new key for backwards compatibility with the previous behavior. Let me know if this is fine or if you want a breaking change.

If you want to keep all three helm charts consistent, have you considered making it one chart?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Having 2 separate variable hierarchies for NFD is rather ugly though => @oxxenix do you see problems in adding/changing variables under nfd?

If you want to keep all three helm charts consistent, have you considered making it one chart?

Not really my decision as I'm not involved with this project currently, except for doing a bit of review for the PRs. @byako, any comments on this?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this proposal.

nameOverride: intel-gpu-nfd
Expand Down