Add DRA support for GPU pod eviction during driver upgrades#129
Add DRA support for GPU pod eviction during driver upgrades#129karthikvetrivel wants to merge 1 commit intomainfrom
Conversation
Don't we cordon the node before starting the upgrade? If the node is cordoned, then there won't be new allocations to that node. |
65a3f53 to
43d29cc
Compare
I think you're right here. Good point, thanks for bringing it up. |
6e1a6fb to
0682513
Compare
a355d89 to
9c7ed23
Compare
cdesiniotis
left a comment
There was a problem hiding this comment.
I didn't review in great detail, but this looks reasonable to me. A couple of things to consider:
- Do we want to merge this change (and get it included into a k8s-driver-manager / gpu-operator release) before the DRA driver is integrated with the gpu-operator? I believe the answer is yes since in many cases users will install the DRA driver alongside the GPU Operator (until they are integrated). @shivamerla do you have any contradicting opinions on this?
- We will need to make a similar change in the gpu-operator itself. By default, the driver-upgrade state machine (and therefore the GPU pod evictions) are handled by our driver upgrade controller that runs in the gpu-operator. We will need to update this line https://github.com/NVIDIA/gpu-operator/blob/51dd7a28cd86fedde8c4daad65c2643582fa4615/cmd/gpu-operator/main.go#L176 to pass in a modified gpu pod filter (that accounts for pods requesting GPUs via DRA) when constructing the driver upgrade controller.
|
9c7ed23 to
fc6bd1f
Compare
internal/kubernetes/client.go
Outdated
|
|
||
| var claim *resourcev1.ResourceClaim | ||
| var lastError error | ||
| _ = wait.PollUntilContextTimeout(c.ctx, 5*time.Second, timeout, true, func(ctx context.Context) (bool, error) { |
There was a problem hiding this comment.
I don't understand why we are not consuming the error returned by wait.PollUntilContextTimeout here?
There was a problem hiding this comment.
You're right, we should. The error handling here became a bit convoluted through the refactors. I will update this once we decide how to search and clean up the claims (claims --> pods vs. pods --> claims).
|
Though there is not much detail on best practice of how to clean up claims managed by DRA driver other than this two liner . But what Kevin was saying in the meeting make sense in terms of iterating on all I was curious to look into it from extended resources perspective. |
fc6bd1f to
9fc355a
Compare
9fc355a to
b76086f
Compare
Signed-off-by: Karthik Vetrivel <[email protected]>
b76086f to
d25e32a
Compare
|
@karthikvetrivel , are we considering this for the next release? |
For now, let's keep this PR around. It may be a part of our initiative to incorporate DRA into the operator. Thanks for taking a look. |
Description
Extends the driver-upgrade controller to detect and evict GPU workloads using Dynamic Resource Allocation (DRA) in addition to traditional
nvidia.com/gpuresources. This ensures GPU driver upgrades work correctly as Kubernetes transitions from device plugins to the DRA model (GA in K8s 1.34+).Changes
internal/kubernetes/claim_cache.go(new): ImplementsResourceClaimCachethat watchesResourceClaimobjects and maintains a map of pod UIDs with allocated NVIDIA GPU claims. Uses informers with O(1) pod UID lookups.internal/kubernetes/client.go:claimCacheto theClientstructpodUsesGPU()to check both traditional resources AND DRA ResourceClaimsTesting
Tested in a kubeadm cluster (K8s 1.34) with NVIDIA DRA driver installed:
Created test workloads:
ResourceClaim(driver: gpu.nvidia.com)Verified ResourceClaim allocation:
$ kubectl get resourceclaim -n default dra-gpu-claim -o yaml status: allocation: devices: results: - driver: gpu.nvidia.com device: gpu-0 pool: ipp1-0744 reservedFor: - name: dra-allocated-pod resource: podsVerified ResourceClaim cache synced:
Triggered driver upgrade eviction:
Verified DRA pod evicted successfully: