fix(pod-group-controller): skip DRA ResourceClaim lookup for terminal pods#1554
Conversation
… pods GetPodMetadata fetched per-pod ResourceClaims unconditionally, even for pods in Succeeded/Failed phases. The DRA driver removes those claims as soon as the pod reaches a terminal phase, so the lookup always failed with NotFound and produced spurious ERROR logs on every reconcile until the pod was finally deleted. Add an early return for terminal pods, mirroring the scheduler-side fix in kai-scheduler#1456. Fixes kai-scheduler#1529 Signed-off-by: David Gang <[email protected]>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
The failing E2E Upgrade Tests job is unrelated to this PR.
|
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
Description
GetPodMetadatain thepodgroupcontrollercalledFetchPodResourceClaimsunconditionally on every reconcile, including for pods inSucceeded/Failedphases. The DRA driver removes per-podResourceClaimobjects as soon as a pod reaches a terminal phase (independent of pod deletion), so the lookup always failed withNotFoundand produced spurious ERROR logs on every reconcile cycle until the pod object itself was eventually cleaned up — which can take an indefinite amount of time whenJob.spec.ttlSecondsAfterFinishedis unset.The fix adds a phase guard at the top of
GetPodMetadata. Terminal pods skip the ResourceClaim lookup and return emptyRequestedResources/AllocatedResources— which is what the existingisActivePod/isAllocatedPodguards downstream would have produced anyway.This is the same root cause as #1455 (fixed scheduler-side in #1456), but the
podgroupcontrollermetadata path was missed at the time.Related Issues
Fixes #1529
Checklist
Breaking Changes
None.
Additional Notes
Tests
TestIsTerminalPod— table-driven check across the four phases, matching the style of the existingTestIsActivePod/TestIsPodAllocated.TestGetPodMetadata_TerminalPodSkipsResourceClaimLookup— integration-style test with afake.ClientBuilderand a Succeeded/Failed pod referencing a missingResourceClaim. Verified to fail (nil-deref panic fromFetchPodResourceClaims) without the fix and pass with it.Validation
go test ./pkg/podgroupcontroller/...✅go vet ./pkg/podgroupcontroller/...✅gofmt -l pkg/podgroupcontroller/controllers/metadata/clean