fix(pod-group-controller): skip DRA ResourceClaim lookup for terminal pods by david-gang · Pull Request #1554 · kai-scheduler/KAI-Scheduler

david-gang · 2026-05-06T11:45:45Z

Description

GetPodMetadata in the podgroupcontroller called FetchPodResourceClaims unconditionally on every reconcile, including for pods in Succeeded/Failed phases. The DRA driver removes per-pod ResourceClaim objects as soon as a pod reaches a terminal phase (independent of pod deletion), so the lookup always failed with NotFound and produced spurious ERROR logs on every reconcile cycle until the pod object itself was eventually cleaned up — which can take an indefinite amount of time when Job.spec.ttlSecondsAfterFinished is unset.

ERROR  Failed to calculate metadata for pod <ns>/<pod>
{"error": "failed to get resource claim <ns>/<claim> for pod <ns>/<pod>: ResourceClaim.resource.k8s.io \"<claim>\" not found"}

The fix adds a phase guard at the top of GetPodMetadata. Terminal pods skip the ResourceClaim lookup and return empty RequestedResources/AllocatedResources — which is what the existing isActivePod/isAllocatedPod guards downstream would have produced anyway.

This is the same root cause as #1455 (fixed scheduler-side in #1456), but the podgroupcontroller metadata path was missed at the time.

Related Issues

Fixes #1529

Checklist

Self-reviewed
Added/updated tests (if needed)
Updated documentation (if needed)

Breaking Changes

None.

Additional Notes

Tests

TestIsTerminalPod — table-driven check across the four phases, matching the style of the existing TestIsActivePod/TestIsPodAllocated.
TestGetPodMetadata_TerminalPodSkipsResourceClaimLookup — integration-style test with a fake.ClientBuilder and a Succeeded/Failed pod referencing a missing ResourceClaim. Verified to fail (nil-deref panic from FetchPodResourceClaims) without the fix and pass with it.

Validation

go test ./pkg/podgroupcontroller/... ✅
go vet ./pkg/podgroupcontroller/... ✅
gofmt -l pkg/podgroupcontroller/controllers/metadata/ clean

… pods GetPodMetadata fetched per-pod ResourceClaims unconditionally, even for pods in Succeeded/Failed phases. The DRA driver removes those claims as soon as the pod reaches a terminal phase, so the lookup always failed with NotFound and produced spurious ERROR logs on every reconcile until the pod was finally deleted. Add an early return for terminal pods, mirroring the scheduler-side fix in kai-scheduler#1456. Fixes kai-scheduler#1529 Signed-off-by: David Gang <[email protected]>

coderabbitai · 2026-05-06T11:45:54Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 66966eee-35aa-48fb-9e83-6d4e0f7893c6

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

david-gang · 2026-05-07T06:47:54Z

The failing E2E Upgrade Tests job is unrelated to this PR.

Helm upgrade itself succeeds (REVISION 2, status: deployed). The timeout happens afterward, waiting for KAIConfig.Status.Available=true at test/e2e/modules/wait/kai_config.go:41.
This PR only touches pkg/podgroupcontroller/controllers/metadata/pod.go (a phase guard in GetPodMetadata) — no path to KAIConfig reconciliation or operator startup.
An in-flight chart fix ("fix(chart): stop recreating kai-config CR on helm upgrade") is currently iterating on this exact area on main, indicating the upgrade flow is broken independently of this change.
All other jobs passed: unit tests, validate, build, regular E2E, FOSSA, coverage.

github-actions · 2026-05-07T07:06:13Z

Merging this branch will increase overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/kai-scheduler/KAI-scheduler/pkg/podgroupcontroller/controllers/metadata	23.53% (+6.86%)	👍

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/kai-scheduler/KAI-scheduler/pkg/podgroupcontroller/controllers/metadata/pod.go	25.00% (+7.22%)	48 (+3)	12 (+4)	36 (-1)	👍

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/kai-scheduler/KAI-scheduler/pkg/podgroupcontroller/controllers/metadata/pod_test.go

enoodle approved these changes May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pod-group-controller): skip DRA ResourceClaim lookup for terminal pods#1554

fix(pod-group-controller): skip DRA ResourceClaim lookup for terminal pods#1554
david-gang wants to merge 1 commit into
kai-scheduler:mainfrom
david-gang:dgang/issue-1529-skip-dra-claims-for-terminal-pods

david-gang commented May 6, 2026

Uh oh!

coderabbitai Bot commented May 6, 2026

Review skipped

Uh oh!

david-gang commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Changed files (no unit tests)

Changed unit test files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

david-gang commented May 6, 2026

Description

Related Issues

Checklist

Breaking Changes

Additional Notes

Uh oh!

coderabbitai Bot commented May 6, 2026

Review skipped

Uh oh!

david-gang commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Merging this branch will increase overall coverage

Changed files (no unit tests)

Changed unit test files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants