Skip to content

fix(pod-group-controller): skip DRA ResourceClaim lookup for terminal pods#1554

Open
david-gang wants to merge 1 commit into
kai-scheduler:mainfrom
david-gang:dgang/issue-1529-skip-dra-claims-for-terminal-pods
Open

fix(pod-group-controller): skip DRA ResourceClaim lookup for terminal pods#1554
david-gang wants to merge 1 commit into
kai-scheduler:mainfrom
david-gang:dgang/issue-1529-skip-dra-claims-for-terminal-pods

Conversation

@david-gang
Copy link
Copy Markdown
Contributor

Description

GetPodMetadata in the podgroupcontroller called FetchPodResourceClaims unconditionally on every reconcile, including for pods in Succeeded/Failed phases. The DRA driver removes per-pod ResourceClaim objects as soon as a pod reaches a terminal phase (independent of pod deletion), so the lookup always failed with NotFound and produced spurious ERROR logs on every reconcile cycle until the pod object itself was eventually cleaned up — which can take an indefinite amount of time when Job.spec.ttlSecondsAfterFinished is unset.

ERROR  Failed to calculate metadata for pod <ns>/<pod>
{"error": "failed to get resource claim <ns>/<claim> for pod <ns>/<pod>: ResourceClaim.resource.k8s.io \"<claim>\" not found"}

The fix adds a phase guard at the top of GetPodMetadata. Terminal pods skip the ResourceClaim lookup and return empty RequestedResources/AllocatedResources — which is what the existing isActivePod/isAllocatedPod guards downstream would have produced anyway.

This is the same root cause as #1455 (fixed scheduler-side in #1456), but the podgroupcontroller metadata path was missed at the time.

Related Issues

Fixes #1529

Checklist

  • Self-reviewed
  • Added/updated tests (if needed)
  • Updated documentation (if needed)

Breaking Changes

None.

Additional Notes

Tests

  • TestIsTerminalPod — table-driven check across the four phases, matching the style of the existing TestIsActivePod/TestIsPodAllocated.
  • TestGetPodMetadata_TerminalPodSkipsResourceClaimLookup — integration-style test with a fake.ClientBuilder and a Succeeded/Failed pod referencing a missing ResourceClaim. Verified to fail (nil-deref panic from FetchPodResourceClaims) without the fix and pass with it.

Validation

  • go test ./pkg/podgroupcontroller/...
  • go vet ./pkg/podgroupcontroller/...
  • gofmt -l pkg/podgroupcontroller/controllers/metadata/ clean

… pods

GetPodMetadata fetched per-pod ResourceClaims unconditionally, even for
pods in Succeeded/Failed phases. The DRA driver removes those claims as
soon as the pod reaches a terminal phase, so the lookup always failed
with NotFound and produced spurious ERROR logs on every reconcile until
the pod was finally deleted.

Add an early return for terminal pods, mirroring the scheduler-side fix
in kai-scheduler#1456.

Fixes kai-scheduler#1529

Signed-off-by: David Gang <[email protected]>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 66966eee-35aa-48fb-9e83-6d4e0f7893c6

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@david-gang
Copy link
Copy Markdown
Contributor Author

The failing E2E Upgrade Tests job is unrelated to this PR.

  • Helm upgrade itself succeeds (REVISION 2, status: deployed). The timeout happens afterward, waiting for KAIConfig.Status.Available=true at test/e2e/modules/wait/kai_config.go:41.
  • This PR only touches pkg/podgroupcontroller/controllers/metadata/pod.go (a phase guard in GetPodMetadata) — no path to KAIConfig reconciliation or operator startup.
  • An in-flight chart fix ("fix(chart): stop recreating kai-config CR on helm upgrade") is currently iterating on this exact area on main, indicating the upgrade flow is broken independently of this change.
  • All other jobs passed: unit tests, validate, build, regular E2E, FOSSA, coverage.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/kai-scheduler/KAI-scheduler/pkg/podgroupcontroller/controllers/metadata 23.53% (+6.86%) 👍

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/kai-scheduler/KAI-scheduler/pkg/podgroupcontroller/controllers/metadata/pod.go 25.00% (+7.22%) 48 (+3) 12 (+4) 36 (-1) 👍

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/kai-scheduler/KAI-scheduler/pkg/podgroupcontroller/controllers/metadata/pod_test.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(podgroupcontroller): logs errors for completed/failed pods when ResourceClaims are already deleted

2 participants