fix(scheduler): account for native sidecar requests in pod resources#1556
fix(scheduler): account for native sidecar requests in pod resources#1556joeltg wants to merge 3 commits into
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Looks good, I have a few minor comments |
|
Thanks for the fix @joeltg ! |
📊 Performance Benchmark ResultsComparing PR (
|
Merging this branch changes the coverage (1 decrease, 1 increase)
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
`getPodResourceRequest` previously treated every initContainer as sequential (max'd against the regular-container sum), which under-counts pods that use native sidecars (initContainers with `restartPolicy: Always`, KEP-753). Native sidecars run concurrently with regular containers, so their requests must be added to the running sum — this is how kubelet computes admission via `k8s.io/component-helpers/resource.PodRequests`. Symptom in production: pods with a 250m gcs-fuse native sidecar were under-counted by 250m each. Once the cumulative undercount on a node exceeded the available headroom, KAI bound a pod that kubelet rejected with `OutOfCpu`. The rejected pod transitioned to `phase=Failed`, `IsActiveUsedStatus` excluded it from `nodeInfo.Requested` on the next cycle, and KAI re-bound another pod into the same phantom slot — a livelock that produced 200+ Failed pods on a single node. Signed-off-by: joeltg <joel@reflection.ai>
Two correctness gaps in the prior commit, both flagged in PR review: 1. `result.Add(&sidecarReq.BaseResource)` invoked the embedded `BaseResource.Add` via Go method promotion, which only sums CPU/memory/scalar resources. Native sidecars that request `nvidia.com/gpu`, `amd.com/gpu`, or MIG resources had their GPU contribution silently dropped — strictly worse than pre-fix, where `SetMaxResource` on the full `*ResourceRequirements` at least max'd the GPU half. Fix: introduce `GpuResourceRequirement.Add` and a `ResourceRequirements.Add` wrapper that sums both halves; use the wrapper in `getPodResourceRequest`. 2. The regular-init loop did `max(result, init.Requests)`, but a regular initContainer's peak demand is `init.Requests + sum(sidecars declared before it)` — those sidecars are already running concurrently per KEP-753. Diverged from upstream `AggregateContainerRequests` and under-counted pods whose regular init dominates main+sidecars (e.g., 8-CPU model-download init beside a 250m gcsfuse sidecar and a 4-CPU main). Fix: single pass over InitContainers tracking a `sidecarPrefix` accumulator, with regular inits max'd as `init.Requests + sidecarPrefix` into a separate `initPhasePeak`. Final result is max(steady-state, init-phase peak). Adds two test cases covering the GPU-on-sidecar path and the regular-init-after-sidecar prefix path. Signed-off-by: joeltg <joel@reflection.ai>
- Extract init-container resource math into `initContainerEffects` helper with a short docstring linking to KEP-753. - Drop wrapper-justification framing on `ResourceRequirements.Add` doc. - Add CHANGELOG entry under Unreleased / Fixed. Signed-off-by: joeltg <joel@reflection.ai>
1e41f7d to
312e1b5
Compare
|
Thanks for the quick review! |
Merging this branch changes the coverage (1 decrease, 1 increase)
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
Summary
getPodResourceRequest(inpkg/scheduler/api/pod_info/pod_info.go) treated every initContainer as sequential —max(running_sum, init.Requests)— which under-counts pods that use native sidecars (initContainers withrestartPolicy: Always, KEP-753).Per kubelet's admission accounting in
k8s.io/component-helpers/resource.PodRequests(AggregateContainerRequests), native sidecars run concurrently with regular containers, so:init.Requests + sum(native sidecars declared before it in spec order)— those sidecars are already started and running concurrently with the initmax(steady-state, init-phase peak)plus pod overheadDiverging from this lets the scheduler bind pods that kubelet then rejects with
OutOfCpu/OutOfGpu/etc. at admission, leaving them inphase=Failed.Production symptom
We hit this on a GKE cluster running pods with a 250m-CPU
gke-gcsfuse-sidecar(restartPolicy: Always) alongside a 4-CPU main container. KAI under-counted each pod by 250m. Across many pods on the same n4-standard-80 node, the cumulative phantom headroom exceeded the node's slack and KAI bound pods that kubelet then rejected withOutOfCpu. Once a pod transitioned toFailed,IsActiveUsedStatusexcluded it fromnodeInfo.Requestedon the next cycle, freeing the same phantom slot, and KAI bound another pod into it — a livelock that left 200+ Failed pods on a single node in our worst case.After deploying this fix to our fork, Failed-pod accumulation on a comparable node dropped to single digits, and admission rejections went from constant to occasional.
Implementation
Single pass over
Spec.InitContainersin spec order:restartPolicy: Always) → accumulate into asidecarPrefixrunning-suminitPhasePeak = max(initPhasePeak, init.Requests + sidecarPrefix)After the loop:
result += sidecarPrefix(steady-state includes all sidecars)result = max(result, initPhasePeak)(peak across init phases)This required a new
Addmethod:GpuResourceRequirement.Add— sumscount,portion(when both non-zero portions match, else error, mirroringSetMaxResource),draGpuCounts,migResourcesResourceRequirements.Add— wrapper that sums bothBaseResourceandGpuResourceRequirement. Without this,BaseResource.Addwas reachable via Go method promotion but silently dropped GPU/MIG resources from sidecars.Test plan
TestGetPodResourceRequestfor:init + sidecarPrefixmathpkg/scheduler/...tests pass