fix(scheduler): prevent cross-queue consolidating reclaim from bypassing fair-share validation#1565
Conversation
…ing fair-share validation When --allow-consolidating-reclaim=true (the default), the proportion plugin's reclaim validator skips victim tasks in active allocated statuses from its resource accounting via getResources(). If all of a victim queue's tasks are re-pipelined during the reclaim simulation, the validator sees an empty victim-resource map and trivially approves — even when the reclaimer queue is over its deserved share and the victim queue is under its deserved share. The fix: when all victims are re-pipelined and the resource map is empty, require the reclaimer queue to stay within its deserved quota (not just FairShare) for managed resources. This preserves legitimate defragmentation (starved queue moving work to make room) while blocking unfair cross-queue disruption (over-fed queue disrupting an under-allocated queue). The deserved-quota check only considers resources the queue actually manages. Resources with deserved=0 (common for CPU/Memory in GPU-only queue configs) are skipped, preventing false rejections. The IsActiveAllocatedStatus predicate in getResources() is intentionally kept unchanged. It serves two purposes: skipping re-placed victims (Pipelined) and skipping not-yet-simulated victims (Running) during incremental solving. These concerns would need to be separated for full accuracy in the partial re-pipelining case, but that is a larger refactor. Adds a reclaim_consolidation_empty_victim_map_total metric with reclaimer_queue and outcome labels for observability. Adds a regression test mirroring the incident scenario that fails without the fix and passes with it. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| func isUnderDeservedForManagedResources(allocated, deserved rs.ResourceQuantities) bool { | ||
| for _, resource := range rs.AllResources { | ||
| d := deserved[resource] | ||
| if d <= 0 || d == commonconstants.UnlimitedResourceQuantity { |
There was a problem hiding this comment.
Does that mean that if I lower a queues resource quota to 0 it might suddenly be able to schedule more jobs?
Description
When
--allow-consolidating-reclaim=true(the default), the proportion plugin's reclaim validator skips victim tasks in active allocated statuses from its resource accounting viagetResources(). If all of a victim queue's tasks are re-pipelined during the reclaim simulation, the validator sees an empty victim-resource map and trivially approves — regardless of whether the reclaimer queue is over its deserved share.We observed this cause an incident where 48 GPU of pre-training requests caused 376 GPU of running mid-training work to be evicted and restarted (evictedToRequestedRatio of 3.75x and 16x). Pre-training was already ~1100 GPU over its deserved share; mid-training was ~1700 GPU under. The evicted pods were successfully re-scheduled on new nodes in the next cycle — so from a resource accounting perspective, the consolidation worked as intended. But an over-fed queue should not be able to trigger disproportionate disruption to an under-fed queue.
The fix
When all victims are re-pipelined and the resource map is empty, require the reclaimer queue to stay within its deserved quota for managed resources. This gates consolidation on whether the reclaimer is actually starved — preventing an over-fed queue from causing disproportionate disruption to get resources it isn't entitled to, while preserving consolidation for queues that genuinely need resources they're owed.
Why deserved and not FairShare: FairShare dynamically expands to absorb unused capacity. A queue can be over-deserved but under-FairShare — which is exactly what happened in the incident (pre-training deserved=5120, FairShare=6294, allocated=6215). Using deserved as the boundary prevents a queue from disrupting another queue precisely because that other queue is under-utilizing.
Why only managed resources: The deserved-quota check uses
isUnderDeservedForManagedResources, which skips resources wheredeserved=0ordeserved=unlimited. GPU-only queue configs (common in production) setcpu.quota: 0andmemory.quota: 0. Without this scoping, running pods' CPU/Memory allocation would always exceed the zero deserved value, silently rejecting all consolidating reclaim.Why
IsActiveAllocatedStatusis kept unchanged ingetResources()We initially narrowed
getResources()to only skipPipelinedtasks (matching the flag's help text). This broke existing tests because the predicate serves two purposes:Pipelined) — correct for resource accounting, the victim queue keeps those GPUsRunning,Allocated, etc.) — necessary because during incremental solving, the scenario can include victims whose tasks have not been speculatively evicted yet in the current solver pass (previous iteration's statement was discarded, reverting tasks toRunning)Properly separating these concerns (always pass full victim resources to the strategies, use a separate signal for "this victim was re-placed") would fix both the empty-map case and the partial re-pipelining case, but is a larger refactor. This PR addresses the most dangerous scenario (all victims invisible, no evaluation at all) with the deserved-quota guard.
Changes
handleAllVictimsConsolidatedinproportion.go: when victims exist buttotalVictimsResourcesis empty, check reclaimer against deserved quota for managed resourcesisUnderDeservedForManagedResourceshelper: compares allocated vs deserved only for resources with non-zero, non-unlimited quotasreclaim_consolidation_empty_victim_map_totalmetric withreclaimer_queueandoutcomelabelsChecklist
Breaking Changes
None. The flag's default behavior is preserved for legitimate consolidation (reclaimer under deserved quota). Only the unfair case — reclaimer over deserved quota with all victims re-pipelined — is now rejected.