Skip to content

fix(scheduler): prevent cross-queue consolidating reclaim from bypassing fair-share validation#1565

Open
piyushsprasad wants to merge 1 commit into
kai-scheduler:mainfrom
piyushsprasad:fix-consolidating-reclaim-fairshare
Open

fix(scheduler): prevent cross-queue consolidating reclaim from bypassing fair-share validation#1565
piyushsprasad wants to merge 1 commit into
kai-scheduler:mainfrom
piyushsprasad:fix-consolidating-reclaim-fairshare

Conversation

@piyushsprasad
Copy link
Copy Markdown

Description

When --allow-consolidating-reclaim=true (the default), the proportion plugin's reclaim validator skips victim tasks in active allocated statuses from its resource accounting via getResources(). If all of a victim queue's tasks are re-pipelined during the reclaim simulation, the validator sees an empty victim-resource map and trivially approves — regardless of whether the reclaimer queue is over its deserved share.

We observed this cause an incident where 48 GPU of pre-training requests caused 376 GPU of running mid-training work to be evicted and restarted (evictedToRequestedRatio of 3.75x and 16x). Pre-training was already ~1100 GPU over its deserved share; mid-training was ~1700 GPU under. The evicted pods were successfully re-scheduled on new nodes in the next cycle — so from a resource accounting perspective, the consolidation worked as intended. But an over-fed queue should not be able to trigger disproportionate disruption to an under-fed queue.

The fix

When all victims are re-pipelined and the resource map is empty, require the reclaimer queue to stay within its deserved quota for managed resources. This gates consolidation on whether the reclaimer is actually starved — preventing an over-fed queue from causing disproportionate disruption to get resources it isn't entitled to, while preserving consolidation for queues that genuinely need resources they're owed.

Why deserved and not FairShare: FairShare dynamically expands to absorb unused capacity. A queue can be over-deserved but under-FairShare — which is exactly what happened in the incident (pre-training deserved=5120, FairShare=6294, allocated=6215). Using deserved as the boundary prevents a queue from disrupting another queue precisely because that other queue is under-utilizing.

Why only managed resources: The deserved-quota check uses isUnderDeservedForManagedResources, which skips resources where deserved=0 or deserved=unlimited. GPU-only queue configs (common in production) set cpu.quota: 0 and memory.quota: 0. Without this scoping, running pods' CPU/Memory allocation would always exceed the zero deserved value, silently rejecting all consolidating reclaim.

Why IsActiveAllocatedStatus is kept unchanged in getResources()

We initially narrowed getResources() to only skip Pipelined tasks (matching the flag's help text). This broke existing tests because the predicate serves two purposes:

  1. Skip re-placed victims (Pipelined) — correct for resource accounting, the victim queue keeps those GPUs
  2. Skip not-yet-simulated victims (Running, Allocated, etc.) — necessary because during incremental solving, the scenario can include victims whose tasks have not been speculatively evicted yet in the current solver pass (previous iteration's statement was discarded, reverting tasks to Running)

Properly separating these concerns (always pass full victim resources to the strategies, use a separate signal for "this victim was re-placed") would fix both the empty-map case and the partial re-pipelining case, but is a larger refactor. This PR addresses the most dangerous scenario (all victims invisible, no evaluation at all) with the deserved-quota guard.

Changes

  • Add handleAllVictimsConsolidated in proportion.go: when victims exist but totalVictimsResources is empty, check reclaimer against deserved quota for managed resources
  • Add isUnderDeservedForManagedResources helper: compares allocated vs deserved only for resources with non-zero, non-unlimited quotas
  • Add reclaim_consolidation_empty_victim_map_total metric with reclaimer_queue and outcome labels
  • Add regression test mirroring the incident scenario (fails without the fix, passes with it)

Checklist

  • Self-reviewed
  • Added/updated tests (if needed)
  • Updated documentation (if needed)

Breaking Changes

None. The flag's default behavior is preserved for legitimate consolidation (reclaimer under deserved quota). Only the unfair case — reclaimer over deserved quota with all victims re-pipelined — is now rejected.

…ing fair-share validation

When --allow-consolidating-reclaim=true (the default), the proportion plugin's
reclaim validator skips victim tasks in active allocated statuses from its
resource accounting via getResources(). If all of a victim queue's tasks are
re-pipelined during the reclaim simulation, the validator sees an empty
victim-resource map and trivially approves — even when the reclaimer queue is
over its deserved share and the victim queue is under its deserved share.

The fix: when all victims are re-pipelined and the resource map is empty,
require the reclaimer queue to stay within its deserved quota (not just
FairShare) for managed resources. This preserves legitimate defragmentation
(starved queue moving work to make room) while blocking unfair cross-queue
disruption (over-fed queue disrupting an under-allocated queue).

The deserved-quota check only considers resources the queue actually manages.
Resources with deserved=0 (common for CPU/Memory in GPU-only queue configs)
are skipped, preventing false rejections.

The IsActiveAllocatedStatus predicate in getResources() is intentionally kept
unchanged. It serves two purposes: skipping re-placed victims (Pipelined) and
skipping not-yet-simulated victims (Running) during incremental solving. These
concerns would need to be separated for full accuracy in the partial
re-pipelining case, but that is a larger refactor.

Adds a reclaim_consolidation_empty_victim_map_total metric with reclaimer_queue
and outcome labels for observability.

Adds a regression test mirroring the incident scenario that fails without the
fix and passes with it.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e0d52158-8966-4906-ac6d-cad2abb5f48a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

func isUnderDeservedForManagedResources(allocated, deserved rs.ResourceQuantities) bool {
for _, resource := range rs.AllResources {
d := deserved[resource]
if d <= 0 || d == commonconstants.UnlimitedResourceQuantity {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean that if I lower a queues resource quota to 0 it might suddenly be able to schedule more jobs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants