Skip to content

design: Solver refactor design#1574

Open
itsomri wants to merge 4 commits into
mainfrom
omric/solver-refactor-design
Open

design: Solver refactor design#1574
itsomri wants to merge 4 commits into
mainfrom
omric/solver-refactor-design

Conversation

@itsomri
Copy link
Copy Markdown
Collaborator

@itsomri itsomri commented May 10, 2026

Description

This PR proposes a refactor for the job solver stack.

Related Issues

Fixes #

Checklist

Note: Ensure your PR title follows the Conventional Commits format (e.g., feat(scheduler): add new feature)

  • Self-reviewed
  • Added/updated tests (if needed)
  • Updated documentation (if needed)

Breaking Changes

Additional Notes

itsomri added 3 commits May 7, 2026 18:01
Signed-off-by: itsomri <[email protected]>
Signed-off-by: itsomri <[email protected]>
Signed-off-by: itsomri <[email protected]>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 10, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c294d114-2e82-444a-b03a-65cc3f20e234

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch omric/solver-refactor-design

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: itsomri <[email protected]>

Typical podgroups are usually:
- Single pod workloads, which are the easiest
- Distributed training, inference, or data processing jobs — a leader plus several (typically one, sometimes a few, generally less than 10) worker templates, ranging from very few to thousands of pods that are *replicas* of those few templates: identical resource requests, predicates, and affinity rules. The solver treats every pod independently today, but template-level equivalence classes can shrink the effective candidate space substantially.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we actually assume that? We also see movment to workloads with several different worker types, and separate auto scaling for the different types and replicas, which leads to different resources requests

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several different is still fewer than # of replicas. We need to optimize for pod templates, not pods. The intention here is, for example, when we implement bin-packing approximations (for example in gpu scenario pre filter), we can bucket by pod type and get better results both in correctness and in performance

- **Jobs with topology requirements** is a common use case that requires its own optimizations
- Busy, multi-tenant, highly utilized clusters that serve dozens of teams

### Best solutions for reclaim, by multiple criteria
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather then "multiple criteria", we hsould talk about queue, job and subgroups order in general, as the critiria and their score can be changed

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention here is to explain the complexity of what can be considered a "good" solution and to show how today's assumptions are very naive. For example:

  • Who says that 1,000 victims from priority 49 are preferable to one victim from priority 50? (which is the state today)
  • Maybe it's worth it to evict 1,001 eligible victims vs 1,000, if it gives the preemptor a better topology placement?
  • What is better? An unfair but 100% allocated cluster (bin-packing optimized), vs 80%, completely fair allocation?

The refactor will not provide us knobs to address this issues, but we will be able to start having this conversation if we implement some scenario scoring mechanism.

While out of scope for the initial refactor, it's worth considering that different job classes warrant different scenario-generation strategies. The refactor should take into account that scenario generation could be **adaptive** to job type and cluster state.

- **Strict topology gangs** — enumerate viable placement domains first, derive the minimum victim set per domain. The default victim-set-first generator could be suboptimal here, both from performance and for finding the optimal solution.
- **Single-task reclaimers** — one pod can only land on one node, so a single-pod reclaimer requires single-node sub-scenario evaluation. This can be generalized further: each reclaimer set of pods has a theoretical minimum and maximum number of nodes that need to be evaluated.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming a max number of nodes might be problematic, when a consolidating reclaim is ebnabled.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, I wanted to add a note on that


The legacy gang loop probed `k = 1, 2, 4, ..., N` and could retain the largest feasible `k` as a partial allocation. The new simulator is all-or-nothing.

- **Strict gang semantics is correct** for min-member=N jobs — partial allocations don't help a gang that requires N tasks running together.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I belived we did used this data as a part of our unscheduleable explanation

Copy link
Copy Markdown
Collaborator Author

@itsomri itsomri May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can solve that without adding this complexity to the scheduler loop - for example, a cli tool that will simulate this on demand

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants