design: Solver refactor design#1574
Conversation
Signed-off-by: itsomri <[email protected]>
Signed-off-by: itsomri <[email protected]>
Signed-off-by: itsomri <[email protected]>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: itsomri <[email protected]>
|
|
||
| Typical podgroups are usually: | ||
| - Single pod workloads, which are the easiest | ||
| - Distributed training, inference, or data processing jobs — a leader plus several (typically one, sometimes a few, generally less than 10) worker templates, ranging from very few to thousands of pods that are *replicas* of those few templates: identical resource requests, predicates, and affinity rules. The solver treats every pod independently today, but template-level equivalence classes can shrink the effective candidate space substantially. |
There was a problem hiding this comment.
Can we actually assume that? We also see movment to workloads with several different worker types, and separate auto scaling for the different types and replicas, which leads to different resources requests
There was a problem hiding this comment.
Several different is still fewer than # of replicas. We need to optimize for pod templates, not pods. The intention here is, for example, when we implement bin-packing approximations (for example in gpu scenario pre filter), we can bucket by pod type and get better results both in correctness and in performance
| - **Jobs with topology requirements** is a common use case that requires its own optimizations | ||
| - Busy, multi-tenant, highly utilized clusters that serve dozens of teams | ||
|
|
||
| ### Best solutions for reclaim, by multiple criteria |
There was a problem hiding this comment.
Rather then "multiple criteria", we hsould talk about queue, job and subgroups order in general, as the critiria and their score can be changed
There was a problem hiding this comment.
The intention here is to explain the complexity of what can be considered a "good" solution and to show how today's assumptions are very naive. For example:
- Who says that 1,000 victims from priority 49 are preferable to one victim from priority 50? (which is the state today)
- Maybe it's worth it to evict 1,001 eligible victims vs 1,000, if it gives the preemptor a better topology placement?
- What is better? An unfair but 100% allocated cluster (bin-packing optimized), vs 80%, completely fair allocation?
The refactor will not provide us knobs to address this issues, but we will be able to start having this conversation if we implement some scenario scoring mechanism.
| While out of scope for the initial refactor, it's worth considering that different job classes warrant different scenario-generation strategies. The refactor should take into account that scenario generation could be **adaptive** to job type and cluster state. | ||
|
|
||
| - **Strict topology gangs** — enumerate viable placement domains first, derive the minimum victim set per domain. The default victim-set-first generator could be suboptimal here, both from performance and for finding the optimal solution. | ||
| - **Single-task reclaimers** — one pod can only land on one node, so a single-pod reclaimer requires single-node sub-scenario evaluation. This can be generalized further: each reclaimer set of pods has a theoretical minimum and maximum number of nodes that need to be evaluated. |
There was a problem hiding this comment.
Assuming a max number of nodes might be problematic, when a consolidating reclaim is ebnabled.
There was a problem hiding this comment.
That's correct, I wanted to add a note on that
|
|
||
| The legacy gang loop probed `k = 1, 2, 4, ..., N` and could retain the largest feasible `k` as a partial allocation. The new simulator is all-or-nothing. | ||
|
|
||
| - **Strict gang semantics is correct** for min-member=N jobs — partial allocations don't help a gang that requires N tasks running together. |
There was a problem hiding this comment.
I belived we did used this data as a part of our unscheduleable explanation
There was a problem hiding this comment.
We can solve that without adding this complexity to the scheduler loop - for example, a cli tool that will simulate this on demand
Description
This PR proposes a refactor for the job solver stack.
Related Issues
Fixes #
Checklist
Breaking Changes
Additional Notes