Skip to content

feat(binder): GPU_MEMORY_LIMIT binder plugin for shared GPU workloads#1504

Open
FouoF wants to merge 4 commits into
kai-scheduler:mainfrom
FouoF:migrate/hami-core-on-pr1480
Open

feat(binder): GPU_MEMORY_LIMIT binder plugin for shared GPU workloads#1504
FouoF wants to merge 4 commits into
kai-scheduler:mainfrom
FouoF:migrate/hami-core-on-pr1480

Conversation

@FouoF
Copy link
Copy Markdown

@FouoF FouoF commented Apr 30, 2026

Description

As the issue described.

Related Issues

Fixes #1367

Checklist

Note: Ensure your PR title follows the Conventional Commits format (e.g., feat(scheduler): add new feature)

  • Self-reviewed
  • Added/updated tests (if needed)
  • Updated documentation (if needed)

Breaking Changes

No

Additional Notes

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 30, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c271df49-fb9e-471c-9bb9-ea776822d8b7

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@enoodle
Copy link
Copy Markdown
Collaborator

enoodle commented Apr 30, 2026

@FouoF Can you please rebase this on top of origin/main removing the commits already merged?

@FouoF FouoF force-pushed the migrate/hami-core-on-pr1480 branch 2 times, most recently from 9a689bd to bd03290 Compare May 1, 2026 04:08
@enoodle enoodle requested a review from davidLif May 4, 2026 15:20
@enoodle
Copy link
Copy Markdown
Collaborator

enoodle commented May 4, 2026

@FouoF Can you please implement it as a new plugin to the binder that will be "opt-in" - ie not enabled by default?
Currently it is added as part of the gpu sharing plugin, but I would like to keep all the GPU virtualization separate to avoid a mess in the future when more are implemented.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/kai-scheduler/KAI-scheduler/pkg/apis/kai/v1/binder 32.73% (ø)
github.com/kai-scheduler/KAI-scheduler/pkg/binder/binding 72.92% (ø)
github.com/kai-scheduler/KAI-scheduler/pkg/binder/common 13.27% (-1.68%) 👎
github.com/kai-scheduler/KAI-scheduler/pkg/binder/plugins 64.89% (-0.40%) 👎
github.com/kai-scheduler/KAI-scheduler/pkg/binder/plugins/gpusharing 30.23% (-13.10%) 💀
github.com/kai-scheduler/KAI-scheduler/pkg/operator/operands/binder 74.03% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/kai-scheduler/KAI-scheduler/pkg/apis/kai/v1/binder/binder.go 94.74% (ø) 57 54 3
github.com/kai-scheduler/KAI-scheduler/pkg/binder/common/constants.go 0.00% (ø) 0 0 0
github.com/kai-scheduler/KAI-scheduler/pkg/binder/common/gpu_access.go 0.00% (ø) 76 (+11) 0 76 (+11)
github.com/kai-scheduler/KAI-scheduler/pkg/binder/plugins/config.go 98.04% (ø) 51 50 1
github.com/kai-scheduler/KAI-scheduler/pkg/binder/plugins/factory.go 53.23% (+1.30%) 62 (+10) 33 (+6) 29 (+4) 👍
github.com/kai-scheduler/KAI-scheduler/pkg/binder/plugins/gpusharing/gpu_sharing.go 30.23% (-13.10%) 86 (+26) 26 60 (+26) 💀

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/kai-scheduler/KAI-scheduler/pkg/apis/kai/v1/binder/binder_test.go
  • github.com/kai-scheduler/KAI-scheduler/pkg/binder/binding/default_binder_test.go
  • github.com/kai-scheduler/KAI-scheduler/pkg/binder/binding/fraction_binder_test.go
  • github.com/kai-scheduler/KAI-scheduler/pkg/binder/plugins/config_test.go
  • github.com/kai-scheduler/KAI-scheduler/pkg/binder/plugins/gpusharing/gpu_sharing_test.go
  • github.com/kai-scheduler/KAI-scheduler/pkg/operator/operands/binder/binder_test.go

@FouoF FouoF force-pushed the migrate/hami-core-on-pr1480 branch from b88c57a to a5ee0a6 Compare May 6, 2026 01:59
@FouoF
Copy link
Copy Markdown
Author

FouoF commented May 6, 2026

@FouoF Can you please implement it as a new plugin to the binder that will be "opt-in" - ie not enabled by default? Currently it is added as part of the gpu sharing plugin, but I would like to keep all the GPU virtualization separate to avoid a mess in the future when more are implemented.

Thanks, I agree with the goal of keeping GPU virtualization separate from the GPU sharing plugin.

After looking deeper, I don't think making this a binder-only standalone plugin is actually independent in the current architecture. The virtualization logic still relies on several pieces owned by GPU sharing: the admission mutation that injects the env/configmap references, the runai/shared-gpu-configmap annotation, the GPU sharing ConfigMap lifecycle, and the binder plugin ordering where GPU sharing must create the ConfigMap before virtualization can update it.

So an opt-in binder plugin would currently be a hidden extension of gpusharing, not a truly separate virtualization plugin. That could be misleading and still leave us with tight coupling.

I think there are two reasonable options:

  1. Keep this behavior explicitly under GPU sharing for now.
  2. Do a larger follow-up refactor that separates the admission mutation, ConfigMap/env management, and binder logic for GPU virtualization as its own opt-in path.

Given that, I’d prefer not to present this as an independent binder plugin in this PR unless we also split the admission/configmap pieces.

@enoodle
Copy link
Copy Markdown
Collaborator

enoodle commented May 7, 2026

@FouoF Lets present it as a dependent binder plugin then - it will only function well if the gpusharing plugin is also enabled.

@FouoF
Copy link
Copy Markdown
Author

FouoF commented May 8, 2026

@enoodle I have made it an independent plugin and it will check if the gpusharing plugin is enabled.

Comment thread pkg/binder/plugins/gpusharing/gpu_sharing.go Outdated
FouoF and others added 4 commits May 11, 2026 10:31
Signed-off-by: Jifei Wang <poff2001@outlook.com>
Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai>
Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai>
Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai>
@FouoF FouoF force-pushed the migrate/hami-core-on-pr1480 branch from 089af4b to ef6037c Compare May 11, 2026 02:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(binder): GPU_MEMORY_LIMIT binder plugin for shared GPU workloads

2 participants