Support overriding logic for hybrid kv cache padding #1285

kyuyeunk · 2025-12-11T07:12:57Z

Description

Fix how KV cache padding is calculated in hybrid kv cache use case.

Tests

In below scenario, we use tp=8 where kv cache will be padded by 4x. Therefore, correct number of KV caches it can use will be 1/4 of non-padded kv cache.

Before: Overode to num_gpu_blocks: 76268 -> 9533

(EngineCore_DP0 pid=3675426) INFO 12-11 07:05:18 [tpu_worker.py:420] KV cache page size calculated by vLLM (524288 Bytes) does not match with actual page size used by Attention kernel (2097152 Bytes). Recalculating number of KV blocks using actual page size.
(EngineCore_DP0 pid=3675426) INFO 12-11 07:05:18 [tpu_worker.py:294] Memory statistics | total_hbm_limit_gb=757.97GiB | total_hbm_limit_cap_gb=735.23GiB | total_hbm_used_gb=64.9GiB | total_hbm_avail_gb=670.33GiB
(EngineCore_DP0 pid=3675426) INFO 12-11 07:05:18 [tpu_worker.py:294] Memory statistics | total_hbm_limit_gb=757.97GiB | total_hbm_limit_cap_gb=735.23GiB | total_hbm_used_gb=64.9GiB | total_hbm_avail_gb=670.33GiB
(EngineCore_DP0 pid=3675426) INFO 12-11 07:05:18 [kv_cache_utils.py:805] Overriding num_gpu_blocks=76268 with num_gpu_blocks_override=9533
(EngineCore_DP0 pid=3675426) INFO 12-11 07:05:18 [kv_cache_utils.py:1291] GPU KV cache size: 2,440,192 tokens

After: Overode to num_gpu_blocks: 76268 -> 19067

(EngineCore_DP0 pid=3647999) INFO 12-11 06:56:23 [tpu_runner.py:526] Init model | hbm=[(8.11, 94.75), (8.11, 94.75), (8.11, 94.75), (8.11, 94.75), (8.11, 94.75), (8.11, 94.75), (8.11, 94.75), (8.11, 94.75)]GiB
(EngineCore_DP0 pid=3647999) WARNING 12-11 06:56:23 [kv_cache_manager.py:116] Compilation num_layers = 36
(EngineCore_DP0 pid=3647999) INFO 12-11 06:56:23 [tpu_worker.py:421] KV cache page size calculated by vLLM (524288 Bytes) does not match with actual page size used by the kernel (2097152 Bytes). Recalculating number of KV blocks using actual page size.
(EngineCore_DP0 pid=3647999) INFO 12-11 06:56:23 [tpu_worker.py:295] Memory statistics | total_hbm_limit_gb=757.97GiB | total_hbm_limit_cap_gb=735.23GiB | total_hbm_used_gb=64.9GiB | total_hbm_avail_gb=670.33GiB
(EngineCore_DP0 pid=3647999) INFO 12-11 06:56:23 [tpu_worker.py:295] Memory statistics | total_hbm_limit_gb=757.97GiB | total_hbm_limit_cap_gb=735.23GiB | total_hbm_used_gb=64.9GiB | total_hbm_avail_gb=670.33GiB
(EngineCore_DP0 pid=3647999) INFO 12-11 06:56:23 [kv_cache_utils.py:805] Overriding num_gpu_blocks=76268 with num_gpu_blocks_override=19067

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

github-actions · 2025-12-11T08:09:00Z

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

why is this change being made,
the problem being solved and any relevant context,
why this is a good solution,
some information about the specific implementation,
shortcomings of the solution and possible future improvements.

If the change fixes a Github issue, please include a link, e.g.,:
FIXES: #123456

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

Signed-off-by: Kyuyeun Kim <[email protected]>

tpu_inference/worker/tpu_worker.py

juncgu-google

LGTM

kyuyeunk requested review from mrjunwan-lang, py4, sixiang-google, vanbasten23 and wenxindongwork as code owners December 11, 2025 07:12

kyuyeunk added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 11, 2025

kyuyeunk force-pushed the support_hybrid_kv_for_padding branch from b854bd4 to 5ee16a6 Compare December 11, 2025 22:41

Support overriding logic for hybrid kv cache padding

b690326

Signed-off-by: Kyuyeun Kim <[email protected]>

kyuyeunk force-pushed the support_hybrid_kv_for_padding branch from 5ee16a6 to b690326 Compare December 12, 2025 03:21

juncgu-google reviewed Dec 12, 2025

View reviewed changes

tpu_inference/worker/tpu_worker.py Show resolved Hide resolved

juncgu-google approved these changes Dec 12, 2025

View reviewed changes

kyuyeunk merged commit 2302b08 into main Dec 12, 2025
41 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support overriding logic for hybrid kv cache padding #1285

Support overriding logic for hybrid kv cache padding #1285

Uh oh!

kyuyeunk commented Dec 11, 2025

Uh oh!

github-actions bot commented Dec 11, 2025

Uh oh!

Uh oh!

juncgu-google left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Support overriding logic for hybrid kv cache padding #1285

Support overriding logic for hybrid kv cache padding #1285

Uh oh!

Conversation

kyuyeunk commented Dec 11, 2025

Description

Tests

Checklist

Uh oh!

github-actions bot commented Dec 11, 2025

Description

Tests

Checklist

Uh oh!

Uh oh!

juncgu-google left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants