Skip to content

Conversation

@kyuyeunk
Copy link
Collaborator

Description

Fix how KV cache padding is calculated in hybrid kv cache use case.

Tests

In below scenario, we use tp=8 where kv cache will be padded by 4x. Therefore, correct number of KV caches it can use will be 1/4 of non-padded kv cache.

Before: Overode to num_gpu_blocks: 76268 -> 9533

(EngineCore_DP0 pid=3675426) INFO 12-11 07:05:18 [tpu_worker.py:420] KV cache page size calculated by vLLM (524288 Bytes) does not match with actual page size used by Attention kernel (2097152 Bytes). Recalculating number of KV blocks using actual page size.
(EngineCore_DP0 pid=3675426) INFO 12-11 07:05:18 [tpu_worker.py:294] Memory statistics | total_hbm_limit_gb=757.97GiB | total_hbm_limit_cap_gb=735.23GiB | total_hbm_used_gb=64.9GiB | total_hbm_avail_gb=670.33GiB
(EngineCore_DP0 pid=3675426) INFO 12-11 07:05:18 [tpu_worker.py:294] Memory statistics | total_hbm_limit_gb=757.97GiB | total_hbm_limit_cap_gb=735.23GiB | total_hbm_used_gb=64.9GiB | total_hbm_avail_gb=670.33GiB
(EngineCore_DP0 pid=3675426) INFO 12-11 07:05:18 [kv_cache_utils.py:805] Overriding num_gpu_blocks=76268 with num_gpu_blocks_override=9533
(EngineCore_DP0 pid=3675426) INFO 12-11 07:05:18 [kv_cache_utils.py:1291] GPU KV cache size: 2,440,192 tokens

After: Overode to num_gpu_blocks: 76268 -> 19067

(EngineCore_DP0 pid=3647999) INFO 12-11 06:56:23 [tpu_runner.py:526] Init model | hbm=[(8.11, 94.75), (8.11, 94.75), (8.11, 94.75), (8.11, 94.75), (8.11, 94.75), (8.11, 94.75), (8.11, 94.75), (8.11, 94.75)]GiB
(EngineCore_DP0 pid=3647999) WARNING 12-11 06:56:23 [kv_cache_manager.py:116] Compilation num_layers = 36
(EngineCore_DP0 pid=3647999) INFO 12-11 06:56:23 [tpu_worker.py:421] KV cache page size calculated by vLLM (524288 Bytes) does not match with actual page size used by the kernel (2097152 Bytes). Recalculating number of KV blocks using actual page size.
(EngineCore_DP0 pid=3647999) INFO 12-11 06:56:23 [tpu_worker.py:295] Memory statistics | total_hbm_limit_gb=757.97GiB | total_hbm_limit_cap_gb=735.23GiB | total_hbm_used_gb=64.9GiB | total_hbm_avail_gb=670.33GiB
(EngineCore_DP0 pid=3647999) INFO 12-11 06:56:23 [tpu_worker.py:295] Memory statistics | total_hbm_limit_gb=757.97GiB | total_hbm_limit_cap_gb=735.23GiB | total_hbm_used_gb=64.9GiB | total_hbm_avail_gb=670.33GiB
(EngineCore_DP0 pid=3647999) INFO 12-11 06:56:23 [kv_cache_utils.py:805] Overriding num_gpu_blocks=76268 with num_gpu_blocks_override=19067

Checklist

Before submitting this PR, please make sure:

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have made or will make corresponding changes to any relevant documentation.

@kyuyeunk kyuyeunk added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 11, 2025
@github-actions
Copy link

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

  • why is this change being made,
  • the problem being solved and any relevant context,
  • why this is a good solution,
  • some information about the specific implementation,
  • shortcomings of the solution and possible future improvements.

If the change fixes a Github issue, please include a link, e.g.,:
FIXES: #123456

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure:

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have made or will make corresponding changes to any relevant documentation.

@kyuyeunk kyuyeunk force-pushed the support_hybrid_kv_for_padding branch from b854bd4 to 5ee16a6 Compare December 11, 2025 22:41
@kyuyeunk kyuyeunk force-pushed the support_hybrid_kv_for_padding branch from 5ee16a6 to b690326 Compare December 12, 2025 03:21
Copy link
Collaborator

@juncgu-google juncgu-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kyuyeunk kyuyeunk merged commit 2302b08 into main Dec 12, 2025
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants