fix(model_runner): all_reduce num_kvcache_blocks to MIN across TP ranks by Anai-Guo · Pull Request #215 · GeeeekExplorer/nano-vllm

Anai-Guo · 2026-04-24T16:21:10Z

Problem

Fixes #187.

Under tensor parallelism, each ModelRunner instance independently estimates config.num_kvcache_blocks from its own GPU memory snapshot (free memory, peak allocations, etc.). Because different ranks can have slightly different memory states at the time of estimation, they can arrive at different block counts:

rank 0: num_kvcache_blocks = 512
rank 1: num_kvcache_blocks = 510

The BlockManager on rank 0 will allocate block IDs up to 511, but rank 1's KV-cache only has indices 0–509. When a sequence is assigned block 510 or 511, rank 1's cache lookup silently goes out of range.

Fix

After the local estimate, synchronize num_kvcache_blocks across all TP ranks with a MIN all-reduce, so every rank allocates exactly the same number of blocks:

if self.world_size > 1:
    t = torch.tensor(config.num_kvcache_blocks, dtype=torch.int64, device="cuda")
    dist.all_reduce(t, op=dist.ReduceOp.MIN)
    config.num_kvcache_blocks = int(t.item())

MIN ensures we never over-allocate relative to the most-constrained rank, and since the all-reduce completes before kv_cache is allocated, all ranks allocate the same tensor size.

🤖 Generated with Claude Code

Under tensor parallelism each rank independently estimates the number of KV-cache blocks from its local GPU memory snapshot. Different ranks can arrive at different values (due to different driver/activation overhead), so block IDs are no longer consistent across ranks — the BlockManager on rank 0 may allocate block 42 while rank 1 has no block 42 in its cache, silently corrupting KV-cache lookups. Fix: after the local estimate, synchronize across all TP ranks via `dist.all_reduce(..., op=ReduceOp.MIN)` so every rank allocates exactly the same number of blocks — the minimum among all ranks. Fixes GeeeekExplorer#187

Anai-Guo · 2026-05-28T13:15:03Z

Friendly ping @GeeeekExplorer — small TP-rank kvcache sync fix awaiting first review (34 days). Happy to adjust if anything's off.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(model_runner): all_reduce num_kvcache_blocks to MIN across TP ranks#215

fix(model_runner): all_reduce num_kvcache_blocks to MIN across TP ranks#215
Anai-Guo wants to merge 1 commit into
GeeeekExplorer:mainfrom
Anai-Guo:fix-tp-kvcache-allreduce

Anai-Guo commented Apr 24, 2026

Uh oh!

Anai-Guo commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Anai-Guo commented Apr 24, 2026

Problem

Fix

Uh oh!

Anai-Guo commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant