Skip to content

fix(model_runner): all_reduce num_kvcache_blocks to MIN across TP ranks#215

Open
Anai-Guo wants to merge 1 commit into
GeeeekExplorer:mainfrom
Anai-Guo:fix-tp-kvcache-allreduce
Open

fix(model_runner): all_reduce num_kvcache_blocks to MIN across TP ranks#215
Anai-Guo wants to merge 1 commit into
GeeeekExplorer:mainfrom
Anai-Guo:fix-tp-kvcache-allreduce

Conversation

@Anai-Guo

Copy link
Copy Markdown
Contributor

Problem

Fixes #187.

Under tensor parallelism, each ModelRunner instance independently estimates config.num_kvcache_blocks from its own GPU memory snapshot (free memory, peak allocations, etc.). Because different ranks can have slightly different memory states at the time of estimation, they can arrive at different block counts:

rank 0: num_kvcache_blocks = 512
rank 1: num_kvcache_blocks = 510

The BlockManager on rank 0 will allocate block IDs up to 511, but rank 1's KV-cache only has indices 0–509. When a sequence is assigned block 510 or 511, rank 1's cache lookup silently goes out of range.

Fix

After the local estimate, synchronize num_kvcache_blocks across all TP ranks with a MIN all-reduce, so every rank allocates exactly the same number of blocks:

if self.world_size > 1:
    t = torch.tensor(config.num_kvcache_blocks, dtype=torch.int64, device="cuda")
    dist.all_reduce(t, op=dist.ReduceOp.MIN)
    config.num_kvcache_blocks = int(t.item())

MIN ensures we never over-allocate relative to the most-constrained rank, and since the all-reduce completes before kv_cache is allocated, all ranks allocate the same tensor size.

🤖 Generated with Claude Code

Under tensor parallelism each rank independently estimates the number of
KV-cache blocks from its local GPU memory snapshot. Different ranks can
arrive at different values (due to different driver/activation overhead),
so block IDs are no longer consistent across ranks — the BlockManager on
rank 0 may allocate block 42 while rank 1 has no block 42 in its cache,
silently corrupting KV-cache lookups.

Fix: after the local estimate, synchronize across all TP ranks via
`dist.all_reduce(..., op=ReduceOp.MIN)` so every rank allocates exactly
the same number of blocks — the minimum among all ranks.

Fixes GeeeekExplorer#187
@Anai-Guo

Copy link
Copy Markdown
Contributor Author

Friendly ping @GeeeekExplorer — small TP-rank kvcache sync fix awaiting first review (34 days). Happy to adjust if anything's off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Potential TP correctness issue: BlockManager capacity may rely on per-rank local KV-cache estimation

1 participant