fix(model_runner): all_reduce num_kvcache_blocks to MIN across TP ranks#215
Open
Anai-Guo wants to merge 1 commit into
Open
fix(model_runner): all_reduce num_kvcache_blocks to MIN across TP ranks#215Anai-Guo wants to merge 1 commit into
Anai-Guo wants to merge 1 commit into
Conversation
Under tensor parallelism each rank independently estimates the number of KV-cache blocks from its local GPU memory snapshot. Different ranks can arrive at different values (due to different driver/activation overhead), so block IDs are no longer consistent across ranks — the BlockManager on rank 0 may allocate block 42 while rank 1 has no block 42 in its cache, silently corrupting KV-cache lookups. Fix: after the local estimate, synchronize across all TP ranks via `dist.all_reduce(..., op=ReduceOp.MIN)` so every rank allocates exactly the same number of blocks — the minimum among all ranks. Fixes GeeeekExplorer#187
Contributor
Author
|
Friendly ping @GeeeekExplorer — small TP-rank kvcache sync fix awaiting first review (34 days). Happy to adjust if anything's off. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Fixes #187.
Under tensor parallelism, each
ModelRunnerinstance independently estimatesconfig.num_kvcache_blocksfrom its own GPU memory snapshot (free memory, peak allocations, etc.). Because different ranks can have slightly different memory states at the time of estimation, they can arrive at different block counts:The
BlockManageron rank 0 will allocate block IDs up to 511, but rank 1's KV-cache only has indices 0–509. When a sequence is assigned block 510 or 511, rank 1's cache lookup silently goes out of range.Fix
After the local estimate, synchronize
num_kvcache_blocksacross all TP ranks with a MIN all-reduce, so every rank allocates exactly the same number of blocks:MIN ensures we never over-allocate relative to the most-constrained rank, and since the all-reduce completes before
kv_cacheis allocated, all ranks allocate the same tensor size.🤖 Generated with Claude Code