Description
BlockManager.may_append allocates a new physical block when len(seq) % block_size == 1, but the correct condition should be == 0. This causes the new block to be allocated after the first token that needs it has already been scheduled, meaning that token's
KV cache entry will be written to an unallocated block.
Root Cause
In block_manager.py:
def can_append(self, seq: Sequence) -> bool:
return len(self.free_block_ids) >= (len(seq) % self.block_size == 1)
def may_append(self, seq: Sequence):
if len(seq) % self.block_size == 1:
seq.block_table.append(self._allocate_block())
may_append is called during the schedule phase, before the new token is appended to the sequence. At this point, len(seq) reflects the
number of tokens already stored in KV cache blocks.
For a concrete example with block_size=256:
┌────────────────────────────┬──────────┬───────┬───────────────────────┬───────────────────────────────────┐
│ Step │ len(seq) │ % 256 │ Current behavior │ Correct behavior │
├────────────────────────────┼──────────┼───────┼───────────────────────┼───────────────────────────────────┤
│ Prefill done (256 tokens) │ 256 │ 0 │ ❌ No block allocated │ Should allocate — block 0 is full │
├────────────────────────────┼──────────┼───────┼───────────────────────┼───────────────────────────────────┤
│ 1st decode token generated │ 257 │ 1 │ Allocates (too late!) │ Already has block allocated │
├────────────────────────────┼──────────┼───────┼───────────────────────┼───────────────────────────────────┤
│ ... │ ... │ ... │ ... │ ... │
├────────────────────────────┼──────────┼───────┼───────────────────────┼───────────────────────────────────┤
│ Block 1 full (512 tokens) │ 512 │ 0 │ ❌ No block allocated │ Should allocate — block 1 is full │
├────────────────────────────┼──────────┼───────┼───────────────────────┼───────────────────────────────────┤
│ Next decode token │ 513 │ 1 │ Allocates (too late!) │ Already has block allocated │
└────────────────────────────┴──────────┴───────┴───────────────────────┴───────────────────────────────────┘
When len(seq) == 256, block 0 is full. The upcoming decode token (the 257th token) needs to be written to block 1, but may_append does
not allocate it because 256 % 256 == 0 ≠ 1. The allocation only happens on the next schedule cycle, after the token has already been
scheduled and presumably written to KV cache — at which point block 1 is not yet in the block table.
Impact
KV cache writes for the first token of every new block will target an unallocated physical block, which can cause:
- Silent data corruption in KV cache
- Memory corruption / out-of-bounds access
- Incorrect generation results
Fix
Change both conditions from == 1 to == 0:
def can_append(self, seq: Sequence) -> bool:
return len(self.free_block_ids) >= (len(seq) % self.block_size == 0)
def may_append(self, seq: Sequence):
if len(seq) % self.block_size == 0:
seq.block_table.append(self._allocate_block())
can_append must also be fixed because it acts as a guard for may_append — if they disagree on whether a new block is needed, the
scheduler could skip preemption and crash with an IndexError when may_append tries to pop from an empty free_block_ids.
Description
BlockManager.may_appendallocates a new physical block whenlen(seq) % block_size == 1, but the correct condition should be== 0. This causes the new block to be allocated after the first token that needs it has already been scheduled, meaning that token'sKV cache entry will be written to an unallocated block.
Root Cause
In
block_manager.py: