Skip to content

[Bug] may_append allocates new block one token too late, causing KV cache write to unallocated block #240

Description

@MrPaoBrother

Description

BlockManager.may_append allocates a new physical block when len(seq) % block_size == 1, but the correct condition should be == 0. This causes the new block to be allocated after the first token that needs it has already been scheduled, meaning that token's
KV cache entry will be written to an unallocated block.

Root Cause

In block_manager.py:

def can_append(self, seq: Sequence) -> bool:
    return len(self.free_block_ids) >= (len(seq) % self.block_size == 1)

def may_append(self, seq: Sequence):
    if len(seq) % self.block_size == 1:
        seq.block_table.append(self._allocate_block())

may_append is called during the schedule phase, before the new token is appended to the sequence. At this point, len(seq) reflects the
number of tokens already stored in KV cache blocks.

For a concrete example with block_size=256:

┌────────────────────────────┬──────────┬───────┬───────────────────────┬───────────────────────────────────┐
│            Steplen(seq) │ % 256Current behaviorCorrect behavior          │
├────────────────────────────┼──────────┼───────┼───────────────────────┼───────────────────────────────────┤
│ Prefill done (256 tokens)  │ 2560     │ ❌ No block allocatedShould allocateblock 0 is full │
├────────────────────────────┼──────────┼───────┼───────────────────────┼───────────────────────────────────┤
│ 1st decode token generated2571Allocates (too late!) │ Already has block allocated       │
├────────────────────────────┼──────────┼───────┼───────────────────────┼───────────────────────────────────┤
│ ...                        │ ...      │ ...   │ ...                   │ ...                               │
├────────────────────────────┼──────────┼───────┼───────────────────────┼───────────────────────────────────┤
│ Block 1 full (512 tokens)  │ 5120     │ ❌ No block allocatedShould allocateblock 1 is full │
├────────────────────────────┼──────────┼───────┼───────────────────────┼───────────────────────────────────┤
│ Next decode token5131Allocates (too late!) │ Already has block allocated       │
└────────────────────────────┴──────────┴───────┴───────────────────────┴───────────────────────────────────┘

When len(seq) == 256, block 0 is full. The upcoming decode token (the 257th token) needs to be written to block 1, but may_append does
not allocate it because 256 % 256 == 01. The allocation only happens on the next schedule cycle, after the token has already been
scheduled and presumably written to KV cacheat which point block 1 is not yet in the block table.

Impact

KV cache writes for the first token of every new block will target an unallocated physical block, which can cause:

- Silent data corruption in KV cache
- Memory corruption / out-of-bounds access
- Incorrect generation results

Fix

Change both conditions from == 1 to == 0:

def can_append(self, seq: Sequence) -> bool:
    return len(self.free_block_ids) >= (len(seq) % self.block_size == 0)

def may_append(self, seq: Sequence):
    if len(seq) % self.block_size == 0:
        seq.block_table.append(self._allocate_block())

can_append must also be fixed because it acts as a guard for may_appendif they disagree on whether a new block is needed, the
scheduler could skip preemption and crash with an IndexError when may_append tries to pop from an empty free_block_ids.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions