fix(scheduler): resolve out-of-bounds error during prefix cache hit after sequence preemption by wangyuzhuo116 · Pull Request #211 · GeeeekExplorer/nano-vllm

wangyuzhuo116 · 2026-04-22T06:07:55Z

What does this PR do?

This PR fixes a critical IndexError: list index out of range in model_runner.py that gets triggered during high-concurrency workloads (e.g., running bench.py with 256 sequences).

Detailed Root Cause Analysis

The bug is caused by a state synchronization issue between the Scheduler and the Block Manager when a preempted sequence hits the prefix cache during reallocation.

Step-by-Step Bug Trigger:

Preemption: During high concurrency, a running sequence might be preempted due to a lack of available KV cache blocks. The scheduler calls self.block_manager.deallocate(seq), which clears seq.block_table and resets seq.num_cached_tokens = 0. Crucially, the released physical blocks are returned to the free pool but retain their token hash data (becoming "orphan blocks").
Rescheduling: When the preempted sequence is pulled from the waiting queue to be resumed, the original scheduler logic first calculates the tokens needed for prefill using the outdated cache state: num_tokens = max(seq.num_tokens - seq.num_cached_tokens, 1). Since num_cached_tokens is 0, num_tokens is incorrectly evaluated as the full sequence length.
Prefix Cache Hit: Immediately after, the scheduler calls self.block_manager.allocate(seq). Because the previously released "orphan blocks" still hold the hashed token data, a Prefix Cache Hit occurs. The allocate method internally increments seq.num_cached_tokens (e.g., changing it from 0 to a large value like 256 or 512).
State Desynchronization: The scheduler remains completely unaware of this internal update and blindly assigns the previously calculated (and now oversized) num_tokens to seq.num_scheduled_tokens.
The Crash: When this sequence is dispatched to model_runner.py for prefill, the prepare_prefill method calculates the token boundary as end = seq.num_cached_tokens + seq.num_scheduled_tokens. Because both values are now inappropriately large, the computed end_block index significantly exceeds the actual capacity of seq.block_table, resulting in a fatal IndexError.

The Solution

The fix is straightforward but vital: reorder the logic in nanovllm/engine/scheduler.py so that block allocation happens before calculating the remaining num_tokens.

By executing self.block_manager.allocate(seq) first, we ensure that if a prefix cache hit occurs, the seq.num_cached_tokens property is accurately updated before the scheduler computes num_tokens and assigns seq.num_scheduled_tokens. This mathematically closes the synchronization gap and perfectly aligns the scheduler's sequence state with the physical Block Manager.

Test Status

Tested with bench.py under extreme concurrency (num_seqs = 256, max length = 1024).
The preemption cycle now safely handles prefix cache hits, completely eliminating the IndexError.

… preemption

…eemption state mismatch

wangyuzhuo116 · 2026-04-22T06:37:35Z

Supplementary Fix: `AssertionError` in `BlockManager.may_append`

During further stress testing, I encountered a downstream issue caused by the same preemption mechanism: an AssertionError: last_block.hash != -1 in block_manager.py.

Root Cause:
may_append() assumes a strict token-by-token generation state machine. When len(seq) % block_size == 1, it assumes it needs to allocate a new block and asserts that the previous block is fully hashed.
However, if a sequence is preempted and re-allocated, allocate() bulk-restores the block_table. For a sequence of length 257, allocate creates 2 blocks. When may_append is subsequently called during decode, it checks 257 % 256 == 1 and blindly tests the assertion against the second block (which is not full and has hash == -1), causing a crash.

Solution:
Refactored may_append() to check if len(block_table) < seq.num_blocks: before appending, and to check if last_block.hash == -1: before computing the hash. This safely integrates the bulk-allocated state of re-prefilled sequences with the token-by-token expectations of the decode phase.

Final Benchmark Result 🚀

With both the prefill prefix-cache state mismatch and the decode assertion bug fixed, the engine now runs extremely stable under extreme concurrency.

Here is my local benchmark result using bench.py (256 seqs):
Total: 133966tok, Time: 23.19s, Throughput: 5777.29tok/s

wangyuzhuo116 added 2 commits April 22, 2026 13:55

fix(scheduler): fix out-of-bounds error during prefix cache hit after…

0f605a5

… preemption

fix(block_manager): resolve AssertionError in may_append caused by pr…

e17eb38

…eemption state mismatch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scheduler): resolve out-of-bounds error during prefix cache hit after sequence preemption#211

fix(scheduler): resolve out-of-bounds error during prefix cache hit after sequence preemption#211
wangyuzhuo116 wants to merge 2 commits into
GeeeekExplorer:mainfrom
wangyuzhuo116:fix-scheduler-prefix-cache-bug

wangyuzhuo116 commented Apr 22, 2026 •

edited

Loading

Uh oh!

wangyuzhuo116 commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wangyuzhuo116 commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Detailed Root Cause Analysis

The Solution

Test Status

Uh oh!

wangyuzhuo116 commented Apr 22, 2026

Supplementary Fix: AssertionError in BlockManager.may_append

Final Benchmark Result 🚀

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wangyuzhuo116 commented Apr 22, 2026 •

edited

Loading

Supplementary Fix: `AssertionError` in `BlockManager.may_append`