Skip to content

fix(scheduler): resolve out-of-bounds error during prefix cache hit after sequence preemption#211

Open
wangyuzhuo116 wants to merge 2 commits into
GeeeekExplorer:mainfrom
wangyuzhuo116:fix-scheduler-prefix-cache-bug
Open

fix(scheduler): resolve out-of-bounds error during prefix cache hit after sequence preemption#211
wangyuzhuo116 wants to merge 2 commits into
GeeeekExplorer:mainfrom
wangyuzhuo116:fix-scheduler-prefix-cache-bug

Conversation

@wangyuzhuo116

@wangyuzhuo116 wangyuzhuo116 commented Apr 22, 2026

Copy link
Copy Markdown

What does this PR do?

This PR fixes a critical IndexError: list index out of range in model_runner.py that gets triggered during high-concurrency workloads (e.g., running bench.py with 256 sequences).

Detailed Root Cause Analysis

The bug is caused by a state synchronization issue between the Scheduler and the Block Manager when a preempted sequence hits the prefix cache during reallocation.

Step-by-Step Bug Trigger:

  1. Preemption: During high concurrency, a running sequence might be preempted due to a lack of available KV cache blocks. The scheduler calls self.block_manager.deallocate(seq), which clears seq.block_table and resets seq.num_cached_tokens = 0. Crucially, the released physical blocks are returned to the free pool but retain their token hash data (becoming "orphan blocks").
  2. Rescheduling: When the preempted sequence is pulled from the waiting queue to be resumed, the original scheduler logic first calculates the tokens needed for prefill using the outdated cache state: num_tokens = max(seq.num_tokens - seq.num_cached_tokens, 1). Since num_cached_tokens is 0, num_tokens is incorrectly evaluated as the full sequence length.
  3. Prefix Cache Hit: Immediately after, the scheduler calls self.block_manager.allocate(seq). Because the previously released "orphan blocks" still hold the hashed token data, a Prefix Cache Hit occurs. The allocate method internally increments seq.num_cached_tokens (e.g., changing it from 0 to a large value like 256 or 512).
  4. State Desynchronization: The scheduler remains completely unaware of this internal update and blindly assigns the previously calculated (and now oversized) num_tokens to seq.num_scheduled_tokens.
  5. The Crash: When this sequence is dispatched to model_runner.py for prefill, the prepare_prefill method calculates the token boundary as end = seq.num_cached_tokens + seq.num_scheduled_tokens. Because both values are now inappropriately large, the computed end_block index significantly exceeds the actual capacity of seq.block_table, resulting in a fatal IndexError.

The Solution

The fix is straightforward but vital: reorder the logic in nanovllm/engine/scheduler.py so that block allocation happens before calculating the remaining num_tokens.

By executing self.block_manager.allocate(seq) first, we ensure that if a prefix cache hit occurs, the seq.num_cached_tokens property is accurately updated before the scheduler computes num_tokens and assigns seq.num_scheduled_tokens. This mathematically closes the synchronization gap and perfectly aligns the scheduler's sequence state with the physical Block Manager.

Test Status

  • Tested with bench.py under extreme concurrency (num_seqs = 256, max length = 1024).
  • The preemption cycle now safely handles prefix cache hits, completely eliminating the IndexError.

@wangyuzhuo116

Copy link
Copy Markdown
Author

Supplementary Fix: AssertionError in BlockManager.may_append

During further stress testing, I encountered a downstream issue caused by the same preemption mechanism: an AssertionError: last_block.hash != -1 in block_manager.py.

Root Cause:
may_append() assumes a strict token-by-token generation state machine. When len(seq) % block_size == 1, it assumes it needs to allocate a new block and asserts that the previous block is fully hashed.
However, if a sequence is preempted and re-allocated, allocate() bulk-restores the block_table. For a sequence of length 257, allocate creates 2 blocks. When may_append is subsequently called during decode, it checks 257 % 256 == 1 and blindly tests the assertion against the second block (which is not full and has hash == -1), causing a crash.

Solution:
Refactored may_append() to check if len(block_table) < seq.num_blocks: before appending, and to check if last_block.hash == -1: before computing the hash. This safely integrates the bulk-allocated state of re-prefilled sequences with the token-by-token expectations of the decode phase.


Final Benchmark Result 🚀

With both the prefill prefix-cache state mismatch and the decode assertion bug fixed, the engine now runs extremely stable under extreme concurrency.

Here is my local benchmark result using bench.py (256 seqs):
Total: 133966tok, Time: 23.19s, Throughput: 5777.29tok/s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant