You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This PR fixes a critical IndexError: list index out of range in model_runner.py that gets triggered during high-concurrency workloads (e.g., running bench.py with 256 sequences).
Detailed Root Cause Analysis
The bug is caused by a state synchronization issue between the Scheduler and the Block Manager when a preempted sequence hits the prefix cache during reallocation.
Step-by-Step Bug Trigger:
Preemption: During high concurrency, a running sequence might be preempted due to a lack of available KV cache blocks. The scheduler calls self.block_manager.deallocate(seq), which clears seq.block_table and resets seq.num_cached_tokens = 0. Crucially, the released physical blocks are returned to the free pool but retain their token hash data (becoming "orphan blocks").
Rescheduling: When the preempted sequence is pulled from the waiting queue to be resumed, the original scheduler logic first calculates the tokens needed for prefill using the outdated cache state: num_tokens = max(seq.num_tokens - seq.num_cached_tokens, 1). Since num_cached_tokens is 0, num_tokens is incorrectly evaluated as the full sequence length.
Prefix Cache Hit: Immediately after, the scheduler calls self.block_manager.allocate(seq). Because the previously released "orphan blocks" still hold the hashed token data, a Prefix Cache Hit occurs. The allocate method internally increments seq.num_cached_tokens (e.g., changing it from 0 to a large value like 256 or 512).
State Desynchronization: The scheduler remains completely unaware of this internal update and blindly assigns the previously calculated (and now oversized) num_tokens to seq.num_scheduled_tokens.
The Crash: When this sequence is dispatched to model_runner.py for prefill, the prepare_prefill method calculates the token boundary as end = seq.num_cached_tokens + seq.num_scheduled_tokens. Because both values are now inappropriately large, the computed end_block index significantly exceeds the actual capacity of seq.block_table, resulting in a fatal IndexError.
The Solution
The fix is straightforward but vital: reorder the logic in nanovllm/engine/scheduler.py so that block allocation happens before calculating the remaining num_tokens.
By executing self.block_manager.allocate(seq) first, we ensure that if a prefix cache hit occurs, the seq.num_cached_tokens property is accurately updated before the scheduler computes num_tokens and assigns seq.num_scheduled_tokens. This mathematically closes the synchronization gap and perfectly aligns the scheduler's sequence state with the physical Block Manager.
Test Status
Tested with bench.py under extreme concurrency (num_seqs = 256, max length = 1024).
The preemption cycle now safely handles prefix cache hits, completely eliminating the IndexError.
Supplementary Fix: AssertionError in BlockManager.may_append
During further stress testing, I encountered a downstream issue caused by the same preemption mechanism: an AssertionError: last_block.hash != -1 in block_manager.py.
Root Cause: may_append() assumes a strict token-by-token generation state machine. When len(seq) % block_size == 1, it assumes it needs to allocate a new block and asserts that the previous block is fully hashed.
However, if a sequence is preempted and re-allocated, allocate() bulk-restores the block_table. For a sequence of length 257, allocate creates 2 blocks. When may_append is subsequently called during decode, it checks 257 % 256 == 1 and blindly tests the assertion against the second block (which is not full and has hash == -1), causing a crash.
Solution:
Refactored may_append() to check if len(block_table) < seq.num_blocks: before appending, and to check if last_block.hash == -1: before computing the hash. This safely integrates the bulk-allocated state of re-prefilled sequences with the token-by-token expectations of the decode phase.
Final Benchmark Result 🚀
With both the prefill prefix-cache state mismatch and the decode assertion bug fixed, the engine now runs extremely stable under extreme concurrency.
Here is my local benchmark result using bench.py (256 seqs): Total: 133966tok, Time: 23.19s, Throughput: 5777.29tok/s
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR fixes a critical
IndexError: list index out of rangeinmodel_runner.pythat gets triggered during high-concurrency workloads (e.g., runningbench.pywith 256 sequences).Detailed Root Cause Analysis
The bug is caused by a state synchronization issue between the Scheduler and the Block Manager when a preempted sequence hits the prefix cache during reallocation.
Step-by-Step Bug Trigger:
self.block_manager.deallocate(seq), which clearsseq.block_tableand resetsseq.num_cached_tokens = 0. Crucially, the released physical blocks are returned to the free pool but retain their token hash data (becoming "orphan blocks").waitingqueue to be resumed, the original scheduler logic first calculates the tokens needed for prefill using the outdated cache state:num_tokens = max(seq.num_tokens - seq.num_cached_tokens, 1). Sincenum_cached_tokensis0,num_tokensis incorrectly evaluated as the full sequence length.self.block_manager.allocate(seq). Because the previously released "orphan blocks" still hold the hashed token data, a Prefix Cache Hit occurs. Theallocatemethod internally incrementsseq.num_cached_tokens(e.g., changing it from0to a large value like256or512).num_tokenstoseq.num_scheduled_tokens.model_runner.pyfor prefill, theprepare_prefillmethod calculates the token boundary asend = seq.num_cached_tokens + seq.num_scheduled_tokens. Because both values are now inappropriately large, the computedend_blockindex significantly exceeds the actual capacity ofseq.block_table, resulting in a fatalIndexError.The Solution
The fix is straightforward but vital: reorder the logic in
nanovllm/engine/scheduler.pyso that block allocation happens before calculating the remainingnum_tokens.By executing
self.block_manager.allocate(seq)first, we ensure that if a prefix cache hit occurs, theseq.num_cached_tokensproperty is accurately updated before the scheduler computesnum_tokensand assignsseq.num_scheduled_tokens. This mathematically closes the synchronization gap and perfectly aligns the scheduler's sequence state with the physical Block Manager.Test Status
bench.pyunder extreme concurrency (num_seqs = 256, max length = 1024).IndexError.