Skip to content

Fix prefix cache OOB in prepare_prefill when num_cached_tokens grows during allocate#233

Open
yuanmouya-prog wants to merge 1 commit into
GeeeekExplorer:mainfrom
yuanmouya-prog:fix/prefix-cache-oob-prepare-prefill
Open

Fix prefix cache OOB in prepare_prefill when num_cached_tokens grows during allocate#233
yuanmouya-prog wants to merge 1 commit into
GeeeekExplorer:mainfrom
yuanmouya-prog:fix/prefix-cache-oob-prepare-prefill

Conversation

@yuanmouya-prog

Copy link
Copy Markdown

Bug

When prefix caching hits during block_manager.allocate(), num_cached_tokens can jump from 0 to N*block_size. But the scheduler already computed num_scheduled_tokens based on the old num_cached_tokens=0 before calling allocate(). This causes prepare_prefill to compute end = start + seqlen_q exceeding num_tokens, leading to block_table index out of range.

Reproduce

High concurrency (512 requests) + KV cache full + prefix cache hit → preempt → re-prefill → 100% crash with IndexError: list index out of range at model_runner.py:155.

Fix

One line: clamp seqlen_q to not exceed remaining uncached tokens.

-            seqlen_q = seq.num_scheduled_tokens
+            seqlen_q = min(seq.num_scheduled_tokens, seqlen - start)

…during allocate

When prefix caching hits during block_manager.allocate(), num_cached_tokens
can jump from 0 to N*block_size. But the scheduler already computed
num_scheduled_tokens based on the old num_cached_tokens=0 before calling
allocate(). This causes prepare_prefill to compute end = start + seqlen_q
exceeding num_tokens, leading to block_table index out of range.

The fix clamps seqlen_q to not exceed the remaining uncached tokens.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant