Fix CUDA graph block_tables shape mismatch by ilrewrite · Pull Request #191 · GeeeekExplorer/nano-vllm

ilrewrite · 2026-03-24T15:27:03Z

Closes #190

What

Add an explicit decode CUDA graph eligibility check and fall back to eager execution when the runtime decode batch exceeds the captured
graph coverage.

Why

Decode CUDA graph replay currently assumes that runtime decode state always fits the graph buffers captured up front from
max_model_len.

Under higher-concurrency / longer-context workloads, this assumption can break. One concrete failure mode is:

RuntimeError: The expanded size of the tensor (...) must match the existing size (...)

caused by context.block_tables being wider than the captured graph buffer.

The broader issue is that replay is attempted even when runtime decode state is outside the graph coverage window.

## How

This patch makes CUDA graph replay conditional on runtime eligibility:

1. Add max_seq_len_to_capture to config, defaulting to max_model_len.
2. Size captured decode graph buffers from max_seq_len_to_capture.
3. Before replay, check that:
    - the step is decode-only
    - eager is not forced
    - batch size is within captured graph support
    - max decode context length is within max_seq_len_to_capture
    - runtime block_tables width fits the captured graph buffer
4. If any check fails, fall back to eager execution instead of replaying the graph.
5. Clear the reused block_tables graph buffer with -1 before copying the current step's values.

## Validation

- python -m py_compile nanovllm/config.py nanovllm/engine/model_runner.py passes.
- The change is local to decode CUDA graph selection / replay.
- The same design has been validated locally in a derived codebase by running longer-context serving benchmarks that previously
  exercised graph boundary conditions without crashing.

## Notes

This keeps CUDA graph as an optimization path rather than a correctness assumption.

It is intentionally conservative: runtime decode batches outside the capture window are handled by eager execution.

Fix CUDA graph block_tables shape mismatch

41bb6d0

ilrewrite closed this Mar 25, 2026

ilrewrite deleted the fix-cudagraph-block-tables branch March 25, 2026 03:01

Guard decode CUDA graph replay with runtime checks

b386ae2

ilrewrite reopened this Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CUDA graph block_tables shape mismatch#191

Fix CUDA graph block_tables shape mismatch#191
ilrewrite wants to merge 2 commits into
GeeeekExplorer:mainfrom
ilrewrite:fix-cudagraph-block-tables

ilrewrite commented Mar 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ilrewrite commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ilrewrite commented Mar 24, 2026 •

edited

Loading