Skip to content

Fix CUDA graph block_tables shape mismatch#191

Open
ilrewrite wants to merge 2 commits into
GeeeekExplorer:mainfrom
ilrewrite:fix-cudagraph-block-tables
Open

Fix CUDA graph block_tables shape mismatch#191
ilrewrite wants to merge 2 commits into
GeeeekExplorer:mainfrom
ilrewrite:fix-cudagraph-block-tables

Conversation

@ilrewrite

@ilrewrite ilrewrite commented Mar 24, 2026

Copy link
Copy Markdown

Closes #190

What

Add an explicit decode CUDA graph eligibility check and fall back to eager execution when the runtime decode batch exceeds the captured
graph coverage.

Why

Decode CUDA graph replay currently assumes that runtime decode state always fits the graph buffers captured up front from
max_model_len.

Under higher-concurrency / longer-context workloads, this assumption can break. One concrete failure mode is:

RuntimeError: The expanded size of the tensor (...) must match the existing size (...)

caused by context.block_tables being wider than the captured graph buffer.

The broader issue is that replay is attempted even when runtime decode state is outside the graph coverage window.

## How

This patch makes CUDA graph replay conditional on runtime eligibility:

1. Add max_seq_len_to_capture to config, defaulting to max_model_len.
2. Size captured decode graph buffers from max_seq_len_to_capture.
3. Before replay, check that:
    - the step is decode-only
    - eager is not forced
    - batch size is within captured graph support
    - max decode context length is within max_seq_len_to_capture
    - runtime block_tables width fits the captured graph buffer
4. If any check fails, fall back to eager execution instead of replaying the graph.
5. Clear the reused block_tables graph buffer with -1 before copying the current step's values.

## Validation

- python -m py_compile nanovllm/config.py nanovllm/engine/model_runner.py passes.
- The change is local to decode CUDA graph selection / replay.
- The same design has been validated locally in a derived codebase by running longer-context serving benchmarks that previously
  exercised graph boundary conditions without crashing.

## Notes

This keeps CUDA graph as an optimization path rather than a correctness assumption.

It is intentionally conservative: runtime decode batches outside the capture window are handled by eager execution.

@ilrewrite ilrewrite closed this Mar 25, 2026
@ilrewrite ilrewrite deleted the fix-cudagraph-block-tables branch March 25, 2026 03:01
@ilrewrite ilrewrite reopened this Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CUDA graph replay can fail with block_tables shape mismatch

1 participant