Skip to content

Add SDPA attention fallback and design docs#231

Open
Lebhoryi wants to merge 3 commits into
GeeeekExplorer:mainfrom
Lebhoryi:ccy_dev
Open

Add SDPA attention fallback and design docs#231
Lebhoryi wants to merge 3 commits into
GeeeekExplorer:mainfrom
Lebhoryi:ccy_dev

Conversation

@Lebhoryi

Copy link
Copy Markdown

Summary

  • Fix model initialization for configs that do not expose dtype, falling back to torch_dtype or the current PyTorch default dtype.
  • Add a selectable attention backend with PyTorch SDPA fallback when flash-attn is unavailable, controlled by NANOVLLM_ATTENTION_BACKEND.
  • Add overview and detailed design docs covering scheduling, KV cache, prefix cache, attention context, tensor parallelism, and execution flow.

Test plan

  • python -m py_compile nanovllm/engine/model_runner.py nanovllm/layers/attention.py

Lebhoryi added 3 commits May 12, 2026 15:22
Qwen3Config may not expose dtype, so fall back to torch_dtype or
the current default dtype before initializing model weights and sizing
KV cache.
Allow running without flash-attn and document backend selection via
NANOVLLM_ATTENTION_BACKEND.
Document the engine architecture, scheduling flow, KV cache lifecycle,
prefix cache behavior, attention context, and execution pipeline.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant