Skip to content

Add Qwen3.5 (text + multimodal) support#232

Open
86MaxCao wants to merge 2 commits into
GeeeekExplorer:mainfrom
86MaxCao:feat/qwen3-5
Open

Add Qwen3.5 (text + multimodal) support#232
86MaxCao wants to merge 2 commits into
GeeeekExplorer:mainfrom
86MaxCao:feat/qwen3-5

Conversation

@86MaxCao

Copy link
Copy Markdown

Summary

  • Add Qwen3.5 (text + multimodal) support and loader wiring so nano-vllm can run vision–language workloads.
  • Extend engine components for multimodal prefill (placeholder expansion), vision-token caching, KV / block handling for special vision tokens, and decode-time recurrent state where needed.
  • Speed up GatedDeltaNet (GDN) decode with a persistent slot pool and more vectorized index-style ops, reducing heavy per-step Python gather/scatter on the hot decode path.
  • Implement relevant hot paths with Triton kernels.
  • Add example_qwen3_5.py and bench_qwen3_5.py for quick testing and benchmarking.
  • Update README with model download notes and how to run the multimodal example.

Performance (reference)

  • Text (256 seqs): ~5,338 tok/s
  • Multimodal (10 images): ~550 tok/s

Testing

python3 example_qwen3_5.py --model "/YOUR/MODEL/PATH"

Notes

  • This is a large diff across model loading, runner, scheduler, and caching; happy to walk through details or split the change if maintainers prefer.
  • If keeping multimodal in core is not desired yet, I’m open to a slimmer PR or an extension-style layout.

86MaxCao and others added 2 commits May 13, 2026 15:59
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant