Add Qwen3.5 (text + multimodal) support by 86MaxCao · Pull Request #232 · GeeeekExplorer/nano-vllm

86MaxCao · 2026-05-13T13:50:35Z

Summary

Add Qwen3.5 (text + multimodal) support and loader wiring so nano-vllm can run vision–language workloads.
Extend engine components for multimodal prefill (placeholder expansion), vision-token caching, KV / block handling for special vision tokens, and decode-time recurrent state where needed.
Speed up GatedDeltaNet (GDN) decode with a persistent slot pool and more vectorized index-style ops, reducing heavy per-step Python gather/scatter on the hot decode path.
Implement relevant hot paths with Triton kernels.
Add example_qwen3_5.py and bench_qwen3_5.py for quick testing and benchmarking.
Update README with model download notes and how to run the multimodal example.

python3 example_qwen3_5.py --model "/YOUR/MODEL/PATH"

This is a large diff across model loading, runner, scheduler, and caching; happy to walk through details or split the change if maintainers prefer.
If keeping multimodal in core is not desired yet, I’m open to a slimmer PR or an extension-style layout.

Co-authored-by: Cursor <cursoragent@cursor.com>

86MaxCao and others added 2 commits May 13, 2026 15:59

Add Qwen3.5 support (multimodal ops, examples, docs)

06005bd

Co-authored-by: Cursor <cursoragent@cursor.com>

Sync Qwen3.5 updates from AIInfra (model_runner, qwen3_5, README)

b48f59a

Co-authored-by: Cursor <cursoragent@cursor.com>

86MaxCao mentioned this pull request May 13, 2026

Add Qwen3-VL multimodal support #132

Open