Motivation
Recent frontier MoE models — notably DeepSeek V3.2 and GLM 5.1 — use HiSparse sparse attention (LMSYS, "HiSparse: Turbocharging Sparse Attention with Hierarchical Memory") in place of dense attention at long context. HiSparse has already been integrated into SGLang; there is currently no equivalent kernel available in FlashInfer, so serving stacks that build on FlashInfer (vLLM in particular) cannot run these architectures efficiently without falling back to dense attention, which is prohibitively expensive at the context lengths these models are designed for.
FlashInfer already tracks some adjacent pieces of the DeepSeek V3.2 stack (e.g. model: dsv3.2 labelled issues such as the segmented top-K work in #3096) — this issue is to track the HiSparse attention kernel itself.
Proposal
Implement HiSparse attention as a FlashInfer attention kernel on Blackwell:
- Support the hierarchical-memory sparse-attention pattern as described in the HiSparse reference (LMSYS) and as used in DeepSeek V3.2 / GLM 5.1.
- Prefill and decode paths.
- Integrate with FlashInfer's existing attention launchers / planner APIs so downstream serving stacks can select HiSparse the same way they select other attention variants today.
- Blackwell / SM100 as the primary target; other SMs as a follow-up.
Success criteria
- Functional parity with the reference HiSparse implementation on representative shapes.
- Usable end-to-end on DeepSeek V3.2 and GLM 5.1 model architectures (validated against expected logits / generations).
- Performance advantage over dense attention at the context lengths these models are deployed at; reasonable parity with the SGLang HiSparse integration as a public baseline.
Notes
Motivation
Recent frontier MoE models — notably DeepSeek V3.2 and GLM 5.1 — use HiSparse sparse attention (LMSYS, "HiSparse: Turbocharging Sparse Attention with Hierarchical Memory") in place of dense attention at long context. HiSparse has already been integrated into SGLang; there is currently no equivalent kernel available in FlashInfer, so serving stacks that build on FlashInfer (vLLM in particular) cannot run these architectures efficiently without falling back to dense attention, which is prohibitively expensive at the context lengths these models are designed for.
FlashInfer already tracks some adjacent pieces of the DeepSeek V3.2 stack (e.g.
model: dsv3.2labelled issues such as the segmented top-K work in #3096) — this issue is to track the HiSparse attention kernel itself.Proposal
Implement HiSparse attention as a FlashInfer attention kernel on Blackwell:
Success criteria
Notes