Skip to content

[Feat] HiSparse sparse attention (DeepSeek V3.2 / GLM 5.1) on Blackwell #3111

@mferrato

Description

@mferrato

Motivation

Recent frontier MoE models — notably DeepSeek V3.2 and GLM 5.1 — use HiSparse sparse attention (LMSYS, "HiSparse: Turbocharging Sparse Attention with Hierarchical Memory") in place of dense attention at long context. HiSparse has already been integrated into SGLang; there is currently no equivalent kernel available in FlashInfer, so serving stacks that build on FlashInfer (vLLM in particular) cannot run these architectures efficiently without falling back to dense attention, which is prohibitively expensive at the context lengths these models are designed for.

FlashInfer already tracks some adjacent pieces of the DeepSeek V3.2 stack (e.g. model: dsv3.2 labelled issues such as the segmented top-K work in #3096) — this issue is to track the HiSparse attention kernel itself.

Proposal

Implement HiSparse attention as a FlashInfer attention kernel on Blackwell:

  • Support the hierarchical-memory sparse-attention pattern as described in the HiSparse reference (LMSYS) and as used in DeepSeek V3.2 / GLM 5.1.
  • Prefill and decode paths.
  • Integrate with FlashInfer's existing attention launchers / planner APIs so downstream serving stacks can select HiSparse the same way they select other attention variants today.
  • Blackwell / SM100 as the primary target; other SMs as a follow-up.

Success criteria

  • Functional parity with the reference HiSparse implementation on representative shapes.
  • Usable end-to-end on DeepSeek V3.2 and GLM 5.1 model architectures (validated against expected logits / generations).
  • Performance advantage over dense attention at the context lengths these models are deployed at; reasonable parity with the SGLang HiSparse integration as a public baseline.

Notes

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions