Skip to content

[Feat] Modular BF16 MoE expert kernel — parity with FP8 / NVFP4 modular path #3110

@mferrato

Description

@mferrato

Related: existing umbrella #3107 (multi-tenant LoRA for MoE under EP) — this issue is a prerequisite for the BF16 training/LoRA path of that work, but is independently useful for any BF16 MoE deployment that uses the FlashInfer NVLink all-to-all backend.

Motivation

The modular MoE path in vLLM (dispatch / compute split) works correctly with the FlashInfer NVLink all-to-all backend for the TRT-LLM FP8 and NVFP4 expert kernels — a modular version exists for those dtypes.

The BF16 TRT-LLM expert kernel is still monolithic: dispatch and compute are entangled in the same call, so it cannot be plugged into the modular MoE flow. When a user tries to combine the BF16 expert kernel with FlashInfer NVLink all-to-all dispatch today, outputs are silently incorrect (garbage tensors, no error). Users are forced to fall back to the Triton BF16 expert kernel, which is significantly slower and blocks BF16 MoE training paths (e.g. BF16 LoRA) on Expert Parallelism.

Proposal

Provide a modular BF16 expert kernel with the same split already used for FP8 and NVFP4:

  • Expose expert compute as a standalone callable that takes already - dispatched tokens (post FlashInfer NVLink all-to-all) and returns per-expert outputs ready for combine.
  • Keep API / tensor layouts / metadata shapes consistent with the FP8 and NVFP4 modular paths, so downstream callers (vLLM, custom stacks) can swap dtype without restructuring the pipeline.
  • Numerics must match the monolithic BF16 kernel within acceptable tolerance on dense (non-dispatched) workloads.

Success criteria

  • A BF16 expert kernel usable in the same modular MoE flow as FP8 / NVFP4.
  • Correct outputs end-to-end when combined with FlashInfer NVLink all-to-all dispatch on Blackwell (GB200).
  • Parity with FP8 / NVFP4 modular path on the API surface and EP semantics.

Scope

  • HW: Blackwell / GB200 priority.
  • Models: any BF16 MoE in vLLM; representative target is a large MoE in the GLM / DeepSeek family.
  • Out of scope for v1: FP16, non-NVLink dispatch backends (nice-to-have but not required).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions