[Feat] Modular BF16 MoE expert kernel — parity with FP8 / NVFP4 modular path

> Related: existing umbrella **#3107** (multi-tenant LoRA for MoE under EP) —  this issue is a prerequisite for the BF16 training/LoRA path of that work,  but is independently useful for any BF16 MoE deployment that uses the FlashInfer NVLink all-to-all backend.

## Motivation

The modular MoE path in vLLM (dispatch / compute split) works correctly with the FlashInfer NVLink all-to-all backend for the TRT-LLM **FP8** and **NVFP4** expert kernels — a modular version exists for those dtypes.

The **BF16** TRT-LLM expert kernel is still monolithic: dispatch and compute are entangled in the same call, so it cannot be plugged into the modular MoE flow. When a user tries to combine the BF16 expert kernel with FlashInfer NVLink all-to-all dispatch today, outputs are **silently incorrect** (garbage tensors, no error). Users are forced to fall back to the Triton BF16 expert kernel, which is significantly slower and blocks BF16 MoE training paths (e.g. BF16 LoRA) on Expert Parallelism.

## Proposal

Provide a modular BF16 expert kernel with the same split already used for FP8 and NVFP4:

- Expose expert **compute** as a standalone callable that takes already - dispatched tokens (post FlashInfer NVLink all-to-all) and returns per-expert outputs ready for combine.
- Keep API / tensor layouts / metadata shapes consistent with the FP8 and NVFP4 modular paths, so downstream callers (vLLM, custom stacks) can swap dtype without restructuring the pipeline.
- Numerics must match the monolithic BF16 kernel within acceptable tolerance on dense (non-dispatched) workloads.

## Success criteria

- A BF16 expert kernel usable in the same modular MoE flow as FP8 / NVFP4.
- Correct outputs end-to-end when combined with FlashInfer NVLink all-to-all dispatch on Blackwell (GB200).
- Parity with FP8 / NVFP4 modular path on the API surface and EP semantics.

## Scope

- **HW:** Blackwell / GB200 priority.
- **Models:** any BF16 MoE in vLLM; representative target is a large MoE in the GLM / DeepSeek family.
- **Out of scope for v1:** FP16, non-NVLink dispatch backends (nice-to-have but not required).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Modular BF16 MoE expert kernel — parity with FP8 / NVFP4 modular path #3110

Motivation

Proposal

Success criteria

Scope

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feat] Modular BF16 MoE expert kernel — parity with FP8 / NVFP4 modular path #3110

Description

Motivation

Proposal

Success criteria

Scope

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions