[https://nvbugs/5518713][fix] Trtllm-gen moe backend for blockwise fp8 ckpt (Qwen3-235B-A22B-FP8) #7856

jhaotingc · 2025-09-19T03:58:15Z

Summary by CodeRabbit

New Features
- Broader dtype support for routing logits (float or bfloat16).
- Public MoE runners accept optional n_group, topk_group, and routed_scaling_factor.
Improvements
- Stricter validation for Renormalize/RenormalizeNaive routing: require top_k in (0, 8] with clearer messages.
Tests
- Added an integration test for FP8 block-scale MOE on a large FP8 model.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-09-19T04:07:50Z

📝 Walkthrough

Walkthrough

C++ FP8 MoE changes: accept routing logits as float or bfloat16, enforce 0 < top_k ≤ 8 for Renormalize/RenormalizeNaive, and change routing_logits pointer to a generic void*. Python custom ops: make n_group, topk_group, and routed_scaling_factor Optional in FP4/FP8 runner constructors and public entrypoints. Tests: add an FP8-block-scales integration test for Qwen3-235B.

Changes

Cohort / File(s)	Summary of Changes
C++ FP8 MoE routing updates `cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp`	- Validate `routing_logits` dtype as float or bfloat16 (was float-only). - For `Renormalize`/`RenormalizeNaive`, require `0 < top_k ≤ 8` and update error messaging. - Replace `routing_logits.data_ptr<float>()`/`float` with generic `data_ptr()`/`void` and pass through `args.routing_logits` to the kernel runner.
Python custom ops: Optional grouping/scaling params `tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py`	- Make `n_group`, `topk_group`, and `routed_scaling_factor` Optional in `FP4BlockScaleMoERunner.__init__`, `FP8BlockScaleMoERunner.__init__`, and public entrypoints (including fake path `_`). - Update public function signatures to accept `Optional[int]`/`Optional[float]`.
Integration tests: add FP8 block-scales test `tests/integration/defs/accuracy/test_llm_api_pytorch.py`	- Add `TestQwen3_235B_A22B.test_fp8_block_scales`, parameterized over tp/pp/ep sizes and runtime flags, testing FP8 block-scales with MOE backends (DEEPGEMM, TRTLLM) on Qwen3-235B-A22B-FP8 for MMLU and GSM8K.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Py as Python op (fp8_block_scale_moe_runner)
    participant Runner as FP8BlockScaleMoERunner
    participant Cpp as C++ fp8BlockScaleMoe
    participant Kernel as MoE Kernel

    Py->>Runner: __init__(..., n_group?, topk_group?, routed_scaling_factor?, routing_method_type)
    Note right of Runner: Optional params accepted (may be None)

    Py->>Cpp: invoke with tensors (routing_logits, bias, weights, scales, ...)
    Cpp->>Cpp: Validate routing_logits dtype ∈ {float, bfloat16}
    Cpp->>Cpp: If routing_method ∈ {Renormalize, RenormalizeNaive}<br/>require 0 < top_k ≤ 8
    Cpp->>Kernel: run(args with generic routing_logits pointer (void*))
    Kernel-->>Cpp: outputs
    Cpp-->>Py: outputs

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description Check	⚠️ Warning	The PR description is still the repository template with placeholders and contains no concrete explanation, test coverage details, or checklist items; the pr_objectives/raw_summary confirm the actual code changes (FP8 block-scale MOE C++ fixes including routing_logits type change, Python API signature updates, and a new FP8-block-scales integration test) but none of these are documented in the PR body. Because the template fields are empty, reviewers lack the required context, test references, and merge guidance.	Replace the template placeholders with a concise Description summarizing what changed and why, explicitly list modified files and public API impacts (e.g., routing_logits type change to void*, FP4/FP8 runner signature changes), enumerate relevant tests added or updated (the new FP8-block-scales integration test and any unit/regression tests), and complete the PR Checklist (CODEOWNERS, documentation, test stages and validation steps) before requesting final review.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title clearly summarizes the main change: fixing the trtllm-gen MoE backend to support blockwise FP8 checkpoints (Qwen3-235B-A22B-FP8), which matches the fp8 block-scale MoE and runner signature changes in the PR. It is specific and related to the changeset rather than being generic or off-topic.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Jhao-Ting Chen <[email protected]>

…TLLM Signed-off-by: Jhao-Ting Chen <[email protected]>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (1)

589-608: Silence fake-path unused parameters (ruff ARG001) per static analysis

Same change as above; underscores are harmless and keep the schema intact.

🧹 Nitpick comments (5)

tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
2862-2893: Add SM120/121 skip guard for TRTLLM MOE to match NVFP4 tests

Parity with existing NVFP4 tests avoids CI flakes on unsupported SMs.

Apply this diff inside test_fp8_block_scales immediately after the function signature:
 def test_fp8_block_scales(self, tp_size, pp_size, ep_size, attention_dp,
                           cuda_graph, overlap_scheduler, moe_backend):
+    if moe_backend == "TRTLLM" and (get_sm_version() == 120 or get_sm_version() == 121):
+        pytest.skip("MOE TRTLLM backend does not support SM version 120 or 121")
     pytorch_config = dict(
cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp (2)
41-45: Fix typo in arch check error text

Use “SM100 family” to match isSM100Family().
-    TORCH_CHECK(tensorrt_llm::common::isSM100Family(), "Only SM100f is supported by FP8 block scale MOE");
+    TORCH_CHECK(tensorrt_llm::common::isSM100Family(), "Only SM100 family is supported by FP8 block scale MOE");
74-76: Top‑k constraint now enforced for Renormalize/Naive

Good to gate unsupported configs; please mirror this constraint in Python‑side validation or docs to surface earlier to users.
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (2)
545-552: Expose Optionals in the FP8 op schema: good; align fake path to silence ARG001

Update the fake impl’s parameter names to underscore unused args to satisfy ruff without changing schema.
 @fp8_block_scale_moe_runner.register_fake
 def _(
-    routing_logits: torch.Tensor,
-    routing_bias: torch.Tensor,
+    _routing_logits: torch.Tensor,
+    _routing_bias: torch.Tensor,
     hidden_states: torch.Tensor,
-    hidden_states_scale: torch.Tensor,
-    gemm1_weights: torch.Tensor,
-    gemm1_weights_scale: torch.Tensor,
-    gemm2_weights: torch.Tensor,
-    gemm2_weights_scale: torch.Tensor,
-    num_experts: int,
-    top_k: int,
-    n_group: Optional[int],
-    topk_group: Optional[int],
-    intermediate_size: int,
-    local_expert_offset: int,
-    local_num_experts: int,
-    routed_scaling_factor: Optional[float],
-    routing_method_type: int,
+    _hidden_states_scale: torch.Tensor,
+    _gemm1_weights: torch.Tensor,
+    _gemm1_weights_scale: torch.Tensor,
+    _gemm2_weights: torch.Tensor,
+    _gemm2_weights_scale: torch.Tensor,
+    _num_experts: int,
+    _top_k: int,
+    _n_group: Optional[int],
+    _topk_group: Optional[int],
+    _intermediate_size: int,
+    _local_expert_offset: int,
+    _local_num_experts: int,
+    _routed_scaling_factor: Optional[float],
+    _routing_method_type: int,
 ) -> torch.Tensor:
533-553: Consider making routing_bias Optional in FP8 inputs to match C++ signature

C++ accepts optional bias; Python FP8 dataclass/op currently requires a Tensor. Making it Optional keeps APIs consistent and avoids surprises if bias is absent.
-    routing_bias: torch.Tensor
+    routing_bias: Optional[torch.Tensor]
And in the op schema:
-    routing_bias: torch.Tensor,
+    routing_bias: Optional[torch.Tensor],

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8adaf0b and 5acb38a.

📒 Files selected for processing (3)

cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp (4 hunks)
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (3 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (6)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}