[feat] Expand mxfp8 quantization to support fine-grained layer precision config by zianglih · Pull Request #614 · radixark/miles

zianglih · 2026-02-17T23:36:59Z

Expand mxfp8 quantization utils to support fine-grained layer precision config:

Megatron --num-layers-at-start-in-bf16, --num-layers-at-end-in-bf16
--extra-high-precision-layers-hf, --extra-high-precision-layers-megatron used by weight conversion
- Intended to be used with Megatron --te-precision-config-file
Ability to convert 128x128 blockscaled fp8 checkpoint to mxfp8 checkpoint in tools/convert_hf_to_mxfp8.py
~~Use FlashInfer mxfp8 quantizer for faster weight sync, fall back to Triton if unavailable~~
- FlashInfer mxfp8 quantizer may add padding so we stick to Triton for now

This cannot be merged untill SGLang v0.6.10 bump with the following PRs;

gemini-code-assist · 2026-02-17T23:37:19Z

Summary of Changes

Hello @zianglih, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the model quantization pipeline by introducing the ability to perform partial MXFP8 quantization. Specifically, it allows users to designate a certain number of the last decoder layers to retain their original BF16 precision, while the rest of the model is converted to MXFP8. This feature provides finer-grained control over model precision, potentially balancing performance and accuracy by preserving higher precision in critical layers.

Highlights

Partial Quantization Control: Introduced a new mechanism to selectively skip MXFP8 quantization for a specified number of final decoder layers, allowing them to remain in BF16 precision.
New Argument --first-last-layers-bf16: Added the --first-last-layers-bf16 flag and a corresponding --num-layers-at-end-in-bf16 argument to control this behavior in the model conversion and execution scripts.
Dynamic Skip Logic: Implemented dynamic calculation of layer prefixes to skip during quantization based on the total number of hidden layers and the specified number of BF16 layers at the end.
Improved Conversion Script: The convert_hf_to_mxfp8.py script was updated to read model configuration (like num_hidden_layers) and pass layer-specific parameters to the quantization logic.

Changelog

miles/backends/megatron_utils/megatron_to_hf/processors/quantizer_mxfp8.py
- Implemented conditional logic to skip MXFP8 quantization for main decoder layers based on first_last_layers_bf16 and num_layers_at_end_in_bf16 arguments.
- Added assertions to ensure num_layers is set and positive when first_last_layers_bf16 is enabled.
scripts/run_qwen3_30b_a3b.py
- Introduced num_layers_at_end_in_bf16 as a new ScriptArgs parameter with a default of 0.
- Added validation to ensure num_layers_at_end_in_bf16 is only used when rollout_mxfp8 is enabled.
- Modified the convert_hf_to_mxfp8.py command to pass the num_layers_at_end_in_bf16 argument.
- Added --first-last-layers-bf16 and related arguments to the misc_args for the execution command when train_fp8 or train_mxfp8 is enabled and num_layers_at_end_in_bf16 is greater than 0.
tools/convert_hf_to_mxfp8.py
- Imported the re module for regular expression operations.
- Modified the should_quantize function to accept an optional skip_weight_substrings argument, allowing dynamic control over which weights to skip.
- Updated the process_file function signature to accept num_hidden_layers and num_layers_at_end_in_bf16.
- Implemented logic within process_file to dynamically determine layer prefixes to skip based on num_layers_at_end_in_bf16 and passed these to should_quantize.
- Modified the convert_mxfp8 function signature to accept num_layers_at_end_in_bf16 and read num_hidden_layers from config.json.
- Updated the call to process_file within convert_mxfp8 to pass the new layer-related arguments.
- Added a natural_key function for sorting modules_to_not_convert to ensure numerical order.
- Added a command-line argument --num-layers-at-end-in-bf16 to the main function's argument parser.
- Passed the num_layers_at_end_in_bf16 argument from the command line to the convert_mxfp8 function.

Activity

The pull request was opened by zianglih.
The motivation for this change is to enable more flexible quantization strategies.
A dependency on sglang/pull/18742 was noted.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a feature to skip quantization for the last N layers of a model, keeping them in BF16 format. The changes span across model conversion scripts and training execution scripts to support this new functionality, primarily for MXFP8 quantization. My review identifies a logical contradiction in the feature's activation condition, a magic number that hurts maintainability, and some redundant and non-robust file handling. Addressing these points will improve the code's correctness and quality.

gemini-code-assist · 2026-02-17T23:41:37Z

scripts/run_qwen3_30b_a3b.py

+    if (args.train_fp8 or args.train_mxfp8) and args.num_layers_at_end_in_bf16 > 0:
+        misc_args += (
+            "--first-last-layers-bf16 "
+            "--num-layers-at-start-in-bf16 0 "
+            f"--num-layers-at-end-in-bf16 {args.num_layers_at_end_in_bf16} "
+        )


There's a logical contradiction regarding when this feature is enabled. The __post_init__ check on line 40 asserts that num_layers_at_end_in_bf16 is only supported when rollout_mxfp8 is enabled. However, this block enables the feature for both train_fp8 and train_mxfp8. If train_fp8 is used (which usually implies rollout_fp8 and not rollout_mxfp8), the assertion on line 40 will fail, making the feature unusable with train_fp8.

Given that other changes in this PR are specific to mxfp8, it seems this feature is intended only for mxfp8. If so, the condition should be narrowed to resolve the contradiction.

Suggested change

if (args.train_fp8 or args.train_mxfp8) and args.num_layers_at_end_in_bf16 > 0:

misc_args += (

"--first-last-layers-bf16 "

"--num-layers-at-start-in-bf16 0 "

f"--num-layers-at-end-in-bf16 {args.num_layers_at_end_in_bf16} "

)

if args.train_mxfp8 and args.num_layers_at_end_in_bf16 > 0:

misc_args += (

"--first-last-layers-bf16 "

"--num-layers-at-start-in-bf16 0 "

f"--num-layers-at-end-in-bf16 {args.num_layers_at_end_in_bf16} "

)

gemini-code-assist · 2026-02-17T23:41:37Z

tools/convert_hf_to_mxfp8.py

+    num_maybe_mtp_layers = 1
+    dynamic_skip_layer_prefixes: set[str] = {
+        f"model.layers.{i}." for i in range(tail_start_idx, num_hidden_layers + num_maybe_mtp_layers)
+    }


The use of the magic number 1 for num_maybe_mtp_layers makes the code less readable and harder to maintain. It's not immediately clear why this value is 1 and if it's model-specific.

To improve clarity and maintainability, please define this as a named constant with a comment explaining its purpose. For example:

# Number of MTP (Mixture of Transformer Parallel) layers to account for, which might not be included in `num_hidden_layers`. # This can be model-specific. NUM_MAYBE_MTP_LAYERS = 1 # ... # ... range(tail_start_idx, num_hidden_layers + NUM_MAYBE_MTP_LAYERS)

tools/convert_hf_to_mxfp8.py

…ers-hf`

This reverts commit f3965d8.

zianglih requested review from fzyzcjy, maocheng23 and yueming-yuan as code owners February 17, 2026 23:37

gemini-code-assist bot reviewed Feb 17, 2026

View reviewed changes

zianglih mentioned this pull request Feb 18, 2026

[Roadmap] Blackwell MXFP8 and NVFP4 RL training #615

Open

24 tasks

ziang-and force-pushed the mxfp8-last-n branch from d5236a8 to 56515e8 Compare February 18, 2026 10:01

yueming-yuan approved these changes Feb 18, 2026

View reviewed changes

zianglih mentioned this pull request Feb 22, 2026

[RL] Support per-layer mixed FP8/BF16 serving for FP8 checkpoints sgl-project/sglang#18742

Merged

5 tasks

zianglih mentioned this pull request Mar 15, 2026

[FlashInfer v0.6.6][RL] Support fp8-last-n-bf16 RL for flashinfer_trtllm_routed moe backend sgl-project/sglang#20214

Merged

5 tasks

ziang-and requested a review from yushengsu-thu as a code owner March 28, 2026 03:06

ziang-and force-pushed the mxfp8-last-n branch 2 times, most recently from ae984d3 to f5790bb Compare March 30, 2026 17:49

ziang-and requested a review from guapisolo as a code owner March 30, 2026 18:30

zianglih added 5 commits March 31, 2026 10:50

Add initial num_layers_at_end_in_bf16

8f6d939

Add support for blockscaled fp8 checkpoints

eff1f4e

Add --num-layers-at-start-in-bf16

bfc2dbc

Add extra_high_precision_layers

bd7cd21

Tiny fix

3c96764

ziang-and force-pushed the mxfp8-last-n branch from 5b315a4 to 3c96764 Compare March 31, 2026 17:50

ziang-and pushed a commit to zianglih/miles that referenced this pull request Mar 31, 2026

Squash changes from radixark#614

79413f7

Add colocate flag

34c6551

ziang-and pushed a commit to zianglih/miles that referenced this pull request Mar 31, 2026

Squash changes from radixark#614

db017b3

Allow extra args in convert script

31b2c39

ziang-and force-pushed the mxfp8-last-n branch from 833babb to 31b2c39 Compare March 31, 2026 23:30

ziang-and pushed a commit to zianglih/miles that referenced this pull request Mar 31, 2026

Squash changes from radixark#614

72db580

Minor fix

920651f

ziang-and pushed a commit to zianglih/miles that referenced this pull request Apr 1, 2026

Squash changes from radixark#614

cc9c9f7

ziang-and pushed a commit to zianglih/miles that referenced this pull request Apr 1, 2026

Squash changes from radixark#614

0b83a89

ziang-and pushed a commit to zianglih/miles that referenced this pull request Apr 1, 2026

Squash changes from radixark#614

404d936

zianglih mentioned this pull request Apr 2, 2026

Support jdopensource/JoyAI-LLM-Flash #838

Open

ziang-and pushed a commit to zianglih/miles that referenced this pull request Apr 2, 2026

Squash changes from radixark#614

1fff9d0

ziang-and pushed a commit to zianglih/miles that referenced this pull request Apr 2, 2026

Squash changes from radixark#614

9a81c34

ziang-and pushed a commit to zianglih/miles that referenced this pull request Apr 2, 2026

Squash changes from radixark#614

9b3c6c7

ziang-and pushed a commit to zianglih/miles that referenced this pull request Apr 2, 2026

Squash changes from radixark#614

3f4f7f6

ziang-and pushed a commit to zianglih/miles that referenced this pull request Apr 2, 2026

Squash changes from radixark#614

bd33cb4

ziang-and pushed a commit to zianglih/miles that referenced this pull request Apr 2, 2026

Squash changes from radixark#614

61a1097

Fix weight name processing for weights_proj

148dca9

ziang-and pushed a commit to zianglih/miles that referenced this pull request Apr 2, 2026

Squash changes from radixark#614

1925a98

Clean up script

5cb921b

ziang-and pushed a commit to zianglih/miles that referenced this pull request Apr 2, 2026

Squash changes from radixark#614

748f73e

zianglih changed the title ~~Add --first-last-layers-bf16~~ Expand mxfp8 quantization utils Apr 6, 2026

zianglih changed the title ~~Expand mxfp8 quantization utils~~ Expand mxfp8 quantization to support fine-grained layer precision config Apr 6, 2026

zianglih added 2 commits April 6, 2026 01:20

Rename --extra-high-precision-layers to `--extra-high-precision-lay…

b67813f

…ers-hf`

Clean up assert

f2b0431

ziang-and force-pushed the mxfp8-last-n branch from 0f47f3c to f2b0431 Compare April 6, 2026 08:27

Default to flashinfer_mxfp8_quantize

f3965d8

ziang-and pushed a commit to zianglih/miles that referenced this pull request Apr 7, 2026

Squash changes from radixark#614

b9d91f3

zianglih changed the title ~~Expand mxfp8 quantization to support fine-grained layer precision config~~ [feat] Expand mxfp8 quantization to support fine-grained layer precision config Apr 9, 2026

ziang-and pushed a commit to zianglih/miles that referenced this pull request Apr 9, 2026

Squash changes from radixark#614

73b684d

Revert "Default to flashinfer_mxfp8_quantize"

f278538

This reverts commit f3965d8.

ziang-and pushed a commit to zianglih/miles that referenced this pull request Apr 10, 2026

Squash changes from radixark#614

6515c0d

Drop assert for full bf16 workaround

d45f5e0

ziang-and pushed a commit to zianglih/miles that referenced this pull request Apr 11, 2026

Squash changes from radixark#614

4f5350e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Expand mxfp8 quantization to support fine-grained layer precision config#614

[feat] Expand mxfp8 quantization to support fine-grained layer precision config#614
zianglih wants to merge 15 commits intoradixark:mainfrom
zianglih:mxfp8-last-n

zianglih commented Feb 17, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 17, 2026

Uh oh!

gemini-code-assist bot Feb 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zianglih commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Feb 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zianglih commented Feb 17, 2026 •

edited

Loading