[rollout] feat: trainer-side FP8 weight quantization for colocated and disaggregated modes by yxs · Pull Request #5976 · verl-project/verl

yxs · 2026-04-12T06:15:38Z

What does this PR do?

Move FP8 blockwise weight quantization from rollout GPU to trainer GPU in the weight sync path, controlled by a new config flag trainer_quantize_fp8. This halves transfer bandwidth in disaggregated mode (1 byte vs 2 bytes per param) and reduces peak memory during bucketed transfer in colocated mode.

Supports both colocated (async quantization via quant_weights_by_name) and disaggregated (sync quantization via quant_weights_by_name_sync for NCCL/NIXL/HCCL checkpoint engines) modes.

Related: #5836 (Q2 roadmap → Weight refit optimization → "fp8 rollout: quantize weights on trainer side")

Checklist Before Starting

Search for similar PRs. https://github.com/verl-project/verl/pulls?q=fp8+trainer+quantize
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

Environment: 8×H100, Qwen2.5-7B, real GSM8K (7473 samples), SGLang rollout, FSDP trainer

1. Disaggregated mode (NCCL checkpoint engine, 4 trainer + 4 rollout GPUs)

Trainer-side FP8 quantization + NCCL weight transfer verified:

Log: FP8 trainer-side quantization enabled (disaggregated)
Log: Skipping FP8 quantization, weights pre-quantized on trainer side
Log: Rank 0 send weights done, time cost: 0.42s
Exit code: 0

2. Convergence comparison (127 global steps, disaggregated mode)

Step	Trainer-side FP8 score	Rollout-side FP8 score
3	0.15	0.20
15	0.26	0.29
31	0.35	0.40
47	0.46	0.31
63	0.49	0.45
79	0.59	0.54
95	0.65	0.63
111	0.73	0.72
127	0.73	0.75

Both converge to ~0.73-0.75. No convergence regression.

3. MoE model verification (Qwen3-30B-A3B, 128 experts)

FP8 weight sync succeeded (send weights done, time cost: 6.42s)
MoE router (mlp.gate.weight) correctly skipped by quantization (unit test: 25/25 PASS)
Full MoE training requires more GPUs (30B model OOMs on 4 trainer GPUs), but weight sync path fully verified

4. Selective quantization verification

Parameter type	Action
q/k/v/o_proj	FP8 quantize
gate/up/down_proj (dense + MoE expert)	FP8 quantize
MoE router (mlp.gate.weight)	Skip
embed_tokens	Skip
layernorm / norm	Skip
lm_head	Skip
bias	Skip

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always

gemini-code-assist

Code Review

This pull request implements trainer-side FP8 quantization for model weights in engine_workers.py and fsdp_workers.py before they are transmitted to the rollout worker. The sglang_rollout logic is updated to recognize pre-quantized weights and skip redundant quantization steps. Feedback highlights issues with hardcoded quantization parameters and data types, as well as opportunities to optimize the code by reusing quantizer instances and avoiding redundant imports.

verl/workers/engine_workers.py

verl/workers/fsdp_workers.py

wuxibin89 · 2026-04-13T02:34:56Z

Support trainer side fp8 quantization in disaggregate mode is higher priority than colocate mode. cc @sophiayyya #5972

…d disaggregated modes

wuxibin89 · 2026-04-14T09:38:07Z

A few comments:

How do we support vllm? Since it quantize weight in GPU worker: https://github.com/verl-project/verl/blob/main/verl/workers/rollout/vllm_rollout/utils.py#L257-L261
Need a strict unit test to verify correctness, e.g https://github.com/verl-project/verl/blob/main/tests/checkpoint_engine/test_correctness_on_gpu.py

wuxibin89 · 2026-04-14T09:28:29Z

verl/utils/fp8_utils.py

                # If quantization fails, use original weights
                yield (k, v)
+
+    def quant_weights_by_name_sync(self, weights, dtype=torch.bfloat16):


We can use ensure_async_iterator in Checkpoint engines to async for iterate weights, and eliminate this sync version.

yxs requested a review from chenhaiq as a code owner April 12, 2026 06:15

gemini-code-assist bot reviewed Apr 12, 2026

View reviewed changes

yxs force-pushed the feat/fp8-trainer-side-quantize branch 2 times, most recently from 2daee1f to b9e2ecb Compare April 13, 2026 00:36

yxs closed this Apr 14, 2026

yxs force-pushed the feat/fp8-trainer-side-quantize branch from b9e2ecb to 516657f Compare April 14, 2026 00:01

[rollout] feat: trainer-side FP8 weight quantization for colocated an…

c000fe6

…d disaggregated modes

yxs reopened this Apr 14, 2026

yxs requested review from PeterSH6, eric-haibin-lin, tongyx361 and vermouth1992 as code owners April 14, 2026 00:04

yxs changed the title ~~[rollout] feat: move FP8 weight quantization from rollout to trainer side~~ [rollout] feat: trainer-side FP8 weight quantization for colocated and disaggregated modes Apr 14, 2026

wuxibin89 reviewed Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rollout] feat: trainer-side FP8 weight quantization for colocated and disaggregated modes#5976

[rollout] feat: trainer-side FP8 weight quantization for colocated and disaggregated modes#5976
yxs wants to merge 1 commit intoverl-project:mainfrom
yxs:feat/fp8-trainer-side-quantize

yxs commented Apr 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wuxibin89 commented Apr 13, 2026 •

edited

Loading

Uh oh!

wuxibin89 commented Apr 14, 2026

Uh oh!

wuxibin89 Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yxs commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

1. Disaggregated mode (NCCL checkpoint engine, 4 trainer + 4 rollout GPUs)

2. Convergence comparison (127 global steps, disaggregated mode)

3. MoE model verification (Qwen3-30B-A3B, 128 experts)

4. Selective quantization verification

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wuxibin89 commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wuxibin89 commented Apr 14, 2026

Uh oh!

wuxibin89 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yxs commented Apr 12, 2026 •

edited

Loading

wuxibin89 commented Apr 13, 2026 •

edited

Loading