Skip to content

[rollout] feat: trainer-side FP8 weight quantization for colocated and disaggregated modes#5976

Open
yxs wants to merge 1 commit intoverl-project:mainfrom
yxs:feat/fp8-trainer-side-quantize
Open

[rollout] feat: trainer-side FP8 weight quantization for colocated and disaggregated modes#5976
yxs wants to merge 1 commit intoverl-project:mainfrom
yxs:feat/fp8-trainer-side-quantize

Conversation

@yxs
Copy link
Copy Markdown
Collaborator

@yxs yxs commented Apr 12, 2026

What does this PR do?

Move FP8 blockwise weight quantization from rollout GPU to trainer GPU in the weight sync path, controlled by a new config flag trainer_quantize_fp8. This halves transfer bandwidth in disaggregated mode (1 byte vs 2 bytes per param) and reduces peak memory during bucketed transfer in colocated mode.

Supports both colocated (async quantization via quant_weights_by_name) and disaggregated (sync quantization via quant_weights_by_name_sync for NCCL/NIXL/HCCL checkpoint engines) modes.

Related: #5836 (Q2 roadmap → Weight refit optimization → "fp8 rollout: quantize weights on trainer side")

Checklist Before Starting

  • Search for similar PRs. https://github.com/verl-project/verl/pulls?q=fp8+trainer+quantize
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

Environment: 8×H100, Qwen2.5-7B, real GSM8K (7473 samples), SGLang rollout, FSDP trainer

1. Disaggregated mode (NCCL checkpoint engine, 4 trainer + 4 rollout GPUs)

Trainer-side FP8 quantization + NCCL weight transfer verified:

  • Log: FP8 trainer-side quantization enabled (disaggregated)
  • Log: Skipping FP8 quantization, weights pre-quantized on trainer side
  • Log: Rank 0 send weights done, time cost: 0.42s
  • Exit code: 0

2. Convergence comparison (127 global steps, disaggregated mode)

Step Trainer-side FP8 score Rollout-side FP8 score
3 0.15 0.20
15 0.26 0.29
31 0.35 0.40
47 0.46 0.31
63 0.49 0.45
79 0.59 0.54
95 0.65 0.63
111 0.73 0.72
127 0.73 0.75

Both converge to ~0.73-0.75. No convergence regression.

3. MoE model verification (Qwen3-30B-A3B, 128 experts)

  • FP8 weight sync succeeded (send weights done, time cost: 6.42s)
  • MoE router (mlp.gate.weight) correctly skipped by quantization (unit test: 25/25 PASS)
  • Full MoE training requires more GPUs (30B model OOMs on 4 trainer GPUs), but weight sync path fully verified

4. Selective quantization verification

Parameter type Action
q/k/v/o_proj FP8 quantize
gate/up/down_proj (dense + MoE expert) FP8 quantize
MoE router (mlp.gate.weight) Skip
embed_tokens Skip
layernorm / norm Skip
lm_head Skip
bias Skip

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@yxs yxs requested a review from chenhaiq as a code owner April 12, 2026 06:15
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements trainer-side FP8 quantization for model weights in engine_workers.py and fsdp_workers.py before they are transmitted to the rollout worker. The sglang_rollout logic is updated to recognize pre-quantized weights and skip redundant quantization steps. Feedback highlights issues with hardcoded quantization parameters and data types, as well as opportunities to optimize the code by reusing quantizer instances and avoiding redundant imports.

@yxs yxs force-pushed the feat/fp8-trainer-side-quantize branch 2 times, most recently from 2daee1f to b9e2ecb Compare April 13, 2026 00:36
@wuxibin89
Copy link
Copy Markdown
Collaborator

wuxibin89 commented Apr 13, 2026

Support trainer side fp8 quantization in disaggregate mode is higher priority than colocate mode. cc @sophiayyya #5972

@yxs yxs closed this Apr 14, 2026
@yxs yxs force-pushed the feat/fp8-trainer-side-quantize branch from b9e2ecb to 516657f Compare April 14, 2026 00:01
@yxs yxs reopened this Apr 14, 2026
@yxs yxs changed the title [rollout] feat: move FP8 weight quantization from rollout to trainer side [rollout] feat: trainer-side FP8 weight quantization for colocated and disaggregated modes Apr 14, 2026
@wuxibin89
Copy link
Copy Markdown
Collaborator

A few comments:

# If quantization fails, use original weights
yield (k, v)

def quant_weights_by_name_sync(self, weights, dtype=torch.bfloat16):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use ensure_async_iterator in Checkpoint engines to async for iterate weights, and eliminate this sync version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants