[5/n][trainer] feat: flowgrpo trainer#5951
[5/n][trainer] feat: flowgrpo trainer#5951zhtmike wants to merge 15 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the FlowGRPO trainer for diffusion models, enabling reinforcement learning for image generation tasks such as OCR. Key additions include the RayFlowGRPOTrainer, diffusion-specific advantage estimation logic, and updated dataset handling for multimodal inputs and negative prompts. The PR also refines the vLLM-Omni integration and provides comprehensive end-to-end testing scripts. A logic error was identified in the advantage calculation for single-sample groups, where the advantage should be zeroed out to prevent incorrect gradient updates.
|
btw the current diffusion yaml is too clumsy and contains too much unnecessary llm configs, will refactor this after this one. |
| # extra configs for algorithm specific features. | ||
| # Model-specific diffusion sampling params (e.g. true_cfg_scale, guidance_scale, | ||
| # max_sequence_length, noise_level) | ||
| extra_configs: {} |
There was a problem hiding this comment.
should we set noise_level: 0.0 here to easier modification?
There was a problem hiding this comment.
Forward-based RL algo like diffusionnft does not have this parameter. So it is not a universal parameter for rollout config
| # Model-specific diffusion sampling params (e.g. true_cfg_scale, guidance_scale, | ||
| # max_sequence_length, noise_level) | ||
| extra_configs: {} |
There was a problem hiding this comment.
btw, it's better to the whole list of arguments that are configured via extra_configs, either in comments and official docs
| seed: 42 | ||
|
|
||
| # extra configs for algorithm specific features during validation. | ||
| extra_configs: {} |
There was a problem hiding this comment.
same here. pls state what arguments can be set in extra_config of inference engine
There was a problem hiding this comment.
Agree. extra_configs is a temporary design and will be dropped.
I prefer not to set the algorithm-specific configs (e.g., noise level, sde type, etc) or model-specific configs (maximum model length, true cfg scale, etc) into the general rollout config directly, which will make config bloated. So the extra_configs is a temp place to place these args here, which directly feeds these args into the rollout pipeline.
I am working on the config refactoring to make these things clearer. In the coming PR
Co-authored-by: Samit <[email protected]>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces the FlowGRPO trainer for diffusion models, specifically targeting image generation tasks such as Qwen-Image. Key additions include a new Ray-based trainer (RayFlowGRPOTrainer), specialized diffusion advantage estimators, and metric utilities for image generation. The PR also refactors the RL dataset to support negative prompts and multi-modal data handling, updates the FSDP engine for diffusers, and adds comprehensive e2e tests and OCR preprocessing examples. Review feedback highlighted a critical issue with method forwarding in the FSDP wrapper's disable_adapter context manager and a logic error in the GRPO advantage calculation regarding standard deviation computation over expanded timesteps.
What does this PR do?
Last piece of puzzle to make flowgrpo trainer runnable :) Following #5297, co-worked with @AndyZhou952
Added:
Documentations will be provided in next PR.
Script of full model RL / RL with non-collated reward model will be provided in the next PR.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
Formal installation guide will be in next PR.
For now, you need to
vllm==0.18vllm-omni==0.18examples/flowgrpo_trainer/data_process/qwenimage_ocr.pyexamples/flowgrpo_trainer/run_qwen_image_ocr_lora.shto have a try.This is wandb result running
examples/flowgrpo_trainer/run_qwen_image_ocr_lora.shwithtrainer.val_before_train=TrueVal Score:

Critic:

Actor:

Performance

Visualization

API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.