[fully_async] feat: standalone log prob server (Model Engine Server) support#5990
[fully_async] feat: standalone log prob server (Model Engine Server) support#5990sl-1314 wants to merge 41 commits intoverl-project:mainfrom
Conversation
This reverts commit 5b8d9b7.
There was a problem hiding this comment.
Code Review
This pull request introduces a standalone ModelEngineServer to compute 'old' log probabilities for fully asynchronous training. Key changes include the implementation of ModelEngineReplica and ModelEngineWorker, updates to the agent loop to handle response_oldlogprobs, and enhancements to Megatron utilities for weight synchronization via async generators. A critical bug was identified in the tool_agent_loop.py where old log probabilities were being appended to the wrong data list, which would lead to incorrect training data.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…meituan-search/verl into standalone_old_log_prob_support
ArronHZG
left a comment
There was a problem hiding this comment.
I feel it also needs to support multiple replicas and have a server manager for data distribution.
Furthermore, modifications to the main branch need to be reduced, converging to fully async.
|
In the experiment, even without rejection sampling correction, the rollout mismatch metric needs to be enabled to observe the results. Additionally, the current implementation can display the Entropy metric. |
What does this PR do?
This PR introduces a standalone
Model Engine Serverfor the fully_async training pipeline to computelog_probs. In the existing design,old_log_probsare recomputed by the actor training engine, which requires saving/restoring actor weights. This PR decouples that computation into a dedicated inference server(allocate additional GPUs forlog_probscomputation) that runs concurrently with rollout generation.Currently, only support megatron backend.
The implementation follows the existing
RolloutReplica/BaseRollout/CheckpointEngineWorkerarchitecture:ModelEngineReplicaRolloutReplicasubclass — resource allocation, lifecycle, weight syncModelEngineWorkerCheckpointEngineWorkersubclass, receives weightsModelEngineServerAdapterBaseRolloutadapter — wrapsTrainingWorkerfor forward-only inferenceModelEngineServerpause_serving/resume_servingprotocolTo use Model Engine Server, you need to use mbridge and apply this PR: ISEEKYAN/mbridge#117
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
Using Model Engine Server, will slightly increase timing_s/gen, but it can effectively eliminate timing_s/old_log_prob and reduce end-to-end time.
Compared to the original old_log_prob calculating method(16 gpus training and 16 gpus rollout), Model Engine Serve(8 gpus addition), resulting in an end-to-end time speedup of approximately 1.64x. Considering the increased resource consumption, the speedup ratio is about 1.09x.
API and Usage Example
Enable with:
example script:
Design & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.