Skip to content

[fully_async] feat: standalone log prob server (Model Engine Server) support#5990

Open
sl-1314 wants to merge 41 commits intoverl-project:mainfrom
meituan-search:standalone_old_log_prob_support
Open

[fully_async] feat: standalone log prob server (Model Engine Server) support#5990
sl-1314 wants to merge 41 commits intoverl-project:mainfrom
meituan-search:standalone_old_log_prob_support

Conversation

@sl-1314
Copy link
Copy Markdown
Contributor

@sl-1314 sl-1314 commented Apr 13, 2026

What does this PR do?

This PR introduces a standalone Model Engine Server for the fully_async training pipeline to compute log_probs. In the existing design, old_log_probs are recomputed by the actor training engine, which requires saving/restoring actor weights. This PR decouples that computation into a dedicated inference server(allocate additional GPUs for log_probs computation) that runs concurrently with rollout generation.

Currently, only support megatron backend.

The implementation follows the existing RolloutReplica / BaseRollout / CheckpointEngineWorker architecture:

Class Role
ModelEngineReplica RolloutReplica subclass — resource allocation, lifecycle, weight sync
ModelEngineWorker Ray Actor per GPU — CheckpointEngineWorker subclass, receives weights
ModelEngineServerAdapter BaseRollout adapter — wraps TrainingWorker for forward-only inference
ModelEngineServer Ray Actor — async batch scheduler with pause_serving/resume_serving protocol

To use Model Engine Server, you need to use mbridge and apply this PR: ISEEKYAN/mbridge#117

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

Using Model Engine Server, will slightly increase timing_s/gen, but it can effectively eliminate timing_s/old_log_prob and reduce end-to-end time.

image image image

Compared to the original old_log_prob calculating method(16 gpus training and 16 gpus rollout), Model Engine Serve(8 gpus addition), resulting in an end-to-end time speedup of approximately 1.64x. Considering the increased resource consumption, the speedup ratio is about 1.09x.

API and Usage Example

Enable with:

model_engine_server.enable_standalone: True
model_engine_server.nnodes: nnodes
model_engine_server.n_gpus_per_node: n_gpus_per_node
# independent model_parallel config
model_engine_server.megatron.pipeline_model_parallel_size= pp_size
model_engine_server.megatron.tensor_model_parallel_size=tp_size
......

example script:

verl/experimental/fully_async_policy/shell/dapo_7b_math_megatron_16_16_8.sh

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a standalone ModelEngineServer to compute 'old' log probabilities for fully asynchronous training. Key changes include the implementation of ModelEngineReplica and ModelEngineWorker, updates to the agent loop to handle response_oldlogprobs, and enhancements to Megatron utilities for weight synchronization via async generators. A critical bug was identified in the tool_agent_loop.py where old log probabilities were being appended to the wrong data list, which would lead to incorrect training data.

sl-1314 and others added 2 commits April 13, 2026 17:58
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@ArronHZG ArronHZG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it also needs to support multiple replicas and have a server manager for data distribution.

Furthermore, modifications to the main branch need to be reduced, converging to fully async.

@ArronHZG
Copy link
Copy Markdown
Collaborator

In the experiment, even without rejection sampling correction, the rollout mismatch metric needs to be enabled to observe the results.

Additionally, the current implementation can display the Entropy metric.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants