fix(grpo): separate returns and advantages in GRPO estimator by cyyueyang · Pull Request #5974 · verl-project/verl

cyyueyang · 2026-04-12T02:36:00Z

Previously both returns and advantages were the normalized scores, which made monitoring raw reward metrics impossible. Now returns preserves raw outcome rewards while advantages contains the group-normalized values for policy gradient.

gemini-code-assist

Code Review

This pull request refactors the compute_grpo_outcome_advantage function to distinguish between advantages and returns by renaming the initial scores and returning both values separately. Feedback suggests vectorizing the advantage calculation to avoid inefficient Python loops on GPU tensors and moving the raw score calculation into the no_grad block. Additionally, it was noted that other GRPO-related estimators should be updated for consistency to ensure accurate reward metrics in monitoring.

gemini-code-assist · 2026-04-12T02:38:45Z

verl/trainer/ppo/core_algos.py

+        advantages = raw_scores.clone()
        for i in range(bsz):
            if norm_adv_by_std_in_grpo:
-                scores[i] = (scores[i] - id2mean[index[i]]) / (id2std[index[i]] + epsilon)
+                advantages[i] = (advantages[i] - id2mean[index[i]]) / (id2std[index[i]] + epsilon)
            else:
-                scores[i] = scores[i] - id2mean[index[i]]
-        scores = scores.unsqueeze(-1) * response_mask
+                advantages[i] = advantages[i] - id2mean[index[i]]


The current implementation uses a Python loop to perform element-wise updates on the advantages tensor. In PyTorch, especially when tensors are on a GPU, this is highly inefficient due to the overhead of multiple small operations and potential synchronization points. Since this function is part of the training inner loop, it should be vectorized to improve performance. Additionally, the raw_scores calculation at line 304 should ideally be moved inside the torch.no_grad() block to ensure that the returns (which serve as targets) do not inadvertently carry gradients.

Suggested change

advantages = raw_scores.clone()

for i in range(bsz):

if norm_adv_by_std_in_grpo:

scores[i] = (scores[i] - id2mean[index[i]]) / (id2std[index[i]] + epsilon)

advantages[i] = (advantages[i] - id2mean[index[i]]) / (id2std[index[i]] + epsilon)

else:

scores[i] = scores[i] - id2mean[index[i]]

scores = scores.unsqueeze(-1) * response_mask

advantages[i] = advantages[i] - id2mean[index[i]]

means = torch.stack([id2mean[idx] for idx in index]).to(raw_scores.device)

if norm_adv_by_std_in_grpo:

stds = torch.stack([id2std[idx] for idx in index]).to(raw_scores.device)

advantages = (raw_scores - means) / (stds + epsilon)

else:

advantages = raw_scores - means

gemini-code-assist · 2026-04-12T02:38:46Z

verl/trainer/ppo/core_algos.py

+        advantages = advantages.unsqueeze(-1) * response_mask

-    return scores, scores
+    return advantages, returns


While this change correctly separates returns and advantages for the compute_grpo_outcome_advantage function, the PR is incomplete as it leaves other GRPO-related estimators inconsistent. Specifically, compute_grpo_vectorized_outcome_advantage (line 360), compute_gdpo_outcome_advantage (line 470), and compute_grpo_passk_outcome_advantage (line 532) still return advantages, advantages. This inconsistency will lead to incorrect or missing raw reward metrics in monitoring when these alternative estimators are selected in the configuration.

fix(grpo): separate returns and advantages in GRPO estimator

bc8af25

cyyueyang requested review from PeterSH6, eric-haibin-lin, tongyx361 and vermouth1992 as code owners April 12, 2026 02:36

gemini-code-assist bot reviewed Apr 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(grpo): separate returns and advantages in GRPO estimator#5974

fix(grpo): separate returns and advantages in GRPO estimator#5974
cyyueyang wants to merge 1 commit intoverl-project:mainfrom
cyyueyang:fix_grpo_advantage

cyyueyang commented Apr 12, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 12, 2026

Uh oh!

gemini-code-assist bot Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cyyueyang commented Apr 12, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant