Skip to content

fix(grpo): separate returns and advantages in GRPO estimator#5974

Open
cyyueyang wants to merge 1 commit intoverl-project:mainfrom
cyyueyang:fix_grpo_advantage
Open

fix(grpo): separate returns and advantages in GRPO estimator#5974
cyyueyang wants to merge 1 commit intoverl-project:mainfrom
cyyueyang:fix_grpo_advantage

Conversation

@cyyueyang
Copy link
Copy Markdown

Previously both returns and advantages were the normalized scores, which made monitoring raw reward metrics impossible. Now returns preserves raw outcome rewards while advantages contains the group-normalized values for policy gradient.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the compute_grpo_outcome_advantage function to distinguish between advantages and returns by renaming the initial scores and returning both values separately. Feedback suggests vectorizing the advantage calculation to avoid inefficient Python loops on GPU tensors and moving the raw score calculation into the no_grad block. Additionally, it was noted that other GRPO-related estimators should be updated for consistency to ensure accurate reward metrics in monitoring.

Comment on lines +324 to +329
advantages = raw_scores.clone()
for i in range(bsz):
if norm_adv_by_std_in_grpo:
scores[i] = (scores[i] - id2mean[index[i]]) / (id2std[index[i]] + epsilon)
advantages[i] = (advantages[i] - id2mean[index[i]]) / (id2std[index[i]] + epsilon)
else:
scores[i] = scores[i] - id2mean[index[i]]
scores = scores.unsqueeze(-1) * response_mask
advantages[i] = advantages[i] - id2mean[index[i]]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation uses a Python loop to perform element-wise updates on the advantages tensor. In PyTorch, especially when tensors are on a GPU, this is highly inefficient due to the overhead of multiple small operations and potential synchronization points. Since this function is part of the training inner loop, it should be vectorized to improve performance. Additionally, the raw_scores calculation at line 304 should ideally be moved inside the torch.no_grad() block to ensure that the returns (which serve as targets) do not inadvertently carry gradients.

Suggested change
advantages = raw_scores.clone()
for i in range(bsz):
if norm_adv_by_std_in_grpo:
scores[i] = (scores[i] - id2mean[index[i]]) / (id2std[index[i]] + epsilon)
advantages[i] = (advantages[i] - id2mean[index[i]]) / (id2std[index[i]] + epsilon)
else:
scores[i] = scores[i] - id2mean[index[i]]
scores = scores.unsqueeze(-1) * response_mask
advantages[i] = advantages[i] - id2mean[index[i]]
means = torch.stack([id2mean[idx] for idx in index]).to(raw_scores.device)
if norm_adv_by_std_in_grpo:
stds = torch.stack([id2std[idx] for idx in index]).to(raw_scores.device)
advantages = (raw_scores - means) / (stds + epsilon)
else:
advantages = raw_scores - means

advantages = advantages.unsqueeze(-1) * response_mask

return scores, scores
return advantages, returns
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While this change correctly separates returns and advantages for the compute_grpo_outcome_advantage function, the PR is incomplete as it leaves other GRPO-related estimators inconsistent. Specifically, compute_grpo_vectorized_outcome_advantage (line 360), compute_gdpo_outcome_advantage (line 470), and compute_grpo_passk_outcome_advantage (line 532) still return advantages, advantages. This inconsistency will lead to incorrect or missing raw reward metrics in monitoring when these alternative estimators are selected in the configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant