fix(grpo): separate returns and advantages in GRPO estimator#5974
fix(grpo): separate returns and advantages in GRPO estimator#5974cyyueyang wants to merge 1 commit intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the compute_grpo_outcome_advantage function to distinguish between advantages and returns by renaming the initial scores and returning both values separately. Feedback suggests vectorizing the advantage calculation to avoid inefficient Python loops on GPU tensors and moving the raw score calculation into the no_grad block. Additionally, it was noted that other GRPO-related estimators should be updated for consistency to ensure accurate reward metrics in monitoring.
| advantages = raw_scores.clone() | ||
| for i in range(bsz): | ||
| if norm_adv_by_std_in_grpo: | ||
| scores[i] = (scores[i] - id2mean[index[i]]) / (id2std[index[i]] + epsilon) | ||
| advantages[i] = (advantages[i] - id2mean[index[i]]) / (id2std[index[i]] + epsilon) | ||
| else: | ||
| scores[i] = scores[i] - id2mean[index[i]] | ||
| scores = scores.unsqueeze(-1) * response_mask | ||
| advantages[i] = advantages[i] - id2mean[index[i]] |
There was a problem hiding this comment.
The current implementation uses a Python loop to perform element-wise updates on the advantages tensor. In PyTorch, especially when tensors are on a GPU, this is highly inefficient due to the overhead of multiple small operations and potential synchronization points. Since this function is part of the training inner loop, it should be vectorized to improve performance. Additionally, the raw_scores calculation at line 304 should ideally be moved inside the torch.no_grad() block to ensure that the returns (which serve as targets) do not inadvertently carry gradients.
| advantages = raw_scores.clone() | |
| for i in range(bsz): | |
| if norm_adv_by_std_in_grpo: | |
| scores[i] = (scores[i] - id2mean[index[i]]) / (id2std[index[i]] + epsilon) | |
| advantages[i] = (advantages[i] - id2mean[index[i]]) / (id2std[index[i]] + epsilon) | |
| else: | |
| scores[i] = scores[i] - id2mean[index[i]] | |
| scores = scores.unsqueeze(-1) * response_mask | |
| advantages[i] = advantages[i] - id2mean[index[i]] | |
| means = torch.stack([id2mean[idx] for idx in index]).to(raw_scores.device) | |
| if norm_adv_by_std_in_grpo: | |
| stds = torch.stack([id2std[idx] for idx in index]).to(raw_scores.device) | |
| advantages = (raw_scores - means) / (stds + epsilon) | |
| else: | |
| advantages = raw_scores - means |
| advantages = advantages.unsqueeze(-1) * response_mask | ||
|
|
||
| return scores, scores | ||
| return advantages, returns |
There was a problem hiding this comment.
While this change correctly separates returns and advantages for the compute_grpo_outcome_advantage function, the PR is incomplete as it leaves other GRPO-related estimators inconsistent. Specifically, compute_grpo_vectorized_outcome_advantage (line 360), compute_gdpo_outcome_advantage (line 470), and compute_grpo_passk_outcome_advantage (line 532) still return advantages, advantages. This inconsistency will lead to incorrect or missing raw reward metrics in monitoring when these alternative estimators are selected in the configuration.
Previously both returns and advantages were the normalized scores, which made monitoring raw reward metrics impossible. Now returns preserves raw outcome rewards while advantages contains the group-normalized values for policy gradient.