-
Notifications
You must be signed in to change notification settings - Fork 0
Machine Learning & Data Science Reinforcement Learning Policy Optimization GRPO
GRPO (Group Relative Policy Optimisation) was introduced by Deepseek in the DeepSeekMath paper and popularised by the Deepseek R1 release.
It is a RL method specifically optimized for Large Language Models (LLMs). It improves on Machine-Learning-&-Data-Science-Reinforcement-Learning-Policy-Optimization-PPO by removing the need for a Value Function (Critic) network.
In standard Machine-Learning-&-Data-Science-Reinforcement-Learning-Policy-Optimization-PPO, we need to calculate the Advantage
- In robotics, the Critic is a tiny MLP.
- In LLMs, the Critic must understand the text as well as the Policy. This means the Critic is usually a copy of the Policy model (e.g., a 7B or 70B parameter transformer).
The Cost:
- Memory: You need to hold the Policy + the Critic + Gradients for both + Optimizer states for both. This effectively doubles the VRAM requirements.
- Compute: You have to run forward/backward passes for both models.
GRPO eliminates the Critic entirely. Instead of learning a value function
For each prompt (state)
-
Sample: Generate
$G$ different completions for the question$q$ . -
Score: Calculate the reward
$r_i$ for each completion using the reward model (or rule-based checker). - Advantage: Calculate the advantage for each output by normalizing the rewards within the group: $$ A_i = \frac{r_i - \text{mean}({r_1, ..., r_G})}{\text{std}({r_1, ..., r_G}) + \epsilon} $$
The objective function is then similar to PPO (using the clipped ratio), but using this group-based advantage: $$ J_{GRPO}(\theta) = \mathbb{E} \left[ \frac{1}{G} \sum_{i=1}^G \min \left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i, \text{clip}(...) A_i \right) - \beta D_{KL}(\pi_\theta || \pi_{ref}) \right] $$
-
Baseline: The mean reward of the group acts as the baseline
$V(s)$ . If an output$o_i$ is better than the group average, it has a positive advantage. If it's worse, it has a negative advantage. - Efficiency: No Critic network is needed. This saves ~50% of memory resources, allowing for training larger models or using larger batch sizes.
| Feature | Machine-Learning-&-Data-Science-Reinforcement-Learning-Policy-Optimization-TRPO | Machine-Learning-&-Data-Science-Reinforcement-Learning-Policy-Optimization-PPO | Machine-Learning-&-Data-Science-Reinforcement-Learning-Policy-Optimization-GRPO |
|---|---|---|---|
| Constraint | Hard KL Constraint | Clipped Objective | Clipped Objective + KL Penalty |
| Optimization | Second-order (Conjugate Gradient) | First-order (Adam) | First-order (Adam) |
| Critic Required? | Yes | Yes | No |
| Best For | Theoretical guarantees | General Purpose RL | LLM Reasoning / Fine-tuning |
GRPO is ideal when:
- The Environment is Resettable: You can ask the model the same question multiple times (trivial for LLMs).
- The Critic is Expensive: The model is so large that duplicating it for a Critic is prohibitive.
- Sparse Rewards: By sampling a group, you are more likely to find at least one successful trajectory to learn from (exploration).