-
Notifications
You must be signed in to change notification settings - Fork 0
Machine Learning & Data Science Reinforcement Learning Policy Optimization Advantage Estimation Actor Critic Methods
Actor-Critic methods are a family of reinforcement learning algorithms that combine two components:
- Actor: A policy network that decides which actions to take
- Critic: A value function that evaluates how good states (or state-action pairs) are
This architecture addresses key limitations of pure policy gradient methods (like Machine-Learning-&-Data-Science-Reinforcement-Learning-Policy-Optimization-REINFORCE) and pure value-based methods (like Q-learning).
The actor is a policy
The actor's job is to act - to decide what to do in each state.
The critic is a value function
The critic's job is to critique - to evaluate how good the actor's decisions are by estimating advantages: $$ A(s_t, a_t) \approx r_t + \gamma V(s_{t+1}) - V(s_t) $$
Actor-critic methods solve problems with earlier approaches:
- High variance: Gradient estimates vary wildly between trajectories
- Sample inefficiency: Requires many episodes to get stable gradients
- Slow learning: High variance means slow convergence
Actor-critic solution: Use the critic's value estimates to compute advantages, which have much lower variance than raw trajectory returns.
- Can't handle continuous actions: Must discretize action space
- Can't learn stochastic policies: Only learns deterministic argmax policy
- Exploration challenges: Requires epsilon-greedy or other exploration hacks
Actor-critic solution: The actor directly parameterizes a stochastic policy that can naturally handle continuous actions and exploration.
-
Actor takes action: Sample
$a_t \sim \pi_\theta(a|s_t)$ -
Environment responds: Observe reward
$r_t$ and next state$s_{t+1}$ -
Critic evaluates: Compute advantage estimate (e.g., TD error)
$$A(s_t, a_t) = r_t + \gamma V(s_{t+1}) - V(s_t)$$ -
Both learn:
-
Critic update: Minimize value function error
$$\mathcal{L}_{\text{critic}} = (V(s_t) - \text{target})^2$$ -
Actor update: Maximize expected return using advantage
$$\nabla_\theta J(\pi_\theta) \propto \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A(s_t, a_t)$$
-
Critic update: Minimize value function error
Estimates
Examples: A2C, A3C, PPO
Estimates Q function
Examples: DDPG, TD3, SAC
- Uses state-value critic
$V(s)$ - Computes advantages using TD error or GAE
- A3C is asynchronous version with multiple parallel workers
- Actor-critic with clipped objective to prevent large policy updates
- Uses GAE for advantage estimation
- One of the most popular modern RL algorithms
- Constrains policy updates to a "trust region" using KL divergence
- Theoretically principled but computationally expensive
- PPO is a simpler approximation of TRPO
- For continuous action spaces
- Uses Q-function critic
- Deterministic policy with noise for exploration
- Entropy-regularized actor-critic
- Encourages exploration through maximum entropy objective
- State-of-the-art for many continuous control tasks
Advantages:
- Lower variance than pure policy gradient methods
- Handles continuous actions unlike pure value-based methods
- More sample efficient than REINFORCE
- Can learn stochastic policies naturally
Disadvantages:
- More complex: Two networks to train instead of one
- Can be unstable: Critic errors can destabilize actor training
- Hyperparameter sensitive: Learning rates, advantage estimation method, etc.
- Training challenges: Need to balance actor and critic learning
When implementing actor-critic methods, you must decide:
- Advantage estimation: TD error, n-step, or GAE?
- Network architecture: Shared or separate networks for actor and critic?
- Update frequency: Update every step, or batch updates?
- On-policy vs off-policy: A2C/PPO (on-policy) or DDPG/SAC (off-policy)?
- Entropy regularization: Add entropy bonus to encourage exploration?