Machine Learning & Data Science Reinforcement Learning Policy Optimization Advantage Estimation Actor Critic Methods

Actor-Critic methods are a family of reinforcement learning algorithms that combine two components:

Actor: A policy network that decides which actions to take
Critic: A value function that evaluates how good states (or state-action pairs) are

This architecture addresses key limitations of pure policy gradient methods (like Machine-Learning-&-Data-Science-Reinforcement-Learning-Policy-Optimization-REINFORCE) and pure value-based methods (like Q-learning).

Architecture

The Actor

The actor is a policy $\pi_\theta(a|s)$ parameterized by $\theta$ that maps states to actions (or action probabilities). It is updated using policy gradients: $$ \nabla_\theta J(\pi_\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) A(s_t, a_t) $$

The actor's job is to act - to decide what to do in each state.

The Critic

The critic is a value function $V^{\pi}(s)$ (or sometimes Q function $Q^{\pi}(s,a)$ ) that estimates expected returns. It is trained using temporal difference learning or Monte Carlo methods.

The critic's job is to critique - to evaluate how good the actor's decisions are by estimating advantages: $$ A(s_t, a_t) \approx r_t + \gamma V(s_{t+1}) - V(s_t) $$

Why Actor-Critic?

Actor-critic methods solve problems with earlier approaches:

Problems with Pure Policy Gradients (REINFORCE)

High variance: Gradient estimates vary wildly between trajectories
Sample inefficiency: Requires many episodes to get stable gradients
Slow learning: High variance means slow convergence

Actor-critic solution: Use the critic's value estimates to compute advantages, which have much lower variance than raw trajectory returns.

Problems with Pure Value-Based Methods (DQN, Q-learning)

Can't handle continuous actions: Must discretize action space
Can't learn stochastic policies: Only learns deterministic argmax policy
Exploration challenges: Requires epsilon-greedy or other exploration hacks

Actor-critic solution: The actor directly parameterizes a stochastic policy that can naturally handle continuous actions and exploration.

How They Work Together

Actor takes action: Sample $a_t \sim \pi_\theta(a|s_t)$
Environment responds: Observe reward $r_t$ and next state $s_{t+1}$
Critic evaluates: Compute advantage estimate (e.g., TD error) $$A(s_t, a_t) = r_t + \gamma V(s_{t+1}) - V(s_t)$$
Both learn:
- Critic update: Minimize value function error $$\mathcal{L}_{\text{critic}} = (V(s_t) - \text{target})^2$$
- Actor update: Maximize expected return using advantage $$\nabla_\theta J(\pi_\theta) \propto \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A(s_t, a_t)$$

Types of Critics

State-Value Critic

Estimates $V^{\pi}(s)$ , used to compute advantages via TD errors: $$A(s_t, a_t) = r_t + \gamma V(s_{t+1}) - V(s_t)$$

Examples: A2C, A3C, PPO

Action-Value Critic

Estimates Q function $Q^{\pi}(s,a)$ , can directly provide action values: $$A(s_t, a_t) = Q(s_t, a_t) - V(s_t)$$

Examples: DDPG, TD3, SAC

Common Actor-Critic Algorithms

A2C/A3C (Advantage Actor-Critic)

Uses state-value critic $V(s)$
Computes advantages using TD error or GAE
A3C is asynchronous version with multiple parallel workers

PPO (Proximal Policy Optimization)

Actor-critic with clipped objective to prevent large policy updates
Uses GAE for advantage estimation
One of the most popular modern RL algorithms

TRPO (Trust Region Policy Optimization)

Constrains policy updates to a "trust region" using KL divergence
Theoretically principled but computationally expensive
PPO is a simpler approximation of TRPO

DDPG/TD3 (Deep Deterministic Policy Gradient)

For continuous action spaces
Uses Q-function critic
Deterministic policy with noise for exploration

SAC (Soft Actor-Critic)

Entropy-regularized actor-critic
Encourages exploration through maximum entropy objective
State-of-the-art for many continuous control tasks

Advantages and Disadvantages

Advantages:

Lower variance than pure policy gradient methods
Handles continuous actions unlike pure value-based methods
More sample efficient than REINFORCE
Can learn stochastic policies naturally

Disadvantages:

More complex: Two networks to train instead of one
Can be unstable: Critic errors can destabilize actor training
Hyperparameter sensitive: Learning rates, advantage estimation method, etc.
Training challenges: Need to balance actor and critic learning

Key Design Choices

When implementing actor-critic methods, you must decide:

Advantage estimation: TD error, n-step, or GAE?
Network architecture: Shared or separate networks for actor and critic?
Update frequency: Update every step, or batch updates?
On-policy vs off-policy: A2C/PPO (on-policy) or DDPG/SAC (off-policy)?
Entropy regularization: Add entropy bonus to encourage exploration?

Machine Learning & Data Science Reinforcement Learning Policy Optimization Advantage Estimation Actor Critic Methods

Architecture

The Actor

The Critic

Why Actor-Critic?

Problems with Pure Policy Gradients (REINFORCE)

Problems with Pure Value-Based Methods (DQN, Q-learning)

How They Work Together

Types of Critics

State-Value Critic

Action-Value Critic

Common Actor-Critic Algorithms

A2C/A3C (Advantage Actor-Critic)

PPO (Proximal Policy Optimization)

TRPO (Trust Region Policy Optimization)

DDPG/TD3 (Deep Deterministic Policy Gradient)

SAC (Soft Actor-Critic)

Advantages and Disadvantages

Key Design Choices

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!