-
Notifications
You must be signed in to change notification settings - Fork 0
Important Tricks
These are three fundamental mathematical tricks that appear repeatedly in reinforcement learning algorithms. Understanding these tricks is crucial for grasping advanced RL methods.
This trick transforms a gradient of a probability into an expectation, enabling Monte Carlo estimation:
Before the trick:
After the trick:
$$\int_{\tau} P(\tau | \pi_\theta) \nabla_\theta \log P(\tau | \pi_\theta) R(\tau) d\tau = \mathbb{E}{\tau \sim \pi\theta}[\nabla_\theta \log P(\tau | \pi_\theta) R(\tau)]$$
This is an expectation in the form $\mathbb{E}{x \sim p}[f(x)]$ where $p = P(\tau | \pi\theta)$ is the probability distribution you sample from, and
- REINFORCE: Core of the policy gradient derivation
- Variational-Inference: ELBO optimization in VAEs
- Any gradient estimation involving probabilities
Estimate
The ratio
- Off-policy learning: Use data from old policy to update new policy
- Rare event simulation: Sample from easier distribution, reweight appropriately
- Data efficiency: Reuse collected data instead of collecting new samples
-
Off-policy policy gradients: Estimate gradient for
$\pi_\theta$ using data from$\pi_{\text{old}}$ - PPO: Uses importance sampling with clipping
- Experience replay: Reuse old transitions for learning
Want to estimate returns under new policy
For any probability distribution
Using the log-derivative trick:
(Since probabilities integrate to 1)
This property is crucial for:
- Variance reduction: Shows that adding the score function to any estimator doesn't change its expectation (used in control variates)
- Natural gradients: Foundation for natural policy gradient methods
- Baseline methods: Justifies subtracting baselines in REINFORCE without introducing bias
Control Variates in REINFORCE:
The vanilla REINFORCE gradient has high variance. You can subtract any function
The baseline
Many RL algorithms use combinations of these tricks:
- PPO: Uses log-likelihood trick for policy gradients + importance sampling for off-policy updates
- Actor-Critic methods: Use score function property for baseline/critic without bias
- Natural Policy Gradients: Combine all three tricks for more stable updates