-
Notifications
You must be signed in to change notification settings - Fork 0
Machine Learning & Data Science Reinforcement Learning Policy Optimization Advantage Estimation Temporal Difference Error
The Temporal Difference (TD) error is a one-step bootstrapped estimate of the advantage function. It provides a low-variance alternative to reward-to-go methods by using the value function to estimate future returns instead of sampling complete trajectories.
The TD error at time step
Where:
-
$r_t$ is the immediate reward at time$t$ -
$\gamma$ is the discount factor -
$V^{\pi}(s_{t+1})$ is the value of the next state -
$V^{\pi}(s_t)$ is the value of the current state
The TD error answers the question: "Did I do better or worse than expected?"
Breaking it down:
-
$r_t + \gamma V^{\pi}(s_{t+1})$ : The actual return we got (immediate reward plus estimated future value) -
$V^{\pi}(s_t)$ : What we expected to get from state$s_t$ -
$\delta_t$ : The difference between actual and expected
If
The TD error is an unbiased estimate of the advantage function: $$ A^{\pi}(s_t, a_t) \approx \delta_t $$
To see why, recall that: $$ A^{\pi}(s_t, a_t) = Q^{\pi}(s_t, a_t) - V^{\pi}(s_t) $$
And the Machine-Learning-&-Data-Science-Reinforcement-Learning-Fundamentals-Bellman-Equation tells us: $$ Q^{\pi}(s_t, a_t) = \mathbb{E}[r_t + \gamma V^{\pi}(s_{t+1})] $$
Therefore: $$ A^{\pi}(s_t, a_t) \approx r_t + \gamma V^{\pi}(s_{t+1}) - V^{\pi}(s_t) = \delta_t $$
Advantages of TD error:
- Low variance: Only depends on one transition, not an entire trajectory
- No need for complete episodes: Can be computed immediately after each step
- Faster learning: Lower variance means more stable gradient estimates
Disadvantages of TD error:
-
Can be biased: If
$V^{\pi}$ is not accurately estimated (which is common early in training), the advantage estimate will be biased -
Bootstrap error compounds: Errors in
$V^{\pi}$ affect the estimate
| Aspect | Reward-to-Go | TD Error |
|---|---|---|
| Variance | High | Low |
| Bias | Unbiased | Biased if |
| Episode requirement | Complete trajectory | Single step |
| Sensitivity | Noisy trajectories | Errors in value function |
We can extend TD error to use
-
$n=1$ : Standard TD error (low variance, higher bias) -
$n=\infty$ : Equivalent to reward-to-go (high variance, no bias) -
$n$ intermediate: Balanced tradeoff
This idea is generalized in GAE, which takes an exponentially-weighted average of all multi-step TD errors.
TD error is used in many reinforcement learning algorithms:
-
TD Learning: Updates value functions using
$V(s_t) \leftarrow V(s_t) + \alpha \delta_t$ - Actor-Critic: Uses TD error as the advantage estimate for policy updates
- A3C/A2C: Asynchronous/synchronous advantage actor-critic methods
-
PPO with GAE: When
$\lambda = 0$ , GAE reduces to single-step TD error