Properties | |
---|---|
authors | Richard S. Sutton, Andrew G. Barton |
year | 2018 |
6 Temporal-Difference Learning¶
6.1 TD Prediction¶
Equation 6.2: TD(0) update
Equations 6.3 and 6.4: Relationship between TD(0), MC and DP
Why is (6.3) called the Monte Carlo estimate?
Because the expected value is not known, and sampled returns are used in its place.
Why is (6.4) called the Dynamic Programming estimate?
Although the expectation is known, the value function is not, as we use the estimate \(V(S_t)\).
By looking at the previous two answers, what does TD(0) estimate and how does that differ from the previous methods?
TD(0) maintains both an estimate of the value function and uses a sample reward as the estimate to the expectation.
Equation 6.5: TD error
6.4 Sarsa: On-policy TD Control¶
Equation 6.7
{ width="900" }
6.5 Q-learning: Off-policy TD Control¶
Equation 6.8
{ width="700" }
6.6 Expected SARSA¶
Equation 6.9
It's more computationally demanding but it's more stable and fares better than q learning and sarsa.
Can also be used as is for off-policy case.
Why doesn't Expected SARSA off-policy need importance sampling?
I wasn't convinced by the slides explanation, so I'll have to check a proper explanation later.
6.7 Maximization Bias and Double Learning¶
"All the control algorithms that we have discussed so far involve maximization in the construction of their target policies"
this causes maximization bias:
- think of estimating the mean of N(-0.1, 1)
- this estimate might at some point be 0.1 and the other option might be correctly 0
- the optimal choice is to pick 0, but because we take the max of an estimate, we positively bias ourselves
The general way to solve it is to estimate two different value functions, one for getting the value (\(Q_2\)) and the other for obtaining the best action \(Q_1\).
This effectively debiases the estimate \(\mathbb{E}[Q_2(A^*)] = q(A^*)\)