Properties | |
---|---|
authors | Richard S. Sutton, Andrew G. Barton |
year | 2018 |
11 Off-policy Methods with Approximation¶
Off-policy case is a bit tricky, previously Reinforcement Learning - An Introduction - Chapter 5 we adjusted our target \(G_t\) with the importance sampling ratio so that its expectation was \(v_{\pi}\) and not \(v_b\).
However, for semi-gradient methods with function approximation, another factor comes in: By following the behavior policy, the distribution of updates is also biased!
There are 2 solutions:
1. Importance sampling - Corrects the distribution of updates
2. True gradient methods - To avoid relying on the distribution for stability
11.1 Semi-gradient Methods¶
Equation 11.1: Per-step importance sampling ratio
What does a high/low importance sampling ratio imply about the action taken?
- High: The action was taken rarely under the behavior policy
- Low: The action was taken often under the behavior policy
Equation 11.2: Update rule for semi-gradient off-policy TD(0)
Equation 11.3: Definition of \(\delta_t\) for semi-gradient off-policy TD(0)
Equation 11.5: Update rule for semi-gradient off-policy SARSA
Equation 11.6: Definition of \(\delta_t\) for semi-gradient off-policy SARSA
Also:
- eq 11.6: n-step version of semi-gradient Sarsa
- eq 11.7: semi-gradient version of n-step tree-backup algorithm
11.2 Examples of Off-policy Divergence¶
Baird's counterexample: Shows that even the simplest bootstrapping (DP, TD) with function approximation can diverge if updates aren't done under on-policy distribution.
Another way to try to prevent instability is to use special methods for function approximation. In particular, stability is guaranteed for function approximation methods that do not extrapolate from the observed targets. These methods, called averagers, include nearest neighbor methods and locally weighted regression, but not popular methods such as tile coding and artificial neural networks (ANNs).
11.3 The Deadly Triad¶
Instability arises when we combine the following 3 elements:
1. Function approximation
2. Bootstrapping
3. Off-policy training
11.4 Linear Value-function Geometry¶
TLDR: Using the geometry of the value function, we find that \(\overline{BE}\) measures how far off \(\mathbf{w}\) is from \(v_\pi\).
Equation 11.11: \(\mu\)-norm
Equation 11.17 and 11.18: Bellman error
Equation 11.19: Mean-square Bellman error
With linear function approximation there always exists an approximate value function (within the subspace) with zero PBE; this is the TD fixed point, wTD
Equation 11.13: Projection matrix for linear function approximation
Where:
- \(\mathbf{X} \in \mathbb{R}^{|\mathcal{S}| \times d}\) is the matrix of feature vectors
- \(\mathbf{D} \in \mathbb{R}^{|\mathcal{S}| \times |\mathcal{S}|}\) is a diagonal matrix with \(\mu(s)\) on the diagonal
Equation 11.22: Mean square Projected Bellman error
Where:
- \(\Pi\) is the projection matrix
- \(\bar{\delta}_{\mathbf{w}}\) is the Bellman error
11.5 Gradient Descent in the Bellman Error¶
TLDR: Semi-gradient SGD might diverge, but true SGD doesn't! Sadly, both TDE and BE yield bad minima.
Let's first take an easier case, minimizing \(\overline{TDE}\).
Mean-squared temporal difference error
This yields the following.
Equation 11.23: Weight update of naive residual-gradient algorithm
Conclusion:
Minimizing the TDE is naive; by penalizing all TD errors it achieves something more like temporal smoothing than accurate prediction.
Although the naive residual-gradient algorithm converges robustly, it does not necessarily converge to a desirable place.
Doing the same for \(\overline{BE}\) yields the residual-gradient algorithm. But it's not a good choice either.
11.6 The Bellman Error is Not Learnable¶
TLDR: \(\overline{BE}\) is not learnable but \(\overline{TDE}\) and \(\overline{PBE}\) are. Since minimizing \(\overline{TDE}\) is naive, next section: minimize \(\overline{PBE}\).
Here we use the term in a more basic way, to mean learnable at all, with any amount of experience. It turns out many quantities of apparent interest in reinforcement learning cannot be learned even from an infinite amount of experiential data. These quantities are well defined and can be computed given knowledge of the internal structure of the environment, but cannot be computed or estimated from the observed sequence of feature vectors, actions, and rewards
11.7 Gradient-TD Methods¶
TLDR: To minimize \(\overline{PBE}\) using SGD efficiently we use two separate estimates for dependent expectations. This yields two algorithms: GTD2 and TDC.
DISCLAIMER: These methods only work with linear function approximation.
Equation 11.27: Gradient of \(\overline{PBE}\)
First and last term are not independent, thus we must have separate estimates.
Equation 11.28: Definition of \(\mathbf{v}\)
Grouping the last two terms of the gradient of \(\overline{PBE}\), we get a new vector \(\mathbf{v}\in \mathbb{R}^d\) which we can estimate and store efficiently:
Update rule for \(\mathbf{v}\)
Where:
- \(\beta\) is a learning rate
Using \(\mathbf{v}\), we can now update \(\mathbf{w}\), which yields the GTD2 algorithm:
Equation 11.29: Update rule for GTD2
What is the complexity of GTD2?
\(O(d)\) per step if \(x_t^T v_t\) computed first
We can do a bit more algebra to get to a better algorithm: TDC (TD(0) with Gradient Correction) or also known as GTD(0).
Equation: Update rule for TDC
Summary from the lectures
{ width="700" }