Properties | |
---|---|
authors | Richard S. Sutton, Andrew G. Barton |
year | 2018 |
16 Applications and Case Studies¶
16.5 Human-level Video Game Play¶
DQN's Network architecture: Conv2d + RELU blocks for feature extraction and linear layers for output.
Equation 16.3: DQN Semi-Gradient update rule
\[
\mathbf{w}_{t+1} = \mathbf{w}_{t} + \alpha \left[ R_{t+1} + \gamma \max_{a} \hat{q}(S_{t+1}, a; \mathbf{w}_{t}) - \hat{q}(S_t, A_t; \mathbf{w}_{t}) \right] \nabla \hat{q}(S_t, A_t; \mathbf{w}_{t})
\]
What are the three modifications to Q-learning that make DQN?
- Experience Replay: Useful to use data better and remove the dependence of successive experiences on the current weights.
- "Double Q-learning": Keep a copy of the network at the previous step to provide targets to avoid divergence and oscillations.
- Clip the error term \(R_{t+1} + \gamma \max_{a} q(S_{t+1}, a; \mathbf{w}_{t}) - q(S_t, A_t; \mathbf{w}_{t})\) to \([-1, 1]\) to improve stability.