textbook
rl1
Reinforcement Learning An Introduction Chapter 3
Properties
authors
Richard S. Sutton, Andrew G. Barton
year
2018
3.1 The Agent-Environment Interface
Equation 3.1: Trajectory
\[
S_0,A_0,R_1,S_1,A_1,R_2,S_2,A_2,R_3, \dots \tag{3.1}
\]
Equation 3.2: MDP dynamics
\[
p(s', r \mid s, a) \doteq \Pr \{ S_t = s', R_t = r \mid S_{t-1} = s, A_{t-1} = a \} \tag{3.2}
\]
You can obtain the state-transition probabilities and the with the law of total probability.
You can obtain the expected reward also.
3.2 Goals and Rewards
What is the reward hypothesis?
The reward hypothesis is the idea that all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward ).
The reward signal is your way of communicating to the agent what you want it to achieve not how you want it to achieve it .
3.3 Returns and Episodes
Equation 3.7: Undiscounted return
\[
G_t \doteq R_{t+1} + R_{t+2} + R_{t+3} + \dots + R_T \tag{3.7}
\]
Equation 3.8: Discounted return
\[
G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \tag{3.8}
\]
Where \(\gamma\) is the discount rate.
Equation 3.9: Recursive definition of return
You can group Eq 3.8 into a recursive definition of the return.
\[
G_t \doteq R_{t+1} + \gamma G_{t+1} \tag{3.9}
\]
3.4 Unified Notation for Episodic and Continuing Tasks
3.5 Policies and Value Functions
A policy \(\pi(a \mid s)\) is a probability distribution over actions given states.
Equation 3.12: State-value function
\[
v_{\pi}(s) \doteq \mathbb{E}_{\pi}[G_t \mid S_t = s] \;\; \forall s \in \mathcal{S} \tag{3.12}
\]
Equation 3.13: Action-value function
\[
q_{\pi}(s, a) \doteq \mathbb{E}_{\pi}[G_t \mid S_t = s, A_t = a] \;\; \forall s \in \mathcal{S}, a \in \mathcal{A} \tag{3.13}
\]
Writing \(v_{\pi}\) in terms of \(q_{\pi}\)
\[
v_{\pi}(s) = \sum_{a} \pi(a \mid s) q_{\pi}(s, a)
\]
Equation 3.14: Bellman equation for \(v_{\pi}\)
\[
\begin{align}
v_\pi(s) &\doteq \mathbb{E}_{\pi}[G_t \mid S_t = s] \\
&= \mathbb{E}_{\pi}[R_{t+1} + \gamma G_{t+1} \mid S_t = s] \tag{by (3.9)} \\
&= \sum_{a} \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a) \left[r + \gamma \mathbb{E}_{\pi}\left[G_{t+1} \mid S_{t+1} = s'\right]\right] \\
&= \sum_{a} \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a) [r + \gamma v_\pi(s')] \tag{3.14}
\end{align}
\]
3.6 Optimal Policies and Optimal Value Functions
A policy \(\pi^*\) is the optimal policy if:
\[
v_{\pi^*} (s) \geq v_{\pi'}(s) \quad \forall s, \pi' \in \mathcal{S} \times \mathcal{\Pi}
\]
Equation 3.15: Optimal state-value function
\[
v_*(s) \doteq \max_{\pi} v_{\pi}(s) \tag{3.15}
\]
Equation 3.16: Optimal action-value function
\[
q_*(s, a) \doteq \max_{\pi} q_{\pi}(s, a) \tag{3.16}
\]
Equation 3.17
\[
q_*(s, a) = \mathbb{E}[R_{t+1} + \gamma v_*(S_{t+1}) \mid S_t = s, A_t = a] \tag{3.17}
\]
Equation 3.18 and 3.19: Bellman optimality equations for \(v_*\)
\[
\begin{align}
v_*(s) &= \max_{a \in \mathcal{A}(s)} q_{\pi_*}(s, a) \\
&= \max_{a} \mathbb{E}_{\pi_*}[G_t \mid S_t = s, A_t = a] \tag{by (3.9)}\\
&= \max_{a} \mathbb{E}_{\pi_*}[R_{t+1} + \gamma G_{t+1} \mid S_t = s, A_t = a] \\
&= \max_{a} \mathbb{E}[R_{t+1} + \gamma v_*(S_{t+1}) \mid S_t = s, A_t = a] \tag{3.18} \\
&= \max_{a} \sum_{s', r} p(s', r \mid s, a) [r + \gamma v_*(s')] \tag{3.19} \\
\end{align}
\]
Equation 3.20: Bellman optimality equation for \(q_*\)
\[
\begin{align}
q_*(s, a) &= \mathbb{E}[R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}, a') \mid S_t = s, A_t = a] \\
&= \sum_{s', r} p(s', r \mid s, a) [r + \gamma \max_{a'} q_*(s', a')] \tag{3.20}
\end{align}
\]
Any policy that is greedy with respect to the optimal evaluation function \(v_*\) is an optimal policy.