Properties
authors	Richard S. Sutton, Andrew G. Barton
year	2018

2 Multi-armed Bandits¶

2.2 Action-value Methods¶

Equation 2.1: Sample-average Method

\[ Q_t(a) \doteq \frac{\text{sum of rewards when } a \text{ taken prior to } t}{\text{number of times } a \text{ taken prior to } t} = \frac{\sum_{i=1}^{t-1} R_i \cdot \mathbb{1}_{A_i = a}}{\sum_{i=1}^{t-1} \mathbb{1}_{A_i = a}} \tag{2.1} \]

Equation 2.2: Greedy Action Selection

\[ A_t \doteq \underset{a}{\arg\max} Q_t(a) \tag{2.2} \]

2.4 Incremental Implementation¶

Equation 2.4: Incremental Sample-average method

\[ Q_{n+1} = Q_n + \frac{1}{n}[R_n - Q_n] \tag{2.4} \]

Where:

\(Q_1\) is usually initialized to zero.
\(R_n\) is the reward received after the \(n\)-th selection of action \(a\)
\(Q_n\) denote the estimate of its action value after it has been selected \(n-1\) times

2.5 Tracking a Nonstationary Problem¶

Equation 2.7: Two learning rate conditions to ensure convergence

\[ \sum_{n=1}^{\infty} \alpha_n(a) = \infty \quad \text{and} \quad \sum_{n=1}^{\infty} \alpha_n^2(a) < \infty \tag{2.7} \]

2.6 Optimistic Initial Values¶

TLDR: Initializing \(Q_1(a)\) to a positive non-zero number encourages exploration.

2.7 Upper-Confidence-Bound Action Selection¶

Equation 2.10: UCB action selection

\[ A_t \doteq \underset{a}{\arg\max} \left[ Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \right] \tag{2.10} \]

Where:

\(c > 0\) controls the degree of exploration.