Casey Chu

- Reinforcement Learning
*I took these notes in spring 2017 for CS 234, taught by Professor Emma Brunskill.*- A
**Markov decision process**is a triplet of stochastic processes $(S, A, R)$ that is:- governed by the dynamics$\Pr[S_{t+1}=s', R_{t+1}=r \,|\, S_t=s, A_t = a] = p(s', r \,|\, s, a),$ the probability of transitioning from state $s$ to $s'$ via action $a$, obtaining reward $r$, and
- parameterized by the
**policy**$\pi$ $\Pr[A_t = a \,|\, S_t = s] = \pi(a \,|\, s),$ the probability of choosing an action $a$ from the current state $s$.

- Note the expectations $\begin{aligned} \E[f(S_{t+1}, R_{t+1}) \,|\, S_t=s, A_t=a] = \sum_{s',r} f(s',r)\,p(s',r\,|\,s,a), \\ \E^\pi[f(S_{t+1}, R_{t+1}, A_t) \,|\, S_t=s] = \sum_{s',r,a} f(s',r)\,p(s',r\,|\,s,a)\,\pi(a\,|\,s). \end{aligned}$ ($\E^\pi$ simply emphasizes the dependence of the expression on $\pi$.)
- Policy evaluation
- To evaluate how good a policy $\pi$ is, we consider expected future reward.
- The
**state value function**is the expected future reward from a state $s$ following that policy: $v_\pi(s) = \E^\pi\left[ \sum_{k=0}^\infty \gamma^{k} R_{t+1+k} \,\Big|\, S_t = s \right].$ - The
**action value function**is the expected future reward from a state $s$ taking the action $a$ and then following the policy thereafter: $q_\pi(s, a) = \E^\pi\left[ \sum_{k=0}^\infty \gamma^{k} R_{t+1+k} \,\Big|\, S_t = s, A_t = a \right].$ - If we know $q_\pi$ and $\pi$, then we can obtain $v_\pi$ from the law of total expectation: $v_\pi(s) = \sum_a q_\pi(s,a) \,\pi(a\,|\,s).$ If we know $v_\pi$ and $p$, then we can obtain $q_\pi$ as $\begin{aligned} q_\pi(s,a) &= \E[R_{t+1} + \gamma v_\pi(S_{t+1}) \,|\, S_t = s, A_t = a] \\ &= \sum_{r,s'} (r + \gamma v_\pi(s'))\,p(s',r\,|\,s,a). \end{aligned}$
- Bellman equations
- By either recursively expanding the definition of $v_\pi(s)$ or eliminating $q_\pi(s,a)$ from the above two relations, we see that the state value $v_\pi(s)$ must satisfy the
**Bellman equations**$v_\pi(s) = \E^\pi\big[ R_{t+1} + \gamma v_\pi(S_{t+1}) \,|\, S_t = s\big],$ a system of linear equations in the $v_\pi(s)$ which we may solve, at least in principle, if the dynamics are known. - Iterative evaluation
- Again assuming the dynamics are known, a more tractable method is to iterate the assignment: $V(s) \leftarrow \E^\pi[ R_{t+1} + \gamma V(S_{t+1}) \,|\, S_t = s].$ If we denote this assignment $V \leftarrow T_\pi V$, then $|| T_\pi V - T_\pi V' ||_\infty = \gamma || V - V' ||_\infty$, so if $\gamma < 1$, by the Banach fixed point theorem, this process converges to a unique fixed point $v_\pi$, the solution to the Bellman equations.
- Sample-based evaluation
- If the dynamics are unknown, we can sample from the MDP under the policy $\pi$. Let $\leftarrow^\alpha$ denote either an
**exponential moving average update**$X \leftarrow^\alpha X' \quad \Longleftrightarrow \quad X \leftarrow (1-\alpha)X + \alpha X'$ or a**function approximation gradient descent update**$X \leftarrow^\alpha X' \quad \Longleftrightarrow \quad \theta \leftarrow \theta - \alpha (X - X')\nabla_\theta X.$- This is stochastic gradient descent for the objective $J(\theta) = \E[(X_\theta - X)^2].$
**Experience replay**: collect many episodes and then resample minibatches to reduce correlations between samples. If $X'$ depends on $X$, it’s a good idea to use a stable copy of $X$ to evaluate $X'$, so that SGD has a stable target even though $\theta$ is changing (either periodically update the stable copy with the latest $\theta$, or keep two versions of $X$ that alternate roles).

**Temporal difference learning**,**SARSA**, and**expected SARSA**are similar methods that update their estimates for every time step: $\begin{aligned} V(s_t) &\leftarrow^\alpha r_{t+1} + \gamma V(s_{t+1}), \\ Q(s_t, a_t) &\leftarrow^\alpha r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}), \\ Q(s_t, a_t) &\leftarrow^\alpha r_{t+1} + \gamma \sum_{a'} Q(s_{t+1}, a')\,\pi(a' \,|\, s_{t+1}). \end{aligned}$ Expected SARSA can be used even if the samples are from some other policy $\mu$, making it an*off-policy*method.- We may also consider $n$ time steps at a time. For example: $V(s_t) \leftarrow^\alpha r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \cdots + \gamma^n V(s_{t+n-1}).$
- As $n \to \infty$, this becomes
**Monte Carlo**sampling on episodes (if exponential averaging is replaced with the mean):$V(s) \leftarrow \frac{1}{n}\sum_{k=1}^n g_k,$ where $g_k$ is an episode reward starting at $s_t = s$: $g_k = r_{t+1} + \gamma r_t + \gamma^2 r_{t+1} + \cdots.$ - Monte Carlo estimation can even use samples from another policy $\mu$ through
**importance sampling**:$V(s) = \frac{1}{n}\sum_{k=1}^n \rho_k g_k,$ where $\rho_k = \frac{\Pr^\pi[ \forall \tau > t: R_\tau = r_\tau \,|\, S_t = s]}{\Pr^\mu[ \forall \tau > t: R_\tau = r_\tau \,|\, S_t = s]} = \prod_{\tau = t+1}^\infty \frac{\pi(a_\tau \,|\, s_\tau)}{\mu(a_\tau \,|\, s_\tau)}.$ **Weighted imporance sampling**is a more stable alternative:$V(s) = \frac{\sum_{k=1}^n \rho_k g_k}{\sum_{k=1}^n \rho_k}.$- Policy learning
- An
**optimal policy**$\pi_*$ is one whose value $v_*(s) = v_{\pi_*}(s) \ge v_\pi(s)$ for every policy $\pi$ and state $s$. - The
**greedy policy**$\pi_G$ with respect to the value function $Q(s,a)$ has $\pi_G(a_*\,|\,s) > 0 \qquad \text{iff} \qquad a_* = \argmax_a Q(s,a).$ (If $Q$ is unavailable, recall we can obtain it from $V$ and $p$.) - The
**Bellman operators**are useful in analysis: $\begin{aligned} (T_\pi V)(s) &= \E^\pi[R_{t+1} + \gamma V(S_{t+1}) \,|\, S_t = s], \\ (T V)(s) &= \max_a \E^\pi[R_{t+1} + \gamma V(S_{t+1}) \,|\, S_t = s, A_t = a]. \end{aligned}$ Note that both are monotonic contraction operators (under the sup norm) with constant $\gamma$. Note that $TV \ge T_{\pi} V$ with equality $TV = T_{\pi_G} V$ for the greedy policy w.r.t $V$. The unique fixed points are $T_\pi v_\pi = v_\pi$ (the Bellman equation) and $Tv_* = T_{\pi_G} v_* = v_*$ respectively. - Policy iteration
- One way to find an optimal policy is through
**policy iteration**. Initialize with an arbitrary policy $\pi$, and then alternate the following steps until convergence of $\pi$:*Policy evaluation*: Calculate $v_\pi(s)$ or $q_\pi(s,a)$ with the above methods.*Policy improvement*: Set $\pi$ to be the greedy policy w.r.t. $v_\pi(s)$ or $q_\pi(s,a)$.

- This produces a sequence of policies with monotonically increasing value, since $v_\pi = T_\pi v_\pi \le T v_\pi = T_{\pi'} v_\pi$, so by monotonicity, $v_\pi \le T^n_{\pi'} v_\pi \to v_{\pi'}$. This must converge to a fixed policy $\pi = \pi'$, which is an optimal policy because $v_\pi = v_{\pi'} = T_{\pi'} v_{\pi'} = T_{\pi'} v_\pi = Tv_{\pi}$, so $v_\pi = v_*$ is the unique fixed point of $T$.
- If we use one of the iterative methods above for the policy evaluation step, we can often get away with performing only one step of the iteration. This is
**generalized policy iteration**. - With iterative evaluation, generalized policy iteration is
**value iteration**:$V(s) \leftarrow \max_a \E[ R_{t+1} + \gamma V(S_{t+1}) \,|\, S_t=s, A_t=a].$ - With expected SARSA, generalized policy iteration is
**Q-learning**: $Q(s_t, a_t) \leftarrow^\alpha r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a').$ Like expected SARSA, Q-learning is off-policy, converging to $q_*$ even if the policy being followed is not the greedy one! - Model-based learning
**Model-based**approaches estimate the dynamics using observed transitions and then solves them using e.g. value iteration. It may use the learned policy for further exploration.**R-max**estimates the reward of a transition $(s,a)$ as $R_{\text{max}}$ until the transition is seen enough times. This*optimism under uncertainty*encourages exploration. It is**probably approximately correct**, i.e. the probability that the number of non-$\epsilon$-optimal episodes is bounded polynomially with probability $1-\delta$.**PSRL**/**UCRL2**keep a distribution/confidence interval of possible MDPs and solves a sample from the distribution/the corner of the confidence interval that gives the maximum reward. They give**regret bounds**sublinear in the number of episodes, i.e. bounds on the expected difference between the optimal policy cumulative reward and the policy’s cumulative reward.- Policy gradient
- References
- http://chercheurs.lille.inria.fr/~ghavamza/RL-EC-Lille/Lecture3.pdf
- http://chercheurs.lille.inria.fr/~ghavamza/RL-EC-Lille/Lecture4-b.pdf
- Sutton and Barto