Casey Chu
• Reinforcement Learning
• I took these notes in spring 2017 for CS 234, taught by Professor Emma Brunskill.
• A Markov decision process is a triplet of stochastic processes $(S, A, R)$ that is:
• governed by the dynamics$\Pr[S_{t+1}=s', R_{t+1}=r \,|\, S_t=s, A_t = a] = p(s', r \,|\, s, a),$ the probability of transitioning from state $s$ to $s'$ via action $a$, obtaining reward $r$, and
• parameterized by the policy $\pi$ $\Pr[A_t = a \,|\, S_t = s] = \pi(a \,|\, s),$ the probability of choosing an action $a$ from the current state $s$.
• Note the expectations \begin{aligned} \E[f(S_{t+1}, R_{t+1}) \,|\, S_t=s, A_t=a] = \sum_{s',r} f(s',r)\,p(s',r\,|\,s,a), \\ \E^\pi[f(S_{t+1}, R_{t+1}, A_t) \,|\, S_t=s] = \sum_{s',r,a} f(s',r)\,p(s',r\,|\,s,a)\,\pi(a\,|\,s). \end{aligned} ($\E^\pi$ simply emphasizes the dependence of the expression on $\pi$.)
• Policy evaluation
• To evaluate how good a policy $\pi$ is, we consider expected future reward.
• The state value function is the expected future reward from a state $s$ following that policy: $v_\pi(s) = \E^\pi\left[ \sum_{k=0}^\infty \gamma^{k} R_{t+1+k} \,\Big|\, S_t = s \right].$
• The action value function is the expected future reward from a state $s$ taking the action $a$ and then following the policy thereafter: $q_\pi(s, a) = \E^\pi\left[ \sum_{k=0}^\infty \gamma^{k} R_{t+1+k} \,\Big|\, S_t = s, A_t = a \right].$
• If we know $q_\pi$ and $\pi$, then we can obtain $v_\pi$ from the law of total expectation: $v_\pi(s) = \sum_a q_\pi(s,a) \,\pi(a\,|\,s).$ If we know $v_\pi$ and $p$, then we can obtain $q_\pi$ as \begin{aligned} q_\pi(s,a) &= \E[R_{t+1} + \gamma v_\pi(S_{t+1}) \,|\, S_t = s, A_t = a] \\ &= \sum_{r,s'} (r + \gamma v_\pi(s'))\,p(s',r\,|\,s,a). \end{aligned}
• Bellman equations
• By either recursively expanding the definition of $v_\pi(s)$ or eliminating $q_\pi(s,a)$ from the above two relations, we see that the state value $v_\pi(s)$ must satisfy the Bellman equations$v_\pi(s) = \E^\pi\big[ R_{t+1} + \gamma v_\pi(S_{t+1}) \,|\, S_t = s\big],$ a system of linear equations in the $v_\pi(s)$ which we may solve, at least in principle, if the dynamics are known.
• Iterative evaluation
• Again assuming the dynamics are known, a more tractable method is to iterate the assignment: $V(s) \leftarrow \E^\pi[ R_{t+1} + \gamma V(S_{t+1}) \,|\, S_t = s].$ If we denote this assignment $V \leftarrow T_\pi V$, then $|| T_\pi V - T_\pi V' ||_\infty = \gamma || V - V' ||_\infty$, so if $\gamma < 1$, by the Banach fixed point theorem, this process converges to a unique fixed point $v_\pi$, the solution to the Bellman equations.
• Sample-based evaluation
• If the dynamics are unknown, we can sample from the MDP under the policy $\pi$. Let $\leftarrow^\alpha$ denote either an exponential moving average update $X \leftarrow^\alpha X' \quad \Longleftrightarrow \quad X \leftarrow (1-\alpha)X + \alpha X'$ or a function approximation gradient descent update $X \leftarrow^\alpha X' \quad \Longleftrightarrow \quad \theta \leftarrow \theta - \alpha (X - X')\nabla_\theta X.$
• This is stochastic gradient descent for the objective $J(\theta) = \E[(X_\theta - X)^2].$
• Experience replay: collect many episodes and then resample minibatches to reduce correlations between samples. If $X'$ depends on $X$, it’s a good idea to use a stable copy of $X$ to evaluate $X'$, so that SGD has a stable target even though $\theta$ is changing (either periodically update the stable copy with the latest $\theta$, or keep two versions of $X$ that alternate roles).
• Temporal difference learning, SARSA, and expected SARSA are similar methods that update their estimates for every time step: \begin{aligned} V(s_t) &\leftarrow^\alpha r_{t+1} + \gamma V(s_{t+1}), \\ Q(s_t, a_t) &\leftarrow^\alpha r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}), \\ Q(s_t, a_t) &\leftarrow^\alpha r_{t+1} + \gamma \sum_{a'} Q(s_{t+1}, a')\,\pi(a' \,|\, s_{t+1}). \end{aligned} Expected SARSA can be used even if the samples are from some other policy $\mu$, making it an off-policy method.
• We may also consider $n$ time steps at a time. For example: $V(s_t) \leftarrow^\alpha r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \cdots + \gamma^n V(s_{t+n-1}).$
• As $n \to \infty$, this becomes Monte Carlo sampling on episodes (if exponential averaging is replaced with the mean):$V(s) \leftarrow \frac{1}{n}\sum_{k=1}^n g_k,$ where $g_k$ is an episode reward starting at $s_t = s$: $g_k = r_{t+1} + \gamma r_t + \gamma^2 r_{t+1} + \cdots.$
• Monte Carlo estimation can even use samples from another policy $\mu$ through importance sampling:$V(s) = \frac{1}{n}\sum_{k=1}^n \rho_k g_k,$ where $\rho_k = \frac{\Pr^\pi[ \forall \tau > t: R_\tau = r_\tau \,|\, S_t = s]}{\Pr^\mu[ \forall \tau > t: R_\tau = r_\tau \,|\, S_t = s]} = \prod_{\tau = t+1}^\infty \frac{\pi(a_\tau \,|\, s_\tau)}{\mu(a_\tau \,|\, s_\tau)}.$
• Weighted imporance sampling is a more stable alternative:$V(s) = \frac{\sum_{k=1}^n \rho_k g_k}{\sum_{k=1}^n \rho_k}.$
• Policy learning
• An optimal policy $\pi_*$ is one whose value $v_*(s) = v_{\pi_*}(s) \ge v_\pi(s)$ for every policy $\pi$ and state $s$.
• The greedy policy $\pi_G$ with respect to the value function $Q(s,a)$ has $\pi_G(a_*\,|\,s) > 0 \qquad \text{iff} \qquad a_* = \argmax_a Q(s,a).$ (If $Q$ is unavailable, recall we can obtain it from $V$ and $p$.)
• The Bellman operators are useful in analysis: \begin{aligned} (T_\pi V)(s) &= \E^\pi[R_{t+1} + \gamma V(S_{t+1}) \,|\, S_t = s], \\ (T V)(s) &= \max_a \E^\pi[R_{t+1} + \gamma V(S_{t+1}) \,|\, S_t = s, A_t = a]. \end{aligned} Note that both are monotonic contraction operators (under the sup norm) with constant $\gamma$. Note that $TV \ge T_{\pi} V$ with equality $TV = T_{\pi_G} V$ for the greedy policy w.r.t $V$. The unique fixed points are $T_\pi v_\pi = v_\pi$ (the Bellman equation) and $Tv_* = T_{\pi_G} v_* = v_*$ respectively.
• Policy iteration
• One way to find an optimal policy is through policy iteration. Initialize with an arbitrary policy $\pi$, and then alternate the following steps until convergence of $\pi$:
• Policy evaluation: Calculate $v_\pi(s)$ or $q_\pi(s,a)$ with the above methods.
• Policy improvement: Set $\pi$ to be the greedy policy w.r.t. $v_\pi(s)$ or $q_\pi(s,a)$.
• This produces a sequence of policies with monotonically increasing value, since $v_\pi = T_\pi v_\pi \le T v_\pi = T_{\pi'} v_\pi$, so by monotonicity, $v_\pi \le T^n_{\pi'} v_\pi \to v_{\pi'}$. This must converge to a fixed policy $\pi = \pi'$, which is an optimal policy because $v_\pi = v_{\pi'} = T_{\pi'} v_{\pi'} = T_{\pi'} v_\pi = Tv_{\pi}$, so $v_\pi = v_*$ is the unique fixed point of $T$.
• If we use one of the iterative methods above for the policy evaluation step, we can often get away with performing only one step of the iteration. This is generalized policy iteration.
• With iterative evaluation, generalized policy iteration is value iteration:$V(s) \leftarrow \max_a \E[ R_{t+1} + \gamma V(S_{t+1}) \,|\, S_t=s, A_t=a].$
• With expected SARSA, generalized policy iteration is Q-learning: $Q(s_t, a_t) \leftarrow^\alpha r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a').$ Like expected SARSA, Q-learning is off-policy, converging to $q_*$ even if the policy being followed is not the greedy one!
• Model-based learning
• Model-based approaches estimate the dynamics using observed transitions and then solves them using e.g. value iteration. It may use the learned policy for further exploration.
• R-max estimates the reward of a transition $(s,a)$ as $R_{\text{max}}$ until the transition is seen enough times. This optimism under uncertainty encourages exploration. It is probably approximately correct, i.e. the probability that the number of non-$\epsilon$-optimal episodes is bounded polynomially with probability $1-\delta$.
• PSRL/UCRL2 keep a distribution/confidence interval of possible MDPs and solves a sample from the distribution/the corner of the confidence interval that gives the maximum reward. They give regret bounds sublinear in the number of episodes, i.e. bounds on the expected difference between the optimal policy cumulative reward and the policy’s cumulative reward.