I took these notes in spring 2017 for CS 234, taught by Professor Emma Brunskill.
A Markov decision process is a triplet of stochastic processes (S,A,R) that is:
governed by the dynamicsPr[St+1=s′,Rt+1=r∣St=s,At=a]=p(s′,r∣s,a), the probability of transitioning from state s to s′ via action a, obtaining reward r, and
parameterized by the policyπPr[At=a∣St=s]=π(a∣s), the probability of choosing an action a from the current state s.
Note the expectations E[f(St+1,Rt+1)∣St=s,At=a]=s′,r∑f(s′,r)p(s′,r∣s,a),Eπ[f(St+1,Rt+1,At)∣St=s]=s′,r,a∑f(s′,r)p(s′,r∣s,a)π(a∣s). (Eπ simply emphasizes the dependence of the expression on π.)
Policy evaluation
To evaluate how good a policy π is, we consider expected future reward.
The state value function is the expected future reward from a state s following that policy: vπ(s)=Eπ[k=0∑∞γkRt+1+k∣∣∣St=s].
The action value function is the expected future reward from a state s taking the action a and then following the policy thereafter: qπ(s,a)=Eπ[k=0∑∞γkRt+1+k∣∣∣St=s,At=a].
If we know qπ and π, then we can obtain vπ from the law of total expectation: vπ(s)=a∑qπ(s,a)π(a∣s). If we know vπ and p, then we can obtain qπ as qπ(s,a)=E[Rt+1+γvπ(St+1)∣St=s,At=a]=r,s′∑(r+γvπ(s′))p(s′,r∣s,a).
Bellman equations
By either recursively expanding the definition of vπ(s) or eliminating qπ(s,a) from the above two relations, we see that the state value vπ(s) must satisfy the Bellman equationsvπ(s)=Eπ[Rt+1+γvπ(St+1)∣St=s], a system of linear equations in the vπ(s) which we may solve, at least in principle, if the dynamics are known.
Iterative evaluation
Again assuming the dynamics are known, a more tractable method is to iterate the assignment: V(s)←Eπ[Rt+1+γV(St+1)∣St=s]. If we denote this assignment V←TπV, then ∣∣TπV−TπV′∣∣∞=γ∣∣V−V′∣∣∞, so if γ<1, by the Banach fixed point theorem, this process converges to a unique fixed point vπ, the solution to the Bellman equations.
Sample-based evaluation
If the dynamics are unknown, we can sample from the MDP under the policy π. Let ←α denote either an exponential moving average updateX←αX′⟺X←(1−α)X+αX′ or a function approximation gradient descent updateX←αX′⟺θ←θ−α(X−X′)∇θX.
This is stochastic gradient descent for the objective J(θ)=E[(Xθ−X)2].
Experience replay: collect many episodes and then resample minibatches to reduce correlations between samples. If X′ depends on X, it’s a good idea to use a stable copy of X to evaluate X′, so that SGD has a stable target even though θ is changing (either periodically update the stable copy with the latest θ, or keep two versions of X that alternate roles).
Temporal difference learning, SARSA, and expected SARSA are similar methods that update their estimates for every time step: V(st)Q(st,at)Q(st,at)←αrt+1+γV(st+1),←αrt+1+γQ(st+1,at+1),←αrt+1+γa′∑Q(st+1,a′)π(a′∣st+1). Expected SARSA can be used even if the samples are from some other policy μ, making it an off-policy method.
We may also consider n time steps at a time. For example: V(st)←αrt+1+γrt+2+γ2rt+3+⋯+γnV(st+n−1).
As n→∞, this becomes Monte Carlo sampling on episodes (if exponential averaging is replaced with the mean):V(s)←n1k=1∑ngk, where gk is an episode reward starting at st=s: gk=rt+1+γrt+γ2rt+1+⋯.
Monte Carlo estimation can even use samples from another policy μ through importance sampling:V(s)=n1k=1∑nρkgk, where ρk=Prμ[∀τ>t:Rτ=rτ∣St=s]Prπ[∀τ>t:Rτ=rτ∣St=s]=τ=t+1∏∞μ(aτ∣sτ)π(aτ∣sτ).
Weighted imporance sampling is a more stable alternative:V(s)=∑k=1nρk∑k=1nρkgk.
Policy learning
An optimal policyπ∗ is one whose value v∗(s)=vπ∗(s)≥vπ(s) for every policy π and state s.
The greedy policyπG with respect to the value function Q(s,a) has πG(a∗∣s)>0iffa∗=aargmaxQ(s,a). (If Q is unavailable, recall we can obtain it from V and p.)
The Bellman operators are useful in analysis: (TπV)(s)(TV)(s)=Eπ[Rt+1+γV(St+1)∣St=s],=amaxEπ[Rt+1+γV(St+1)∣St=s,At=a]. Note that both are monotonic contraction operators (under the sup norm) with constant γ. Note that TV≥TπV with equality TV=TπGV for the greedy policy w.r.t V. The unique fixed points are Tπvπ=vπ (the Bellman equation) and Tv∗=TπGv∗=v∗ respectively.
Policy iteration
One way to find an optimal policy is through policy iteration. Initialize with an arbitrary policy π, and then alternate the following steps until convergence of π:
Policy evaluation: Calculate vπ(s) or qπ(s,a) with the above methods.
Policy improvement: Set π to be the greedy policy w.r.t. vπ(s) or qπ(s,a).
This produces a sequence of policies with monotonically increasing value, since vπ=Tπvπ≤Tvπ=Tπ′vπ, so by monotonicity, vπ≤Tπ′nvπ→vπ′. This must converge to a fixed policy π=π′, which is an optimal policy because vπ=vπ′=Tπ′vπ′=Tπ′vπ=Tvπ, so vπ=v∗ is the unique fixed point of T.
If we use one of the iterative methods above for the policy evaluation step, we can often get away with performing only one step of the iteration. This is generalized policy iteration.
With iterative evaluation, generalized policy iteration is value iteration:V(s)←amaxE[Rt+1+γV(St+1)∣St=s,At=a].
With expected SARSA, generalized policy iteration is Q-learning: Q(st,at)←αrt+1+γa′maxQ(st+1,a′). Like expected SARSA, Q-learning is off-policy, converging to q∗ even if the policy being followed is not the greedy one!
Model-based learning
Model-based approaches estimate the dynamics using observed transitions and then solves them using e.g. value iteration. It may use the learned policy for further exploration.
R-max estimates the reward of a transition (s,a) as Rmax until the transition is seen enough times. This optimism under uncertainty encourages exploration. It is probably approximately correct, i.e. the probability that the number of non-ϵ-optimal episodes is bounded polynomially with probability 1−δ.
PSRL/UCRL2 keep a distribution/confidence interval of possible MDPs and solves a sample from the distribution/the corner of the confidence interval that gives the maximum reward. They give regret bounds sublinear in the number of episodes, i.e. bounds on the expected difference between the optimal policy cumulative reward and the policy’s cumulative reward.