I took these notes in spring 2017 for CS 234, taught by Professor Emma Brunskill.
A Markov decision process is a triplet of stochastic processes that is: governed by the dynamics the probability of transitioning from state to via action , obtaining reward , and parameterized by the policy the probability of choosing an action from the current state . Note the expectations ( simply emphasizes the dependence of the expression on .)
To evaluate how good a policy is, we consider expected future reward. The state value function is the expected future reward from a state following that policy: The action value function is the expected future reward from a state taking the action and then following the policy thereafter: If we know and , then we can obtain from the law of total expectation: If we know and , then we can obtain as
By either recursively expanding the definition of or eliminating from the above two relations, we see that the state value must satisfy the Bellman equations a system of linear equations in the which we may solve, at least in principle, if the dynamics are known.
Again assuming the dynamics are known, a more tractable method is to iterate the assignment: If we denote this assignment , then , so if , by the Banach fixed point theorem, this process converges to a unique fixed point , the solution to the Bellman equations.
If the dynamics are unknown, we can sample from the MDP under the policy . Let denote either an exponential moving average update or a function approximation gradient descent update This is stochastic gradient descent for the objective Experience replay: collect many episodes and then resample minibatches to reduce correlations between samples. If depends on , it’s a good idea to use a stable copy of to evaluate , so that SGD has a stable target even though is changing (either periodically update the stable copy with the latest , or keep two versions of that alternate roles). Temporal difference learning, SARSA, and expected SARSA are similar methods that update their estimates for every time step: Expected SARSA can be used even if the samples are from some other policy , making it an off-policy method. We may also consider time steps at a time. For example: As , this becomes Monte Carlo sampling on episodes (if exponential averaging is replaced with the mean): where is an episode reward starting at : Monte Carlo estimation can even use samples from another policy through importance sampling: where Weighted imporance sampling is a more stable alternative:
An optimal policy is one whose value for every policy and state . The greedy policy with respect to the value function has (If is unavailable, recall we can obtain it from and .) The Bellman operators are useful in analysis: Note that both are monotonic contraction operators (under the sup norm) with constant . Note that with equality for the greedy policy w.r.t . The unique fixed points are (the Bellman equation) and respectively.
One way to find an optimal policy is through policy iteration. Initialize with an arbitrary policy , and then alternate the following steps until convergence of : Policy evaluation: Calculate or with the above methods. Policy improvement: Set to be the greedy policy w.r.t. or . This produces a sequence of policies with monotonically increasing value, since , so by monotonicity, . This must converge to a fixed policy , which is an optimal policy because , so is the unique fixed point of .
If we use one of the iterative methods above for the policy evaluation step, we can often get away with performing only one step of the iteration. This is generalized policy iteration.
With iterative evaluation, generalized policy iteration is value iteration: With expected SARSA, generalized policy iteration is Q-learning: Like expected SARSA, Q-learning is off-policy, converging to even if the policy being followed is not the greedy one!
Model-based approaches estimate the dynamics using observed transitions and then solves them using e.g. value iteration. It may use the learned policy for further exploration.
R-max estimates the reward of a transition as until the transition is seen enough times. This optimism under uncertainty encourages exploration. It is probably approximately correct, i.e. the probability that the number of non--optimal episodes is bounded polynomially with probability .
PSRL/UCRL2 keep a distribution/confidence interval of possible MDPs and solves a sample from the distribution/the corner of the confidence interval that gives the maximum reward. They give regret bounds sublinear in the number of episodes, i.e. bounds on the expected difference between the optimal policy cumulative reward and the policy’s cumulative reward.
Sutton and Barto