A Bayesian trains a classifier

A Bayesian trains a classifier
Suppose we are trying to learn the parameters of a classifier $p_\theta(y|x)$ using a dataset $\mathcal{D} = \{ (x_1, y_1), \ldots, (x_N, y_N) \}$ . The typical machine learning way to do this is to minimize the cross-entropy loss $L(\theta) = -\sum_{i=1}^N \log p_\theta(y_i | x_i) + R(\theta)$ using (stochastic) gradient descent, where $R(\theta)$ is an optional regularization term. The value of $\theta$ at the end of this training procedure, which we will denote $\hat\theta$ , is interpreted as the parameters of the “best” classifier.
There’s a second, perhaps more principled way to solve this problem: Bayesian inference. The Bayesian way to learn a classifier is to specify a prior distribution on the parameters $p(\theta)$ and then construct the posterior $p(\theta | \mathcal{D})$ . Rather than a single, final classifier, this procedure gives a distribution over $\theta$ , and hence a distribution over classifiers, weighted by how likely that classifier is given the data.
So how are the machine learning approach and the Bayesian approach related?
The first answer is that if the cross-entropy loss is minimized, then the resulting parameters $\hat\theta$ are the parameters with the highest probability under the posterior distribution $p(\theta | \mathcal{D})$ , and hence $\hat\theta$ is a maximum a posteriori estimate: $\hat\theta = \argmax_\theta p(\theta | \mathcal{D}).$ The second, more subtle answer is that stochastic gradient descent on the cross entropy loss mimics Markov chain Monte Carlo sampling from the posterior, and hence $\hat\theta$ is a sample from the posterior: $\hat\theta \sim p(\theta | \mathcal{D}).$
Cross-entropy minimization as MAP estimation
Finding the parameter that maximizes the posterior distribution is known as maximum a posteriori estimation, or MAP estimation. An expression for the posterior distribution can be found using Bayes’ rule: $p(\theta | \mathcal{D}) = \frac{1}{Z} p_\theta(\mathcal{D})\, p(\theta),$ where the likelihood of the data under our model $p_\theta(y|x)$ is $p_\theta(\mathcal{D}) = \prod_i p_\theta(x_i, y_i) = \prod_i p_\theta(y_i|x_i) p_\theta(x_i).$ Because we are only interested in the classifier $p(y|x)$ and aren’t trying to model $p(x)$ , we will say that $p_\theta(x)$ is actually a constant with respect to $\theta$ . Therefore, the log of the posterior is $\log p(\theta|\mathcal{D}) = \sum_i \log p_\theta(y_i|x_i) + \log p(\theta) + \text{constant.}$
From this, we see that maximizing the posterior distribution with respect to $\theta$ is equivalent to minimizing the cross-entropy loss with respect to $\theta$ , where the regularization term $R(\theta)$ corresponds to the prior term $\log p(\theta)$ . Therefore, if we assume that gradient descent on the loss finds the minimizer of the loss, the estimate $\hat\theta$ is a MAP estimate.
As a short demonstration of the regularization term, suppose we impose a Gaussian prior centered around $0$ on the parameters, that is, $p(\theta) = \mathcal{N}(\theta; 0, I)$ . Then the prior $\log p(\theta)$ becomes $\log p(\theta) = -\frac{1}{2}||\theta||^2 + \text{constant,}$ and we recover L2 regularization on the parameters, $R(\theta) = \frac{1}{2}||\theta||^2$ .
Gradient descent for posterior sampling
It’s nice that optimization of the cross-entropy loss gives the MAP estimate if gradient descent converges to a minimum. However, it’s also slightly unsatisfactory, because in Bayesian inference, knowledge of the entire posterior $p(\theta | \mathcal{D})$ is preferred to a single point estimate $\hat\theta$ of the posterior. It turns out that by slightly modifying gradient descent, we can sample from the posterior $p(\theta | \mathcal{D})$ , effectively allowing the entire posterior to be characterized after sufficiently many samples.
In particular, gradient descent can be viewed as an instance of Markov chain Monte Carlo. Markov chain Monte Carlo is a technique to sample from a distribution: we first construct a Markov chain whose stationary distribution is the distribution of interest; then we can sample from the distribution by simulating the Markov chain for enough steps. The procedure can in fact be generalized to continuous-time analogues of Markov chains, known as Markov processes, a setting where powerful tools are available for analysis.
Just as a Markov chain can be specified by its transition probabilities, a Markov process can be specified by a stochastic differential equation (SDE). For example, consider the following SDE
$d\theta_t = -\nabla L(\theta_t)\,dt + \sqrt{2} \,dW_t,$ where $W_t$ is the Wiener process, which describes Brownian motion. It is perhaps most intuitively understood as the limit of the discrete process (Markov chain) described by $\Delta \theta = -\epsilon \nabla L(\theta) + \sqrt{2} \mathcal{N}(0, \epsilon I),$ as $\epsilon \to 0$ , where the noise is sampled independently for every step. This discretization is known as the Euler–Maruyama scheme applied to (8).
The stationary distribution of a Markov process like the one given can be computed by solving a partial differential equation known as the Fokker–Planck equation. It turns out that the stationary distribution of $\theta$ under the evolution of this SDE is $\rho(\theta) = \frac{1}{Z}e^{-L(\theta)},$ where $Z = \int e^{-L(\theta)} \,d\theta$ .
Now, notice that the discretization (9) can be interpreted as gradient descent with added Gaussian noise at every iteration. With our choice of the cross-entropy loss, $L(\theta) = -\log p(\theta | \mathcal{D})$ , so the stationary distribution $\rho(\theta)$ of this Markov process is exactly the posterior $p(\theta |\mathcal{D})$ . Therefore, if noise of the appropriate variance is injected into the update for gradient descent, each iteration of gradient descent may be seen approximately giving the next iterate in a Markov process, and if enough steps are executed, the iterates are approximately sampled from the posterior distribution. Under this procedure, then, when we stop training, $\hat\theta$ can be thought of as an approximate sample from the posterior distribution.
For more reading on this perspective, see Pavliotis (2014), especially chapter 4.8.
Stochastic gradient descent for posterior sampling
The above analysis applies to full-batch gradient descent, but this can be costly if the dataset is large. Can we instead use stochastic gradient descent for posterior sampling? Intuitively, it seems to be possible, as stochastic gradient descent can be thought of as gradient descent with a noisy gradient, which is exactly the dynamics described by (9).
Stochastic gradient descent operates by taking gradient steps $\Delta \theta = -\epsilon \nabla \tilde{L}(\theta),$ where $\tilde{L}(\theta)$ is an estimate of $L(\theta)$ computed using minibatches of size $n$ : $\tilde{L}(\theta) = -\frac{N}{n}\sum_{i=1}^n \log p_\theta(y^{(i)}|x^{(i)}) + R(\theta),$ where $(x^{(i)}, y^{(i)})$ is sampled uniformly from $\mathcal{D}$ .
Let’s try to write $\nabla \tilde{L}(\theta)$ in terms of $\nabla L(\theta)$ and some noise. Define $\ell(\theta; x, y) = -N \log p_\theta(y|x) + R(\theta),$ so that $\tilde{L}(\theta) = \frac{1}{n}\sum_{i=1}^n \ell(\theta; x^{(i)},y^{(i)}),$ and in particular, $\nabla \tilde{L}(\theta) = \frac{1}{n}\sum_{i=1}^n \nabla \ell(\theta; x^{(i)},y^{(i)}).$ By the central limit theorem, if $n$ is large enough, $\nabla \tilde{L}(\theta)$ is distributed as a Gaussian with mean $\E[\nabla \tilde{\ell}(\theta; x,y)] = \nabla \E [\tilde{\ell}(\theta; x,y)] = \nabla L(\theta)$ . Denote the covariance of this Gaussian $V(\theta)$ .
After making the approximation $\nabla \tilde{L}(\theta) \approx \nabla L(\theta) + \mathcal{N}(0, V(\theta)),$ we can write the SGD update as $\Delta \theta \approx - \epsilon \nabla L(\theta) + \mathcal{N}(0, \epsilon^2 V(\theta)).$ Unfortunately, this differs from the dynamics in (9), as the variance of the noise in (9) is proportional to $\epsilon$ , but here it is proportional to $\epsilon^2$ . Therefore, the iterates under normal SGD do not approach a stationary distribution given by the posterior. Chaudhari and Soatto (2017) give the SDE that SGD actually corresponds to and characterize its actual stationary distribution; see also Li et al. (2017) and Mandt et al. (2016) for further discussion.
To correct this difference in dynamics, Welling and Teh (2011) suggest an alternative update to the SGD update, which they call stochastic gradient Langevin dynamics (SGLD): $\Delta \theta = - \epsilon \nabla \tilde{L}(\theta) + \sqrt{2} \mathcal{N}(0, \epsilon I).$ With this modification, the dynamics becomes $\Delta \theta \approx - \epsilon \nabla L(\theta) + \mathcal{N}(0, \epsilon^2 V(\theta)) + \sqrt{2} \mathcal{N}(0, \epsilon I).$ This is still not of the form in (9); however, if the step size $\epsilon$ is taken to $0$ as training progresses, then the stochastic gradient noise $\mathcal{N}(0, \epsilon^2 V(\theta))$ will eventually be dominated by the injected noise $\mathcal{N}(0, \epsilon I)$ . This scheme thereby converges as training progresses to the dynamics (9) and hence to the SDE (8). Thus, the iterates $\theta$ may eventually be viewed as samples from the posterior $p(\theta | \mathcal{D})$ .
SGLD can be thought of as the Bayesian version of SGD, one that allows sampling from the posterior rather than simply finding a point estimate. Chen et al. (2014) extend this idea even further by proposing stochastic gradient Hamiltonian Monte Carlo (SGHMC), which can be thought of as a Bayesian version of SGD with momentum. These techniques can be used when a fully Bayesian treatment is desired.
Originally written November 16, 2018.