Casey Chu

- Perspectives on the variational autoencoder
- There are many ways of looking at the variational autoencoder, or VAE, of Kingma et al. (2014), and the evidence lower bound, or ELBO, used to train it. The goal of this post is to concisely catalog these perspectives for quick reference.
- In the VAE, there are two probability distributions:
- $q(x, z) = q(z|x) q(x), \qquad p(x, z) = p(x |z) p(z).$
- Conceptually, $q(x)$ is the data distribution, making $q(z|x)$ an encoder, and $p(z)$ is a latent prior, making $p(x|z)$ a decoder. These terms are assumed to be tractable, whereas reversed terms like $q(x|z)$ and $p(z|x)$ and marginalized terms like $q(z)$ and $p(x)$ are intractable.
- The variational autoencoder is trained by maximizing the ELBO:
- $\text{ELBO} = \E_{x \sim q(x)} \E_{z \sim q(z|x)} \log \frac{p(z) p(x | z)}{q(z|x)} = \E_{x,z \sim q(x,z)} \log \frac{p(x,z)}{q(z|x)}.$
**Maximum likelihood.**We can think of the VAE as training a generative model $p(x)$ using maximum likelihood, by attempting to maximize $\E_{x \sim q(x)} \log p(x)$. Since $p(x)$ is intractable, we instead maximize a lower bound $\E_{x \sim q(x)} \log p(x) \ge \text{ELBO} = \E_{x \sim q(x)} \log p(x) - \E_{x \sim q(x)} D(q(z|x) || p(z|x)),$ a lower bound that is tight when $q(z|x) = p(z|x)$.**Variational Bayes.**From the perspective of Bayesian inference, $p(z)$ is a prior, and $p(x|z)$ is a likelihood. This makes $p(z|x)$ the posterior; unfortunately, this is intractable, so we approximate it with a variational posterior $q(z|x)$. Ideally, we would minimize $\E_{x \sim q(x)} D(q(z|x) || p(z|x))$, but this is intractable, so we instead maximize $\text{ELBO} = \E_{x \sim q(x)} [\log p(x) - D(q(z|x) || p(z|x))].$ Note that from this perspective, $p(x,z)$ has no optimizable parameters, so that $\log p(x)$ is a constant. Inference is amortized, being done for every $x \sim q(x)$; if we only care about one observed data point $x_0$, then we can set $q(x) = \delta(x-x_0)$.**Autoencoder.**We can view the VAE as an autoencoder by writing the ELBO as $\text{ELBO} = \E_{x \sim q(x)} \E_{z \sim q(z|x)}\log p(x|z) - \E_{x \sim q(x)} D(q(z|x) || p(z)).$ Suppose $p(x|z) = \mathcal{N}(\mu(z), \sigma^2)$, and $q(z|x)$ is a deterministic function $\varphi(x)$. The first term is a*reconstruction error*, proportional to $-|| x - \mu(\varphi(x))||^2$. The second is a*KL term*that matches the variational posterior with the prior. In practice, this is the objective that is trained with SGD, with the KL term either estimated via Monte Carlo or analytically integrated.**Importance sampling.**Motivated by the observation that $\log p(x) = \log \E_{z \sim p(z|x)} p(x) = \log \E_{z \sim q(z|x)} \Big[\frac{p(z|x)}{q(z|x)} \cdot p(x) \Big],$Burda et al. (2016) proposed the importance-weighted autoencoder, which maximizes $\text{IWAE}_k = \E_{x \sim q(x)}\E_{z_1, \ldots, z_k \sim q(z|x)} \log \frac{1}{k}\sum_{i=1}^k \frac{p(x,z)}{q(z|x)}.$ Jensen’s inequality shows that $\E_{x \sim q(x)}\log p(x) \ge \cdots \ge \text{IWAE}_{k+1} \ge \text{IWAE}_k \ge \text{IWAE}_1 = \text{ELBO}.$- Thus we may achieve a tighter bound by replacing the ELBO with the IWAE bound. Here, $q(z|x)$ loses its interpretation as a variational posterior, but Bachman and Precup (2015) and Cremer et al. (2017) reinterpret the IWAE bound as the usual ELBO where $q(z|x)$ is replaced with an implicitly defined distribution $q^{(k)}(z|x)$, one that converges to the true posterior $p(z|x)$ as $k \to \infty$. This approximate posterior $q^{(k)}(z|x)$ can be sampled from by first sampling $z_1, \ldots, z_k \sim q(z|x)$ and returning $z_j$ with probability proportional to $\frac{p(x,z_j)}{q(z_j|x)}$.
**Expectation-maximization.**Expectation-maximization is an iterative algorithm for computing the maximum likelihood estimator. In the language of VAEs, it maximizes the ELBO by coordinate ascent, alternately on $q(z|x)$ (the “E-step”) and on $p(z)$ (the “M-step”). In the E-step, the ELBO is maximized when $q(z|x)$ is set to $p(z|x)$. (This is called the E-step because $p(z|x) = \E_{h\sim p(h|x)} p(z|x,h)$, where $h$ represents unobserved data.) In the M-step, the ELBO is maximized when $p(z)$ is set to $\delta(z - z^*)$, where $z^*$ maximizes the likelihood $p(x|z)$.**Representation learning.**The encoder $q(z|x)$ in a VAE can be seen as a way to learn a compact representation of the data. However, the ELBO on its own does not seem to promote this objective particularly well. See, for example, Higgins et al. (2017) and Alemi et al. (2018).