Flavors of Wasserstein GAN

Flavors of Wasserstein GAN
Arkjovsky et al. (2017) introduced the Wasserstein GAN (WGAN), a formulation of GANs (generative adversarial networks) that attempts to minimize the Wasserstein distance $W$ between the generator’s distribution and the data distribution. It relies on one particularly useful expression for this distance, the Kantorovich-Rubinstein formula, which expresses the Wasserstein distance between two distributions $\mathbb{P}$ and $\mathbb{Q}$ as
$W(\mathbb{P}, \mathbb{Q}) = \sup_{f \in \mathrm{Lip}_1} \E_{x \sim \mathbb{P}}[f(x)] - \E_{y \sim \mathbb{Q}}[f(y)],$ where $\mathrm{Lip}_1$ is the set of all $1$ -Lipschitz functions. We’ll talk about what it means for $f$ to be $1$ -Lipschitz later. Intuitively, all it means is that $f(x)$ doesn’t change too quickly as $x$ changes.
The WGAN algorithm
The WGAN algorithm works as follows. Let $\mathbb{P}$ be the data distribution that we’re trying to mimic, and let $\mathbb{Q}_\theta$ be the generated distribution, parameterized by $\theta$ . Usually, $\mathbb{Q}_\theta$ is the distribution of $G_\theta(z)$ , where $G_\theta$ is some neural network that transforms a noise sample $z \sim \mathcal{N}(0, I)$ into a sample. To train $\mathbb{Q}_\theta$ to approximate $\mathbb{P}$ , we approximately minimize the distance $W(\mathbb{P}, \mathbb{Q}_\theta)$ . In particular, we use the approximation $\hat{W}(\mathbb{P}, \mathbb{Q}_\theta) = \hat\E_{x \sim \mathbb{P}}[f_\phi(x)] - \hat\E_{y \sim \mathbb{Q}_\theta}[f_\phi(y)],$ where $f_\phi$ is a critic, and $\hat\E$ denotes an approximation to the expectation by a finite sample. Notice that if $f_\phi$ is indeed $1$ -Lipschitz, then by the Kantorovich-Rubinstein formula above, $W(\mathbb{P}, \mathbb{Q}_\theta) \ge \E[\hat{W}(\mathbb{P}, \mathbb{Q}_\theta)].$ In other words, $\hat{W}(\mathbb{P}, \mathbb{Q}_\theta)$ is a lower bound to the true Wasserstein distance in expectation.
The idea, then, is to maximize $\hat{W}(\mathbb{P}, \mathbb{Q}_\theta)$ with respect to $f_\phi$ to get the best lower bound for the true Wasserstein distance, and then perform a gradient step on $\mathbb{Q}_\theta$ to decrease our approximation $\hat{W}(\mathbb{P}, \mathbb{Q}_\theta)$ , hopefully decreasing the true distance as a result.
Concretely, we perform the following steps:
- 1. Optimize critic. Perform gradient steps on $\phi$ to maximize $\hat\E_{x \sim \mathbb{P}}[f_\phi(x)] - \hat\E_{y \sim \mathbb{Q}_\theta}[f_\phi(y)],$ somehow constraining $f_\phi$ to $1$ -Lipschitz functions, thus improving our approximation to the Wasserstein distance.
- 2. Optimize generator. Perform gradient steps on $\theta$ to minimize the approximate distance $\hat{W}(\mathbb{P}, \mathbb{Q}_\theta)$ .
- 3. Repeat. Go to step 1.
Once $\theta$ is trained, the hope is that $\mathbb{Q}_\theta$ will accurately model the data distribution $\mathbb{P}$ .
The Lipschitz condition
One interesting degree of freedom in this algorithm is how to constrain $f_\phi$ to a family of $1$ -Lipschitz functions. Recall that $f$ is called $k$ -Lipschitz if $|f(x) - f(y)| \le k ||x - y||$ for all $x, y$ .
Notice that we can actually relax the Lipschitz condition a little: $f$ is $1$ -Lipschitz if and only if $kf$ is $k$ -Lipschitz, so optimizing $f_\phi$ over $k$ -Lipschitz functions instead of $1$ -Lipschitz functions simply scales the optimal $f_\phi$ by $k$ . Thus so long as $k$ is fixed, nothing changes, except for the scale of the optimal $f_\phi$ . (This does affect the effective learning rate of the generator step.)
Let’s denote the Lipschitz constant of a function $f$ as $||f||_{\mathrm{Lip}}$ , the smallest constant $k$ for which $f$ is $k$ -Lipschitz. (If we define $||f||_{\mathrm{Lip}} = k + |f(0)|$ , then $||f||_{\mathrm{Lip}}$ actually becomes a true norm. In our application, we can safely assume $f_\phi(0) = 0$ , since we can always subtract an arbitrary constant; nothing changes because we only access $f_\phi$ through differences or derivatives.)
The derivative
One way to control the Lipschitz constant of $f_\phi$ is through its derivative. Specifically, if $f$ is continuous and differentiable, then $f$ is $1$ -Lipschitz if and only if $||\nabla f|| \le 1$ everywhere.
To see this, first suppose $f$ is not $1$ -Lipschitz. Then there exists an $x$ and $y$ such that $||f(x) - f(y)|| > ||x - y||$ . If we define $g(t) = f(tx + (1-t)y)$ , then the mean value theorem says that there exists a $c$ such that $g'(c) = f(x) - f(y)$ . But $g'(c) = (x - y) \cdot \nabla f(z)$ , where $z = cx + (1-c)y$ . Hence by Cauchy-Schwarz, $|| f(x) - f(y)|| = |(x - y) \cdot \nabla f(z)| \le ||x - y||\, ||\nabla f(z)||,$ or $||\nabla f(z) || \ge \frac{||f(x) - f(y) ||}{||x - y||} > 1,$ and it is not the case that $\nabla f \le 1$ everywhere.
Conversely, suppose $||\nabla f(z)|| > 1$ for some $z$ . Then there exists a unit vector $u$ such that $|u \cdot \nabla f(z)| > 1$ , or $\lim_{t \to 0} \frac{||f(x + tu) - f(x) ||}{||tu||} > 1.$ In particular there exists some $\epsilon > 0$ such that $\frac{||f(x + \epsilon u) - f(x) ||}{ ||\epsilon u||} > 1$ , and hence $f$ is not Lipschitz (set $y = x + \epsilon u$ ).
Therefore, to enforce that $f$ is $1$ -Lipschitz, we may prescribe that $||\nabla f|| \le 1$ everywhere.
The spectral norm
Another way to control the Lipschitz constant of $f_\phi$ is to take advantage of the particular structure of neural networks. If we assume that $f_\phi$ is a neural network, so that $f_\phi$ alternates affine transformations and nonlinearities $f_\phi(x) = g_n W_n \cdots g_2 W_2 g_1 W_1 x,$ then since compositions of functions behave nicely with the Lipschitz norm, we may conclude that $||f_\phi||_{\mathrm{Lip}} \le ||g_n||_{\mathrm{Lip}} ||W_n||_{\mathrm{Lip}} \cdots ||g_2||_{\mathrm{Lip}} ||W_2||_{\mathrm{Lip}} ||g_1||_{\mathrm{Lip}} ||W_1||_{\mathrm{Lip}}.$ In other words, $f$ will be Lipschitz if we constrain every layer to be Lipschitz.
For the linear layers, remarkably we have that $||W||_{\mathrm{Lip}} = \sup_{x \ne y} \frac{||Wx - Wy||}{||x- y||} = \sup_{z \ne 0} \frac{||Wz||}{||z||} = ||W||_{\text{op}},$ where $||W||_{\text{op}}$ is the operator norm. Furthermore, when $||\cdot||$ is the Euclidean norm, the operator norm is called the spectral norm. The spectral norm of $||W||$ is the largest singular value of the matrix $W$ , usually denoted $\sigma_1(W)$ . Therefore, to control the Lipschitz constant of a layer $W$ , all we need to do is bound the largest singular value of $W$ .
As for the nonlinearities, most in use are Lipschitz. The only ones I can think of that we need to be careful of is parametric ReLU; since the slope is trainable, it can have unbounded Lipschitz constant.
Algorithmic approachs
So far, there have been three main ways that have been proposed in the literature to enforce the Lipschitz constraint: weight clipping, gradient penalty, and spectral normalization.
Weight clipping
Arkjovsky et al. (2017) originally proposed weight clipping as a crude way to enforce the Lipschitz condition (a “clearly terrible way,” in their own words). After each gradient step on $\phi$ , they clip the weights $\phi$ to only take on values in $[-c, c]$ for some constant $c$ .
This is an example of the second approach, control with the spectral norm of each layer. The claim is that the family of neural networks with weights between $[-c, c]$ is $k$ -Lipschitz for some $k$ . One neat way to see this is to notice that $\sigma_1(W)$ is the square root of the largest eigenvalue of $W^T W$ , and the eigenvalues are bounded by $m^2c^2$ by the Gershgorin circle theorem, where $W$ has $m$ rows. Hence $\sigma_1(W)$ is bounded by $mc$ , so the family of networks we’re optimizing over is $k$ -Lipschitz for some $k$ .
So why is it a clearly terrible way? The key is capacity underuse. Notice that the optimal function $f_\phi$ will have a Lipschitz constant of exactly $k$ , since otherwise we can scale our function to have Lipschitz constant $k$ and achieve a larger value for the objective $\hat\E_{x \sim \mathbb{P}}[f_\phi(x)] - \hat\E_{y \sim \mathbb{Q}_\theta}[f_\phi(y)].$ This means that effectively, the space of functions that we are optimizing over is limited to the space of functions with Lipschitz constant exactly $k$ .
If we perform weight clipping, the space of functions with maximum Lipschitz constant is where each weight is $\pm c$ , achieving the bound mentioned above of $mc$ . Gulrajani et al. (2017) found that training WGANs with weight clipping indeed leads to weight matrices populated by $\pm c$ . Such matrices are not very expressive and cannot discriminate well between real and fake samples.
Gradient penalty
As a gentler alternative, Gulrajani et al. (2017) proposed directly penalizing the gradient to have norm $1$ by adding a regularization term $\lambda \hat\E_{\hat{x} \sim \hat{\mathbb{P}}}[ (|| \nabla f_\phi(\hat{x}) ||_2 - 1)^2]$ to the loss, where $\hat\mathbb{P}$ is the following distribution: let $\hat{x}$ is be a random point between $x \sim \mathbb{P}$ and $y \sim \mathbb{Q}_\theta$ , two random points from the data distribution and the generator distribution respectively. Concretely, $\hat{x} = t x + (1-t) y$ , where $t \sim \mathrm{Uniform}(0,1)$ . For convenience, denote $\hat{x} \sim \hat\mathbb{P}$ .
As mentioned above, constraining $||\nabla f_\phi ||$ to be at most $1$ everywhere would constrain $f_\phi$ to be Lipschitz. With this approach, however, we penalize the function to satisfy the Lipschitz condition only on straight lines connecting the support of $\mathbb{P}$ with the support of $\mathbb{Q}_\theta$ . I believe that enforcing this weaker constraint is sufficient from the perspective of the Kantorovich-Rubinstein formula and optimal transport theory — if you have a proof, please let me know! A good reference is Villani’s Optimal Transport, Old and New.
I find Gulrajani et al.’s justification for why they penalize the gradient norm to be exactly $1$ (rather than merely at most $1$ ) to be less convincing, though. They cite a theorem that says that for the optimal $f^*$ , $|| \nabla f^*(\hat{x}) || = 1$ almost surely if $(x, y) \sim \pi$ , where $\pi$ is the optimal coupling between $\mathbb{P}$ and $\mathbb{Q}_\theta$ . However, in their penalty, they sample from $(x,y) \sim \mathbb{P} \times \mathbb{Q}_\theta$ , whose support is much less sparse than $\pi$ ; it is not true generally that $|| \nabla f^*(\hat{x}) || = 1$ $\hat\mathbb{P}$ -almost everywhere. Thus the two-sided penalty seems a bit too heavy-handed.
Nonetheless, the gradient penalty has proven empirically successful at training GANs.
Spectral normalization
Miyato et al. (2018) propose a third alternative that takes advantage of the special structure of neural networks by bounding the spectral norm of each layer, as mentioned above.
Concretely, they take a normal network, but replace each layer $W$ with $\frac{W}{||W||_2}$ , normalizing each layer to automatically have spectral norm $1$ . To do this, they note that $||W||_2$ is the largest singular value of $W$ , which may be estimated using the power method. The authors calculate both the dominant left and right singular vectors $u_1$ and $v_1$ of $W$ and then estimate $\sigma_1 \approx u^T W v$ , but I don’t see any problem with performing the normal power method (for eigenvalues) on $W^T W$ .
This approach to be both mathematically elegant and computationally inexpensive. The only problem I see is that bounding the spectral norm of each layer does not tightly bound the Lipschitz constant of the entire network. But perhaps in practice the bound is tight enough.
The images generated using spectral normalization seem really impressive!
Originally written March 31, 2018.