Casey Chu
• Flavors of Wasserstein GAN
• Arkjovsky et al. (2017) introduced the Wasserstein GAN (WGAN), a formulation of GANs (generative adversarial networks) that attempts to minimize the Wasserstein distance $W$ between the generator’s distribution and the data distribution. It relies on one particularly useful expression for this distance, the Kantorovich-Rubinstein formula, which expresses the Wasserstein distance between two distributions $\mathbb{P}$ and $\mathbb{Q}$ as
• $W(\mathbb{P}, \mathbb{Q}) = \sup_{f \in \mathrm{Lip}_1} \E_{x \sim \mathbb{P}}[f(x)] - \E_{y \sim \mathbb{Q}}[f(y)],$ where $\mathrm{Lip}_1$ is the set of all $1$-Lipschitz functions. We’ll talk about what it means for $f$ to be $1$-Lipschitz later. Intuitively, all it means is that $f(x)$ doesn’t change too quickly as $x$ changes.
• The WGAN algorithm
• The WGAN algorithm works as follows. Let $\mathbb{P}$ be the data distribution that we’re trying to mimic, and let $\mathbb{Q}_\theta$ be the generated distribution, parameterized by $\theta$. Usually, $\mathbb{Q}_\theta$ is the distribution of $G_\theta(z)$, where $G_\theta$ is some neural network that transforms a noise sample $z \sim \mathcal{N}(0, I)$ into a sample. To train $\mathbb{Q}_\theta$ to approximate $\mathbb{P}$, we approximately minimize the distance $W(\mathbb{P}, \mathbb{Q}_\theta)$. In particular, we use the approximation $\hat{W}(\mathbb{P}, \mathbb{Q}_\theta) = \hat\E_{x \sim \mathbb{P}}[f_\phi(x)] - \hat\E_{y \sim \mathbb{Q}_\theta}[f_\phi(y)],$ where $f_\phi$ is a critic, and $\hat\E$ denotes an approximation to the expectation by a finite sample. Notice that if $f_\phi$ is indeed $1$-Lipschitz, then by the Kantorovich-Rubinstein formula above, $W(\mathbb{P}, \mathbb{Q}_\theta) \ge \E[\hat{W}(\mathbb{P}, \mathbb{Q}_\theta)].$ In other words, $\hat{W}(\mathbb{P}, \mathbb{Q}_\theta)$ is a lower bound to the true Wasserstein distance in expectation.
• The idea, then, is to maximize $\hat{W}(\mathbb{P}, \mathbb{Q}_\theta)$ with respect to $f_\phi$ to get the best lower bound for the true Wasserstein distance, and then perform a gradient step on $\mathbb{Q}_\theta$ to decrease our approximation $\hat{W}(\mathbb{P}, \mathbb{Q}_\theta)$, hopefully decreasing the true distance as a result.
• Concretely, we perform the following steps:
• 1. Optimize critic. Perform gradient steps on $\phi$ to maximize $\hat\E_{x \sim \mathbb{P}}[f_\phi(x)] - \hat\E_{y \sim \mathbb{Q}_\theta}[f_\phi(y)],$ somehow constraining $f_\phi$ to $1$-Lipschitz functions, thus improving our approximation to the Wasserstein distance.
• 2. Optimize generator. Perform gradient steps on $\theta$ to minimize the approximate distance $\hat{W}(\mathbb{P}, \mathbb{Q}_\theta)$.
• 3. Repeat. Go to step 1.
• Once $\theta$ is trained, the hope is that $\mathbb{Q}_\theta$ will accurately model the data distribution $\mathbb{P}$.
• The Lipschitz condition
• One interesting degree of freedom in this algorithm is how to constrain $f_\phi$ to a family of $1$-Lipschitz functions. Recall that $f$ is called $k$-Lipschitz if $|f(x) - f(y)| \le k ||x - y||$ for all $x, y$.
• Notice that we can actually relax the Lipschitz condition a little: $f$ is $1$-Lipschitz if and only if $kf$ is $k$-Lipschitz, so optimizing $f_\phi$ over $k$-Lipschitz functions instead of $1$-Lipschitz functions simply scales the optimal $f_\phi$ by $k$. Thus so long as $k$ is fixed, nothing changes, except for the scale of the optimal $f_\phi$. (This does affect the effective learning rate of the generator step.)
• Let’s denote the Lipschitz constant of a function $f$ as $||f||_{\mathrm{Lip}}$, the smallest constant $k$ for which $f$ is $k$-Lipschitz. (If we define $||f||_{\mathrm{Lip}} = k + |f(0)|$, then $||f||_{\mathrm{Lip}}$ actually becomes a true norm. In our application, we can safely assume $f_\phi(0) = 0$, since we can always subtract an arbitrary constant; nothing changes because we only access $f_\phi$ through differences or derivatives.)
• The derivative
• One way to control the Lipschitz constant of $f_\phi$ is through its derivative. Specifically, if $f$ is continuous and differentiable, then $f$ is $1$-Lipschitz if and only if $||\nabla f|| \le 1$ everywhere.
• To see this, first suppose $f$ is not $1$-Lipschitz. Then there exists an $x$ and $y$ such that $||f(x) - f(y)|| > ||x - y||$. If we define $g(t) = f(tx + (1-t)y)$, then the mean value theorem says that there exists a $c$ such that $g'(c) = f(x) - f(y)$. But $g'(c) = (x - y) \cdot \nabla f(z)$, where $z = cx + (1-c)y$. Hence by Cauchy-Schwarz, $|| f(x) - f(y)|| = |(x - y) \cdot \nabla f(z)| \le ||x - y||\, ||\nabla f(z)||,$ or $||\nabla f(z) || \ge \frac{||f(x) - f(y) ||}{||x - y||} > 1,$ and it is not the case that $\nabla f \le 1$ everywhere.
• Conversely, suppose $||\nabla f(z)|| > 1$ for some $z$. Then there exists a unit vector $u$ such that $|u \cdot \nabla f(z)| > 1$, or $\lim_{t \to 0} \frac{||f(x + tu) - f(x) ||}{||tu||} > 1.$ In particular there exists some $\epsilon > 0$ such that $\frac{||f(x + \epsilon u) - f(x) ||}{ ||\epsilon u||} > 1$, and hence $f$ is not Lipschitz (set $y = x + \epsilon u$).
• Therefore, to enforce that $f$ is $1$-Lipschitz, we may prescribe that $||\nabla f|| \le 1$ everywhere.
• The spectral norm
• Another way to control the Lipschitz constant of $f_\phi$ is to take advantage of the particular structure of neural networks. If we assume that $f_\phi$ is a neural network, so that $f_\phi$ alternates affine transformations and nonlinearities $f_\phi(x) = g_n W_n \cdots g_2 W_2 g_1 W_1 x,$ then since compositions of functions behave nicely with the Lipschitz norm, we may conclude that $||f_\phi||_{\mathrm{Lip}} \le ||g_n||_{\mathrm{Lip}} ||W_n||_{\mathrm{Lip}} \cdots ||g_2||_{\mathrm{Lip}} ||W_2||_{\mathrm{Lip}} ||g_1||_{\mathrm{Lip}} ||W_1||_{\mathrm{Lip}}.$ In other words, $f$ will be Lipschitz if we constrain every layer to be Lipschitz.
• For the linear layers, remarkably we have that $||W||_{\mathrm{Lip}} = \sup_{x \ne y} \frac{||Wx - Wy||}{||x- y||} = \sup_{z \ne 0} \frac{||Wz||}{||z||} = ||W||_{\text{op}},$ where $||W||_{\text{op}}$ is the operator norm. Furthermore, when $||\cdot||$ is the Euclidean norm, the operator norm is called the spectral norm. The spectral norm of $||W||$ is the largest singular value of the matrix $W$, usually denoted $\sigma_1(W)$. Therefore, to control the Lipschitz constant of a layer $W$, all we need to do is bound the largest singular value of $W$.
• As for the nonlinearities, most in use are Lipschitz. The only ones I can think of that we need to be careful of is parametric ReLU; since the slope is trainable, it can have unbounded Lipschitz constant.
• Algorithmic approachs
• So far, there have been three main ways that have been proposed in the literature to enforce the Lipschitz constraint: weight clipping, gradient penalty, and spectral normalization.
• Weight clipping
• Arkjovsky et al. (2017) originally proposed weight clipping as a crude way to enforce the Lipschitz condition (a “clearly terrible way,” in their own words). After each gradient step on $\phi$, they clip the weights $\phi$ to only take on values in $[-c, c]$ for some constant $c$.
• This is an example of the second approach, control with the spectral norm of each layer. The claim is that the family of neural networks with weights between $[-c, c]$ is $k$-Lipschitz for some $k$. One neat way to see this is to notice that $\sigma_1(W)$ is the square root of the largest eigenvalue of $W^T W$, and the eigenvalues are bounded by $m^2c^2$ by the Gershgorin circle theorem, where $W$ has $m$ rows. Hence $\sigma_1(W)$ is bounded by $mc$, so the family of networks we’re optimizing over is $k$-Lipschitz for some $k$.
• So why is it a clearly terrible way? The key is capacity underuse. Notice that the optimal function $f_\phi$ will have a Lipschitz constant of exactly $k$, since otherwise we can scale our function to have Lipschitz constant $k$ and achieve a larger value for the objective $\hat\E_{x \sim \mathbb{P}}[f_\phi(x)] - \hat\E_{y \sim \mathbb{Q}_\theta}[f_\phi(y)].$ This means that effectively, the space of functions that we are optimizing over is limited to the space of functions with Lipschitz constant exactly $k$.
• If we perform weight clipping, the space of functions with maximum Lipschitz constant is where each weight is $\pm c$, achieving the bound mentioned above of $mc$. Gulrajani et al. (2017) found that training WGANs with weight clipping indeed leads to weight matrices populated by $\pm c$. Such matrices are not very expressive and cannot discriminate well between real and fake samples.
• As a gentler alternative, Gulrajani et al. (2017) proposed directly penalizing the gradient to have norm $1$ by adding a regularization term $\lambda \hat\E_{\hat{x} \sim \hat{\mathbb{P}}}[ (|| \nabla f_\phi(\hat{x}) ||_2 - 1)^2]$ to the loss, where $\hat\mathbb{P}$ is the following distribution: let $\hat{x}$ is be a random point between $x \sim \mathbb{P}$ and $y \sim \mathbb{Q}_\theta$, two random points from the data distribution and the generator distribution respectively. Concretely, $\hat{x} = t x + (1-t) y$, where $t \sim \mathrm{Uniform}(0,1)$. For convenience, denote $\hat{x} \sim \hat\mathbb{P}$.
• As mentioned above, constraining $||\nabla f_\phi ||$ to be at most $1$ everywhere would constrain $f_\phi$ to be Lipschitz. With this approach, however, we penalize the function to satisfy the Lipschitz condition only on straight lines connecting the support of $\mathbb{P}$ with the support of $\mathbb{Q}_\theta$. I believe that enforcing this weaker constraint is sufficient from the perspective of the Kantorovich-Rubinstein formula and optimal transport theory if you have a proof, please let me know! A good reference is Villani’s Optimal Transport, Old and New.
• I find Gulrajani et al.’s justification for why they penalize the gradient norm to be exactly $1$ (rather than merely at most $1$) to be less convincing, though. They cite a theorem that says that for the optimal $f^*$, $|| \nabla f^*(\hat{x}) || = 1$ almost surely if $(x, y) \sim \pi$, where $\pi$ is the optimal coupling between $\mathbb{P}$ and $\mathbb{Q}_\theta$. However, in their penalty, they sample from $(x,y) \sim \mathbb{P} \times \mathbb{Q}_\theta$, whose support is much less sparse than $\pi$; it is not true generally that $|| \nabla f^*(\hat{x}) || = 1$ $\hat\mathbb{P}$-almost everywhere. Thus the two-sided penalty seems a bit too heavy-handed.
• Concretely, they take a normal network, but replace each layer $W$ with $\frac{W}{||W||_2}$, normalizing each layer to automatically have spectral norm $1$. To do this, they note that $||W||_2$ is the largest singular value of $W$, which may be estimated using the power method. The authors calculate both the dominant left and right singular vectors $u_1$ and $v_1$ of $W$ and then estimate $\sigma_1 \approx u^T W v$, but I don’t see any problem with performing the normal power method (for eigenvalues) on $W^T W$.