Flavors of Wasserstein GAN
Arkjovsky et al. (2017) introduced the Wasserstein GAN (WGAN), a formulation of GANs (generative adversarial networks) that attempts to minimize the Wasserstein distance between the generator’s distribution and the data distribution. It relies on one particularly useful expression for this distance, the Kantorovich-Rubinstein formula, which expresses the Wasserstein distance between two distributions and as where is the set of all -Lipschitz functions. We’ll talk about what it means for to be -Lipschitz later. Intuitively, all it means is that doesn’t change too quickly as changes.
The WGAN algorithm
The WGAN algorithm works as follows. Let be the data distribution that we’re trying to mimic, and let be the generated distribution, parameterized by . Usually, is the distribution of , where is some neural network that transforms a noise sample into a sample. To train to approximate , we approximately minimize the distance . In particular, we use the approximation where is a critic, and denotes an approximation to the expectation by a finite sample. Notice that if is indeed -Lipschitz, then by the Kantorovich-Rubinstein formula above, In other words, is a lower bound to the true Wasserstein distance in expectation. The idea, then, is to maximize with respect to to get the best lower bound for the true Wasserstein distance, and then perform a gradient step on to decrease our approximation , hopefully decreasing the true distance as a result.
Concretely, we perform the following steps:
1. Optimize critic. Perform gradient steps on to maximize somehow constraining to -Lipschitz functions, thus improving our approximation to the Wasserstein distance. 2. Optimize generator. Perform gradient steps on to minimize the approximate distance .
3. Repeat. Go to step 1.
Once is trained, the hope is that will accurately model the data distribution .
The Lipschitz condition
One interesting degree of freedom in this algorithm is how to constrain to a family of -Lipschitz functions. Recall that is called -Lipschitz if for all . Notice that we can actually relax the Lipschitz condition a little: is -Lipschitz if and only if is -Lipschitz, so optimizing over -Lipschitz functions instead of -Lipschitz functions simply scales the optimal by . Thus so long as is fixed, nothing changes, except for the scale of the optimal . (This does affect the effective learning rate of the generator step.) Let’s denote the Lipschitz constant of a function as , the smallest constant for which is -Lipschitz. (If we define , then actually becomes a true norm. In our application, we can safely assume , since we can always subtract an arbitrary constant; nothing changes because we only access through differences or derivatives.)
One way to control the Lipschitz constant of is through its derivative. Specifically, if is continuous and differentiable, then is -Lipschitz if and only if everywhere. To see this, first suppose is not -Lipschitz. Then there exists an and such that . If we define , then the mean value theorem says that there exists a such that . But , where . Hence by Cauchy-Schwarz, or and it is not the case that everywhere. Conversely, suppose for some . Then there exists a unit vector such that , or In particular there exists some such that , and hence is not Lipschitz (set ). Therefore, to enforce that is -Lipschitz, we may prescribe that everywhere.
The spectral norm
Another way to control the Lipschitz constant of is to take advantage of the particular structure of neural networks. If we assume that is a neural network, so that alternates affine transformations and nonlinearities then since compositions of functions behave nicely with the Lipschitz norm, we may conclude that In other words, will be Lipschitz if we constrain every layer to be Lipschitz. For the linear layers, remarkably we have that where is the operator norm. Furthermore, when is the Euclidean norm, the operator norm is called the spectral norm. The spectral norm of is the largest singular value of the matrix , usually denoted . Therefore, to control the Lipschitz constant of a layer , all we need to do is bound the largest singular value of .
As for the nonlinearities, most in use are Lipschitz. The only ones I can think of that we need to be careful of is parametric ReLU; since the slope is trainable, it can have unbounded Lipschitz constant.
So far, there have been three main ways that have been proposed in the literature to enforce the Lipschitz constraint: weight clipping, gradient penalty, and spectral normalization.
Arkjovsky et al. (2017) originally proposed weight clipping as a crude way to enforce the Lipschitz condition (a “clearly terrible way,” in their own words). After each gradient step on , they clip the weights to only take on values in for some constant . This is an example of the second approach, control with the spectral norm of each layer. The claim is that the family of neural networks with weights between is -Lipschitz for some . One neat way to see this is to notice that is the square root of the largest eigenvalue of , and the eigenvalues are bounded by by the Gershgorin circle theorem, where has rows. Hence is bounded by , so the family of networks we’re optimizing over is -Lipschitz for some . So why is it a clearly terrible way? The key is capacity underuse. Notice that the optimal function will have a Lipschitz constant of exactly , since otherwise we can scale our function to have Lipschitz constant and achieve a larger value for the objective This means that effectively, the space of functions that we are optimizing over is limited to the space of functions with Lipschitz constant exactly . If we perform weight clipping, the space of functions with maximum Lipschitz constant is where each weight is , achieving the bound mentioned above of . Gulrajani et al. (2017) found that training WGANs with weight clipping indeed leads to weight matrices populated by . Such matrices are not very expressive and cannot discriminate well between real and fake samples.
As a gentler alternative, Gulrajani et al. (2017) proposed directly penalizing the gradient to have norm by adding a regularization term to the loss, where is the following distribution: let is be a random point between and , two random points from the data distribution and the generator distribution respectively. Concretely, , where . For convenience, denote . As mentioned above, constraining to be at most everywhere would constrain to be Lipschitz. With this approach, however, we penalize the function to satisfy the Lipschitz condition only on straight lines connecting the support of with the support of . I believe that enforcing this weaker constraint is sufficient from the perspective of the Kantorovich-Rubinstein formula and optimal transport theory — if you have a proof, please let me know! A good reference is Villani’s Optimal Transport, Old and New. I find Gulrajani et al.’s justification for why they penalize the gradient norm to be exactly (rather than merely at most ) to be less convincing, though. They cite a theorem that says that for the optimal , almost surely if , where is the optimal coupling between and . However, in their penalty, they sample from , whose support is much less sparse than ; it is not true generally that -almost everywhere. Thus the two-sided penalty seems a bit too heavy-handed.
Nonetheless, the gradient penalty has proven empirically successful at training GANs.
Miyato et al. (2018) propose a third alternative that takes advantage of the special structure of neural networks by bounding the spectral norm of each layer, as mentioned above. Concretely, they take a normal network, but replace each layer with , normalizing each layer to automatically have spectral norm . To do this, they note that is the largest singular value of , which may be estimated using the power method. The authors calculate both the dominant left and right singular vectors and of and then estimate , but I don’t see any problem with performing the normal power method (for eigenvalues) on .
This approach to be both mathematically elegant and computationally inexpensive. The only problem I see is that bounding the spectral norm of each layer does not tightly bound the Lipschitz constant of the entire network. But perhaps in practice the bound is tight enough.
Originally written March 31, 2018.