Casey Chu
• Maximum entropy, KL divergence, and Bayesian inference
• The principle of maximum likelihood is a commonly applied technique, but I think it’s mysterious. Where does it come from, and why does it make sense mathematically?
• First proposed by Jaynes (1956), the principle of maximum entropy is a method of choosing, out of a set of probability distributions, one particular distribution that purportedly best represents our state of knowledge. It goes like this. Suppose we want to assign a probability distribution to a set of outcomes to describe our knowledge of the outcomes, but we aren’t sure which distribution to assign. Let’s say that we do know that certain distributions are completely ruled out, and the distributions that are allowed are given by a set $P$. Then the principle says to choose the probability distribution that maximizes the Shannon entropy, constraining ourselves to distributions in $P$.
• At this point, recall that the Shannon entropy of a probability distribution $\mathbf{p} = (p_1, \ldots, p_n)$ on $n$ outcomes is defined to be $H(\mathbf{p}) = -\sum_{i=1}^n p_i \log p_i,$ and intuitively, it measures the amount of “uncertainty” present in the distribution. For example, as we become more and more certain of a particular outcome (that is, $p_i \to 1$ for some $i$), the entropy approaches $0$. In contrast, if we don’t know anything about the distribution (that is, $p_1 = \cdots = p_n = \frac{1}{n}$), then it can be shown that the entropy is attains its maximum value of $\log n$.
• Vaguely, therefore, this principle can be thought of as choosing the distribution in $P$ that is “least certain,” which intuitively makes some sense. In fact, many named distributions that we’ve heard of can be exhibited as instances of the principle of maximum entropy. For example, the uniform distribution, exponential distribution, geometric distribution, and normal distribution can all be viewed as maximum entropy distributions for some set $P$. I think this points to the fact that this mysterious principle of maximum entropy is fundamental in some way.
• The Wallis derivation
• In what he calls the Wallis derivation, Jaynes (2003), citing Graham Wallis, gave what seems to me to be a good justification for why the entropy is a reasonable quantity to maximize. It imagines the following random process for generating a probability distribution $\mathbf{p} = (p_1, \ldots, p_n)$. First divide the total probability mass of $1$ into $M$ “chunks of probability” each with probability $\frac{1}{M}$. Then, uniformly scatter each chunk of probability among the $n$ outcomes. The result of this process is a probability distribution given by $\mathbf{p} = (\frac{M_1}{M}, \ldots, \frac{M_n}{M})$, where $M_i$ is the number of chunks that landed in outcome $i$. Eventually, the idea is to make the chunks smaller and smaller, taking $M \to \infty$.
• The probability $p(\mathbf{p})$ of obtaining a particular distribution $\mathbf{p} = (\frac{M_1}{M}, \ldots, \frac{M_n}{M})$ from this procedure is given by the multinomial distribution $p( \mathbf{p}) = \frac{M!}{M_1 ! \cdots M_n!} n^{-M} = n^{-M}(2\pi M)^{-\frac{1}{2}(n-1)} e^{M H(\mathbf{p})} (1 + O(\tfrac{1}{M})),$ using Stirling’s approximation. To emphasize, for large $M$, the probability becomes $p(\mathbf{p}) \approx \frac{1}{Z} e^{M H(\mathbf{p})},$ where $Z$ is some normalization constant, and $H(\mathbf{p})$ is again the entropy. This means that the most probable distribution generated by the process is the one with the maximum entropy.
• Therefore, if you believe that the process described is a fair way to generate distributions “uninformatively,” then it would make sense to use the maximum entropy distribution as the least informative. It even makes sense, in the case where we have some prior knowledge that allows us to restrict possible distributions to a set $P$, to maximize the entropy over distributions $\mathbf{p} \in P$, since this corresponds to taking the most likely distribution under this process, but rejecting distributions that are generated but do not fall in $P$.
• There are several subtleties to this process that I can think of. First, why are we simply taking the most likely distribution $\mathbf{p}^*$, instead of considering the full distribution $p(\mathbf{p})$ of possible distributions? That is, why can we take a point estimate rather than the more Bayesian approach of integrating over a distribution of distributions? The answer is that the distribution of distributions is extremely concentrated at the maximum entropy distribution. To see this, let $\mathbf{p}$ be the maximum entropy distribution, and let $\mathbf{p}'$ have entropy $H(\mathbf{p}') = H(\mathbf{p}) - \epsilon$. Then its probability is $p(\mathbf{p'}) \approx \frac{1}{Z} e^{M H(\mathbf{p'})} = \frac{1}{Z} e^{M H(\mathbf{p}) - M \epsilon} \approx p(\mathbf{p})\, e^{-M\epsilon}.$ Therefore, as $M \to \infty$, any non-maximum entropy distribution $\mathbf{p}'$ becomes exponentially less likely than the maximum entropy distribution $\mathbf{p}$, no matter how small the entropy gap $\epsilon$ is. So it makes sense to consider just the maximum entropy distribution, as any other distributions have negligible probabilities anyway.
• Second, depending on the set $P$ we take (remember that $P$ represents the set of possible distributions given our prior information), there might be multiple maximum entropy distributions. In the most common cases, this doesn’t happen, but interestingly, our calculations above already prescribe what to do if it does happen. Since $p(\mathbf{p}) = p(\mathbf{p'})$ for two maximum entropy distributions, the resulting distribution over distributions is uniform over the maxima, and $0$ everywhere else.
• The KL divergence
• The Wallis derivation leads naturally to an interesting generalization. Suppose that we don’t scatter the probability chunks uniformly among the $n$ outcomes, but instead according to some other probability distribution $\mathbf{q} = (q_1, \ldots, q_n)$. Then the probability of generating a distribution $\mathbf{p}$ becomes $p(\mathbf{p}) = \frac{M!}{M_1 ! \cdots M_n!} q_1^{M_1} \cdots q_n^{M_n} = (2\pi M)^{-\frac{1}{2}(n-1)} e^{-M \mathrm{KL}(\mathbf{p} \,||\, \mathbf{q})} (1 + O(\tfrac{1}{M})),$ or $p(\mathbf{p}) \approx \frac{1}{Z} e^{-M \mathrm{KL}(\mathbf{p} \,||\, \mathbf{q})},$ where $\mathrm{KL}(\mathbf{p} \,||\, \mathbf{q}) = \sum_{i=1}^n p_i \log \frac{p_i}{q_i}.$ This last quantity is the KL divergence (Kullback-Liebler divergence) from $\mathbf{q}$ to $\mathbf{p}$, also known as the relative entropy. It’s intuitive to reason about because it behaves almost like a distance between the distributions $\mathbf{p}$ and $\mathbf{q}$, since $\mathrm{KL}(\mathbf{p} \,||\, \mathbf{q}) \ge 0$ and $\mathrm{KL}(\mathbf{p} \,||\, \mathbf{q}) = 0$ if and only if $\mathbf{p} = \mathbf{q}$.
• At this point, we arrive at a very intuitive reformulation of the principle of maximum entropy: the maximum entropy distribution is simply the distribution that minimizes its KL divergence from the uniform distribution.
• From this perspective, it’s obvious that the uniform distribution is the maximum entropy distribution when no constraints are present, since the minimum divergence distribution from the uniform distribution is obviously the uniform distribution. If we do constrain the distribution to some set $P$, we are finding the “closest” distribution from the uniform distribution that lies in the set $P$.
• This raises the question, then: is there anything really fundamental as choosing the uniform distribution as the reference distribution when computing the KL divergence? I think there isn’t. Instead, the principle of maximum entropy works with any “prior” distribution $\mathbf{q}$, and it happens that the uniform distribution is a good prior for many discrete problems.
• The uniform distribution becomes less useful for continuous problems, since it’s impossible to have a uniform distribution on all of $\R$. Indeed, the naive generalization of entropy, the differential entropy $-\int p(x) \log p(x)\,dx$ loses some of the nice properties that Shannon entropy has. Instead, it seems that the proper generalization is $-\int p(x) \log \frac{p(x)}{q(x)}\,dx$ for some reference distribution $q(x)$, which is the negative of the continuous KL divergence. In the continuous setting, the reference distribution $q(x)$ must be specified for the entropy to be well-defined, suggesting that it must be specified for the discrete case as well, and it is just that we usually take the uniform measure out of convenience.
• Bayesian inference
• The principle of maximum entropy and its formulation in terms of KL divergence prescribes a rule for going from a (often uniform) prior distribution to another distribution. Concretely, if we have a prior distribution $q$, but we learn that the set of possible distributions is actually constrained to $P$, then the rule prescribes that the prior distribution $q$ should be replaced with the distribution $p$ that minimizes $\mathrm{KL}(p\,||\,q)$ subject to $p \in P$.
• Of course, this procedure is reminiscent of Bayesian inference, which also prescribes a rule for going from a prior distribution to a new distribution, in light of new, observed data. Concretely, a prior $q(\theta)$ should be replaced with the posterior $q(\theta|x)$ after observing new data $x$. How are these two prescriptions related?
• For me, Giffin and Caticha (2007) provide one insightful answer: they show that Bayesian inference can be viewed precisely as an instance of the principle of maximum entropy. It imagines a prior joint distribution $q(x,\theta)$, which encodes both the prior $q(\theta)$ and the likelihood $q(x|\theta)$. When data $x'$ is observed, the constraint set $P$ is restricted to only distributions $p(x,\theta)$ that match the observed data, i.e. those for which $p(x) = \delta(x-x')$. As per the principle of maximum entropy (or, more precisely, minimum KL divergence), we then replace the prior joint distribution $q(x,\theta)$ with the distribution $p(x,\theta)$ that minimizes $\mathrm{KL}(p(x,\theta)\,||\,q(x,\theta))$ subject to $p \in P$. We will show that we can recover the traditional posterior $q(\theta|x)$ from this maximum entropy distribution $p(x,\theta)$.
• The key observation is that $\mathrm{KL}(p(x,\theta)\,||\,q(x,\theta)) = \int \mathrm{KL}(p(\theta|x)\,||\,q(\theta|x)) \,p(x)\,dx + \mathrm{KL}(p(x)\,||\,q(x)).$ Under the constraint that $p(x) = \delta(x - x')$, the first term on the right-hand side becomes simply $\mathrm{KL}(p(\theta|x')\,||\,q(\theta|x'))$, while the second term on the right-hand side becomes a constant. Minimizing this quantity is therefore achieved by ensuring that $p(\theta|x') = q(\theta| x')$, which can be done while satisfying the constraint by setting $p(x,\theta) = p(x)p(\theta|x) = \delta(x-x') q(\theta|x)$. Therefore, the distribution prescribed by the principle of maximum entropy after observing data $x'$ is completely captured by the Bayesian posterior $q(\theta | x')$ (and the observed data $x'$).
• This gives an equivalence between the principle of maximum entropy and the conditioning procedure prescribed by Bayesian inference.
• Originally written July 6, 2018.