发布于  更新于 

KL Divergence of Gaussians

Preliminary: KL Divergence

Kullback–Leibler (KL) Divergence, aka the relative entropy or I-divergence is a distance metric that quantifies the difference between two probability distributions. We denote the KL Divergence of $P$ from $Q$ with:

$$
D_\mathrm{KL} \left( P \Vert Q \right) = \sum_{x} P(x) \lg \frac{P(x)}{Q(x)}
$$

For distributions of continuous random variable:

$$
D_\mathrm{KL} \left( P \Vert Q \right) = \int p(x) \lg \frac{p(x)}{q(x)} \mathrm{d}x
$$

Clearly that KL Divergence is not symmetric, for $D_\mathrm{KL}(P \Vert Q) \neq D_\mathrm{KL}(Q \Vert P)$.

KL Divergence of Gaussians

Gaussian Distributions

Recall that

$$
\mathcal{N}(x; \mu, \Sigma) = \exp { - \frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu)}
$$

And here’s also a trick in processing the index term: we know that $(x - \mu)^T \Sigma^{-1} (x - \mu) \in \mathbb{R}$, therefore with the trace trick we have:

$$
(x - \mu)^T \Sigma^{-1} (x - \mu) = \mathrm{tr} \left( (x - \mu)^T \Sigma^{-1} (x - \mu) \right) = \mathrm{tr} \left( (x - \mu)^T (x - \mu) \Sigma^{-1} \right)
$$

KL Divergence of two Gaussians

Consider two Gaussian distributions:

$$
p(x) = \mathcal{N}(x; \mu_1, \Sigma_1), q(x) = \frac{1}{\sqrt{(2 \pi)^k \vert \Sigma \vert}} \mathcal{N}(x; \mu_2, \Sigma_2)
$$

We have

$$
\begin{aligned}
D_\mathrm{KL} (p \Vert q)
&= \mathbb{E}_p \left[ \log p - \log q \right] \
&= \frac{1}{2} \mathbb{E}_p \left[ \log \frac{\vert \Sigma_q \vert}{\vert \Sigma_p \vert} - (x - \mu_1)^T \Sigma_1^{-1} (x - \mu_1) + (x - \mu_2)^T \Sigma_2^{-1} (x - \mu_2) \right] \
&= \frac{1}{2} \log \frac{\vert \Sigma_q \vert}{\vert \Sigma_p \vert} + \mathbb{E}_p \left[ - (x - \mu_1)^T \Sigma_1^{-1} (x - \mu_1) + (x - \mu_2)^T \Sigma_2^{-1} (x - \mu_2) \right]
\end{aligned}
$$

Using the trick mentioned above we have

$$
\begin{aligned}
D_\mathrm{KL} (p \Vert q)
&= \frac{1}{2} \log \frac{\vert \Sigma_q \vert}{\vert \Sigma_p \vert} + \mathbb{E}_p \left[ \mathrm{tr} \left( (x - \mu_2)^T (x - \mu_2) \Sigma_2^{-1} \right) \right] - \mathbb{E}_p \left[ \mathrm{tr} \left( (x - \mu_1)^T (x - \mu_1) \Sigma_1^{-1} \right) \right] \
&= \frac{1}{2} \log \frac{\vert \Sigma_q \vert}{\vert \Sigma_p \vert} + \mathrm{tr} \left( \mathbb{E}_p \left[ (x - \mu_2)^T (x - \mu_2) \Sigma_2^{-1} \right] \right) - \mathrm{tr} \left( \mathbb{E}_p \left[ (x - \mu_1)^T (x - \mu_1) \Sigma_1^{-1} \right] \right) \
&= \frac{1}{2} \log \frac{\vert \Sigma_q \vert}{\vert \Sigma_p \vert} + \mathrm{tr} \left( \mathbb{E}_p \left[ (x - \mu_2)^T (x - \mu_2) \right] \Sigma_2^{-1} \right) - \mathrm{tr} \left( \mathbb{E}_p \left[ (x - \mu_1)^T (x - \mu_1) \right] \Sigma_1^{-1} \right)
\end{aligned}
$$

Interestingly, $\mathbb{E}_p \left[ (x - \mu_1)^T (x - \mu_1) \right] = \Sigma_1$, so the last term above is equal to:

$$
\mathrm{tr} \left( \Sigma_1 \Sigma_1^{-1} \right) = k
$$

But the second term involves 2 difference distributions. Using the trick from:

$$
\begin{aligned}
\mathrm{tr} \left( \mathbb{E}_p \left[ (x - \mu_2)^T (x - \mu_2) \right] \Sigma_2^{-1} \right)
&= (\mu_1 - \mu_2)^T \Sigma_2^{-1} (\mu_1 - \mu_2) + \mathrm{tr} \left( \Sigma_2^{-1} \Sigma_1 \right)
\end{aligned}
$$

And finally:

$$
\begin{aligned}
D_\mathrm{KL} (p \Vert q)
&= \frac{1}{2} \left[ (\mu_1 - \mu_2)^T \Sigma_2^{-1} (\mu_1 - \mu_2) - k + \log \frac{\vert \Sigma_2 \vert}{\vert \Sigma_1 \vert} + \mathrm{tr} \left( \Sigma_2 \Sigma_1 \right) \right]
\end{aligned}
$$

Further in programming we consider

$$
\begin{aligned}
D_\mathrm{KL} (p \Vert q)
&= \frac{1}{2 \vert \mathcal{X} \vert} \sum_{x \in \mathcal{X}} \sum_{i=1}^k \left[ -1 + \frac{(\mu_{1, i} - \mu_{2, i})^2}{\sigma_{2, i}} + \log \sigma_{2, i} - \log \sigma_{1, i} + \sigma_{1, i} \sigma_{2, i} \right]
\end{aligned}
$$

1
2
mu1, logvar1, mu2, logvar2 = output
kl_loss = torch.sum((mu1 - mu2).pow(2) / logvar2.exp() + logvar2 - logvar1 + (logvar1 + logvar2).exp(), dim=-1).mean()

KL Divergence from $\mathcal{N}(0, I)$

Recall that in models like VAE and cVAE, the evidence lower bound (VLB) is

$$
\mathrm{VLB} = \mathbb{E}_{q_\phi(z \vert x)} p_\theta (x \vert z) - D_\mathrm{KL} \left( q_\phi(z \vert x) \Vert p(z) \right)
$$

We want the optimal parameters that

$$
(\phi^*, \theta^*) = \argmax_{\phi, \theta} \mathrm{VLB}
$$

Note that to align with most literature, here the KL Divergence is from $p$ to $q$.

And we assume that $p(z) = \mathcal{N} (z; 0, I)$ and $q_\phi(z \vert x) = \mathcal{N} (z; \mu, \Sigma)$, where $\Sigma = \mathrm{diag} { \sigma_1, \dots, \sigma_k}$ for which:

$$
\begin{aligned}
D_\mathrm{KL} \left( q_\phi(z \vert x) \Vert p(z) \right)
&= \mathbb{E}{p(z)} \left[ \Vert \mu_2 \Vert^2_2 - k - \log\vert \Sigma_2 \vert \right] \
&= - \frac{1}{2 \vert \mathcal{X} \vert} \sum
{x \in \mathcal{X}} \sum_{i=1}^k \left[ 1 + \left(\log \sigma_{2,i} \right) - \mu_{2i}^2 - \sigma_{2, i} \right]
\end{aligned}
$$

In coding, we get mu and logvar (which is $\log \Sigma$), and the KL Divergence loss term can be computed with:

1
2
mu, logvar = output
kld_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp(), dim=-1).mean()

KL Divergence by Sampling

Pending…

References

  • Mr.Esay’s Blog: KL Divergence between 2 Gaussian Distributions
  • Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  • Sohn, K., Lee, H., & Yan, X. (2015). Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 28.
  • Petersen, K. B., & Pedersen, M. S. (2008). The matrix cookbook. Technical University of Denmark, 7(15), 510.
  • Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The annals of mathematical statistics, 22(1), 79-86.
  • Csiszár, I. (1975). I-divergence geometry of probability distributions and minimization problems. The annals of probability, 146-158.