# KL Divergence of Gaussians

## Preliminary: KL Divergence

Kullback–Leibler (KL) Divergence, aka the relative entropy or I-divergence is a distance metric that quantifies the difference between two probability distributions. We denote the KL Divergence of $P$ from $Q$ with:

$D_\mathrm{KL} \left( P \Vert Q \right) = \sum_{x} P(x) \lg \frac{P(x)}{Q(x)}$

For distributions of continuous random variable:

$D_\mathrm{KL} \left( P \Vert Q \right) = \int p(x) \lg \frac{p(x)}{q(x)} \mathrm{d}x$

Clearly that KL Divergence is not symmetric, for $D_\mathrm{KL}(P \Vert Q) \neq D_\mathrm{KL}(Q \Vert P)$.

## KL Divergence of Gaussians

### Gaussian Distributions

Recall that

$\mathcal{N}(x; \mu, \Sigma) = \exp \{ - \frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu)\}$

And here’s also a trick in processing the index term: we know that $(x - \mu)^T \Sigma^{-1} (x - \mu) \in \mathbb{R}$, therefore with the trace trick we have:

$(x - \mu)^T \Sigma^{-1} (x - \mu) = \mathrm{tr} \left( (x - \mu)^T \Sigma^{-1} (x - \mu) \right) = \mathrm{tr} \left( (x - \mu)^T (x - \mu) \Sigma^{-1} \right)$

### KL Divergence of two Gaussians

Consider two Gaussian distributions:

$p(x) = \mathcal{N}(x; \mu_1, \Sigma_1), q(x) = \frac{1}{\sqrt{(2 \pi)^k \vert \Sigma \vert}} \mathcal{N}(x; \mu_2, \Sigma_2)$

We have

\begin{aligned} D_\mathrm{KL} (p \Vert q) &= \mathbb{E}_p \left[ \log p - \log q \right] \\ &= \frac{1}{2} \mathbb{E}_p \left[ \log \frac{\vert \Sigma_q \vert}{\vert \Sigma_p \vert} - (x - \mu_1)^T \Sigma_1^{-1} (x - \mu_1) + (x - \mu_2)^T \Sigma_2^{-1} (x - \mu_2) \right] \\ &= \frac{1}{2} \log \frac{\vert \Sigma_q \vert}{\vert \Sigma_p \vert} + \mathbb{E}_p \left[ - (x - \mu_1)^T \Sigma_1^{-1} (x - \mu_1) + (x - \mu_2)^T \Sigma_2^{-1} (x - \mu_2) \right] \end{aligned}

Using the trick mentioned above we have

\begin{aligned} D_\mathrm{KL} (p \Vert q) &= \frac{1}{2} \log \frac{\vert \Sigma_q \vert}{\vert \Sigma_p \vert} + \mathbb{E}_p \left[ \mathrm{tr} \left( (x - \mu_2)^T (x - \mu_2) \Sigma_2^{-1} \right) \right] - \mathbb{E}_p \left[ \mathrm{tr} \left( (x - \mu_1)^T (x - \mu_1) \Sigma_1^{-1} \right) \right] \\ &= \frac{1}{2} \log \frac{\vert \Sigma_q \vert}{\vert \Sigma_p \vert} + \mathrm{tr} \left( \mathbb{E}_p \left[ (x - \mu_2)^T (x - \mu_2) \Sigma_2^{-1} \right] \right) - \mathrm{tr} \left( \mathbb{E}_p \left[ (x - \mu_1)^T (x - \mu_1) \Sigma_1^{-1} \right] \right) \\ &= \frac{1}{2} \log \frac{\vert \Sigma_q \vert}{\vert \Sigma_p \vert} + \mathrm{tr} \left( \mathbb{E}_p \left[ (x - \mu_2)^T (x - \mu_2) \right] \Sigma_2^{-1} \right) - \mathrm{tr} \left( \mathbb{E}_p \left[ (x - \mu_1)^T (x - \mu_1) \right] \Sigma_1^{-1} \right) \end{aligned}

Interestingly, $\mathbb{E}_p \left[ (x - \mu_1)^T (x - \mu_1) \right] = \Sigma_1$, so the last term above is equal to:

$\mathrm{tr} \left( \Sigma_1 \Sigma_1^{-1} \right) = k$

But the second term involves 2 difference distributions. Using the trick from:

\begin{aligned} \mathrm{tr} \left( \mathbb{E}_p \left[ (x - \mu_2)^T (x - \mu_2) \right] \Sigma_2^{-1} \right) &= (\mu_1 - \mu_2)^T \Sigma_2^{-1} (\mu_1 - \mu_2) + \mathrm{tr} \left( \Sigma_2^{-1} \Sigma_1 \right) \end{aligned}

And finally:

\begin{aligned} D_\mathrm{KL} (p \Vert q) &= \frac{1}{2} \left[ (\mu_1 - \mu_2)^T \Sigma_2^{-1} (\mu_1 - \mu_2) - k + \log \frac{\vert \Sigma_2 \vert}{\vert \Sigma_1 \vert} + \mathrm{tr} \left( \Sigma_2 \Sigma_1 \right) \right] \end{aligned}

Further in programming we consider

\begin{aligned} D_\mathrm{KL} (p \Vert q) &= \frac{1}{2 \vert \mathcal{X} \vert} \sum_{x \in \mathcal{X}} \sum_{i=1}^k \left[ -1 + \frac{(\mu_{1, i} - \mu_{2, i})^2}{\sigma_{2, i}} + \log \sigma_{2, i} - \log \sigma_{1, i} + \sigma_{1, i} \sigma_{2, i} \right] \end{aligned}

### KL Divergence from $\mathcal{N}(0, I)$

Recall that in models like VAE and cVAE, the evidence lower bound (VLB) is

$\mathrm{VLB} = \mathbb{E}_{q_\phi(z \vert x)} p_\theta (x \vert z) - D_\mathrm{KL} \left( q_\phi(z \vert x) \Vert p(z) \right)$

We want the optimal parameters that

$(\phi^*, \theta^*) = \argmax_{\phi, \theta} \mathrm{VLB}$

Note that to align with most literature, here the KL Divergence is from $p$ to $q$.

And we assume that $p(z) = \mathcal{N} (z; 0, I)$ and $q_\phi(z \vert x) = \mathcal{N} (z; \mu, \Sigma)$, where $\Sigma = \mathrm{diag} \{ \sigma_1, \dots, \sigma_k\}$ for which:

\begin{aligned} D_\mathrm{KL} \left( q_\phi(z \vert x) \Vert p(z) \right) &= \mathbb{E}_{p(z)} \left[ \Vert \mu_2 \Vert^2_2 - k - \log\vert \Sigma_2 \vert \right] \\ &= - \frac{1}{2 \vert \mathcal{X} \vert} \sum_{x \in \mathcal{X}} \sum_{i=1}^k \left[ 1 + \left(\log \sigma_{2,i} \right) - \mu_{2i}^2 - \sigma_{2, i} \right] \end{aligned}

In coding, we get mu and logvar (which is $\log \Sigma$), and the KL Divergence loss term can be computed with:

Pending…

## References

• Mr.Esay’s Blog: KL Divergence between 2 Gaussian Distributions
• Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
• Sohn, K., Lee, H., & Yan, X. (2015). Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 28.
• Petersen, K. B., & Pedersen, M. S. (2008). The matrix cookbook. Technical University of Denmark, 7(15), 510.
• Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The annals of mathematical statistics, 22(1), 79-86.
• Csiszár, I. (1975). I-divergence geometry of probability distributions and minimization problems. The annals of probability, 146-158.