发布于 

KL Divergence of Gaussians

Preliminary: KL Divergence

Kullback–Leibler (KL) Divergence, aka the relative entropy or I-divergence is a distance metric that quantifies the difference between two probability distributions. We denote the KL Divergence of PP from QQ with:

DKL(PQ)=xP(x)lgP(x)Q(x)D_\mathrm{KL} \left( P \Vert Q \right) = \sum_{x} P(x) \lg \frac{P(x)}{Q(x)}

For distributions of continuous random variable:

DKL(PQ)=p(x)lgp(x)q(x)dxD_\mathrm{KL} \left( P \Vert Q \right) = \int p(x) \lg \frac{p(x)}{q(x)} \mathrm{d}x

Clearly that KL Divergence is not symmetric, for DKL(PQ)DKL(QP)D_\mathrm{KL}(P \Vert Q) \neq D_\mathrm{KL}(Q \Vert P).

KL Divergence of Gaussians

Gaussian Distributions

Recall that

N(x;μ,Σ)=exp{12(xμ)TΣ1(xμ)}\mathcal{N}(x; \mu, \Sigma) = \exp \{ - \frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu)\}

And here’s also a trick in processing the index term: we know that (xμ)TΣ1(xμ)R(x - \mu)^T \Sigma^{-1} (x - \mu) \in \mathbb{R}, therefore with the trace trick we have:

(xμ)TΣ1(xμ)=tr((xμ)TΣ1(xμ))=tr((xμ)T(xμ)Σ1)(x - \mu)^T \Sigma^{-1} (x - \mu) = \mathrm{tr} \left( (x - \mu)^T \Sigma^{-1} (x - \mu) \right) = \mathrm{tr} \left( (x - \mu)^T (x - \mu) \Sigma^{-1} \right)

KL Divergence of two Gaussians

Consider two Gaussian distributions:

p(x)=N(x;μ1,Σ1),q(x)=1(2π)kΣN(x;μ2,Σ2)p(x) = \mathcal{N}(x; \mu_1, \Sigma_1), q(x) = \frac{1}{\sqrt{(2 \pi)^k \vert \Sigma \vert}} \mathcal{N}(x; \mu_2, \Sigma_2)

We have

DKL(pq)=Ep[logplogq]=12Ep[logΣqΣp(xμ1)TΣ11(xμ1)+(xμ2)TΣ21(xμ2)]=12logΣqΣp+Ep[(xμ1)TΣ11(xμ1)+(xμ2)TΣ21(xμ2)]\begin{aligned} D_\mathrm{KL} (p \Vert q) &= \mathbb{E}_p \left[ \log p - \log q \right] \\ &= \frac{1}{2} \mathbb{E}_p \left[ \log \frac{\vert \Sigma_q \vert}{\vert \Sigma_p \vert} - (x - \mu_1)^T \Sigma_1^{-1} (x - \mu_1) + (x - \mu_2)^T \Sigma_2^{-1} (x - \mu_2) \right] \\ &= \frac{1}{2} \log \frac{\vert \Sigma_q \vert}{\vert \Sigma_p \vert} + \mathbb{E}_p \left[ - (x - \mu_1)^T \Sigma_1^{-1} (x - \mu_1) + (x - \mu_2)^T \Sigma_2^{-1} (x - \mu_2) \right] \end{aligned}

Using the trick mentioned above we have

DKL(pq)=12logΣqΣp+Ep[tr((xμ2)T(xμ2)Σ21)]Ep[tr((xμ1)T(xμ1)Σ11)]=12logΣqΣp+tr(Ep[(xμ2)T(xμ2)Σ21])tr(Ep[(xμ1)T(xμ1)Σ11])=12logΣqΣp+tr(Ep[(xμ2)T(xμ2)]Σ21)tr(Ep[(xμ1)T(xμ1)]Σ11)\begin{aligned} D_\mathrm{KL} (p \Vert q) &= \frac{1}{2} \log \frac{\vert \Sigma_q \vert}{\vert \Sigma_p \vert} + \mathbb{E}_p \left[ \mathrm{tr} \left( (x - \mu_2)^T (x - \mu_2) \Sigma_2^{-1} \right) \right] - \mathbb{E}_p \left[ \mathrm{tr} \left( (x - \mu_1)^T (x - \mu_1) \Sigma_1^{-1} \right) \right] \\ &= \frac{1}{2} \log \frac{\vert \Sigma_q \vert}{\vert \Sigma_p \vert} + \mathrm{tr} \left( \mathbb{E}_p \left[ (x - \mu_2)^T (x - \mu_2) \Sigma_2^{-1} \right] \right) - \mathrm{tr} \left( \mathbb{E}_p \left[ (x - \mu_1)^T (x - \mu_1) \Sigma_1^{-1} \right] \right) \\ &= \frac{1}{2} \log \frac{\vert \Sigma_q \vert}{\vert \Sigma_p \vert} + \mathrm{tr} \left( \mathbb{E}_p \left[ (x - \mu_2)^T (x - \mu_2) \right] \Sigma_2^{-1} \right) - \mathrm{tr} \left( \mathbb{E}_p \left[ (x - \mu_1)^T (x - \mu_1) \right] \Sigma_1^{-1} \right) \end{aligned}

Interestingly, Ep[(xμ1)T(xμ1)]=Σ1\mathbb{E}_p \left[ (x - \mu_1)^T (x - \mu_1) \right] = \Sigma_1, so the last term above is equal to:

tr(Σ1Σ11)=k\mathrm{tr} \left( \Sigma_1 \Sigma_1^{-1} \right) = k

But the second term involves 2 difference distributions. Using the trick from:

tr(Ep[(xμ2)T(xμ2)]Σ21)=(μ1μ2)TΣ21(μ1μ2)+tr(Σ21Σ1)\begin{aligned} \mathrm{tr} \left( \mathbb{E}_p \left[ (x - \mu_2)^T (x - \mu_2) \right] \Sigma_2^{-1} \right) &= (\mu_1 - \mu_2)^T \Sigma_2^{-1} (\mu_1 - \mu_2) + \mathrm{tr} \left( \Sigma_2^{-1} \Sigma_1 \right) \end{aligned}

And finally:

DKL(pq)=12[(μ1μ2)TΣ21(μ1μ2)k+logΣ2Σ1+tr(Σ2Σ1)]\begin{aligned} D_\mathrm{KL} (p \Vert q) &= \frac{1}{2} \left[ (\mu_1 - \mu_2)^T \Sigma_2^{-1} (\mu_1 - \mu_2) - k + \log \frac{\vert \Sigma_2 \vert}{\vert \Sigma_1 \vert} + \mathrm{tr} \left( \Sigma_2 \Sigma_1 \right) \right] \end{aligned}

Further in programming we consider

DKL(pq)=12XxXi=1k[1+(μ1,iμ2,i)2σ2,i+logσ2,ilogσ1,i+σ1,iσ2,i]\begin{aligned} D_\mathrm{KL} (p \Vert q) &= \frac{1}{2 \vert \mathcal{X} \vert} \sum_{x \in \mathcal{X}} \sum_{i=1}^k \left[ -1 + \frac{(\mu_{1, i} - \mu_{2, i})^2}{\sigma_{2, i}} + \log \sigma_{2, i} - \log \sigma_{1, i} + \sigma_{1, i} \sigma_{2, i} \right] \end{aligned}

1
2
mu1, logvar1, mu2, logvar2 = output
kl_loss = torch.sum((mu1 - mu2).pow(2) / logvar2.exp() + logvar2 - logvar1 + (logvar1 + logvar2).exp(), dim=-1).mean()

KL Divergence from N(0,I)\mathcal{N}(0, I)

Recall that in models like VAE and cVAE, the evidence lower bound (VLB) is

VLB=Eqϕ(zx)pθ(xz)DKL(qϕ(zx)p(z))\mathrm{VLB} = \mathbb{E}_{q_\phi(z \vert x)} p_\theta (x \vert z) - D_\mathrm{KL} \left( q_\phi(z \vert x) \Vert p(z) \right)

We want the optimal parameters that

(ϕ,θ)=arg maxϕ,θVLB(\phi^*, \theta^*) = \argmax_{\phi, \theta} \mathrm{VLB}

Note that to align with most literature, here the KL Divergence is from pp to qq.

And we assume that p(z)=N(z;0,I)p(z) = \mathcal{N} (z; 0, I) and qϕ(zx)=N(z;μ,Σ)q_\phi(z \vert x) = \mathcal{N} (z; \mu, \Sigma), where Σ=diag{σ1,,σk}\Sigma = \mathrm{diag} \{ \sigma_1, \dots, \sigma_k\} for which:

DKL(qϕ(zx)p(z))=Ep(z)[μ222klogΣ2]=12XxXi=1k[1+(logσ2,i)μ2i2σ2,i]\begin{aligned} D_\mathrm{KL} \left( q_\phi(z \vert x) \Vert p(z) \right) &= \mathbb{E}_{p(z)} \left[ \Vert \mu_2 \Vert^2_2 - k - \log\vert \Sigma_2 \vert \right] \\ &= - \frac{1}{2 \vert \mathcal{X} \vert} \sum_{x \in \mathcal{X}} \sum_{i=1}^k \left[ 1 + \left(\log \sigma_{2,i} \right) - \mu_{2i}^2 - \sigma_{2, i} \right] \end{aligned}

In coding, we get mu and logvar (which is logΣ\log \Sigma), and the KL Divergence loss term can be computed with:

1
2
mu, logvar = output
kld_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp(), dim=-1).mean()

KL Divergence by Sampling

Pending…

References

  • Mr.Esay’s Blog: KL Divergence between 2 Gaussian Distributions
  • Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  • Sohn, K., Lee, H., & Yan, X. (2015). Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 28.
  • Petersen, K. B., & Pedersen, M. S. (2008). The matrix cookbook. Technical University of Denmark, 7(15), 510.
  • Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The annals of mathematical statistics, 22(1), 79-86.
  • Csiszár, I. (1975). I-divergence geometry of probability distributions and minimization problems. The annals of probability, 146-158.

本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议,转载请注明出处。

本站由 @Aiden 创建,使用 Stellar 作为主题。