发布于 

autoencoders-explained

Autoencoders

This page provides basic information about some kinds of autoencoders.

AE

An autoencoder is a DNN trained to replicate an input vector xRFx \in \R^F at the output. It usually has a diabolo shape, with an encoder providing a low-dim latent representation zRLz \in \R^L from xx (LFL \ll F) and a decoder tries to reconstruct xx from zz, outputting x^\hat{x}.

Autoendoer

  • Encoder: p(zx)p(z \vert x)
  • Decoder: p(xz)p(x \vert z)

VAE

VAE can be seen as a probabilistic version of an AE, where the decoder outputs the parameters of a probability distribution of xx. Thus, an VAE is capable of

  • generating new x^\hat{x} from unseen values of zz
  • transforming existing data x1,,xnx_1, \dots, x_n by modifying their encoded latents.
  • providing a prior distribution of xx in more complex Bayesian models.

Decoder

The VAE decoder is defined as

pθ(x,z)=pθx(xz)pθz(z)(vae:dec)p_\theta (x, z) = p_{\theta_x} (x \vert z) p_{\theta_z} (z) \tag{vae:dec}

  • Prior distribution for zz: pθz(z)p_{\theta_z}(z)
    • usually modelled with N(0,I)\mathcal{N}(0, I)
    • modelled with Gamma distributions that better fit natural statistics of speech/audio power spectra
  • Conditional distribution pθx(xz)p_{\theta_x} (x \vert z), as a non-linear function of zz

Generation with VAE

If we consider the marginal distribution of xx by integrating on p(xz)p(x \vert z):

pθ(x)=pθx(xz)pθz(z)dz(vae:x-marginal)p_\theta(x) = \int p_{\theta_x} (x \vert z) p_{\theta_z} (z) \mathrm{d}z \tag{vae:x-marginal}

Because any conditional distribution pθx(xz)p_{\theta_x} (x \vert z) can provide a mode, pθ(x)p_\theta (x) can be highly multi-modal.

💡 This only provides an analysis on pθ(x)p_\theta(x). Later we will find that pθx(zx)p_{\theta_x} (z \vert x) is actually intractable, for which we actually cannot compute this integral.

Gaussian Prior

When the prior distribution is modelled with N(0,I)\mathcal{N}(0, I), we have

pθx(xz)=N(x;μθx(z),diag{σθx2(z)})=i=1Fpθx(xfz)=i=1FN(xf;μθx,f(z),diag{σθx,f2(z)})\begin{aligned} p_{\theta_x}(x \vert z) &= \mathcal{N} \left(x; \mu_{\theta_x}(z), \mathrm{diag}\{ \sigma_{\theta_x}^2 (z) \}\right)\\ &=\prod_{i=1}^F p_{\theta_x} (x_f \vert z) \\ &=\prod_{i=1}^F \mathcal{N} \left(x_f; \mu_{\theta_x, f}(z), \mathrm{diag}\{ \sigma_{\theta_x, f}^2 (z) \}\right)\\ \end{aligned}

Encoder

To train the generative model in (vae:dec)\mathrm{(vae:dec)}, we should learn θ\theta from the training dataset X={xnRF}i=1NX = \{ x_n \in \R^F \}_{i=1}^N, where **************************************it is assumed that xiiidp(x)x_i \overset{\mathrm{iid}}{\sim} p(x)
. We estimate p(x)p(x) with pθ(x)p_\theta(x) and by training on xXx \in X we find

θ=arg maxθlogpθ(x)\theta^* = \argmax_\theta \log p_\theta(x)

EM: An Intractable Way

Using the classical Expectation-Maximization is not possible.

Variational Inference

Variational Inference is used to train the VAE. It is based on two principles:

  • pθ(zx)p_\theta (z \vert x) in intractable, but the approximation posterior qϕ(zx)pθ(zx)q_\phi (z \vert x) \approx p_\theta(z \vert x) is acceptable
  • The encoder and the decoder are jointly trained

A commonly chosen approximation is also Gaussian:

qϕ(zx)=N(z;μϕ(x),diag{σϕ2(x)})\begin{aligned} q_\phi (z \vert x) &= \mathcal{N} \left(z; \mu_\phi (x), \mathrm{diag} \{ \sigma_\phi^2 (x) \} \right) \end{aligned}

where the encoder network non-linearly maps xx to distribution parameters with two encoder networks (also referred to as recognition networks):

  • μϕ:RFRL\mu_\phi: \R^F \to \R^L
  • σϕ:RFR+L\sigma_\phi: \R^F \to \R^L_+

Here, we only get parameters for the distribution for zz, from which we need sampling to get zz. However, sampling is not differentiable so back propagation cannot be done. In coding, this is tackled by the reparameterize trick that:

  1. sample z^\hat{z} from the Standard Normal Distribution N(0,I)\mathcal{N}(0, I)
  2. transform z^\hat{z} to zz by z=μϕ(x)+σϕ2(x)xz = \mu_\phi(x) + \sigma_\phi^2(x) x
1
2
def reparam(self, mu, logvar):
return mu + torch.randn_like(mu) * torch.exp(logvar)

💡 Here, logvar that is actually logσϕ2(x)R\log\sigma_\phi^2(x) \in \R.

Training

The parameters θ,ϕ\theta, \phi are jointly estimated from XX. The paper (Kingma and Welling, 2014) has shown that even if logpθ(X)\log p_\theta (X) is intractable, we can calculate an EvLdence Lower Bound (ELBO) depending on θ,ϕ\theta, \phi:

The ELBO of an VAE is geven by:

L(θ,ϕ;X)=Eqϕ(zx)logpθ(x,z)Eqϕ(zx)logqϕ(xz)=Eqϕ(zx)logpθ(xz)DDL(qϕ(zx)pθz(z))\begin{aligned} \mathcal{L}(\theta, \phi; X) &= \mathbb{E}_{q_\phi (z \vert x)} \log p_\theta (x, z) - \mathbb{E}_{q_\phi (z \vert x)} \log q_\phi (x \vert z) \\ &= \mathbb{E}_{q_\phi (z \vert x)} \log p_\theta (x \vert z) - D_\mathrm{DL} \left( q_\phi (z \vert x) \Vert p_{\theta_z} (z) \right) \end{aligned}

Improvements (Pending…)

β\beta-VAE

VAEs with Normalization Flows

T-VAE (Tianming et. al., 2019)

cVAE (Pending…)

Dynamical VAE

Dynamical VAEs (DVAEs) consider a sequence of observed random vectores x1:T={xtRF}i=1Tx_{1:T} = \{ x_t \in \R^F \}_{i=1}^T and a sequence of latent random vectors z1:T={ztRF}i=1Tz_{1:T} = \{ z_t \in \R^F \}_{i=1}^T that are assumed to be temporally correlated. Designing a DVAE consists in specifying the joint distributions pθ(x1:T,z1:T)p_\theta (x_{1:T}, z_{1:T}) that determine their dependencies.

In a commonly-used mode called driven mode, by observing u1:T={utRU}t=1Tu_{1:T} = \{ u_t \in \R^U \}_{t=1}^T, the model generates x1:Tx_{1:T} as the output. This requires specifying pθ(x1:T,z1:T,u1:T)p_\theta(x_{1:T}, z_{1:T}, u_{1:T}), but we usually assume u1:Tu_{1:T} is deterministic and x1:T,z1:Tx_{1:T}, z_{1:T} are stochastic, turning the problem to specifying pθ(x1:T,z1:Tu1:T)p_\theta (x_{1:T}, z_{1:T} \vert u_{1:T})

Ordering Dependencies

We naturally order the dependencies by a casual model:

  • Causal: Generate the variable from the past - most popular for tasks like motion generation
  • Non-causal: Generate the variable from both the past and the future
  • Anti-causal: Generate the variable from the future

E.g. a very simple example of a causal model (the first line) and a non-causal model (the second line):

Causal and Non-Causal DVAE

Causal DVAEs Modelling

For the causal DVAE, the joint distribution can be modelled with the chain rule:

q(x1:T,z1:Tu1:T)=t=1Tp(xt,ztx1:t1,z1:t1,u1:t)=t=1Tp(xtx1:t1,z1:t,u1:t)p(ztx1:t1,z1:t1,u1:t)\begin{aligned} q(x_{1:T}, z_{1:T} \vert u_{1:T}) &= \prod_{t=1}^T p(x_t, z_t \vert x_{1:t-1}, z_{1:t-1}, u_{1:t}) \\ &= \prod_{t=1}^T p(x_t \vert x_{1:t-1}, z_{1:t}, u_{1:t}) p(z_t \vert x_{1:t-1}, z_{1:t-1}, u_{1:t}) \end{aligned}

💡 Of note, x1:0=z1:0=x_{1:0} = z_{1:0} = \empty, therefore the first term is p(x1z1,u1)p(z1u1)p(x_1 \vert z_1, u_1) p(z_1 \vert u_1). The initialization requires specifying p(z1)p(z_1) and p(z1u1)p(z_1 \vert u_1).

Still, this depencency is complex that involves all the past and present uus, as well as all the past xxs, zzs.

Simplifications

There are possible simplifications, for example, the SSM family simplifies the dependencies as:

p(xtx1:t1,z1:t,u1:t)=p(xtzt)p(ztx1:t1,z1:t1,u1:t)=p(ztzt1,ut)\begin{aligned} p(x_t \vert x_{1:t-1}, z_{1:t}, u_{1:t}) &= p(x_t \vert z_t) \\ p(z_t \vert x_{1:t-1}, z_{1:t-1}, u_{1:t}) &= p(z_t \vert z_{t-1}, u_t) \end{aligned}

But a more generally used may is:

p(xtx1:t1,z1:t,u1:t)=p(xtx1:t1,ut)p(ztx1:t1,z1:t1,u1:t)=p(ztx1:t,ut)\begin{aligned} p(x_t \vert x_{1:t-1}, z_{1:t}, u_{1:t}) &= p(x_t \vert x_{1:t-1}, u_t) \\ p(z_t \vert x_{1:t-1}, z_{1:t-1}, u_{1:t}) &= p(z_t \vert x_{1:t}, u_t) \end{aligned}

The probabilistic graph for it is like:

A possible probabilistic graph for causal DVAE.

To further facterize the terms, RNNs with hidden states are introduced. Two possible implementations are sharing the same set of hidden states or use different sets:

2 possible ways of factorization for causal DVAE.

Inference Model

Also, the posterior distribution of z1:Tz_{1:T}: pθ(z1:Tx1:T,u1:T)p_\theta (z_{1:T} \vert x_{1:T}, u_{1:T}) is intractable. We also need an inference model qϕ(z1:Tx1:T,u1:T)q_\phi(z_{1:T} \vert x_{1:T}, u_{1:T}) to approximate it. Exploiting D-Spearation, the inference can be formulated as:

pθ(z1:Tx1:T,u1:T)=t=1Tpθz(ztz1:t1,x1:T,u1:T)qϕ(z1:Tx1:T,u1:T)=t=1Tqϕ(ztz1:t1,x1:T,u1:T)\begin{aligned} p_\theta(z_{1:T} \vert x_{1:T}, u_{1:T}) &= \prod_{t=1}^T p_{\theta_z} (z_t \vert z_{1:t-1}, x_{1:T}, u_{1:T}) \\ q_\phi(z_{1:T} \vert x_{1:T}, u_{1:T}) &= \prod_{t=1}^T q_\phi (z_t \vert z_{1:t-1}, x_{1:T}, u_{1:T}) \end{aligned}

Training DVAE

The ELBO is extended to data sequences as:

L(θ,ϕ;x1:T,u1:T)=Eqϕ(z1:Tx1:Tu1:T)lnpθ(x1:T,zt:Tu1:T)Eqϕ(z1:Tx1:Tu1:T)lnqϕ(zt:Tx1:T,u1:T)=t=1TEqϕ(z1:tx1:Tu1:T)lnpθx(xtx1:t1,z1:t,u1:t)t=1TEqϕ(z1:t1x1:Tu1:T)DKL(qϕ(ztz1:t1,x1:T,u1:T)pθz(ztx1:t1,z1:t1,u1:t))\begin{aligned} \mathcal{L}(\theta, \phi; x_{1:T}, u_{1:T}) &= \mathbb{E}_{q_\phi(z_{1:T} \vert x_{1:T} \vert u_{1:T})} \ln p_\theta (x_{1:T}, z_{t:T} \vert u_{1:T}) \\ &-\mathbb{E}_{q_\phi(z_{1:T} \vert x_{1:T} \vert u_{1:T})} \ln q_\phi (z_{t:T} \vert x_{1:T}, u_{1:T}) \\ &= \sum_{t=1}^T \mathbb{E}_{q_\phi(z_{1:t} \vert x_{1:T} \vert u_{1:T})} \ln p_{\theta_x} (x_t \vert x_{1:t-1}, z_{1:t}, u_{1:t}) \\ &- \sum_{t=1}^T \mathbb{E}_{q_\phi(z_{1:t-1} \vert x_{1:T} \vert u_{1:T})} D_\mathrm{KL} \left(q_\phi(z_t \vert z_{1:t-1}, x_{1:T}, u_{1:T}) \Vert p_{\theta_z} (z_t \vert x_{1:t-1}, z_{1:t-1}, u_{1:t}) \right) \end{aligned}

💡 In the standard VAE, the regularization term has an analytical form for usual distributions like Gaussian. However, for DVAE, the reconstruction accuracy and regularization term require computing Monte Carlo estimates with samples drawn from qϕ(z1:τx1:T,u1:T)q_\phi (z_{1:\tau} \vert x_{1:T}, u_{1:T}) where τ{1,,T}\tau \in \{ 1, \dots, T \}.

References

now publishers - Dynamical Variational Autoencoders: A Comprehensive Review


本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议,转载请注明出处。

本站由 @Aiden 创建,使用 Stellar 作为主题。