Variational autoencoder

From testwiki
Revision as of 18:53, 1 March 2025 by imported>Arjayay (Sp)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Template:Short description Template:Use dmy dates

The basic scheme of a variational autoencoder. The model receives x as input. The encoder compresses it into the latent space. The decoder receives as input the information sampled from the latent space and produces x as similar as possible to x.

Template:Machine learning bar

In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling.[1] It is part of the families of probabilistic graphical models and variational Bayesian methods.[2]

In addition to being seen as an autoencoder neural network architecture, variational autoencoders can also be studied within the mathematical formulation of variational Bayesian methods, connecting a neural encoder network to its decoder through a probabilistic latent space (for example, as a multivariate Gaussian distribution) that corresponds to the parameters of a variational distribution.

Thus, the encoder maps each point (such as an image) from a large complex dataset into a distribution within the latent space, rather than to a single point in that space. The decoder has the opposite function, which is to map from the latent space to the input space, again according to a distribution (although in practice, noise is rarely added during the decoding stage). By mapping a point to a distribution instead of a single point, the network can avoid overfitting the training data. Both networks are typically trained together with the usage of the reparameterization trick, although the variance of the noise model can be learned separately.Template:Cn

Although this type of model was initially designed for unsupervised learning,[3][4] its effectiveness has been proven for semi-supervised learning[5][6] and supervised learning.[7]

Overview of architecture and operation

A variational autoencoder is a generative model with a prior and noise distribution respectively. Usually such models are trained using the expectation-maximization meta-algorithm (e.g. probabilistic PCA, (spike & slab) sparse coding). Such a scheme optimizes a lower bound of the data likelihood, which is usually computationally intractable, and in doing so requires the discovery of q-distributions, or variational posteriors. These q-distributions are normally parameterized for each individual data point in a separate optimization process. However, variational autoencoders use a neural network as an amortized approach to jointly optimize across data points. In that way, the same parameters are reused for multiple data points, which can result in massive memory savings. The first neural network takes as input the data points themselves, and outputs parameters for the variational distribution. As it maps from a known input space to the low-dimensional latent space, it is called the encoder.

The decoder is the second neural network of this model. It is a function that maps from the latent space to the input space, e.g. as the means of the noise distribution. It is possible to use another neural network that maps to the variance, however this can be omitted for simplicity. In such a case, the variance can be optimized with gradient descent.

To optimize this model, one needs to know two terms: the "reconstruction error", and the Kullback–Leibler divergence (KL-D). Both terms are derived from the free energy expression of the probabilistic model, and therefore differ depending on the noise distribution and the assumed prior of the data, here referred to as p-distribution. For example, a standard VAE task such as IMAGENET is typically assumed to have a gaussianly distributed noise; however, tasks such as binarized MNIST require a Bernoulli noise. The KL-D from the free energy expression maximizes the probability mass of the q-distribution that overlaps with the p-distribution, which unfortunately can result in mode-seeking behaviour. The "reconstruction" term is the remainder of the free energy expression, and requires a sampling approximation to compute its expectation value.[8]

More recent approaches replace Kullback–Leibler divergence (KL-D) with various statistical distances, see see section "Statistical distance VAE variants" below..

Formulation

From the point of view of probabilistic modeling, one wants to maximize the likelihood of the data x by their chosen parameterized probability distribution pθ(x)=p(x|θ). This distribution is usually chosen to be a Gaussian N(x|μ,σ) which is parameterized by μ and σ respectively, and as a member of the exponential family it is easy to work with as a noise distribution. Simple distributions are easy enough to maximize, however distributions where a prior is assumed over the latents z results in intractable integrals. Let us find pθ(x) via marginalizing over z.

pθ(x)=zpθ(x,z)dz,

where pθ(x,z) represents the joint distribution under pθ of the observable data x and its latent representation or encoding z. According to the chain rule, the equation can be rewritten as

pθ(x)=zpθ(x|z)pθ(z)dz

In the vanilla variational autoencoder, z is usually taken to be a finite-dimensional vector of real numbers, and pθ(x|z) to be a Gaussian distribution. Then pθ(x) is a mixture of Gaussian distributions.

It is now possible to define the set of the relationships between the input data and its latent representation as

  • Prior pθ(z)
  • Likelihood pθ(x|z)
  • Posterior pθ(z|x)

Unfortunately, the computation of pθ(z|x) is expensive and in most cases intractable. To speed up the calculus to make it feasible, it is necessary to introduce a further function to approximate the posterior distribution as

qϕ(z|x)pθ(z|x)

with ϕ defined as the set of real values that parametrize q. This is sometimes called amortized inference, since by "investing" in finding a good qϕ, one can later infer z from x quickly without doing any integrals.

In this way, the problem is to find a good probabilistic autoencoder, in which the conditional likelihood distribution pθ(x|z) is computed by the probabilistic decoder, and the approximated posterior distribution qϕ(z|x) is computed by the probabilistic encoder.

Parametrize the encoder as Eϕ, and the decoder as Dθ.

Evidence lower bound (ELBO)

Template:Main

Like many deep learning approaches that use gradient-based optimization, VAEs require a differentiable loss function to update the network weights through backpropagation.

For variational autoencoders, the idea is to jointly optimize the generative model parameters θ to reduce the reconstruction error between the input and the output, and ϕ to make qϕ(z|x) as close as possible to pθ(z|x). As reconstruction loss, mean squared error and cross entropy are often used.

As distance loss between the two distributions the Kullback–Leibler divergence DKL(qϕ(z|x)pθ(z|x)) is a good choice to squeeze qϕ(z|x) under pθ(z|x).[8][9]

The distance loss just defined is expanded as

DKL(qϕ(z|x)pθ(z|x))=𝔼zqϕ(|x)[lnqϕ(z|x)pθ(z|x)]=𝔼zqϕ(|x)[lnqϕ(z|x)pθ(x)pθ(x,z)]=lnpθ(x)+𝔼zqϕ(|x)[lnqϕ(z|x)pθ(x,z)]

Now define the evidence lower bound (ELBO):Lθ,ϕ(x):=𝔼zqϕ(|x)[lnpθ(x,z)qϕ(z|x)]=lnpθ(x)DKL(qϕ(|x)pθ(|x))Maximizing the ELBOθ*,ϕ*=argmaxθ,ϕLθ,ϕ(x)is equivalent to simultaneously maximizing lnpθ(x) and minimizing DKL(qϕ(z|x)pθ(z|x)). That is, maximizing the log-likelihood of the observed data, and minimizing the divergence of the approximate posterior qϕ(|x) from the exact posterior pθ(|x).

The form given is not very convenient for maximization, but the following, equivalent form, is:Lθ,ϕ(x)=𝔼zqϕ(|x)[lnpθ(x|z)]DKL(qϕ(|x)pθ())where lnpθ(x|z) is implemented as 12xDθ(z)22, since that is, up to an additive constant, what x𝒩(Dθ(z),I) yields. That is, we model the distribution of x conditional on z to be a Gaussian distribution centered on Dθ(z). The distribution of qϕ(z|x) and pθ(z) are often also chosen to be Gaussians as z|x𝒩(Eϕ(x),σϕ(x)2I) and z𝒩(0,I), with which we obtain by the formula for KL divergence of Gaussians:Lθ,ϕ(x)=12𝔼zqϕ(|x)[xDθ(z)22]12(Nσϕ(x)2+Eϕ(x)222Nlnσϕ(x))+ConstHere N is the dimension of z. For a more detailed derivation and more interpretations of ELBO and its maximization, see its main page.

Reparameterization

The scheme of the reparameterization trick. The randomness variable ε is injected into the latent space z as external input. In this way, it is possible to backpropagate the gradient without involving stochastic variable during the update.

To efficiently search for θ*,ϕ*=argmaxθ,ϕLθ,ϕ(x)the typical method is gradient ascent.

It is straightforward to findθ𝔼zqϕ(|x)[lnpθ(x,z)qϕ(z|x)]=𝔼zqϕ(|x)[θlnpθ(x,z)qϕ(z|x)]However, ϕ𝔼zqϕ(|x)[lnpθ(x,z)qϕ(z|x)]does not allow one to put the ϕ inside the expectation, since ϕ appears in the probability distribution itself. The reparameterization trick (also known as stochastic backpropagation[10]) bypasses this difficulty.[8][11][12]

The most important example is when zqϕ(|x) is normally distributed, as 𝒩(μϕ(x),Σϕ(x)).

The scheme of a variational autoencoder after the reparameterization trick

This can be reparametrized by letting ε𝒩(0,𝑰) be a "standard random number generator", and construct z as z=μϕ(x)+Lϕ(x)ϵ. Here, Lϕ(x) is obtained by the Cholesky decomposition:Σϕ(x)=Lϕ(x)Lϕ(x)TThen we haveϕ𝔼zqϕ(|x)[lnpθ(x,z)qϕ(z|x)]=𝔼ϵ[ϕlnpθ(x,μϕ(x)+Lϕ(x)ϵ)qϕ(μϕ(x)+Lϕ(x)ϵ|x)]and so we obtained an unbiased estimator of the gradient, allowing stochastic gradient descent.

Since we reparametrized z, we need to find qϕ(z|x). Let q0 be the probability density function for ϵ, then Template:Clarifylnqϕ(z|x)=lnq0(ϵ)ln|det(ϵz)|where ϵz is the Jacobian matrix of z with respect to ϵ. Since z=μϕ(x)+Lϕ(x)ϵ, this is lnqϕ(z|x)=12ϵ2ln|detLϕ(x)|n2ln(2π)

Variations

Many variational autoencoders applications and extensions have been used to adapt the architecture to other domains and improve its performance.

β-VAE is an implementation with a weighted Kullback–Leibler divergence term to automatically discover and interpret factorised latent representations. With this implementation, it is possible to force manifold disentanglement for β values greater than one. This architecture can discover disentangled latent factors without supervision.[13][14]

The conditional VAE (CVAE), inserts label information in the latent space to force a deterministic constrained representation of the learned data.[15]

Some structures directly deal with the quality of the generated samples[16][17] or implement more than one latent space to further improve the representation learning.

Some architectures mix VAE and generative adversarial networks to obtain hybrid models.[18][19][20]

It is not necessary to use gradients to update the encoder. In fact, the encoder is not necessary for the generative model. [21]

Statistical distance VAE variants

After the initial work of Diederik P. Kingma and Max Welling.[22] several procedures were proposed to formulate in a more abstract way the operation of the VAE. In these approaches the loss function is composed of two parts :

  • the usual reconstruction error part which seeks to ensure that the encoder-then-decoder mapping xDθ(Eψ(x)) is as close to the identity map as possible; the sampling is done at run time from the empirical distribution real of objects available (e.g., for MNIST or IMAGENET this will be the empirical probability law of all images in the dataset). This gives the term: 𝔼xreal[xDθ(Eϕ(x))22].
  • a variational part that ensures that, when the empirical distribution real is passed through the encoder Eϕ, we recover the target distribution, denoted here μ(dz) that is usually taken to be a Multivariate normal distribution. We will denote Eϕreal this pushforward measure which in practice is just the empirical distribution obtained by passing all dataset objects through the encoder Eϕ. In order to make sure that Eϕreal is close to the target μ(dz), a Statistical distance d is invoked and the term d(μ(dz),Eϕreal)2 is added to the loss.

We obtain the final formula for the loss: Lθ,ϕ=𝔼xreal[xDθ(Eϕ(x))22]+d(μ(dz),Eϕreal)2

The statistical distance d requires special properties, for instance it has to be posses a formula as expectation because the loss function will need to be optimized by stochastic optimization algorithms. Several distances can be chosen and this gave rise to several flavors of VAEs:

See also

Template:Div col

Template:Div col end

References

Template:Reflist

Further reading

Template:Artificial intelligence navbox