Diffusion model

From testwiki
Jump to navigation Jump to search

Template:Short descriptionTemplate:AboutTemplate:Machine learning

In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure.[1] The goal of diffusion models is to learn a diffusion process for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion model models data as generated by a diffusion process, whereby a new datum performs a random walk with drift through the space of all possible data.[2] A trained diffusion model can be sampled in many ways, with different efficiency and quality.

There are various equivalent formalisms, including Markov chains, denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations.[3] They are typically trained using variational inference.[4] The model responsible for denoising is typically called its "backbone". The backbone may be of any kind, but they are typically U-nets or transformers.

Template:As of, diffusion models are mainly used for computer vision tasks, including image denoising, inpainting, super-resolution, image generation, and video generation. These typically involve training a neural network to sequentially denoise images blurred with Gaussian noise.[2][5] The model is trained to reverse the process of adding noise to an image. After training to convergence, it can be used for image generation by starting with an image composed of random noise, and applying the network iteratively to denoise the image.

Diffusion-based image generators have seen widespread commercial interest, such as Stable Diffusion and DALL-E. These models typically combine diffusion models with other models, such as text-encoders and cross-attention modules to allow text-conditioned generation.[6]

Other than computer vision, diffusion models have also found applications in natural language processing[7][8] such as text generation[9][10] and summarization,[11] sound generation,[12] and reinforcement learning.[13][14]

Denoising diffusion model

Non-equilibrium thermodynamics

Diffusion models were introduced in 2015 as a method to learn a model that can sample from a highly complex probability distribution. They used techniques from non-equilibrium thermodynamics, especially diffusion.[15]

Consider, for example, how one might model the distribution of all naturally-occurring photos. Each image is a point in the space of all images, and the distribution of naturally-occurring photos is a "cloud" in space, which, by repeatedly adding noise to the images, diffuses out to the rest of the image space, until the cloud becomes all but indistinguishable from a Gaussian distribution 𝒩(0,I). A model that can approximately undo the diffusion can then be used to sample from the original distribution. This is studied in "non-equilibrium" thermodynamics, as the starting distribution is not in equilibrium, unlike the final distribution.

The equilibrium distribution is the Gaussian distribution 𝒩(0,I), with pdf ρ(x)e12x2. This is just the Maxwell–Boltzmann distribution of particles in a potential well V(x)=12x2 at temperature 1. The initial distribution, being very much out of equilibrium, would diffuse towards the equilibrium distribution, making biased random steps that are a sum of pure randomness (like a Brownian walker) and gradient descent down the potential well. The randomness is necessary: if the particles were to undergo only gradient descent, then they will all fall to the origin, collapsing the distribution.

Denoising Diffusion Probabilistic Model (DDPM)

The 2020 paper proposed the Denoising Diffusion Probabilistic Model (DDPM), which improves upon the previous method by variational inference.[4][16]

Forward diffusion

To present the model, we need some notation.

  • β1,...,βT(0,1) are fixed constants.
  • αt:=1βt
  • αΒ―t:=α1αt
  • σt:=1αΒ―t
  • σ~t:=σt1σtβt
  • μ~t(xt,x0):=αt(1αΒ―t1)xt+αΒ―t1(1αt)x0σt2
  • 𝒩(μ,Σ) is the normal distribution with mean μ and variance Σ, and 𝒩(x|μ,Σ) is the probability density at x.
  • A vertical bar denotes conditioning.

A forward diffusion process starts at some starting point x0q, where q is the probability distribution to be learned, then repeatedly adds noise to it byxt=1βtxt1+βtztwhere z1,...,zT are IID samples from 𝒩(0,I). This is designed so that for any starting distribution of x0, we have limtxt|x0 converging to 𝒩(0,I).

The entire diffusion process then satisfiesq(x0:T)=q(x0)q(x1|x0)q(xT|xT1)=q(x0)𝒩(x1|α1x0,β1I)𝒩(xT|αTxT1,βTI)orlnq(x0:T)=lnq(x0)t=1T12βtxt1βtxt12+Cwhere C is a normalization constant and often omitted. In particular, we note that x1:T|x0 is a gaussian process, which affords us considerable freedom in reparameterization. For example, by standard manipulation with gaussian process, xt|x0N(αΒ―tx0,σt2I)xt1|xt,x0𝒩(μ~t(xt,x0),σ~t2I)In particular, notice that for large t, the variable xt|x0N(αΒ―tx0,σt2I) converges to 𝒩(0,I). That is, after a long enough diffusion process, we end up with some xT that is very close to 𝒩(0,I), with all traces of the original x0q gone.

For example, sincext|x0N(αΒ―tx0,σt2I)we can sample xt|x0 directly "in one step", instead of going through all the intermediate steps x1,x2,...,xt1.

Template:Math proof

Backward diffusion

The key idea of DDPM is to use a neural network parametrized by θ. The network takes in two arguments xt,t, and outputs a vector μθ(xt,t) and a matrix Σθ(xt,t), such that each step in the forward diffusion process can be approximately undone by xt1𝒩(μθ(xt,t),Σθ(xt,t)). This then gives us a backward diffusion process pθ defined bypθ(xT)=𝒩(xT|0,I)pθ(xt1|xt)=𝒩(xt1|μθ(xt,t),Σθ(xt,t))The goal now is to learn the parameters such that pθ(x0) is as close to q(x0) as possible. To do that, we use maximum likelihood estimation with variational inference.

Variational inference

The ELBO inequality states that lnpθ(x0)Ex1:Tq(|x0)[lnpθ(x0:T)lnq(x1:T|x0)], and taking one more expectation, we getEx0q[lnpθ(x0)]Ex0:Tq[lnpθ(x0:T)lnq(x1:T|x0)]We see that maximizing the quantity on the right would give us a lower bound on the likelihood of observed data. This allows us to perform variational inference.

Define the loss functionL(θ):=Ex0:Tq[lnpθ(x0:T)lnq(x1:T|x0)]and now the goal is to minimize the loss by stochastic gradient descent. The expression may be simplified to[17]L(θ)=t=1TExt1,xtq[lnpθ(xt1|xt)]+Ex0q[DKL(q(xT|x0)pθ(xT))]+Cwhere C does not depend on the parameter, and thus can be ignored. Since pθ(xT)=𝒩(xT|0,I) also does not depend on the parameter, the term Ex0q[DKL(q(xT|x0)pθ(xT))] can also be ignored. This leaves just L(θ)=t=1TLt with Lt=Ext1,xtq[lnpθ(xt1|xt)] to be minimized.

Noise prediction network

Since xt1|xt,x0𝒩(μ~t(xt,x0),σ~t2I), this suggests that we should use μθ(xt,t)=μ~t(xt,x0); however, the network does not have access to x0, and so it has to estimate it instead. Now, since xt|x0N(αΒ―tx0,σt2I), we may write xt=αΒ―tx0+σtz, where z is some unknown gaussian noise. Now we see that estimating x0 is equivalent to estimating z.

Therefore, let the network output a noise vector ϵθ(xt,t), and let it predictμθ(xt,t)=μ~t(xt,xtσtϵθ(xt,t)αΒ―t)=xtϵθ(xt,t)βt/σtαtIt remains to design Σθ(xt,t). The DDPM paper suggested not learning it (since it resulted in "unstable training and poorer sample quality"), but fixing it at some value Σθ(xt,t)=ζt2I, where either ζt2=βt or σ~t2 yielded similar performance.

With this, the loss simplifies to Lt=βt22αtσt2ζt2Ex0q;z𝒩(0,I)[ϵθ(xt,t)z2]+Cwhich may be minimized by stochastic gradient descent. The paper noted empirically that an even simpler loss functionLsimple,t=Ex0q;z𝒩(0,I)[ϵθ(xt,t)z2]resulted in better models.

Backward diffusion process

After a noise prediction network is trained, it can be used for generating data points in the original distribution in a loop as follows:

  1. Compute the noise estimate ϵϵθ(xt,t)
  2. Compute the original data estimate x~0(xtσtϵ)/αΒ―t
  3. Sample the previous data xt1𝒩(μ~t(xt,x~0),σ~t2I)
  4. Change time tt1

Score-based generative model

Score-based generative model is another formulation of diffusion modelling. They are also called noise conditional score network (NCSN) or score-matching with Langevin dynamics (SMLD).[18][19][20][21]

Score matching

The idea of score functions

Consider the problem of image generation. Let x represent an image, and let q(x) be the probability distribution over all possible images. If we have q(x) itself, then we can say for certain how likely a certain image is. However, this is intractable in general.

Most often, we are uninterested in knowing the absolute probability of a certain image. Instead, we are usually only interested in knowing how likely a certain image is compared to its immediate neighbors β€” e.g. how much more likely is an image of cat compared to some small variants of it? Is it more likely if the image contains two whiskers, or three, or with some Gaussian noise added?

Consequently, we are actually quite uninterested in q(x) itself, but rather, xlnq(x). This has two major effects:

  • One, we no longer need to normalize q(x), but can use any q~(x)=Cq(x), where C=q~(x)dx>0 is any unknown constant that is of no concern to us.
  • Two, we are comparing q(x) neighbors q(x+dx), by q(x)q(x+dx)=exlnq,dx

Let the score function be s(x):=xlnq(x); then consider what we can do with s(x).

As it turns out, s(x) allows us to sample from q(x) using thermodynamics. Specifically, if we have a potential energy function U(x)=lnq(x), and a lot of particles in the potential well, then the distribution at thermodynamic equilibrium is the Boltzmann distribution qU(x)eU(x)/kBT=q(x)1/kBT. At temperature kBT=1, the Boltzmann distribution is exactly q(x).

Therefore, to model q(x), we may start with a particle sampled at any convenient distribution (such as the standard gaussian distribution), then simulate the motion of the particle forwards according to the Langevin equation dxt=xtU(xt)dt+dWt and the Boltzmann distribution is, by Fokker-Planck equation, the unique thermodynamic equilibrium. So no matter what distribution x0 has, the distribution of xt converges in distribution to q as t.

Learning the score function

Given a density q, we wish to learn a score function approximation fθlnq. This is score matching.[22] Typically, score matching is formalized as minimizing Fisher divergence function Eq[fθ(x)lnq(x)2]. By expanding the integral, and performing an integration by parts, Eq[fθ(x)lnq(x)2]=Eq[fθ2+22fθ]+Cgiving us a loss function, also known as the HyvΓ€rinen scoring rule, that can be minimized by stochastic gradient descent.

Annealing the score function

Suppose we need to model the distribution of images, and we want x0𝒩(0,I), a white-noise image. Now, most white-noise images do not look like real images, so q(x0)0 for large swaths of x0𝒩(0,I). This presents a problem for learning the score function, because if there are no samples around a certain point, then we can't learn the score function at that point. If we do not know the score function xtlnq(xt) at that point, then we cannot impose the time-evolution equation on a particle:dxt=xtlnq(xt)dt+dWtTo deal with this problem, we perform annealing. If q is too different from a white-noise distribution, then progressively add noise until it is indistinguishable from one. That is, we perform a forward diffusion, then learn the score function, then use the score function to perform a backward diffusion.

Continuous diffusion processes

Forward diffusion process

Consider again the forward diffusion process, but this time in continuous time:xt=1βtxt1+βtztBy taking the βtβ(t)dt,dtztdWt limit, we obtain a continuous diffusion process, in the form of a stochastic differential equation:dxt=12β(t)xtdt+β(t)dWtwhere Wt is a Wiener process (multidimensional Brownian motion).

Now, the equation is exactly a special case of the overdamped Langevin equationdxt=DkBT(xU)dt+2DdWtwhere D is diffusion tensor, T is temperature, and U is potential energy field. If we substitute in D=12β(t)I,kBT=1,U=12x2, we recover the above equation. This explains why the phrase "Langevin dynamics" is sometimes used in diffusion models.

Now the above equation is for the stochastic motion of a single particle. Suppose we have a cloud of particles distributed according to q at time t=0, then after a long time, the cloud of particles would settle into the stable distribution of 𝒩(0,I). Let ρt be the density of the cloud of particles at time t, then we haveρ0=q;ρT𝒩(0,I)and the goal is to somehow reverse the process, so that we can start at the end and diffuse back to the beginning.

By Fokker-Planck equation, the density of the cloud evolves according totlnρt=12β(t)(n+(x+lnρt)lnρt+Δlnρt)where n is the dimension of space, and Δ is the Laplace operator. Equivalently,tρt=12β(t)((xρt)+Δρt)

Backward diffusion process

If we have solved ρt for time t[0,T], then we can exactly reverse the evolution of the cloud. Suppose we start with another cloud of particles with density ν0=ρT, and let the particles in the cloud evolve according todyt=12β(Tt)ytdt+β(Tt)ytlnρTt(yt)score function dt+β(Tt)dWtthen by plugging into the Fokker-Planck equation, we find that tρTt=tνt. Thus this cloud of points is the original cloud, evolving backwards.[23]

Noise conditional score network (NCSN)

At the continuous limit, αΒ―t=(1β1)(1βt)=eiln(1βi)e0tβ(t)dt and so xt|x0N(e120tβ(t)dtx0,(1e0tβ(t)dt)I) In particular, we see that we can directly sample from any point in the continuous diffusion process without going through the intermediate steps, by first sampling x0q,z𝒩(0,I), then get xt=e120tβ(t)dtx0+(1e0tβ(t)dt)z. That is, we can quickly sample xtρt for any t0.

Now, define a certain probability distribution γ over [0,), then the score-matching loss function is defined as the expected Fisher divergence: L(θ)=Etγ,xtρt[fθ(xt,t)2+2fθ(xt,t)] After training, fθ(xt,t)lnρt, so we can perform the backwards diffusion process by first sampling xT𝒩(0,I), then integrating the SDE from t=T to t=0: xtdt=xt+12β(t)xtdt+β(t)fθ(xt,t)dt+β(t)dWt This may be done by any SDE integration method, such as Euler–Maruyama method.

The name "noise conditional score network" is explained thus:

  • "network", because fθ is implemented as a neural network.
  • "score", because the output of the network is interpreted as approximating the score function lnρt.
  • "noise conditional", because ρt is equal to ρ0 blurred by an added gaussian noise that increases with time, and so the score function depends on the amount of noise added.

Their equivalence

DDPM and score-based generative models are equivalent.[19][2][24] This means that a network trained using DDPM can be used as a NCSN, and vice versa.

We know that xt|x0N(αΒ―tx0,σt2I), so by Tweedie's formula, we have xtlnq(xt)=1σt2(xt+αΒ―tEq[x0|xt]) As described previously, the DDPM loss function is tLsimple,t with Lsimple,t=Ex0q;z𝒩(0,I)[ϵθ(xt,t)z2] where xt=αΒ―tx0+σtz. By a change of variables, Lsimple,t=Ex0,xtq[ϵθ(xt,t)xtαΒ―tx0σt2]=Extq,x0q(|xt)[ϵθ(xt,t)xtαΒ―tx0σt2] and the term inside becomes a least squares regression, so if the network actually reaches the global minimum of loss, then we have ϵθ(xt,t)=xtαΒ―tEq[x0|xt]σt=σtxtlnq(xt)

Thus, a score-based network predicts noise, and can be used for denoising.

Conversely, the continuous limit xt1=xtdt,βt=β(t)dt,ztdt=dWt of the backward equation xt1=xtαtβtσtαtϵθ(xt,t)+βtzt;zt𝒩(0,I) gives us precisely the same equation as score-based diffusion: xtdt=xt(1+β(t)dt/2)+β(t)xtlnq(xt)dt+β(t)dWtThus, at infinitesimal steps of DDPM, a denoising network performs score-based diffusion.

Main variants

Noise schedule

Illustration for a linear diffusion noise schedule. With settings β1=104,β1000=0.02.

In DDPM, the sequence of numbers 0=σ0<σ1<<σT<1 is called a (discrete time) noise schedule. In general, consider a strictly increasing monotonic function σ of type ℝ(0,1), such as the sigmoid function. In that case, a noise schedule is a sequence of real numbers λ1<λ2<<λT. It then defines a sequence of noises σt:=σ(λt), which then derives the other quantities βt=11σt21σt12.

In order to use arbitrary noise schedules, instead of training a noise prediction model ϵθ(xt,t), one trains ϵθ(xt,σt).

Similarly, for the noise conditional score network, instead of training fθ(xt,t), one trains fθ(xt,σt).

Denoising Diffusion Implicit Model (DDIM)

The original DDPM method for generating images is slow, since the forward diffusion process usually takes T1000 to make the distribution of xT to appear close to gaussian. However this means the backward diffusion process also take 1000 steps. Unlike the forward diffusion process, which can skip steps as xt|x0 is gaussian for all t1, the backward diffusion process does not allow skipping steps. For example, to sample xt2|xt1𝒩(μθ(xt1,t1),Σθ(xt1,t1)) requires the model to first sample xt1. Attempting to directly sample xt2|xt would require us to marginalize out xt1, which is generally intractable.

DDIM[25] is a method to take any model trained on DDPM loss, and use it to sample with some steps skipped, sacrificing an adjustable amount of quality. If we generate the Markovian chain case in DDPM to non-Markovian case, DDIM corresponds to the case that the reverse process has variance equals to 0. In other words, the reverse process (and also the forward process) is deterministic. When using fewer sampling steps, DDIM outperforms DDPM.

In detail, the DDIM sampling method is as follows. Start with the forward diffusion process xt=αΒ―tx0+σtϵ. Then, during the backward denoising process, given xt,ϵθ(xt,t), the original data is estimated as x0=xtσtϵθ(xt,t)αΒ―tthen the backward diffusion process can jump to any step 0s<t, and the next denoised sample is xs=αΒ―sx0+σs2(σ's)2ϵθ(xt,t)+σsϵwhere σs is an arbitrary real number within the range [0,σs], and ϵ𝒩(0,I) is a newly sampled gaussian noise.[17] If all σs=0, then the backward process becomes deterministic, and this special case of DDIM is also called "DDIM". The original paper noted that when the process is deterministic, samples generated with only 20 steps are already very similar to ones generated with 1000 steps on the high-level.

The original paper recommended defining a single "eta value" η[0,1], such that σs=ησ~s. When η=1, this is the original DDPM. When η=0, this is the fully deterministic DDIM. For intermediate values, the process interpolates between them.

By the equivalence, the DDIM algorithm also applies for score-based diffusion models.

Latent diffusion model (LDM)

Template:Main

Since the diffusion model is a general method for modelling probability distributions, if one wants to model a distribution over images, one can first encode the images into a lower-dimensional space by an encoder, then use a diffusion model to model the distribution over encoded images. Then to generate an image, one can sample from the diffusion model, then use a decoder to decode it into an image.[26]

The encoder-decoder pair is most often a variational autoencoder (VAE).

Architectural improvements

[27] proposed various architectural improvements. For example, they proposed log-space interpolation during backward sampling. Instead of sampling from xt1𝒩(μ~t(xt,x~0),σ~t2I), they recommended sampling from 𝒩(μ~t(xt,x~0),(σtvσ~t1v)2I) for a learned parameter v.

In the v-prediction formalism, the noising formula xt=αΒ―tx0+1αΒ―tϵt is reparameterised by an angle ϕt such that cosϕt=αΒ―t and a "velocity" defined by cosϕtϵtsinϕtx0. The network is trained to predict the velocity v^θ, and denoising is by xϕtδ=cos(δ)xϕtsin(δ)v^θ(xϕt).[28] This parameterization was found to improve performance, as the model can be trained to reach total noise (i.e. ϕt=90) and then reverse it, whereas the standard parameterization never reaches total noise since αΒ―t>0 is always true.[29]

Classifier guidance

Classifier guidance was proposed in 2021 to improve class-conditional generation by using a classifier. The original publication used CLIP text encoders to improve text-conditional image generation.[30]

Suppose we wish to sample not from the entire distribution of images, but conditional on the image description. We don't want to sample a generic image, but an image that fits the description "black cat with red eyes". Generally, we want to sample from the distribution p(x|y), where x ranges over images, and y ranges over classes of images (a description "black cat with red eyes" is just a very detailed class, and a class "cat" is just a very vague description).

Taking the perspective of the noisy channel model, we can understand the process as follows: To generate an image x conditional on description y, we imagine that the requester really had in mind an image x, but the image is passed through a noisy channel and came out garbled, as y. Image generation is then nothing but inferring which x the requester had in mind.

In other words, conditional image generation is simply "translating from a textual language into a pictorial language". Then, as in noisy-channel model, we use Bayes theorem to get p(x|y)p(y|x)p(x) in other words, if we have a good model of the space of all images, and a good image-to-class translator, we get a class-to-image translator "for free". In the equation for backward diffusion, the score lnp(x) can be replaced by xlnp(x|y)=xlnp(x)score+xlnp(y|x)classifier guidance where xlnp(x) is the score function, trained as previously described, and xlnp(y|x) is found by using a differentiable image classifier.

During the diffusion process, we need to condition on the time, givingxtlnp(xt|y,t)=xtlnp(y|xt,t)+xtlnp(xt|t)Although, usually the classifier model does not depend on time, in which case p(y|xt,t)=p(y|xt).

Classifier guidance is defined for the gradient of score function, thus for score-based diffusion network, but as previously noted, score-based diffusion models are equivalent to denoising models by ϵθ(xt,t)=σtxtlnp(xt|t), and similarly, ϵθ(xt,y,t)=σtxtlnp(xt|y,t). Therefore, classifier guidance works for denoising diffusion as well, using the modified noise prediction:[30]ϵθ(xt,y,t)=ϵθ(xt,t)σtxtlnp(y|xt,t)classifier guidance

With temperature

The classifier-guided diffusion model samples from p(x|y), which is concentrated around the maximum a posteriori estimate argmaxxp(x|y). If we want to force the model to move towards the maximum likelihood estimate argmaxxp(y|x), we can use pγ(x|y)p(y|x)γp(x) where γ>0 is interpretable as inverse temperature. In the context of diffusion models, it is usually called the guidance scale. A high γ would force the model to sample from a distribution concentrated around argmaxxp(y|x). This sometimes improves quality of generated images.[30]

This gives a modification to the previous equation:xlnpβ(x|y)=xlnp(x)+γxlnp(y|x)For denoising models, it corresponds to[31]ϵθ(xt,y,t)=ϵθ(xt,t)γσtxtlnp(y|xt,t)

Classifier-free guidance (CFG)

If we do not have a classifier p(y|x), we could still extract one out of the image model itself:[31] xlnpγ(x|y)=(1γ)xlnp(x)+γxlnp(x|y) Such a model is usually trained by presenting it with both (x,y) and (x,None), allowing it to model both xlnp(x|y) and xlnp(x).

Note that for CFG, the diffusion model cannot be merely a generative model of the entire data distribution xlnp(x). It must be a conditional generative model xlnp(x|y). For example, in stable diffusion, the diffusion backbone takes as input both a noisy model xt, a time t, and a conditioning vector y (such as a vector encoding a text prompt), and produces a noise prediction ϵθ(xt,y,t).

For denoising models, it corresponds toϵθ(xt,y,t,γ)=ϵθ(xt,t)+γ(ϵθ(xt,y,t)ϵθ(xt,t))As sampled by DDIM, the algorithm can be written as[32]ϵuncondϵθ(xt,t)ϵcondϵθ(xt,t,c)ϵCFGϵuncond+γ(ϵcondϵuncond)x0(xtσtϵCFG)/1σt2xs1σs2x0+σs2(σs)2ϵuncond+σsϵA similar technique applies to language model sampling. Also, if the unconditional generation ϵuncondϵθ(xt,t) is replaced by ϵneg condϵθ(xt,t,c), then it results in negative prompting, which pushes the generation away from c condition.[33][34]

Samplers

Given a diffusion model, one may regard it either as a continuous process, and sample from it by integrating a SDE, or one can regard it as a discrete process, and sample from it by iterating the discrete steps. The choice of the "noise schedule" βt can also affect the quality of samples. A noise schedule is a function that sends a natural number to a noise level: tβt,t{1,2,},β(0,1)A noise schedule is more often specified by a map tσt. The two definitions are equivalent, since βt=11σt21σt12.

In the DDPM perspective, one can use the DDPM itself (with noise), or DDIM (with adjustable amount of noise). The case where one adds noise is sometimes called ancestral sampling.[35] One can interpolate between noise and no noise. The amount of noise is denoted η ("eta value") in the DDIM paper, with η=0 denoting no noise (as in deterministic DDIM), and η=1 denoting full noise (as in DDPM).

In the perspective of SDE, one can use any of the numerical integration methods, such as Euler–Maruyama method, Heun's method, linear multistep methods, etc. Just as in the discrete case, one can add an adjustable amount of noise during the integration.[36]

A survey and comparison of samplers in the context of image generation is in.[37]

Other examples

Notable variants include[38] Poisson flow generative model,[39] consistency model,[40] critically-damped Langevin diffusion,[41] GenPhys,[42] cold diffusion,[43] discrete diffusion,[44][45] etc.

Flow-based diffusion model

Abstractly speaking, the idea of diffusion model is to take an unknown probability distribution (the distribution of natural-looking images), then progressively convert it to a known probability distribution (standard gaussian distribution), by building an absolutely continuous probability path connecting them. The probability path is in fact defined implicitly by the score function lnpt.

In denoising diffusion models, the forward process adds noise, and the backward process removes noise. Both the forward and backward processes are SDEs, though the forward process is integrable in closed-form, so it can be done at no computational cost. The backward process is not integrable in closed-form, so it must be integrated step-by-step by standard SDE solvers, which can be very expensive. The probability path in diffusions model is defined through an ItΓ΄ process and one can retrieve the deterministic process by using the Probability ODE flow formulation.[2]

In flow-based diffusion models, the forward process is a deterministic flow along a time-dependent vector field, and the backward process is also a deterministic flow along the same vector field, but going backwards. Both processes are solutions to ODEs. If the vector field is well-behaved, the ODE will also be well-behaved.

Given two distributions π0 and π1, a flow-based model is a time-dependent velocity field vt(x) in [0,1]×ℝd, such that if we start by sampling a point xπ0, and let it move according to the velocity field: ddtϕt(x)=vt(ϕt(x))t[0,1],starting from ϕ0(x)=x we end up with a point x1π1. The solution ϕt of the above ODE define a probability path pt=[ϕt]#π0 by the pushforward measure operator. In particular, [ϕ1]#π0=π1.

The probability path and the velocity field also satisfy the continuity equation, in the sense of probability distribution: tpt+(vtpt)=0 To construct a probability path, we start by construct a conditional probability path pt(x|z) and the corresponding conditional velocity field vt(x|z) on some conditional distribution q(z). A natural choice is the Gaussian conditional probability path: pt(x|z)=𝒩(mt(z),ζt2I) The conditional velocity field which corresponds to the geodesic path between conditional Gaussian path is vt(x|z)=ζtζt(xmt(z))+mt(z) The probability path and velocity field are then computed by marginalizing

pt(x)=pt(x|z)q(z)dz and vt(x)=𝔼q(z)[vt(x|z)pt(x|z)pt(x)]

Optimal transport flow

The idea of optimal transport flow [46] is to construct a probability path minimizing the Wasserstein metric. The distribution on which we condition is an approximation of the optimal transport plan between π0 and π1: z=(x0,x1) and q(z)=Γ(π0,π1), where Γ is the optimal transport plan, which can be approximated by mini-batch optimal transport. If the batch size is not large, then the transport it computes can be very far from the true optimal transport.

Rectified flow

The idea of rectified flow[47][48] is to learn a flow model such that the velocity is nearly constant along each flow path. This is beneficial, because we can integrate along such a vector field with very few steps. For example, if an ODE ϕtΛ™(x)=vt(ϕt(x)) follows perfectly straight paths, it simplifies to ϕt(x)=x0+tv0(x0), allowing for exact solutions in one step. In practice, we cannot reach such perfection, but when the flow field is nearly so, we can take a few large steps instead of many little steps.

Linear interpolation Rectified Flow Straightened Rectified Flow [1]

The general idea is to start with two distributions π0 and π1, then construct a flow field ϕ0={ϕt:t[0,1]} from it, then repeatedly apply a "reflow" operation to obtain successive flow fields ϕ1,ϕ2,, each straighter than the previous one. When the flow field is straight enough for the application, we stop.

Generally, for any time-differentiable process ϕt, vt can be estimated by solving: minθ01𝔼xpt[vt(x,θ)vt(x)2]dt.

In rectified flow, by injecting strong priors that intermediate trajectories are straight, it can achieve both theoretical relevance for optimal transport and computational efficiency, as ODEs with straight paths can be simulated precisely without time discretization.

Transport by rectified flow[47]

Specifically, rectified flow seeks to match an ODE with the marginal distributions of the linear interpolation between points from distributions π0 and π1. Given observations x0π0 and x1π1, the canonical linear interpolation xt=tx1+(1t)x0,t[0,1] yields a trivial case xΛ™t=x1x0, which cannot be causally simulated without x1. To address this, xt is "projected" into a space of causally simulatable ODEs, by minimizing the least squares loss with respect to the direction x1x0: minθ01𝔼π0,π1,pt[(x1x0)vt(xt)2]dt.

The data pair (x0,x1) can be any coupling of π0 and π1, typically independent (i.e., (x0,x1)π0×π1) obtained by randomly combining observations from π0 and π1. This process ensures that the trajectories closely mirror the density map of xt trajectories but reroute at intersections to ensure causality. This rectifying process is also known as Flow Matching,[49] Stochastic Interpolation,[50] and alpha-(de)blending.[51]

The reflow process[47]

A distinctive aspect of rectified flow is its capability for "reflow", which straightens the trajectory of ODE paths. Denote the rectified flow ϕ0={ϕt:t[0,1]} induced from (x0,x1) as ϕ0=π–±π–Ύπ–Όπ—π–Ώπ—…π—ˆπ—((x0,x1)). Recursively applying this π–±π–Ύπ–Όπ—π–Ώπ—…π—ˆπ—() operator generates a series of rectified flows ϕk+1=π–±π–Ύπ–Όπ—π–Ώπ—…π—ˆπ—((ϕ0k(x0),ϕ1k(x1))). This "reflow" process not only reduces transport costs but also straightens the paths of rectified flows, making ϕk paths straighter with increasing k.

Rectified flow includes a nonlinear extension where linear interpolation xt is replaced with any time-differentiable curve that connects x0 and x1, given by xt=αtx1+βtx0. This framework encompasses DDIM and probability flow ODEs as special cases, with particular choices of αt and βt. However, in the case where the path of xt is not straight, the reflow process no longer ensures a reduction in convex transport costs, and also no longer straighten the paths of ϕt.[47]

See [52] for a tutorial on flow matching, with animations.

Choice of architecture

Architecture of Stable Diffusion
The denoising process used by Stable Diffusion

Diffusion model

For generating images by DDPM, we need a neural network that takes a time t and a noisy image xt, and predicts a noise ϵθ(xt,t) from it. Since predicting the noise is the same as predicting the denoised image, then subtracting it from xt, denoising architectures tend to work well. For example, the U-Net, which was found to be good for denoising images, is often used for denoising diffusion models that generate images.[53]

Template:AnchorFor DDPM, the underlying architecture ("backbone") does not have to be a U-Net. It just has to predict the noise somehow. For example, the diffusion transformer (DiT) uses a Transformer to predict the mean and diagonal covariance of the noise, given the textual conditioning and the partially denoised image. It is the same as standard U-Net-based denoising diffusion model, with a Transformer replacing the U-Net.[54] Mixture of experts-Transformer can also be applied.[55]

DDPM can be used to model general data distributions, not just natural-looking images. For example, Human Motion Diffusion[56] models human motion trajectory by DDPM. Each human motion trajectory is a sequence of poses, represented by either joint rotations or positions. It uses a Transformer network to generate a less noisy trajectory out of a noisy one.

Conditioning

The base diffusion model can only generate unconditionally from the whole distribution. For example, a diffusion model learned on ImageNet would generate images that look like a random image from ImageNet. To generate images from just one category, one would need to impose the condition, and then sample from the conditional distribution. Whatever condition one wants to impose, one needs to first convert the conditioning into a vector of floating point numbers, then feed it into the underlying diffusion model neural network. However, one has freedom in choosing how to convert the conditioning into a vector.

Stable Diffusion, for example, imposes conditioning in the form of cross-attention mechanism, where the query is an intermediate representation of the image in the U-Net, and both key and value are the conditioning vectors. The conditioning can be selectively applied to only parts of an image, and new kinds of conditionings can be finetuned upon the base model, as used in ControlNet.[57]

As a particularly simple example, consider image inpainting. The conditions are x~, the reference image, and m, the inpainting mask. The conditioning is imposed at each step of the backward diffusion process, by first sampling x~tN(αΒ―tx~,σt2I), a noisy version of x~, then replacing xt with (1m)xt+mx~t, where means elementwise multiplication.[58] Another application of cross-attention mechanism is prompt-to-prompt image editing.[59]

Conditioning is not limited to just generating images from a specific category, or according to a specific caption (as in text-to-image). For example,[56] demonstrated generating human motion, conditioned on an audio clip of human walking (allowing syncing motion to a soundtrack), or video of human running, or a text description of human motion, etc. For how conditional diffusion models are mathematically formulated, see a methodological summary in.[60]

Upscaling

As generating an image takes a long time, one can try to generate a small image by a base diffusion model, then upscale it by other models. Upscaling can be done by GAN,[61] Transformer,[62] or signal processing methods like Lanczos resampling.

Diffusion models themselves can be used to perform upscaling. Cascading diffusion model stacks multiple diffusion models one after another, in the style of Progressive GAN. The lowest level is a standard diffusion model that generate 32x32 image, then the image would be upscaled by a diffusion model specifically trained for upscaling, and the process repeats.[53]

In more detail, the diffusion upscaler is trained as follows:[53]

  • Sample (x0,z0,c), where x0 is the high-resolution image, z0 is the same image but scaled down to a low-resolution, and c is the conditioning, which can be the caption of the image, the class of the image, etc.
  • Sample two white noises ϵx,ϵz, two time-steps tx,tz. Compute the noisy versions of the high-resolution and low-resolution images: {xtx=αΒ―txx0+σtxϵxztz=αΒ―tzz0+σtzϵz.
  • Train the denoising network to predict ϵx given xtx,ztz,tx,tz,c. That is, apply gradient descent on θ on the L2 loss ϵθ(xtx,ztz,tx,tz,c)ϵx22.

Examples

This section collects some notable diffusion models, and briefly describes their architecture.

OpenAI

Template:Main The DALL-E series by OpenAI are text-conditional diffusion models of images.

The first version of DALL-E (2021) is not actually a diffusion model. Instead, it uses a Transformer architecture that autoregressively generates a sequence of tokens, which is then converted to an image by the decoder of a discrete VAE. Released with DALL-E was the CLIP classifier, which was used by DALL-E to rank generated images according to how close the image fits the text.

GLIDE (2022-03)[63] is a 3.5-billion diffusion model, and a small version was released publicly.[6] Soon after, DALL-E 2 was released (2022-04).[64] DALL-E 2 is a 3.5-billion cascaded diffusion model that generates images from text by "inverting the CLIP image encoder", the technique which they termed "unCLIP".

The unCLIP method contains 4 models: a CLIP image encoder, a CLIP text encoder, an image decoder, and a "prior" model (which can be a diffusion model, or an autoregressive model). During training, the prior model is trained to convert CLIP image encodings to CLIP text encodings. The image decoder is trained to convert CLIP image encodings back to images. During inference, a text is converted by the CLIP text encoder to a vector, then it is converted by the prior model to an image encoding, then it is converted by the image decoder to an image.

Sora (2024-02) is a diffusion Transformer model (DiT).

Stability AI

Template:Main

Stable Diffusion (2022-08), released by Stability AI, consists of a denoising latent diffusion model (860 million parameters), a VAE, and a text encoder. The denoising network is a U-Net, with cross-attention blocks to allow for conditional image generation.[65][26]

Stable Diffusion 3 (2024-03)[66] changed the latent diffusion model from the UNet to a Transformer model, and so it is a DiT. It uses rectified flow.

Stable Video 4D (2024-07)[67] is a latent diffusion model for videos of 3D objects.

Google

Imagen (2022)[68][69] uses a T5-XXL language model to encode the input text into an embedding vector. It is a cascaded diffusion model with three sub-models. The first step denoises a white noise to a 64Γ—64 image, conditional on the embedding vector of the text. This model has 2B parameters. The second step upscales the image by 64Γ—64β†’256Γ—256, conditional on embedding. This model has 650M parameters. The third step is similar, upscaling by 256Γ—256β†’1024Γ—1024. This model has 400M parameters. The three denoising networks are all U-Nets.

Muse (2023-01)[70] is not a diffusion model, but an encoder-only Transformer that is trained to predict masked image tokens from unmasked image tokens.

Imagen 2 (2023-12) is also diffusion-based. It can generate images based on a prompt that mixes images and text. No further information available.[71] Imagen 3 (2024-05) is too. No further information available.[72]

Veo (2024) generates videos by latent diffusion. The diffusion is conditioned on a vector that encodes both a text prompt and an image prompt.[73]

Meta

Make-A-Video (2022) is a text-to-video diffusion model.[74][75]

CM3leon (2023) is not a diffusion model, but an autoregressive causally masked Transformer, with mostly the same architecture as LLaMa-2.[76][77]

Transfusion architectural diagram

Transfusion (2024) is a Transformer that combines autoregressive text generation and denoising diffusion. Specifically, it generates text autoregressively (with causal masking), and generates images by denoising multiple times over image tokens (with all-to-all attention).[78]

Movie Gen (2024) is a series of Diffusion Transformers operating on latent space and by flow matching.[79]

See also

Further reading

References

Template:Reflist

  1. ↑ Template:Cite arXiv
  2. ↑ 2.0 2.1 2.2 2.3 Template:Cite arXiv
  3. ↑ Template:Cite journal
  4. ↑ 4.0 4.1 Template:Cite journal
  5. ↑ Template:Cite arXiv
  6. ↑ 6.0 6.1 Template:Citation
  7. ↑ Template:Cite arXiv
  8. ↑ Template:Cite book
  9. ↑ Template:Cite journal
  10. ↑ Template:Cite journal
  11. ↑ Template:Cite journal
  12. ↑ Template:Cite journal
  13. ↑ Template:Cite arXiv
  14. ↑ Template:Cite arXiv
  15. ↑ Template:Cite journal
  16. ↑ Template:Citation
  17. ↑ 17.0 17.1 Template:Cite web
  18. ↑ Template:Cite web
  19. ↑ 19.0 19.1 Template:Cite journal
  20. ↑ Template:Cite arXiv
  21. ↑ Template:Citation
  22. ↑ Template:Cite web
  23. ↑ Template:Cite journal
  24. ↑ Template:Cite arXiv
  25. ↑ Template:Cite arXiv
  26. ↑ 26.0 26.1 Template:Cite arXiv
  27. ↑ Template:Cite journal
  28. ↑ Template:Cite conference
  29. ↑ Template:Cite conference
  30. ↑ 30.0 30.1 30.2 Template:Cite arXiv
  31. ↑ 31.0 31.1 Template:Cite arXiv
  32. ↑ Template:Cite arXiv
  33. ↑ Template:Cite arXiv
  34. ↑ Template:Cite arXiv
  35. ↑ Template:Cite arXiv
  36. ↑ Template:Cite arXiv
  37. ↑ Template:Cite arXiv
  38. ↑ Template:Cite journal
  39. ↑ Template:Cite journal
  40. ↑ Template:Cite journal
  41. ↑ Template:Cite arXiv
  42. ↑ Template:Cite arXiv
  43. ↑ Template:Cite journal
  44. ↑ Template:Cite journal
  45. ↑ Template:Cite arXiv
  46. ↑ Template:Cite journal
  47. ↑ 47.0 47.1 47.2 47.3 Template:Cite arXiv
  48. ↑ Template:Cite arXiv
  49. ↑ Template:Cite arXiv
  50. ↑ Template:Cite arXiv
  51. ↑ Template:Cite arXiv
  52. ↑ Template:Cite web
  53. ↑ 53.0 53.1 53.2 Template:Cite journal
  54. ↑ Template:Cite arXiv
  55. ↑ Template:Cite arXiv
  56. ↑ 56.0 56.1 Template:Cite arXiv
  57. ↑ Template:Cite arXiv
  58. ↑ Template:Cite arXiv
  59. ↑ Template:Cite arXiv
  60. ↑ Template:Cite arXiv
  61. ↑ Template:Cite conference
  62. ↑ Template:Cite conference
  63. ↑ Template:Cite arXiv
  64. ↑ Template:Cite arXiv
  65. ↑ Template:Cite web
  66. ↑ Template:Cite arXiv
  67. ↑ Template:Cite arXiv
  68. ↑ Template:Cite web
  69. ↑ Template:Cite journal
  70. ↑ Template:Cite arXiv
  71. ↑ Template:Cite web
  72. ↑ Template:Citation
  73. ↑ Template:Cite web
  74. ↑ Template:Cite web
  75. ↑ Template:Cite arXiv
  76. ↑ Template:Cite web
  77. ↑ Template:Cite arXiv
  78. ↑ Template:Cite arXiv
  79. ↑ Movie Gen: A Cast of Media Foundation Models, The Movie Gen team @ Meta, October 4, 2024.