Diffusion Deep Dive Part 1: From an Impossible Integral to a Two-Line Loss (and Back Out to Samples)

SW
Suchinthaka W.22 min read
diffusiongenerative-modelsdeep-learningddpmmachine-learning

This is the first post in a short series on diffusion models. The goal is narrow but important: walk through every line of math between "maximize the log-likelihood of the data" and the two-line PyTorch loss that actually trains a diffusion model. The second post will take this objective and turn it into runnable code.

Diffusion models now dominate image, video, audio, and molecular generation. Every one of them, from DDPM to Stable Diffusion to the latest video models, is trained with essentially the same objective: sample a clean datapoint, add a known amount of Gaussian noise, and ask a neural network to predict the noise back. A simple MSE loss. The reason this simple loss works, and the reason it corresponds to maximum likelihood on a model of incredible expressive power, is the subject of this post.

Reference: Ho, Jain, Abbeel, "Denoising Diffusion Probabilistic Models" (NeurIPS 2020).

Notation Cheat Sheet

SymbolMeaning
x0\mathbf{x}_0Clean data sample (e.g. an image)
x1,,xT\mathbf{x}_1, \ldots, \mathbf{x}_TProgressively noised latents (TT is typically 10001000)
q(xtxt1)q(\mathbf{x}_t \mid \mathbf{x}_{t-1})Fixed forward (noising) kernel
pθ(xt1xt)p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t)Learned reverse (denoising) kernel
βt\beta_tNoise variance added at step tt
αt=1βt\alpha_t = 1 - \beta_t,   αˉt=s=1tαs\;\bar\alpha_t = \prod_{s=1}^{t}\alpha_sCumulative signal-retention factor
ϵ\boldsymbol{\epsilon}Standard Gaussian noise N(0,I)\sim \mathcal{N}(\mathbf{0}, \mathbf{I})
ϵθ(xt,t)\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)The neural network (predicts the noise)

Equations below are numbered in order of appearance. Any time you see (n)(n) in prose, it refers back to the display equation tagged (n)(n).

1. The Goal: Maximum Likelihood

A diffusion model is a pair of Markov chains. The forward process qq takes a clean image x0\mathbf{x}_0 and gradually destroys it by adding Gaussian noise over TT steps, until xT\mathbf{x}_T is pure noise. The reverse process pθp_\theta is a neural network that learns to undo this, starting from random noise and progressively denoising to produce a sample. Training means: given the fixed forward process, learn the reverse process to match it.

Forward and reverse Markov chains of a diffusion model

The forward chain (blue, top) is a fixed Gaussian corruption schedule. The reverse chain (red, bottom) is a neural network that tries to invert it step by step.

We have data x0q(x0)\mathbf{x}_0 \sim q(\mathbf{x}_0) and a model pθ(x0)p_\theta(\mathbf{x}_0), and we want to maximize the log-likelihood

logpθ(x0).(1)\log p_\theta(\mathbf{x}_0). \tag{1}

In a diffusion model, pθ(x0)p_\theta(\mathbf{x}_0) is defined as the marginal of a joint distribution over a sequence of latent variables x1,,xT\mathbf{x}_1, \ldots, \mathbf{x}_T:

pθ(x0)=pθ(x0:T)dx1:T,(2)p_\theta(\mathbf{x}_0) = \int p_\theta(\mathbf{x}_{0:T})\, d\mathbf{x}_{1:T}, \tag{2}

where the reverse process is

pθ(x0:T)=p(xT)t=1Tpθ(xt1xt),p(xT)=N(xT;0,I).(3)p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T) \prod_{t=1}^{T} p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t), \qquad p(\mathbf{x}_T) = \mathcal{N}(\mathbf{x}_T; \mathbf{0}, \mathbf{I}). \tag{3}

Why This is Intractable

The marginal in (2)(2) requires integrating over all possible trajectories x1:T\mathbf{x}_{1:T}, which lives in an extremely high-dimensional space. There is no closed form, and Monte Carlo estimates of the integrand have astronomical variance. So we cannot directly optimize logpθ(x0)\log p_\theta(\mathbf{x}_0).

Intuition. To compute pθ(x0)p_\theta(\mathbf{x}_0) directly, we would need to sum the probabilities of every possible sequence of intermediate noisy images x1,,xT\mathbf{x}_1, \ldots, \mathbf{x}_T that could have led to x0\mathbf{x}_0. For a 256×256256 \times 256 image, each xt\mathbf{x}_t lives in R196,608\mathbb{R}^{196{,}608}, and there are T=1000T = 1000 such steps. The number of plausible trajectories is effectively infinite, and random sampling misses almost all the probability mass. This is why we need a smarter trick, the ELBO.

2. The ELBO Fix

We introduce the forward process q(x1:Tx0)q(\mathbf{x}_{1:T} \mid \mathbf{x}_0), a fixed (non-learned) Gaussian Markov chain that progressively adds noise:

q(x1:Tx0)=t=1Tq(xtxt1),(4)q(\mathbf{x}_{1:T} \mid \mathbf{x}_0) = \prod_{t=1}^{T} q(\mathbf{x}_t \mid \mathbf{x}_{t-1}), \tag{4} q(xtxt1)=N ⁣(xt;1βtxt1,βtI).(5)q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}\!\left(\mathbf{x}_t;\, \sqrt{1-\beta_t}\,\mathbf{x}_{t-1},\, \beta_t \mathbf{I}\right). \tag{5}

Using Jensen's inequality (or equivalently the non-negativity of KL divergence):

logpθ(x0)=logpθ(x0:T)dx1:T=logEq(x1:Tx0) ⁣[pθ(x0:T)q(x1:Tx0)]Eq(x1:Tx0) ⁣[logpθ(x0:T)q(x1:Tx0)]  =:  LELBO.(6)\begin{aligned} \log p_\theta(\mathbf{x}_0) &= \log \int p_\theta(\mathbf{x}_{0:T})\, d\mathbf{x}_{1:T} \\ &= \log \mathbb{E}_{q(\mathbf{x}_{1:T}\mid \mathbf{x}_0)}\!\left[\frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\mid \mathbf{x}_0)}\right] \\ &\geq \mathbb{E}_{q(\mathbf{x}_{1:T}\mid \mathbf{x}_0)}\!\left[\log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\mid \mathbf{x}_0)}\right] \;=:\; \mathcal{L}_{\mathrm{ELBO}}. \end{aligned} \tag{6}

So instead of maximizing the intractable logpθ(x0)\log p_\theta(\mathbf{x}_0), we maximize its lower bound LELBO\mathcal{L}_{\mathrm{ELBO}}.

Intuition. Why is a lower bound good enough? If we push the bound up, the true log-likelihood (which sits above it) gets pushed up too. It is like lifting a ceiling by pushing the floor: as long as the gap does not matter much, the floor is easier to work with. The "floor" LELBO\mathcal{L}_{\mathrm{ELBO}} turns out to be a clean sum of KL terms we can compute and optimize, while the "ceiling" logpθ(x0)\log p_\theta(\mathbf{x}_0) would require the impossible trajectory integral. The gap between them is always non-negative, so maximizing the ELBO pushes up the true log-likelihood too.

3. Rewriting the ELBO as a Sum of KL Terms

Step 1: Expand Using the Markov Factorizations

Plug the factorizations from (3)(3) and (4)(4) into the (negated) ELBO:

LELBO=Eq ⁣[logq(x1:Tx0)pθ(x0:T)]=Eq ⁣[logt=1Tq(xtxt1)p(xT)t=1Tpθ(xt1xt)]=Eq ⁣[logp(xT)+t=1Tlogq(xtxt1)pθ(xt1xt)].(7)\begin{aligned} -\mathcal{L}_{\mathrm{ELBO}} &= \mathbb{E}_q\!\left[ \log \frac{q(\mathbf{x}_{1:T}\mid \mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \right] \\ &= \mathbb{E}_q\!\left[ \log \frac{\prod_{t=1}^{T} q(\mathbf{x}_t\mid \mathbf{x}_{t-1})}{p(\mathbf{x}_T)\prod_{t=1}^{T} p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)} \right] \\ &= \mathbb{E}_q\!\left[ -\log p(\mathbf{x}_T) + \sum_{t=1}^{T} \log \frac{q(\mathbf{x}_t\mid \mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)} \right]. \end{aligned} \tag{7}

Step 2: Flip the Forward Transition with Bayes' Rule

We cannot directly compare q(xtxt1)q(\mathbf{x}_t\mid \mathbf{x}_{t-1}) (a forward step) with pθ(xt1xt)p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t) (a reverse step). They go in opposite directions. The trick of Ho et al. is to flip the forward transition into a reverse one using Bayes' rule, conditioned on x0\mathbf{x}_0.

Intuition. We want the model (reverse) to match the forward process, but they describe things in opposite directions. Comparing them is like comparing "what is the probability of adding noise from step t1t{-}1 to tt?" with "what is the probability of removing noise from step tt to t1t{-}1?". Both reference the same pair of states, but the arrow of time is flipped. Bayes' rule is the tool for flipping conditional probabilities: given the forward q(xtxt1)q(\mathbf{x}_t \mid \mathbf{x}_{t-1}), it gives us the "ground-truth reverse" q(xt1xt,x0)q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0). Then we can directly compare our learned reverse pθp_\theta against that ground-truth reverse, apples to apples. The extra conditioning on x0\mathbf{x}_0 is what makes the posterior have a closed form; without it, the posterior would itself be intractable.

Why we can insert x0\mathbf{x}_0 into the conditioning. The defining property of a Markov chain is that the next state depends only on the current state, not on the full history. Formally, for any s<t1s < t-1:

q(xtxt1,xt2,,xs)=q(xtxt1).(8)q(\mathbf{x}_t \mid \mathbf{x}_{t-1}, \mathbf{x}_{t-2}, \ldots, \mathbf{x}_s) = q(\mathbf{x}_t \mid \mathbf{x}_{t-1}). \tag{8}

Once you know xt1\mathbf{x}_{t-1}, knowing any earlier state (including x0\mathbf{x}_0) gives no additional information about xt\mathbf{x}_t. The state xt1\mathbf{x}_{t-1} already "screens off" the past. Setting s=0s = 0 in (8)(8) gives

q(xtxt1,x0)=q(xtxt1).(9)q(\mathbf{x}_t \mid \mathbf{x}_{t-1}, \mathbf{x}_0) = q(\mathbf{x}_t \mid \mathbf{x}_{t-1}). \tag{9}

The two forms are literally equal. Why write the longer form at all? Because Bayes' rule needs x0\mathbf{x}_0 to appear in the conditioning set on both sides. The Markov equality (9)(9) lets us smuggle x0\mathbf{x}_0 in for free, which then unlocks the Bayes manipulation.

Applying Bayes' rule to the right-hand side:

q(xtxt1,x0)=q(xt1xt,x0)q(xtx0)q(xt1x0).(10)q(\mathbf{x}_t \mid \mathbf{x}_{t-1}, \mathbf{x}_0) = \frac{q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)\, q(\mathbf{x}_t \mid \mathbf{x}_0)}{q(\mathbf{x}_{t-1} \mid \mathbf{x}_0)}. \tag{10}

We only do this for t2t \geq 2; the t=1t = 1 term is handled separately. Substituting (10)(10) into the sum:

t=2Tlogq(xtxt1)pθ(xt1xt)=t=2Tlogq(xt1xt,x0)q(xtx0)pθ(xt1xt)q(xt1x0)=t=2Tlogq(xt1xt,x0)pθ(xt1xt)will become KL terms+t=2Tlogq(xtx0)q(xt1x0)telescopes.(11)\begin{aligned} \sum_{t=2}^{T} \log \frac{q(\mathbf{x}_t\mid \mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)} &= \sum_{t=2}^{T} \log \frac{q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)\, q(\mathbf{x}_t\mid \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)\, q(\mathbf{x}_{t-1}\mid \mathbf{x}_0)} \\ &= \underbrace{\sum_{t=2}^{T} \log \frac{q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)}}_{\text{will become KL terms}} + \underbrace{\sum_{t=2}^{T} \log \frac{q(\mathbf{x}_t\mid \mathbf{x}_0)}{q(\mathbf{x}_{t-1}\mid \mathbf{x}_0)}}_{\text{telescopes}}. \end{aligned} \tag{11}

Step 3: The Telescoping Sum

The second sum in (11)(11) telescopes because each denominator cancels the previous numerator:

t=2Tlogq(xtx0)q(xt1x0)=logq(xTx0)q(x1x0).(12)\sum_{t=2}^{T} \log \frac{q(\mathbf{x}_t\mid \mathbf{x}_0)}{q(\mathbf{x}_{t-1}\mid \mathbf{x}_0)} = \log \frac{q(\mathbf{x}_T\mid \mathbf{x}_0)}{q(\mathbf{x}_1\mid \mathbf{x}_0)}. \tag{12}

Step 4: Put It All Together

We now substitute (11)(11) and (12)(12) back into (7)(7).

Substitution 1: split (7)(7) into t=1t=1 and t2t \geq 2. Step 2's Bayes trick was applied only for t2t \geq 2, so separate that term out:

t=1Tlogq(xtxt1)pθ(xt1xt)=logq(x1x0)pθ(x0x1)+t=2Tlogq(xtxt1)pθ(xt1xt).(13)\sum_{t=1}^{T} \log \frac{q(\mathbf{x}_t \mid \mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)} = \log\frac{q(\mathbf{x}_1 \mid \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \mid \mathbf{x}_1)} + \sum_{t=2}^{T} \log\frac{q(\mathbf{x}_t \mid \mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)}. \tag{13}

Substitution 2: plug in (11)(11) for the t2t \geq 2 sum, then (12)(12) for the telescope. Putting both into (13)(13), then (13)(13) into (7)(7):

LELBO=Eq ⁣[logp(xT)+logq(xTx0)q(x1x0)+t=2Tlogq(xt1xt,x0)pθ(xt1xt)+logq(x1x0)pθ(x0x1)].(14)-\mathcal{L}_{\mathrm{ELBO}} = \mathbb{E}_q\!\Bigg[ -\log p(\mathbf{x}_T) + \log\frac{q(\mathbf{x}_T\mid \mathbf{x}_0)}{q(\mathbf{x}_1\mid \mathbf{x}_0)} + \sum_{t=2}^{T} \log\frac{q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)} + \log\frac{q(\mathbf{x}_1\mid \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \mid \mathbf{x}_1)} \Bigg]. \tag{14}

We now regroup the four pieces inside the expectation in (14)(14). This is not a new derivation, only bookkeeping.

Bookkeeping 1: merge logp(xT)-\log p(\mathbf{x}_T) with the telescoped term.

logp(xT)+logq(xTx0)q(x1x0)=logq(xTx0)p(xT)    logq(x1x0).(15)-\log p(\mathbf{x}_T) + \log\frac{q(\mathbf{x}_T\mid \mathbf{x}_0)}{q(\mathbf{x}_1\mid \mathbf{x}_0)} = \log\frac{q(\mathbf{x}_T\mid \mathbf{x}_0)}{p(\mathbf{x}_T)} \;-\; \log q(\mathbf{x}_1\mid \mathbf{x}_0). \tag{15}

Bookkeeping 2: split the last term of (14)(14).

logq(x1x0)pθ(x0x1)=logq(x1x0)    logpθ(x0x1).(16)\log\frac{q(\mathbf{x}_1\mid \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \mid \mathbf{x}_1)} = \log q(\mathbf{x}_1\mid \mathbf{x}_0) \;-\; \log p_\theta(\mathbf{x}_0\mid \mathbf{x}_1). \tag{16}

Bookkeeping 3: the logq(x1x0)\log q(\mathbf{x}_1\mid\mathbf{x}_0) terms cancel. The logq(x1x0)-\log q(\mathbf{x}_1\mid\mathbf{x}_0) from (15)(15) cancels the +logq(x1x0)+\log q(\mathbf{x}_1\mid\mathbf{x}_0) from (16)(16). This cancellation is the whole point of introducing the x0\mathbf{x}_0-conditioned posterior: it makes the endpoint terms line up.

Bookkeeping 4: distribute Eq\mathbb{E}_q linearly. Using Eq[A+B+]=Eq[A]+Eq[B]+\mathbb{E}_q[A+B+\cdots] = \mathbb{E}_q[A] + \mathbb{E}_q[B] + \cdots:

LELBO=Eq ⁣[logq(xTx0)p(xT)]+t=2TEq ⁣[logq(xt1xt,x0)pθ(xt1xt)]Eq[logpθ(x0x1)].(17)-\mathcal{L}_{\mathrm{ELBO}} = \mathbb{E}_q\!\left[ \log\frac{q(\mathbf{x}_T\mid \mathbf{x}_0)}{p(\mathbf{x}_T)} \right] + \sum_{t=2}^{T} \mathbb{E}_q\!\left[ \log\frac{q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)} \right] - \mathbb{E}_q[\log p_\theta(\mathbf{x}_0\mid \mathbf{x}_1)]. \tag{17}

Each expectation of the form Eq[log(q/p)]\mathbb{E}_q[\log(q/p)] is, by definition, a KL divergence. Rewriting (17)(17) accordingly gives the final three-part form:

LELBO=DKL ⁣(q(xTx0)p(xT))LT+t=2TEq ⁣[DKL ⁣(q(xt1xt,x0)pθ(xt1xt))]Lt1    Eq[logpθ(x0x1)]L0.(18)-\mathcal{L}_{\mathrm{ELBO}} = \underbrace{D_{\mathrm{KL}}\!\big(q(\mathbf{x}_T\mid \mathbf{x}_0)\,\|\,p(\mathbf{x}_T)\big)}_{L_T} + \sum_{t=2}^{T} \underbrace{\mathbb{E}_q\!\left[D_{\mathrm{KL}}\!\big(q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)\,\|\,p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)\big)\right]}_{L_{t-1}} \;-\; \underbrace{\mathbb{E}_q[\log p_\theta(\mathbf{x}_0 \mid \mathbf{x}_1)]}_{L_0}. \tag{18}

Step 5: How the KL Identity Turns Log-Ratios into Gaussian KLs

The only identity used in this step is the definition of KL as an expectation of a log-ratio:

DKL(qp)  =  Exq ⁣[logq(x)p(x)].(19)D_{\mathrm{KL}}(q \,\|\, p) \;=\; \mathbb{E}_{x \sim q}\!\left[\log \frac{q(x)}{p(x)}\right]. \tag{19}

We apply (19)(19) to each of the three pieces of (17)(17).

First term. The integrand logq(xTx0)p(xT)\log\frac{q(\mathbf{x}_T\mid\mathbf{x}_0)}{p(\mathbf{x}_T)} depends only on xT\mathbf{x}_T (given x0\mathbf{x}_0), so the full trajectory expectation collapses to one over xT\mathbf{x}_T:

Eq ⁣[logq(xTx0)p(xT)]=Eq(xTx0) ⁣[logq(xTx0)p(xT)]=DKL ⁣(q(xTx0)p(xT)).(20)\mathbb{E}_q\!\left[\log\frac{q(\mathbf{x}_T\mid \mathbf{x}_0)}{p(\mathbf{x}_T)}\right] = \mathbb{E}_{q(\mathbf{x}_T\mid\mathbf{x}_0)}\!\left[\log\frac{q(\mathbf{x}_T\mid \mathbf{x}_0)}{p(\mathbf{x}_T)}\right] = D_{\mathrm{KL}}\!\big(q(\mathbf{x}_T\mid \mathbf{x}_0)\,\|\,p(\mathbf{x}_T)\big). \tag{20}

No outer expectation is needed; xT\mathbf{x}_T is fully consumed inside the KL. Strictly speaking, there is an implicit Eq(x0)\mathbb{E}_{q(\mathbf{x}_0)} outside since x0\mathbf{x}_0 is random too; we absorb it into the overall data-level average.

Middle term (one summand). The integrand depends on xt1,xt,x0\mathbf{x}_{t-1}, \mathbf{x}_t, \mathbf{x}_0, so the trajectory expectation reduces to one over the joint q(xt1,xtx0)q(\mathbf{x}_{t-1}, \mathbf{x}_t \mid \mathbf{x}_0). Split it via the chain rule,

q(xt1,xtx0)=q(xtx0)q(xt1xt,x0),(21)q(\mathbf{x}_{t-1}, \mathbf{x}_t \mid \mathbf{x}_0) = q(\mathbf{x}_t \mid \mathbf{x}_0)\, q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0), \tag{21}

which lets us write the single expectation as two nested ones:

Eq ⁣[logq(xt1xt,x0)pθ(xt1xt)]=Eq(xtx0) ⁣[  Eq(xt1xt,x0) ⁣[logq(xt1xt,x0)pθ(xt1xt)]  ].(22)\mathbb{E}_q\!\left[\log\frac{q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)}\right] = \mathbb{E}_{q(\mathbf{x}_t\mid\mathbf{x}_0)}\!\Bigg[\; \mathbb{E}_{q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)}\!\left[\log\frac{q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)}\right]\;\Bigg]. \tag{22}

The inner expectation in (22)(22) is exactly the KL identity (19)(19), so it equals DKL ⁣(q(xt1xt,x0)pθ(xt1xt))D_{\mathrm{KL}}\!\big(q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)\,\|\,p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)\big), leaving

Eq(xtx0) ⁣[DKL ⁣(q(xt1xt,x0)pθ(xt1xt))].(23)\mathbb{E}_{q(\mathbf{x}_t\mid\mathbf{x}_0)}\!\Big[\,D_{\mathrm{KL}}\!\big(q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)\,\|\,p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)\big)\,\Big]. \tag{23}

The outer Eq\mathbb{E}_q remains because the KL value depends on the random xt\mathbf{x}_t: different xt\mathbf{x}_t samples give different KL values, so we must average over them.

Last term. Eq[logpθ(x0x1)]-\mathbb{E}_q[\log p_\theta(\mathbf{x}_0\mid \mathbf{x}_1)] is not a log-ratio, just the log of a single distribution. The KL identity does not apply. It stays as an expectation, a simple reconstruction likelihood, and is labelled L0L_0.

What we gained. Nothing mathematical changed: we merely renamed each Eq[log(q/p)]\mathbb{E}_q[\log(q/p)] as a KL (plus possibly an outer expectation over remaining random variables). The point is that KL divergences between Gaussians have closed-form formulas, whereas raw log-ratios inside expectations do not. This renaming is what makes the next section's computation possible.

Why This Form Is Nice

  • LTL_T has no learnable parameters (the forward process is fixed), so it is a constant.
  • L0L_0 is a simple reconstruction term.
  • Each Lt1L_{t-1} is a KL between two Gaussians, which has a closed form.

Intuition. The three types of terms correspond to three distinct jobs.

  • LTL_T (end of chain): make sure the forward process actually ends in pure noise. With enough steps and a reasonable schedule, this is already ensured, so we can forget about it.
  • Lt1L_{t-1} (denoising steps): the heart of the loss. At each timestep, the network must match the ideal Bayesian denoising distribution. Summed over all tt, this is the signal that teaches the model to denoise.
  • L0L_0 (final reconstruction): handles the last jump from x1\mathbf{x}_1 back to clean data x0\mathbf{x}_0, which needs discrete-pixel likelihood rather than Gaussian KL.

4. Closed-Form Posterior q(xt1xt,x0)q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)

A key property of the forward process is that xt\mathbf{x}_t given x0\mathbf{x}_0 is Gaussian in closed form. With αt=1βt\alpha_t = 1-\beta_t and αˉt=s=1tαs\bar{\alpha}_t = \prod_{s=1}^{t}\alpha_s,

q(xtx0)=N ⁣(xt;αˉtx0,(1αˉt)I),(24)q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}\!\left(\mathbf{x}_t;\, \sqrt{\bar\alpha_t}\,\mathbf{x}_0,\, (1-\bar\alpha_t)\mathbf{I}\right), \tag{24}

and the posterior is also Gaussian:

q(xt1xt,x0)=N ⁣(xt1;μ~t(xt,x0),β~tI),(25)q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}\!\left(\mathbf{x}_{t-1};\, \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0),\, \tilde\beta_t \mathbf{I}\right), \tag{25}

with

μ~t(xt,x0)=αˉt1βt1αˉtx0+αt(1αˉt1)1αˉtxt,(26)\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) = \frac{\sqrt{\bar\alpha_{t-1}}\,\beta_t}{1-\bar\alpha_t}\,\mathbf{x}_0 + \frac{\sqrt{\alpha_t}\,(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}\,\mathbf{x}_t, \tag{26} β~t=1αˉt11αˉtβt.(27)\tilde\beta_t = \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\,\beta_t. \tag{27}

Derivation of μ~t\tilde{\boldsymbol{\mu}}_t and β~t\tilde\beta_t

We derive the posterior by applying Bayes' rule and matching the resulting quadratic to a Gaussian.

Step 1: Bayes' rule with three known Gaussians.

q(xt1xt,x0)=q(xtxt1)q(xt1x0)q(xtx0),(28)q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) = \frac{q(\mathbf{x}_t \mid \mathbf{x}_{t-1})\, q(\mathbf{x}_{t-1}\mid \mathbf{x}_0)}{q(\mathbf{x}_t \mid \mathbf{x}_0)}, \tag{28}

where (using the Markov property for the first) all three right-hand side distributions are known:

q(xtxt1)=N ⁣(xt;αtxt1,βtI),q(xt1x0)=N ⁣(xt1;αˉt1x0,(1αˉt1)I),q(xtx0)=N ⁣(xt;αˉtx0,(1αˉt)I).(29)\begin{aligned} q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) &= \mathcal{N}\!\big(\mathbf{x}_t;\, \sqrt{\alpha_t}\,\mathbf{x}_{t-1},\, \beta_t \mathbf{I}\big), \\ q(\mathbf{x}_{t-1} \mid \mathbf{x}_0) &= \mathcal{N}\!\big(\mathbf{x}_{t-1};\, \sqrt{\bar\alpha_{t-1}}\,\mathbf{x}_0,\, (1-\bar\alpha_{t-1})\mathbf{I}\big), \\ q(\mathbf{x}_t \mid \mathbf{x}_0) &= \mathcal{N}\!\big(\mathbf{x}_t;\, \sqrt{\bar\alpha_t}\,\mathbf{x}_0,\, (1-\bar\alpha_t)\mathbf{I}\big). \end{aligned} \tag{29}

Step 2: Work in log-space, keep only xt1\mathbf{x}_{t-1} terms. Treat xt\mathbf{x}_t and x0\mathbf{x}_0 as fixed. The third density in (29)(29) (the denominator of (28)(28)) has no xt1\mathbf{x}_{t-1}, so it contributes only a constant. Dropping constants:

logq(xt1xt,x0)=(xtαtxt1)22βt(xt1αˉt1x0)22(1αˉt1)+C.(30)\log q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) = -\frac{(\mathbf{x}_t - \sqrt{\alpha_t}\,\mathbf{x}_{t-1})^2}{2\beta_t} - \frac{(\mathbf{x}_{t-1} - \sqrt{\bar\alpha_{t-1}}\,\mathbf{x}_0)^2}{2(1-\bar\alpha_{t-1})} + C. \tag{30}

Step 3: Expand the squares.

(xtαtxt1)2=xt22αtxtxt1+αtxt12,(xt1αˉt1x0)2=xt122αˉt1x0xt1+αˉt1x02.(31)\begin{aligned} (\mathbf{x}_t - \sqrt{\alpha_t}\,\mathbf{x}_{t-1})^2 &= \mathbf{x}_t^2 - 2\sqrt{\alpha_t}\,\mathbf{x}_t\,\mathbf{x}_{t-1} + \alpha_t\,\mathbf{x}_{t-1}^2, \\ (\mathbf{x}_{t-1} - \sqrt{\bar\alpha_{t-1}}\,\mathbf{x}_0)^2 &= \mathbf{x}_{t-1}^2 - 2\sqrt{\bar\alpha_{t-1}}\,\mathbf{x}_0\,\mathbf{x}_{t-1} + \bar\alpha_{t-1}\,\mathbf{x}_0^2. \end{aligned} \tag{31}

Dropping terms without xt1\mathbf{x}_{t-1} (they become part of CC) and substituting into (30)(30):

logq(xt1xt,x0)=12[(αtβt+11αˉt1)Axt12    2(αtxtβt+αˉt1x01αˉt1)Bxt1]+C.(32)\log q(\mathbf{x}_{t-1}\mid\mathbf{x}_t, \mathbf{x}_0) = -\frac{1}{2}\Bigg[\underbrace{\left(\frac{\alpha_t}{\beta_t} + \frac{1}{1-\bar\alpha_{t-1}}\right)}_{A}\,\mathbf{x}_{t-1}^2 \;-\; 2\underbrace{\left(\frac{\sqrt{\alpha_t}\,\mathbf{x}_t}{\beta_t} + \frac{\sqrt{\bar\alpha_{t-1}}\,\mathbf{x}_0}{1-\bar\alpha_{t-1}}\right)}_{B}\,\mathbf{x}_{t-1}\Bigg] + C. \tag{32}

Step 4: Match to Gaussian shape. A Gaussian N(xt1;μ~t,β~t)\mathcal{N}(\mathbf{x}_{t-1}; \tilde{\boldsymbol{\mu}}_t, \tilde\beta_t) has log-density 12 ⁣[1β~txt122μ~tβ~txt1]+C-\frac{1}{2}\!\left[\frac{1}{\tilde\beta_t}\mathbf{x}_{t-1}^2 - 2\frac{\tilde{\boldsymbol{\mu}}_t}{\tilde\beta_t}\mathbf{x}_{t-1}\right] + C. Matching coefficients in (32)(32):

β~t=1A,μ~t=BA=Bβ~t.(33)\tilde\beta_t = \frac{1}{A}, \qquad \tilde{\boldsymbol{\mu}}_t = \frac{B}{A} = B \cdot \tilde\beta_t. \tag{33}

Step 5: Simplify β~t=1/A\tilde\beta_t = 1/A. Combine the fractions in AA over common denominator βt(1αˉt1)\beta_t(1-\bar\alpha_{t-1}):

A=αt(1αˉt1)+βtβt(1αˉt1).(34)A = \frac{\alpha_t(1-\bar\alpha_{t-1}) + \beta_t}{\beta_t(1-\bar\alpha_{t-1})}. \tag{34}

Simplify the numerator using αt+βt=1\alpha_t + \beta_t = 1 and αtαˉt1=αˉt\alpha_t \bar\alpha_{t-1} = \bar\alpha_t:

αt(1αˉt1)+βt=αtαtαˉt1+βt=(αt+βt)αˉt=1αˉt.(35)\alpha_t(1-\bar\alpha_{t-1}) + \beta_t = \alpha_t - \alpha_t\bar\alpha_{t-1} + \beta_t = (\alpha_t + \beta_t) - \bar\alpha_t = 1 - \bar\alpha_t. \tag{35}

Hence

A=1αˉtβt(1αˉt1)β~t=βt(1αˉt1)1αˉt=1αˉt11αˉtβt,(36)A = \frac{1-\bar\alpha_t}{\beta_t(1-\bar\alpha_{t-1})} \quad\Longrightarrow\quad \tilde\beta_t = \frac{\beta_t(1-\bar\alpha_{t-1})}{1-\bar\alpha_t} = \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\,\beta_t, \tag{36}

which matches (27)(27).

Step 6: Simplify μ~t=Bβ~t\tilde{\boldsymbol{\mu}}_t = B \cdot \tilde\beta_t. Multiply each term of BB by β~t=βt(1αˉt1)1αˉt\tilde\beta_t = \frac{\beta_t(1-\bar\alpha_{t-1})}{1-\bar\alpha_t} and cancel:

  • x0\mathbf{x}_0 coefficient: αˉt1x01αˉt1βt(1αˉt1)1αˉt\dfrac{\sqrt{\bar\alpha_{t-1}}\,\mathbf{x}_0}{1-\bar\alpha_{t-1}} \cdot \dfrac{\beta_t(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}. The (1αˉt1)(1-\bar\alpha_{t-1}) cancels, leaving αˉt1βt1αˉt\dfrac{\sqrt{\bar\alpha_{t-1}}\,\beta_t}{1-\bar\alpha_t}.
  • xt\mathbf{x}_t coefficient: αtxtβtβt(1αˉt1)1αˉt\dfrac{\sqrt{\alpha_t}\,\mathbf{x}_t}{\beta_t} \cdot \dfrac{\beta_t(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}. The βt\beta_t cancels, leaving αt(1αˉt1)1αˉt\dfrac{\sqrt{\alpha_t}\,(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}.

Summing:

μ~t(xt,x0)=αˉt1βt1αˉtx0+αt(1αˉt1)1αˉtxt,(37)\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) = \frac{\sqrt{\bar\alpha_{t-1}}\,\beta_t}{1-\bar\alpha_t}\,\mathbf{x}_0 + \frac{\sqrt{\alpha_t}\,(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}\,\mathbf{x}_t, \tag{37}

which matches (26)(26).

Intuition. μ~t\tilde{\boldsymbol{\mu}}_t is a weighted average of x0\mathbf{x}_0 (the clean signal) and xt\mathbf{x}_t (the current noisy state). At large tt (lots of noise), αˉt0\bar\alpha_t \to 0 and more weight falls on x0\mathbf{x}_0; at small tt (little noise), more weight falls on xt\mathbf{x}_t. The posterior is exactly the right Bayesian interpolation between "where we started" and "where we are now". Geometrically, μ~t\tilde{\boldsymbol{\mu}}_t sits on the line segment between xt\mathbf{x}_t and x0\mathbf{x}_0; the further along the chain we are, the less trustworthy xt\mathbf{x}_t becomes, and the more the posterior pulls toward x0\mathbf{x}_0. This is Bayesian denoising: "I know you are trying to get back to x0\mathbf{x}_0, I know you are currently at xt\mathbf{x}_t, so the best next step is this weighted mixture."

5. The Simplification to a Noise-Prediction Loss

Step 1: KL Between Two Gaussians with the Same Covariance

We parameterize the reverse process as

pθ(xt1xt)=N ⁣(xt1;μθ(xt,t),σt2I),(38)p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t) = \mathcal{N}\!\left(\mathbf{x}_{t-1};\, \boldsymbol{\mu}_\theta(\mathbf{x}_t, t),\, \sigma_t^2 \mathbf{I}\right), \tag{38}

with σt2\sigma_t^2 fixed (commonly σt2=β~t\sigma_t^2 = \tilde\beta_t or βt\beta_t). Both q(xt1xt,x0)q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0) in (25)(25) and pθ(xt1xt)p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t) in (38)(38) are Gaussians with the same (isotropic) covariance, so the KL has a clean closed form:

DKL ⁣(N(μ1,σ2I)N(μ2,σ2I))=12σ2μ1μ22.(39)D_{\mathrm{KL}}\!\left(\mathcal{N}(\boldsymbol{\mu}_1, \sigma^2\mathbf{I}) \,\|\, \mathcal{N}(\boldsymbol{\mu}_2, \sigma^2\mathbf{I})\right) = \frac{1}{2\sigma^2} \|\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2\|^2. \tag{39}

Applying (39)(39) to Lt1L_{t-1}:

Lt1=Eq ⁣[12σt2μ~t(xt,x0)μθ(xt,t)2]+C,(40)L_{t-1} = \mathbb{E}_q\!\left[\frac{1}{2\sigma_t^2}\,\big\| \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t,\mathbf{x}_0) - \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) \big\|^2\right] + C, \tag{40}

where CC absorbs any θ\theta-independent constants.

Step 2: Reparameterize xt\mathbf{x}_t in Terms of Noise ϵ\boldsymbol{\epsilon}

From the closed-form marginal (24)(24), we can sample xt\mathbf{x}_t via

xt=αˉtx0+1αˉtϵ,ϵN(0,I),(41)\mathbf{x}_t = \sqrt{\bar\alpha_t}\,\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\,\boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), \tag{41}

which means, equivalently,

x0=1αˉt ⁣(xt1αˉtϵ).(42)\mathbf{x}_0 = \frac{1}{\sqrt{\bar\alpha_t}}\!\left(\mathbf{x}_t - \sqrt{1-\bar\alpha_t}\,\boldsymbol{\epsilon}\right). \tag{42}

Step 3: Substitute Into μ~t\tilde{\boldsymbol{\mu}}_t

Take the posterior mean from (26)(26),

μ~t(xt,x0)=αˉt1βt1αˉtx0+αt(1αˉt1)1αˉtxt,\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) = \frac{\sqrt{\bar\alpha_{t-1}}\,\beta_t}{1-\bar\alpha_t}\,\mathbf{x}_0 + \frac{\sqrt{\alpha_t}\,(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}\,\mathbf{x}_t,

substitute (42)(42) for x0\mathbf{x}_0, and simplify using αˉt=αtαˉt1\bar\alpha_t = \alpha_t \bar\alpha_{t-1}:

μ~t(xt,x0)=αˉt1βt1αˉtxt1αˉtϵαˉt+αt(1αˉt1)1αˉtxt=βt(1αˉt)αtxtβtαt1αˉtϵ+αt(1αˉt1)(1αˉt)αtxt.(43)\begin{aligned} \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) &= \frac{\sqrt{\bar\alpha_{t-1}}\,\beta_t}{1-\bar\alpha_t}\cdot \frac{\mathbf{x}_t - \sqrt{1-\bar\alpha_t}\,\boldsymbol{\epsilon}}{\sqrt{\bar\alpha_t}} + \frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}\,\mathbf{x}_t \\ &= \frac{\beta_t}{(1-\bar\alpha_t)\sqrt{\alpha_t}}\,\mathbf{x}_t - \frac{\beta_t}{\sqrt{\alpha_t}\sqrt{1-\bar\alpha_t}}\,\boldsymbol{\epsilon} + \frac{\alpha_t(1-\bar\alpha_{t-1})}{(1-\bar\alpha_t)\sqrt{\alpha_t}}\,\mathbf{x}_t. \end{aligned} \tag{43}

The two xt\mathbf{x}_t coefficients combine: βt+αt(1αˉt1)=1αˉt\beta_t + \alpha_t(1-\bar\alpha_{t-1}) = 1 - \bar\alpha_t (using αt=1βt\alpha_t = 1-\beta_t), so

μ~t(xt,x0)=1αt ⁣(xtβt1αˉtϵ).(44)\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) = \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\,\boldsymbol{\epsilon}\right). \tag{44}

Step 4: Parameterize the Model by Predicting Noise

Expression (44)(44) suggests mirroring the same form for μθ\boldsymbol{\mu}_\theta, but with a learned noise estimate ϵθ(xt,t)\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) in place of the true ϵ\boldsymbol{\epsilon}:

μθ(xt,t)=1αt ⁣(xtβt1αˉtϵθ(xt,t)).(45)\boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right). \tag{45}

Intuition. Why train the network to predict the noise instead of the denoised image? Because of the reparameterization (41)(41), knowing xt\mathbf{x}_t and ϵ\boldsymbol{\epsilon} is equivalent to knowing xt\mathbf{x}_t and x0\mathbf{x}_0. They carry the same information. So the network could equally well predict (a) the clean image x0\mathbf{x}_0 directly, (b) the noise ϵ\boldsymbol{\epsilon} that was added, or (c) the posterior mean μ~t\tilde{\boldsymbol{\mu}}_t itself. All three are mathematically equivalent. In practice, predicting ϵ\boldsymbol{\epsilon} tends to be easier for the network: the noise is unit-variance Gaussian (standardized), while x0\mathbf{x}_0 has arbitrary scale and structure. A network outputting values of roughly fixed scale is easier to train, and gradients behave better. This is a practical choice, not a mathematical necessity, but it is the choice that made DDPM work.

Now the difference μ~tμθ\tilde{\boldsymbol{\mu}}_t - \boldsymbol{\mu}_\theta simplifies dramatically; subtracting (45)(45) from (44)(44), the xt\mathbf{x}_t terms cancel:

μ~t(xt,x0)μθ(xt,t)=βtαt1αˉt(ϵθ(xt,t)ϵ).(46)\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) - \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{\beta_t}{\sqrt{\alpha_t}\sqrt{1-\bar\alpha_t}}\,\big(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) - \boldsymbol{\epsilon}\big). \tag{46}

Plugging (46)(46) back into (40)(40):

Lt1=Ex0,ϵ ⁣[βt22σt2αt(1αˉt)ϵϵθ(xt,t)2].(47)L_{t-1} = \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}}\!\left[ \frac{\beta_t^2}{2\sigma_t^2\,\alpha_t\,(1-\bar\alpha_t)}\, \big\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \big\|^2 \right]. \tag{47}

6. The Final, Tractable Objective

Ho et al. observed that simply dropping the time-dependent weighting in front of (47)(47) improves sample quality. The final training objective is:

  Lsimple(θ)=Et,x0,ϵ ⁣[ϵϵθ ⁣(αˉtx0+1αˉtϵ,  t)2]  (48)\boxed{\; \mathcal{L}_{\mathrm{simple}}(\theta) = \mathbb{E}_{t,\,\mathbf{x}_0,\,\boldsymbol{\epsilon}}\!\left[ \big\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta\!\big(\sqrt{\bar\alpha_t}\,\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\,\boldsymbol{\epsilon},\; t\big) \big\|^2 \right]\;} \tag{48}

where tUniform{1,,T}t \sim \mathrm{Uniform}\{1,\ldots,T\}, x0q(x0)\mathbf{x}_0 \sim q(\mathbf{x}_0), and ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0},\mathbf{I}).

What We Actually Do

Algorithm 1DDPM training (Ho et al. 2020)

That is the whole training loop. No loop over timesteps, no Monte Carlo over trajectories, no adversarial game. Just a supervised regression problem on (xt,t)ϵ(\mathbf{x}_t, t) \mapsto \boldsymbol{\epsilon}.

Intuition. What started as "maximize the likelihood of a high-dimensional integral that can never be computed" has become "sample a random timestep, add some noise, and ask the network to guess the noise back". The miracle is that every step of this simplification is either exact, or a tiny empirically-justified relaxation. The final objective is basically a denoising autoencoder loss applied at random noise levels, simple enough that a few lines of PyTorch can implement it.

7. Inference: How We Actually Sample

Training gives us ϵθ\boldsymbol{\epsilon}_\theta. But we never directly sample from pθ(x0)p_\theta(\mathbf{x}_0); the marginal is exactly the intractable integral (2)(2) we started with. Instead, inference walks the reverse Markov chain one step at a time, which is exactly what chain (3)(3) was built for.

Starting from pure noise xTN(0,I)\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), we iterate for t=T,T1,,1t = T, T-1, \ldots, 1:

xt1pθ(xt1xt)=N ⁣(xt1;μθ(xt,t),σt2I).(49)\mathbf{x}_{t-1} \sim p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}\!\left(\mathbf{x}_{t-1};\, \boldsymbol{\mu}_\theta(\mathbf{x}_t, t),\, \sigma_t^2 \mathbf{I}\right). \tag{49}

Because this is a Gaussian, we sample via the standard reparameterization,

xt1=μθ(xt,t)+σtz,zN(0,I).(50)\mathbf{x}_{t-1} = \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) + \sigma_t \mathbf{z}, \qquad \mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}). \tag{50}

Step 1: Plug in the Noise-Prediction Parameterization

We already picked μθ\boldsymbol{\mu}_\theta in training, equation (45)(45). Substituting it into (50)(50) gives the closed-form reverse step:

  xt1=1αt ⁣(xtβt1αˉtϵθ(xt,t))+σtz.  (51)\boxed{\; \mathbf{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right) + \sigma_t \mathbf{z}.\;} \tag{51}

Read this in three pieces.

  • 1αtxt\dfrac{1}{\sqrt{\alpha_t}}\,\mathbf{x}_t un-scales the current state. Recall the forward step shrinks x\mathbf{x} by αt\sqrt{\alpha_t}; we dilate it back.
  • βtαt1αˉtϵθ(xt,t)-\dfrac{\beta_t}{\sqrt{\alpha_t}\sqrt{1-\bar\alpha_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) subtracts the estimated noise, scaled by how much noise is in xt\mathbf{x}_t relative to how much one step adds.
  • σtz\sigma_t \mathbf{z} re-injects stochasticity, the same way the forward chain did. Without it, the chain would collapse onto a single deterministic trajectory.

Step 2: The Last Step Is Special

At t=1t = 1 we want to land on clean data x0\mathbf{x}_0, not draw from a Gaussian around it. DDPM handles this by setting z=0\mathbf{z} = \mathbf{0} at the final step:

x0=1α1 ⁣(x1β11αˉ1ϵθ(x1,1)).(52)\mathbf{x}_0 = \frac{1}{\sqrt{\alpha_1}}\!\left(\mathbf{x}_1 - \frac{\beta_1}{\sqrt{1-\bar\alpha_1}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_1, 1)\right). \tag{52}

Equivalently, we take the mean and drop the noise term. Any remaining stochasticity would just blur the output.

Step 3: Choosing σt\sigma_t

The reverse variance σt2\sigma_t^2 was not learned; we fixed it in (38)(38). Two standard choices, both introduced by Ho et al. and both producing nearly identical sample quality:

σt2=βtorσt2=β~t=1αˉt11αˉtβt.(53)\sigma_t^2 = \beta_t \qquad \text{or} \qquad \sigma_t^2 = \tilde\beta_t = \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\,\beta_t. \tag{53}
  • βt\beta_t is the variance of the forward step. It matches the upper bound on the true posterior variance (which holds when x0N(0,I)\mathbf{x}_0 \sim \mathcal{N}(\mathbf{0}, \mathbf{I})).
  • β~t\tilde\beta_t from (27)(27) is the exact posterior variance when x0\mathbf{x}_0 is known. It's the lower bound on the optimal σt2\sigma_t^2.

The truth sits between them, but in practice nothing swings much. Improved-DDPM (Nichol and Dhariwal, 2021) learns σt\sigma_t as an interpolation between the two, which gives a small quality bump.

Intuition. Sampling is just the forward chain run backwards, but with a learned best-guess at each step. At every tt, the network says "here is what I think the noise component of xt\mathbf{x}_t is", and we subtract a scaled version of that from xt\mathbf{x}_t to land at xt1\mathbf{x}_{t-1}. Then we add fresh Gaussian noise (except on the last step) so that the trajectory stays random, matching the statistical variety of the forward chain we trained against.

The Full Sampling Algorithm

Putting the pieces together:

Algorithm 2DDPM sampling (Ho et al. 2020)

It calls the network exactly TT times (typically T=1000T = 1000), which is why vanilla DDPM sampling is so slow.

Why Sampling Is Slow, and How to Fix It

The chain is Markov and the transitions are entangled: xt1\mathbf{x}_{t-1} genuinely depends on xt\mathbf{x}_t, so there is no algebraic shortcut that collapses TT steps into one, the way (41)(41) collapses training.

Three families of work attack this bottleneck, all reusing the same trained ϵθ\boldsymbol{\epsilon}_\theta (the loss in (48)(48) is already everything you need):

  • DDIM (Song, Meng, Ermon, 2021) derives a deterministic reverse process whose marginals at each tt still match the forward chain. Samples can be produced in 2525 to 5050 steps instead of 10001000 with only a small quality loss.
  • DPM-Solver (Lu et al., 2022) treats the reverse chain as a solver for a probability-flow ODE and uses higher-order numerical integration to reach good samples in 10\sim 10 to 2020 steps.
  • Classifier-free guidance (Ho and Salimans, 2021) does not reduce step count but scales ϵθ\boldsymbol{\epsilon}_\theta between conditional and unconditional predictions to sharpen samples. This is the trick that made text-to-image diffusion actually work.

All three sit on top of the ϵθ\boldsymbol{\epsilon}_\theta you trained with (48)(48); none of them require re-training.

8. Summary of the Simplification Path

StepForm
Intractable (1), (2)maxθlogpθ(x0)=log ⁣pθ(x0:T)dx1:T\max_\theta \log p_\theta(\mathbf{x}_0) = \log\!\int p_\theta(\mathbf{x}_{0:T})\,d\mathbf{x}_{1:T}
\downarrow Jensen's inequality with q(x1:Tx0)q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)
Lower bound (6)LELBO=Eq ⁣[logpθ(x0:T)q(x1:Tx0)]\mathcal{L}_{\mathrm{ELBO}} = \mathbb{E}_q\!\left[\log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}\right]
\downarrow Bayes' rule on qq, telescoping
Sum of KLs (18)LT+t2Lt1+L0L_T + \sum_{t\geq 2} L_{t-1} + L_0, all Gaussian-Gaussian
\downarrow reparameterize xt\mathbf{x}_t in terms of ϵ\boldsymbol{\epsilon}
Weighted MSE (47)twtE ⁣[ϵϵθ(xt,t)2]\sum_t w_t \, \mathbb{E}\!\left[\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\|^2\right]
\downarrow drop weights wt1w_t \to 1
Tractable objective (48)Lsimple=E ⁣[ϵϵθ(xt,t)2]\mathcal{L}_{\mathrm{simple}} = \mathbb{E}\!\left[\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\|^2\right]

The key insight is that each transformation either preserves the bound exactly (the ELBO, Bayes flip, telescope, and KL rewrite) or trades a tighter bound for a cleaner objective that empirically trains better (the weight drop). The end result is an MSE loss between true noise and predicted noise, something we can compute from a single forward pass of a neural network.

What's Next

We now have the objective function and the sampler. Two things are missing:

  1. The T1000T \approx 1000 steps of the DDPM sampler are painfully slow. Can we reuse the same trained ϵθ\boldsymbol{\epsilon}_\theta with a smarter sampler and get samples in 25\sim 25 steps instead?
  2. We still need actual code: a schedule, a UNet, a training loop.

Part 2 answers the first question. It derives DDIM (Denoising Diffusion Implicit Models), which generalizes the reverse process into a family parameterized by η\eta. Setting η=0\eta = 0 gives a deterministic sampler that produces quality equivalent to DDPM in 2525 to 5050 steps. No retraining: the loss (48)(48) already contains everything DDIM needs.

Part 3 answers the second. We build DDPM end-to-end in PyTorch: noise schedule, small UNet with time embeddings, training loop from (48)(48), and the iterative sampler (51)(51). Every equation in this post maps to a few lines of code.

Share:
SW

Written by Suchinthaka Wanninayaka

AI/ML Researcher exploring semantic communications, diffusion models, and language model systems. Writing about deep learning from theory to production.

Responses

?

No responses yet. Be the first to share your thoughts!