This is the first post in a short series on diffusion models. The goal is narrow but important: walk through every line of math between "maximize the log-likelihood of the data" and the two-line PyTorch loss that actually trains a diffusion model. The second post will take this objective and turn it into runnable code.
Diffusion models now dominate image, video, audio, and molecular generation. Every one of them, from DDPM to Stable Diffusion to the latest video models, is trained with essentially the same objective: sample a clean datapoint, add a known amount of Gaussian noise, and ask a neural network to predict the noise back. A simple MSE loss. The reason this simple loss works, and the reason it corresponds to maximum likelihood on a model of incredible expressive power, is the subject of this post.
Progressively noised latents (T is typically 1000)
q(xt∣xt−1)
Fixed forward (noising) kernel
pθ(xt−1∣xt)
Learned reverse (denoising) kernel
βt
Noise variance added at step t
αt=1−βt, αˉt=∏s=1tαs
Cumulative signal-retention factor
ϵ
Standard Gaussian noise ∼N(0,I)
ϵθ(xt,t)
The neural network (predicts the noise)
Equations below are numbered in order of appearance. Any time you see (n) in prose, it refers back to the display equation tagged (n).
1. The Goal: Maximum Likelihood
A diffusion model is a pair of Markov chains. The forward processq takes a clean image x0 and gradually destroys it by adding Gaussian noise over T steps, until xT is pure noise. The reverse processpθ is a neural network that learns to undo this, starting from random noise and progressively denoising to produce a sample. Training means: given the fixed forward process, learn the reverse process to match it.
The forward chain (blue, top) is a fixed Gaussian corruption schedule. The reverse chain (red, bottom) is a neural network that tries to invert it step by step.
We have data x0∼q(x0) and a model pθ(x0), and we want to maximize the log-likelihood
logpθ(x0).(1)
In a diffusion model, pθ(x0) is defined as the marginal of a joint distribution over a sequence of latent variables x1,…,xT:
The marginal in (2) requires integrating over all possible trajectories x1:T, which lives in an extremely high-dimensional space. There is no closed form, and Monte Carlo estimates of the integrand have astronomical variance. So we cannot directly optimize logpθ(x0).
Intuition. To compute pθ(x0) directly, we would need to sum the probabilities of every possible sequence of intermediate noisy images x1,…,xT that could have led to x0. For a 256×256 image, each xt lives in R196,608, and there are T=1000 such steps. The number of plausible trajectories is effectively infinite, and random sampling misses almost all the probability mass. This is why we need a smarter trick, the ELBO.
2. The ELBO Fix
We introduce the forward processq(x1:T∣x0), a fixed (non-learned) Gaussian Markov chain that progressively adds noise:
So instead of maximizing the intractable logpθ(x0), we maximize its lower bound LELBO.
Intuition. Why is a lower bound good enough? If we push the bound up, the true log-likelihood (which sits above it) gets pushed up too. It is like lifting a ceiling by pushing the floor: as long as the gap does not matter much, the floor is easier to work with. The "floor" LELBO turns out to be a clean sum of KL terms we can compute and optimize, while the "ceiling" logpθ(x0) would require the impossible trajectory integral. The gap between them is always non-negative, so maximizing the ELBO pushes up the true log-likelihood too.
3. Rewriting the ELBO as a Sum of KL Terms
Step 1: Expand Using the Markov Factorizations
Plug the factorizations from (3) and (4) into the (negated) ELBO:
Step 2: Flip the Forward Transition with Bayes' Rule
We cannot directly compare q(xt∣xt−1) (a forward step) with pθ(xt−1∣xt) (a reverse step). They go in opposite directions. The trick of Ho et al. is to flip the forward transition into a reverse one using Bayes' rule, conditioned on x0.
Intuition. We want the model (reverse) to match the forward process, but they describe things in opposite directions. Comparing them is like comparing "what is the probability of adding noise from step t−1 to t?" with "what is the probability of removing noise from step t to t−1?". Both reference the same pair of states, but the arrow of time is flipped. Bayes' rule is the tool for flipping conditional probabilities: given the forward q(xt∣xt−1), it gives us the "ground-truth reverse" q(xt−1∣xt,x0). Then we can directly compare our learned reverse pθ against that ground-truth reverse, apples to apples. The extra conditioning on x0 is what makes the posterior have a closed form; without it, the posterior would itself be intractable.
Why we can insert x0 into the conditioning. The defining property of a Markov chain is that the next state depends only on the current state, not on the full history. Formally, for any s<t−1:
q(xt∣xt−1,xt−2,…,xs)=q(xt∣xt−1).(8)
Once you know xt−1, knowing any earlier state (including x0) gives no additional information about xt. The state xt−1 already "screens off" the past. Setting s=0 in (8) gives
q(xt∣xt−1,x0)=q(xt∣xt−1).(9)
The two forms are literally equal. Why write the longer form at all? Because Bayes' rule needs x0 to appear in the conditioning set on both sides. The Markov equality (9) lets us smuggle x0 in for free, which then unlocks the Bayes manipulation.
We only do this for t≥2; the t=1 term is handled separately. Substituting (10) into the sum:
t=2∑Tlogpθ(xt−1∣xt)q(xt∣xt−1)=t=2∑Tlogpθ(xt−1∣xt)q(xt−1∣x0)q(xt−1∣xt,x0)q(xt∣x0)=will become KL termst=2∑Tlogpθ(xt−1∣xt)q(xt−1∣xt,x0)+telescopest=2∑Tlogq(xt−1∣x0)q(xt∣x0).(11)
Step 3: The Telescoping Sum
The second sum in (11) telescopes because each denominator cancels the previous numerator:
Bookkeeping 3: the logq(x1∣x0) terms cancel. The −logq(x1∣x0) from (15) cancels the +logq(x1∣x0) from (16). This cancellation is the whole point of introducing the x0-conditioned posterior: it makes the endpoint terms line up.
Bookkeeping 4: distribute Eq linearly. Using Eq[A+B+⋯]=Eq[A]+Eq[B]+⋯:
No outer expectation is needed; xT is fully consumed inside the KL. Strictly speaking, there is an implicit Eq(x0) outside since x0 is random too; we absorb it into the overall data-level average.
Middle term (one summand). The integrand depends on xt−1,xt,x0, so the trajectory expectation reduces to one over the joint q(xt−1,xt∣x0). Split it via the chain rule,
q(xt−1,xt∣x0)=q(xt∣x0)q(xt−1∣xt,x0),(21)
which lets us write the single expectation as two nested ones:
The outer Eq remains because the KL value depends on the random xt: different xt samples give different KL values, so we must average over them.
Last term.−Eq[logpθ(x0∣x1)] is not a log-ratio, just the log of a single distribution. The KL identity does not apply. It stays as an expectation, a simple reconstruction likelihood, and is labelled L0.
What we gained. Nothing mathematical changed: we merely renamed each Eq[log(q/p)] as a KL (plus possibly an outer expectation over remaining random variables). The point is that KL divergences between Gaussians have closed-form formulas, whereas raw log-ratios inside expectations do not. This renaming is what makes the next section's computation possible.
Why This Form Is Nice
LT has no learnable parameters (the forward process is fixed), so it is a constant.
L0 is a simple reconstruction term.
Each Lt−1 is a KL between two Gaussians, which has a closed form.
Intuition. The three types of terms correspond to three distinct jobs.
LT (end of chain): make sure the forward process actually ends in pure noise. With enough steps and a reasonable schedule, this is already ensured, so we can forget about it.
Lt−1 (denoising steps): the heart of the loss. At each timestep, the network must match the ideal Bayesian denoising distribution. Summed over all t, this is the signal that teaches the model to denoise.
L0 (final reconstruction): handles the last jump from x1 back to clean data x0, which needs discrete-pixel likelihood rather than Gaussian KL.
4. Closed-Form Posterior q(xt−1∣xt,x0)
A key property of the forward process is that xt given x0 is Gaussian in closed form. With αt=1−βt and αˉt=∏s=1tαs,
Step 2: Work in log-space, keep only xt−1 terms. Treat xt and x0 as fixed. The third density in (29) (the denominator of (28)) has no xt−1, so it contributes only a constant. Dropping constants:
Step 4: Match to Gaussian shape. A Gaussian N(xt−1;μ~t,β~t) has log-density −21[β~t1xt−12−2β~tμ~txt−1]+C. Matching coefficients in (32):
β~t=A1,μ~t=AB=B⋅β~t.(33)
Step 5: Simplify β~t=1/A. Combine the fractions in A over common denominator βt(1−αˉt−1):
A=βt(1−αˉt−1)αt(1−αˉt−1)+βt.(34)
Simplify the numerator using αt+βt=1 and αtαˉt−1=αˉt:
Intuition.μ~t is a weighted average of x0 (the clean signal) and xt (the current noisy state). At large t (lots of noise), αˉt→0 and more weight falls on x0; at small t (little noise), more weight falls on xt. The posterior is exactly the right Bayesian interpolation between "where we started" and "where we are now". Geometrically, μ~t sits on the line segment between xt and x0; the further along the chain we are, the less trustworthy xt becomes, and the more the posterior pulls toward x0. This is Bayesian denoising: "I know you are trying to get back to x0, I know you are currently at xt, so the best next step is this weighted mixture."
5. The Simplification to a Noise-Prediction Loss
Step 1: KL Between Two Gaussians with the Same Covariance
We parameterize the reverse process as
pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),σt2I),(38)
with σt2fixed (commonly σt2=β~t or βt). Both q(xt−1∣xt,x0) in (25) and pθ(xt−1∣xt) in (38) are Gaussians with the same (isotropic) covariance, so the KL has a clean closed form:
The two xt coefficients combine: βt+αt(1−αˉt−1)=1−αˉt (using αt=1−βt), so
μ~t(xt,x0)=αt1(xt−1−αˉtβtϵ).(44)
Step 4: Parameterize the Model by Predicting Noise
Expression (44) suggests mirroring the same form for μθ, but with a learned noise estimate ϵθ(xt,t) in place of the true ϵ:
μθ(xt,t)=αt1(xt−1−αˉtβtϵθ(xt,t)).(45)
Intuition. Why train the network to predict the noise instead of the denoised image? Because of the reparameterization (41), knowing xt and ϵ is equivalent to knowing xt and x0. They carry the same information. So the network could equally well predict (a) the clean image x0 directly, (b) the noise ϵ that was added, or (c) the posterior mean μ~t itself. All three are mathematically equivalent. In practice, predicting ϵ tends to be easier for the network: the noise is unit-variance Gaussian (standardized), while x0 has arbitrary scale and structure. A network outputting values of roughly fixed scale is easier to train, and gradients behave better. This is a practical choice, not a mathematical necessity, but it is the choice that made DDPM work.
Now the difference μ~t−μθ simplifies dramatically; subtracting (45) from (44), the xt terms cancel:
That is the whole training loop. No loop over timesteps, no Monte Carlo over trajectories, no adversarial game. Just a supervised regression problem on (xt,t)↦ϵ.
Intuition. What started as "maximize the likelihood of a high-dimensional integral that can never be computed" has become "sample a random timestep, add some noise, and ask the network to guess the noise back". The miracle is that every step of this simplification is either exact, or a tiny empirically-justified relaxation. The final objective is basically a denoising autoencoder loss applied at random noise levels, simple enough that a few lines of PyTorch can implement it.
7. Inference: How We Actually Sample
Training gives us ϵθ. But we never directly sample from pθ(x0); the marginal is exactly the intractable integral (2) we started with. Instead, inference walks the reverse Markov chain one step at a time, which is exactly what chain (3) was built for.
Starting from pure noise xT∼N(0,I), we iterate for t=T,T−1,…,1:
αt1xtun-scales the current state. Recall the forward step shrinks x by αt; we dilate it back.
−αt1−αˉtβtϵθ(xt,t)subtracts the estimated noise, scaled by how much noise is in xt relative to how much one step adds.
σtzre-injects stochasticity, the same way the forward chain did. Without it, the chain would collapse onto a single deterministic trajectory.
Step 2: The Last Step Is Special
At t=1 we want to land on clean data x0, not draw from a Gaussian around it. DDPM handles this by setting z=0 at the final step:
x0=α11(x1−1−αˉ1β1ϵθ(x1,1)).(52)
Equivalently, we take the mean and drop the noise term. Any remaining stochasticity would just blur the output.
Step 3: Choosing σt
The reverse variance σt2 was not learned; we fixed it in (38). Two standard choices, both introduced by Ho et al. and both producing nearly identical sample quality:
σt2=βtorσt2=β~t=1−αˉt1−αˉt−1βt.(53)
βt is the variance of the forward step. It matches the upper bound on the true posterior variance (which holds when x0∼N(0,I)).
β~t from (27) is the exact posterior variance when x0 is known. It's the lower bound on the optimal σt2.
The truth sits between them, but in practice nothing swings much. Improved-DDPM (Nichol and Dhariwal, 2021) learns σt as an interpolation between the two, which gives a small quality bump.
Intuition. Sampling is just the forward chain run backwards, but with a learned best-guess at each step. At every t, the network says "here is what I think the noise component of xt is", and we subtract a scaled version of that from xt to land at xt−1. Then we add fresh Gaussian noise (except on the last step) so that the trajectory stays random, matching the statistical variety of the forward chain we trained against.
The Full Sampling Algorithm
Putting the pieces together:
Algorithm 2DDPM sampling (Ho et al. 2020)
It calls the network exactly T times (typically T=1000), which is why vanilla DDPM sampling is so slow.
Why Sampling Is Slow, and How to Fix It
The chain is Markov and the transitions are entangled: xt−1 genuinely depends on xt, so there is no algebraic shortcut that collapses T steps into one, the way (41) collapses training.
Three families of work attack this bottleneck, all reusing the same trained ϵθ (the loss in (48) is already everything you need):
DDIM (Song, Meng, Ermon, 2021) derives a deterministic reverse process whose marginals at each t still match the forward chain. Samples can be produced in 25 to 50 steps instead of 1000 with only a small quality loss.
DPM-Solver (Lu et al., 2022) treats the reverse chain as a solver for a probability-flow ODE and uses higher-order numerical integration to reach good samples in ∼10 to 20 steps.
Classifier-free guidance (Ho and Salimans, 2021) does not reduce step count but scales ϵθ between conditional and unconditional predictions to sharpen samples. This is the trick that made text-to-image diffusion actually work.
All three sit on top of the ϵθ you trained with (48); none of them require re-training.
8. Summary of the Simplification Path
Step
Form
Intractable (1), (2)
maxθlogpθ(x0)=log∫pθ(x0:T)dx1:T
↓ Jensen's inequality with q(x1:T∣x0)
Lower bound (6)
LELBO=Eq[logq(x1:T∣x0)pθ(x0:T)]
↓ Bayes' rule on q, telescoping
Sum of KLs (18)
LT+∑t≥2Lt−1+L0, all Gaussian-Gaussian
↓ reparameterize xt in terms of ϵ
Weighted MSE (47)
∑twtE[∥ϵ−ϵθ(xt,t)∥2]
↓ drop weights wt→1
Tractable objective (48)
Lsimple=E[∥ϵ−ϵθ(xt,t)∥2]
The key insight is that each transformation either preserves the bound exactly (the ELBO, Bayes flip, telescope, and KL rewrite) or trades a tighter bound for a cleaner objective that empirically trains better (the weight drop). The end result is an MSE loss between true noise and predicted noise, something we can compute from a single forward pass of a neural network.
What's Next
We now have the objective function and the sampler. Two things are missing:
The T≈1000 steps of the DDPM sampler are painfully slow. Can we reuse the same trained ϵθ with a smarter sampler and get samples in ∼25 steps instead?
We still need actual code: a schedule, a UNet, a training loop.
Part 2 answers the first question. It derives DDIM (Denoising Diffusion Implicit Models), which generalizes the reverse process into a family parameterized by η. Setting η=0 gives a deterministic sampler that produces quality equivalent to DDPM in 25 to 50 steps. No retraining: the loss (48) already contains everything DDIM needs.
Part 3 answers the second. We build DDPM end-to-end in PyTorch: noise schedule, small UNet with time embeddings, training loop from (48), and the iterative sampler (51). Every equation in this post maps to a few lines of code.
If you found this helpful, give it a clap!
Share:
SW
Written by Suchinthaka Wanninayaka
AI/ML Researcher exploring semantic communications, diffusion models, and language model systems. Writing about deep learning from theory to production.