Diffusion Deep Dive Part 2: DDIM — From 1000 Steps to 25 Without Retraining
Part 1 ended on a slightly unsatisfying note: we trained a network with a clean two-line loss, but sampling still required running it times in sequence. For a single image on a single GPU that is seconds; for a batch of conditional text-to-image samples with a large UNet it is minutes.
DDIM (Song, Meng, Ermon, ICLR 2021) fixes this. The headline result:
- Same trained network. DDIM does not change the training objective in of Part 1. You take any network already trained as a DDPM.
- A family of samplers parameterized by . recovers DDPM. is deterministic. Everything in between is a continuous interpolation.
- Sub-sampling timesteps. Because the construction is non-Markov, you can run the reverse process on any subsequence of . With or you get samples indistinguishable from the DDPM ones.
This post derives all of that. If you have Part 1 in your head, there are no new tricks; it is one clean Bayes-rule construction plus one choice of variance.
Reference: Song, Meng, Ermon, "Denoising Diffusion Implicit Models" (ICLR 2021).
Notation Recap
Everything is the same as Part 1:
| Symbol | Meaning |
|---|---|
| Clean data sample | |
| , | Noise schedule (shared with DDPM) |
| The network trained with the DDPM loss | |
| Forward marginal, from of Part 1 |
DDIM introduces two new pieces:
| Symbol | Meaning |
|---|---|
| Free parameter: noise level injected at the reverse step | |
| Interpolation knob: DDPM (stochastic), deterministic | |
| Sub-sampled timestep schedule of length |
All equations are numbered (1)-(N) in order of appearance.
1. The Key Insight: Training Uses Marginals, Not the Chain
Re-read the DDPM training loss from Part 1, equation :
Notice what does not appear: the forward chain . The loss only references
which is just the marginal . The full chain of transitions is a construction we used to derive the ELBO, but once the network is trained, only matters.
Intuition. DDPM's forward chain is one particular way to produce a sample with the right marginal . Nothing forces us to invert that exact chain. Any process whose reverse transitions also give samples with marginal is fair game, and can reuse the same . DDIM exploits this freedom.
2. A Non-Markov Forward Process with the Same Marginals
Song et al. define a family of joint distributions , parameterized by a sequence of non-negative variances, such that:
- The marginals still agree with DDPM: .
- The -conditioned reverse has a specific Gaussian form we will construct.
The construction is neat: fix the marginals first, then pick , and let the rest of the joint follow.
The defining posterior. For ,
The constraint must hold for the square root to be real.
Why this mean? The term is just the standardized noise that took to ; call it . Then rewrites as
with . Check the total variance: , which exactly recovers the marginal variance we need for . The construction is designed to preserve this by inspection.
Intuition. The mean in is "land on , then push partway along the exact direction of the noise we observed". That direction is unit-length (after normalization), so we can split the remaining variance freely between deterministic drift along the noise and fresh Gaussian noise. controls the split.
3. The Reverse Step in Terms of
At inference we do not know . The DDPM trick (Part 1, equation ) lets us predict it from and the network:
Substitute for in . The standardized noise becomes exactly (you can verify this by plugging in ). We land on the DDIM reverse update:
Three pieces, each physically meaningful:
- : "predict the clean image, then renoise it to the signal level required at ".
- : "push partway along the network's estimated noise direction".
- : "add whatever Gaussian noise is needed to top up the variance to ".
Equation is the whole sampler, except we still need to pick .
4. The Parameter: Stochastic vs Deterministic
Song et al. parameterize as
Two cases matter.
4.1. : Recovers DDPM
Plugging into and simplifying using and :
This is exactly the DDPM posterior variance from Part 1 equation . And the mean in at matches the DDPM reverse mean from Part 1 equation (a couple of lines of algebra). So DDIM with is DDPM. Nothing has been lost.
4.2. : Deterministic Sampling
Plugging into gives , and collapses to
No noise injection. Given a fixed , the trajectory is a deterministic function of and . This is what the community calls "DDIM sampling" in the narrow sense.
Intuition. At each reverse step is a noisy nudge toward cleaner data. At each step is a crisp projection: "if the current noise direction is , walk deterministically along it to the next signal level". The regime is a smooth family connecting the two.
5. Sub-Sampling Timesteps: The Actual Speed-Up
Here is the payoff. The construction of in Section 2 is non-Markov in : the transition was defined to preserve the marginals, without reference to the chain order. Consequently, the update does not require consecutive timesteps.
Pick any strictly increasing subsequence
Then run the reverse update on this sparse grid:
We call the network times instead of times. Typical choices:
- (linear striding), or
- (quadratic striding, which concentrates early steps near ).
With and on CIFAR-10, Song et al. report FID competitive with DDPM. On ImageNet and CelebA the story is similar.
Intuition. DDPM's reverse chain is Markov, which tied us to consecutive timesteps, which meant network calls. DDIM's reverse chain is non-Markov in the latents (it always conditions on the predicted ), which decouples the chain from the grid and lets us choose a much coarser grid without re-deriving anything.
6. Connection to the Probability-Flow ODE
When , equation is a discretization of a deterministic differential equation. Define the continuous time via some smooth schedule, and let . The DDIM update can be rewritten as an Euler step for
where is the schedule's diffusion coefficient and is the score of the noisy marginal at time . This ODE is the probability-flow ODE of Song et al. (2021, "Score-Based Generative Modeling Through Stochastic Differential Equations"). The score and the noise prediction are related by
so is, up to a scalar, a score network. This reframing has two consequences:
- Any ODE solver applies. Deterministic DDIM is Euler; higher-order solvers (Heun, DPM-Solver, PLMS) give the same or better quality with even fewer steps (often to ).
- The latent is informative. Because the map is deterministic and smooth, latents are meaningfully interpolable.
7. The DDIM Sampling Algorithm
Four things are worth noting:
- is called exactly times, not .
- At the term vanishes, so you can skip sampling .
- At and (no sub-sampling) this algorithm is bit-identical to DDPM's Algorithm 2 from Part 1.
- is useful to clamp to the data range (e.g. ) at each step for stability; this is a common trick with no theoretical cost.
8. What Deterministic DDIM Enables
Because sampling is a deterministic invertible map , three non-obvious things become possible.
(a) Image encoding. Given a real image , the reverse-direction Euler step of encodes it back to a latent such that running DDIM forward on reconstructs (to numerical precision) the original image. This is how DDIM Inversion works.
(b) Semantic interpolation. Interpolating two real images is often disappointing: mixing pixels gives ghosty results. Instead, DDIM-invert both to latents and slerp (spherical linear interpolation) between them, then DDIM-sample. The intermediate outputs cross through plausible, in-distribution samples.
(c) Deterministic seeds. Fixing fixes the output entirely. This is the basis for A/B testing prompts under classifier-free guidance (same seed, different text conditioning, directly comparable outputs) in text-to-image systems.
The stochastic DDPM sampler cannot do any of these, because the path from to integrates fresh Gaussian noise at every step.
9. Summary
| Quantity | DDPM ( in Part 1) | DDIM |
|---|---|---|
| Training loss | same | |
| Network | required | same |
| Reverse step | stochastic, Markov | deterministic, non-Markov in latents |
| Steps needed | to (or to with better ODE solvers) | |
| map | many-to-many (noise injected at each step) | invertible bijection |
| Interpolation / inversion | no | yes |
The path from DDPM to DDIM in one sentence: redefine the forward process as non-Markov while keeping the marginals, and a family of samplers parameterized by falls out, with being a fast deterministic solver for the same trained network.
What's Next
In Part 3 we build a DDPM end-to-end in PyTorch: schedule, UNet with time embeddings, training loop from of Part 1, and the reverse sampler . Swapping in the DDIM sampler from this post is a ten-line change: replace the sampling loop with the algorithm above, keep untouched, and set , . Same network, twenty times faster.
Written by Suchinthaka Wanninayaka
AI/ML Researcher exploring semantic communications, diffusion models, and language model systems. Writing about deep learning from theory to production.
Continue the Series
Diffusion Deep Dive Part 3: Coding a DDPM from Scratch
13 min read
Next ArticleDiffusion Deep Dive Part 1: From an Impossible Integral to a Two-Line Loss (and Back Out to Samples)
22 min read
Related Articles
Responses
No responses yet. Be the first to share your thoughts!