Diffusion Deep Dive Part 2: DDIM — From 1000 Steps to 25 Without Retraining

Part 1 ended on a slightly unsatisfying note: we trained a network with a clean two-line loss, but sampling still required running it $T \approx 1000$ times in sequence. For a single image on a single GPU that is seconds; for a batch of conditional text-to-image samples with a large UNet it is minutes.

DDIM (Song, Meng, Ermon, ICLR 2021) fixes this. The headline result:

Same trained network. DDIM does not change the training objective in $(48)$ of Part 1. You take any network already trained as a DDPM.
A family of samplers parameterized by $\eta$ . $\eta = 1$ recovers DDPM. $\eta = 0$ is deterministic. Everything in between is a continuous interpolation.
Sub-sampling timesteps. Because the construction is non-Markov, you can run the reverse process on any subsequence $\tau_1 < \tau_2 < \cdots < \tau_S$ of $\{1, \ldots, T\}$ . With $S = 25$ or $50$ you get samples indistinguishable from the $T = 1000$ DDPM ones.

This post derives all of that. If you have Part 1 in your head, there are no new tricks; it is one clean Bayes-rule construction plus one choice of variance.

Reference: Song, Meng, Ermon, "Denoising Diffusion Implicit Models" (ICLR 2021).

Notation Recap

Everything is the same as Part 1:

Symbol	Meaning
$\mathbf{x}_0$	Clean data sample
$\alpha_t = 1 - \beta_t$ , $\;\bar\alpha_t = \prod_{s=1}^{t}\alpha_s$	Noise schedule (shared with DDPM)
$\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$	The network trained with the DDPM loss $(48)$
$q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}(\sqrt{\bar\alpha_t}\mathbf{x}_0, (1-\bar\alpha_t)\mathbf{I})$	Forward marginal, from $(24)$ of Part 1

DDIM introduces two new pieces:

Symbol	Meaning
$\sigma_t$	Free parameter: noise level injected at the reverse step
$\eta \in [0, 1]$	Interpolation knob: $\eta{=}1$ DDPM (stochastic), $\eta{=}0$ deterministic
$\tau = (\tau_1, \ldots, \tau_S)$	Sub-sampled timestep schedule of length $S \ll T$

All equations are numbered (1)-(N) in order of appearance.

1. The Key Insight: Training Uses Marginals, Not the Chain

Re-read the DDPM training loss from Part 1, equation $(48)$ :

\mathcal{L}_{\mathrm{simple}}(\theta) = \mathbb{E}_{t,\,\mathbf{x}_0,\,\boldsymbol{\epsilon}}\!\left[ \big\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta\!\big(\sqrt{\bar\alpha_t}\,\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\,\boldsymbol{\epsilon},\; t\big) \big\|^2 \right]. \tag{1}

Notice what does not appear: the forward chain $q(\mathbf{x}_t \mid \mathbf{x}_{t-1})$ . The loss only references

\mathbf{x}_t = \sqrt{\bar\alpha_t}\,\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\,\boldsymbol{\epsilon}, \tag{2}

which is just the marginal $q(\mathbf{x}_t \mid \mathbf{x}_0)$ . The full chain of transitions is a construction we used to derive the ELBO, but once the network is trained, only $(2)$ matters.

Intuition. DDPM's forward chain is one particular way to produce a sample with the right marginal $q(\mathbf{x}_t \mid \mathbf{x}_0)$ . Nothing forces us to invert that exact chain. Any process whose reverse transitions also give samples with marginal $q(\mathbf{x}_t \mid \mathbf{x}_0)$ is fair game, and can reuse the same $\boldsymbol{\epsilon}_\theta$ . DDIM exploits this freedom.

2. A Non-Markov Forward Process with the Same Marginals

Song et al. define a family of joint distributions $q_\sigma(\mathbf{x}_{1:T} \mid \mathbf{x}_0)$ , parameterized by a sequence $\sigma = (\sigma_1, \ldots, \sigma_T)$ of non-negative variances, such that:

The marginals still agree with DDPM: $q_\sigma(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar\alpha_t}\mathbf{x}_0, (1-\bar\alpha_t)\mathbf{I})$ .
The $\mathbf{x}_0$ -conditioned reverse $q_\sigma(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)$ has a specific Gaussian form we will construct.

The construction is neat: fix the marginals first, then pick $q_\sigma(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)$ , and let the rest of the joint follow.

The defining posterior. For $t > 1$ ,

q_\sigma(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}\!\Bigg(\mathbf{x}_{t-1};\; \sqrt{\bar\alpha_{t-1}}\,\mathbf{x}_0 \;+\; \sqrt{1-\bar\alpha_{t-1}-\sigma_t^2}\,\cdot\,\frac{\mathbf{x}_t - \sqrt{\bar\alpha_t}\,\mathbf{x}_0}{\sqrt{1-\bar\alpha_t}},\; \sigma_t^2 \mathbf{I} \Bigg). \tag{3}

The constraint $\sigma_t^2 \leq 1 - \bar\alpha_{t-1}$ must hold for the square root to be real.

Why this mean? The term $(\mathbf{x}_t - \sqrt{\bar\alpha_t}\mathbf{x}_0)/\sqrt{1-\bar\alpha_t}$ is just the standardized noise that took $\mathbf{x}_0$ to $\mathbf{x}_t$ ; call it $\boldsymbol{\epsilon}_t$ . Then $(3)$ rewrites as

\mathbf{x}_{t-1} = \sqrt{\bar\alpha_{t-1}}\,\mathbf{x}_0 \;+\; \sqrt{1-\bar\alpha_{t-1}-\sigma_t^2}\,\boldsymbol{\epsilon}_t \;+\; \sigma_t \mathbf{z}, \tag{4}

with $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ . Check the total variance: $(1-\bar\alpha_{t-1}-\sigma_t^2) + \sigma_t^2 = 1-\bar\alpha_{t-1}$ , which exactly recovers the marginal variance we need for $q_\sigma(\mathbf{x}_{t-1} \mid \mathbf{x}_0) = \mathcal{N}(\sqrt{\bar\alpha_{t-1}}\mathbf{x}_0, (1-\bar\alpha_{t-1})\mathbf{I})$ . The construction is designed to preserve this by inspection.

Intuition. The mean in $(3)$ is "land on $\sqrt{\bar\alpha_{t-1}}\mathbf{x}_0$ , then push partway along the exact direction of the noise we observed". That direction is unit-length (after normalization), so we can split the remaining variance freely between deterministic drift along the noise and fresh Gaussian noise. $\sigma_t$ controls the split.

3. The Reverse Step in Terms of $\boldsymbol{\epsilon}_\theta$

At inference we do not know $\mathbf{x}_0$ . The DDPM trick (Part 1, equation $(42)$ ) lets us predict it from $\mathbf{x}_t$ and the network:

\hat{\mathbf{x}}_0(\mathbf{x}_t, t) = \frac{\mathbf{x}_t - \sqrt{1-\bar\alpha_t}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}{\sqrt{\bar\alpha_t}}. \tag{5}

Substitute $(5)$ for $\mathbf{x}_0$ in $(4)$ . The standardized noise $\boldsymbol{\epsilon}_t = (\mathbf{x}_t - \sqrt{\bar\alpha_t}\hat{\mathbf{x}}_0)/\sqrt{1-\bar\alpha_t}$ becomes exactly $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ (you can verify this by plugging in $(5)$ ). We land on the DDIM reverse update:

\boxed{\; \mathbf{x}_{t-1} \;=\; \sqrt{\bar\alpha_{t-1}}\,\hat{\mathbf{x}}_0(\mathbf{x}_t, t) \;+\; \sqrt{1-\bar\alpha_{t-1}-\sigma_t^2}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \;+\; \sigma_t\,\mathbf{z}. \;} \tag{6}

Three pieces, each physically meaningful:

$\sqrt{\bar\alpha_{t-1}}\,\hat{\mathbf{x}}_0$ : "predict the clean image, then renoise it to the signal level required at $t-1$ ".
$\sqrt{1-\bar\alpha_{t-1}-\sigma_t^2}\,\boldsymbol{\epsilon}_\theta$ : "push partway along the network's estimated noise direction".
$\sigma_t\,\mathbf{z}$ : "add whatever Gaussian noise is needed to top up the variance to $1-\bar\alpha_{t-1}$ ".

Equation $(6)$ is the whole sampler, except we still need to pick $\sigma_t$ .

4. The $\eta$ Parameter: Stochastic vs Deterministic

Song et al. parameterize $\sigma_t$ as

\sigma_t(\eta) \;=\; \eta \cdot \sqrt{\frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}}\,\sqrt{1 - \frac{\bar\alpha_t}{\bar\alpha_{t-1}}}, \qquad \eta \in [0, 1]. \tag{7}

Two cases matter.

4.1. $\eta = 1$ : Recovers DDPM

Plugging $\eta = 1$ into $(7)$ and simplifying using $\bar\alpha_t / \bar\alpha_{t-1} = \alpha_t$ and $1 - \alpha_t = \beta_t$ :

\sigma_t^2(1) \;=\; \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\,(1 - \alpha_t) \;=\; \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\,\beta_t \;=\; \tilde\beta_t. \tag{8}

This is exactly the DDPM posterior variance $\tilde\beta_t$ from Part 1 equation $(27)$ . And the mean in $(6)$ at $\eta = 1$ matches the DDPM reverse mean from Part 1 equation $(44)$ (a couple of lines of algebra). So DDIM with $\eta = 1$ is DDPM. Nothing has been lost.

4.2. $\eta = 0$ : Deterministic Sampling

Plugging $\eta = 0$ into $(7)$ gives $\sigma_t = 0$ , and $(6)$ collapses to

\mathbf{x}_{t-1} \;=\; \sqrt{\bar\alpha_{t-1}}\,\hat{\mathbf{x}}_0(\mathbf{x}_t, t) \;+\; \sqrt{1-\bar\alpha_{t-1}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t). \tag{9}

No noise injection. Given a fixed $\mathbf{x}_T$ , the trajectory $\mathbf{x}_T \to \mathbf{x}_{T-1} \to \cdots \to \mathbf{x}_0$ is a deterministic function of $\mathbf{x}_T$ and $\theta$ . This is what the community calls "DDIM sampling" in the narrow sense.

Intuition. At $\eta = 1$ each reverse step is a noisy nudge toward cleaner data. At $\eta = 0$ each step is a crisp projection: "if the current noise direction is $\boldsymbol{\epsilon}_\theta$ , walk deterministically along it to the next signal level". The $\eta \in (0, 1)$ regime is a smooth family connecting the two.

5. Sub-Sampling Timesteps: The Actual Speed-Up

Here is the payoff. The construction of $q_\sigma$ in Section 2 is non-Markov in $\mathbf{x}_{1:T}$ : the transition $q_\sigma(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)$ was defined to preserve the marginals, without reference to the chain order. Consequently, the update $(6)$ does not require consecutive timesteps.

Pick any strictly increasing subsequence

\tau = (\tau_0, \tau_1, \ldots, \tau_S), \qquad \tau_0 = 0,\; \tau_S = T,\; S \ll T. \tag{10}

Then run the reverse update on this sparse grid:

\mathbf{x}_{\tau_{i-1}} \;=\; \sqrt{\bar\alpha_{\tau_{i-1}}}\,\hat{\mathbf{x}}_0(\mathbf{x}_{\tau_i}, \tau_i) \;+\; \sqrt{1-\bar\alpha_{\tau_{i-1}}-\sigma_{\tau_i}^2}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_{\tau_i}, \tau_i) \;+\; \sigma_{\tau_i}\,\mathbf{z}. \tag{11}

We call the network $S$ times instead of $T$ times. Typical choices:

$\tau_i = \lfloor i \cdot T / S \rfloor$ (linear striding), or
$\tau_i = \lfloor i^2 \cdot T / S^2 \rfloor$ (quadratic striding, which concentrates early steps near $T$ ).

With $S = 50$ and $\eta = 0$ on CIFAR-10, Song et al. report FID competitive with $T = 1000$ DDPM. On ImageNet and CelebA the story is similar.

Intuition. DDPM's reverse chain is Markov, which tied us to consecutive timesteps, which meant $T$ network calls. DDIM's reverse chain is non-Markov in the latents (it always conditions on the predicted $\hat{\mathbf{x}}_0$ ), which decouples the chain from the grid and lets us choose a much coarser grid without re-deriving anything.

6. Connection to the Probability-Flow ODE

When $\eta = 0$ , equation $(9)$ is a discretization of a deterministic differential equation. Define the continuous time $t \in [0, 1]$ via $\bar\alpha_t =$ some smooth schedule, and let $\mathrm{d}t \to 0$ . The DDIM update can be rewritten as an Euler step for

\mathrm{d}\mathbf{x} \;=\; -\tfrac{1}{2}\,g(t)^2\,\nabla_{\mathbf{x}} \log p_t(\mathbf{x})\,\mathrm{d}t, \tag{12}

where $g(t)^2$ is the schedule's diffusion coefficient and $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$ is the score of the noisy marginal at time $t$ . This ODE is the probability-flow ODE of Song et al. (2021, "Score-Based Generative Modeling Through Stochastic Differential Equations"). The score and the noise prediction are related by

\nabla_{\mathbf{x}} \log p_t(\mathbf{x}) \;=\; -\frac{\boldsymbol{\epsilon}_\theta(\mathbf{x}, t)}{\sqrt{1-\bar\alpha_t}}, \tag{13}

so $\boldsymbol{\epsilon}_\theta$ is, up to a scalar, a score network. This reframing has two consequences:

Any ODE solver applies. Deterministic DDIM is Euler; higher-order solvers (Heun, DPM-Solver, PLMS) give the same or better quality with even fewer steps (often $10$ to $20$ ).
The latent is informative. Because the map $\mathbf{x}_T \to \mathbf{x}_0$ is deterministic and smooth, latents are meaningfully interpolable.

7. The DDIM Sampling Algorithm

Algorithm 3DDIM sampling (Song, Meng, Ermon 2021)

Four things are worth noting:

$\boldsymbol{\epsilon}_\theta$ is called exactly $S$ times, not $T$ .
At $\eta = 0$ the $\sigma \cdot \mathbf{z}$ term vanishes, so you can skip sampling $\mathbf{z}$ .
At $\eta = 1$ and $\tau_i = i$ (no sub-sampling) this algorithm is bit-identical to DDPM's Algorithm 2 from Part 1.
$\hat{\mathbf{x}}_0$ is useful to clamp to the data range (e.g. $[-1, 1]$ ) at each step for stability; this is a common trick with no theoretical cost.

8. What Deterministic DDIM Enables

Because $\eta = 0$ sampling is a deterministic invertible map $\mathbf{x}_T \to \mathbf{x}_0$ , three non-obvious things become possible.

(a) Image encoding. Given a real image $\mathbf{x}_0^\star$ , the reverse-direction Euler step of $(9)$ encodes it back to a latent $\mathbf{x}_T^\star$ such that running DDIM forward on $\mathbf{x}_T^\star$ reconstructs (to numerical precision) the original image. This is how DDIM Inversion works.

(b) Semantic interpolation. Interpolating two real images is often disappointing: mixing pixels gives ghosty results. Instead, DDIM-invert both to latents $\mathbf{x}_T^{(a)}, \mathbf{x}_T^{(b)}$ and slerp (spherical linear interpolation) between them, then DDIM-sample. The intermediate outputs cross through plausible, in-distribution samples.

(c) Deterministic seeds. Fixing $\mathbf{x}_T$ fixes the output entirely. This is the basis for A/B testing prompts under classifier-free guidance (same seed, different text conditioning, directly comparable outputs) in text-to-image systems.

The stochastic DDPM sampler cannot do any of these, because the path from $\mathbf{x}_T$ to $\mathbf{x}_0$ integrates fresh Gaussian noise at every step.

9. Summary

Quantity	DDPM ( $(51)$ in Part 1)	DDIM $\eta{=}0$
Training loss	$\mathcal{L}_{\mathrm{simple}}$ $(48)$	same
Network $\boldsymbol{\epsilon}_\theta$	required	same
Reverse step	stochastic, Markov	deterministic, non-Markov in latents
Steps needed	$\sim 1000$	$25$ to $50$ (or $10$ to $20$ with better ODE solvers)
$\mathbf{x}_T \to \mathbf{x}_0$ map	many-to-many (noise injected at each step)	invertible bijection
Interpolation / inversion	no	yes

The path from DDPM to DDIM in one sentence: redefine the forward process as non-Markov while keeping the marginals, and a family of samplers parameterized by $\eta$ falls out, with $\eta = 0$ being a fast deterministic solver for the same trained network.

What's Next

In Part 3 we build a DDPM end-to-end in PyTorch: schedule, UNet with time embeddings, training loop from $(48)$ of Part 1, and the reverse sampler $(51)$ . Swapping in the DDIM sampler from this post is a ten-line change: replace the sampling loop with the algorithm above, keep $\boldsymbol{\epsilon}_\theta$ untouched, and set $S = 50$ , $\eta = 0$ . Same network, twenty times faster.

Diffusion Deep Dive Part 2: DDIM — From 1000 Steps to 25 Without Retraining

Notation Recap

1. The Key Insight: Training Uses Marginals, Not the Chain

2. A Non-Markov Forward Process with the Same Marginals

3. The Reverse Step in Terms of $\boldsymbol{\epsilon}_\theta$

4. The $\eta$ Parameter: Stochastic vs Deterministic

4.1. $\eta = 1$ : Recovers DDPM

4.2. $\eta = 0$ : Deterministic Sampling

5. Sub-Sampling Timesteps: The Actual Speed-Up

6. Connection to the Probability-Flow ODE

7. The DDIM Sampling Algorithm

8. What Deterministic DDIM Enables

9. Summary

What's Next

Continue the Series

Related Articles

Responses

Diffusion Deep Dive Part 2: DDIM — From 1000 Steps to 25 Without Retraining

Notation Recap

1. The Key Insight: Training Uses Marginals, Not the Chain

2. A Non-Markov Forward Process with the Same Marginals

3. The Reverse Step in Terms of ϵθ\boldsymbol{\epsilon}_\thetaϵθ​

4. The η\etaη Parameter: Stochastic vs Deterministic

4.1. η=1\eta = 1η=1: Recovers DDPM

4.2. η=0\eta = 0η=0: Deterministic Sampling

5. Sub-Sampling Timesteps: The Actual Speed-Up

6. Connection to the Probability-Flow ODE

7. The DDIM Sampling Algorithm

8. What Deterministic DDIM Enables

9. Summary

What's Next

Continue the Series

Related Articles

Responses

3. The Reverse Step in Terms of $\boldsymbol{\epsilon}_\theta$

4. The $\eta$ Parameter: Stochastic vs Deterministic

4.1. $\eta = 1$ : Recovers DDPM

4.2. $\eta = 0$ : Deterministic Sampling