Diffusion Deep Dive Part 1: From an Impossible Integral to a Two-Line Loss (and Back Out to Samples)

This is the first post in a short series on diffusion models. The goal is narrow but important: walk through every line of math between "maximize the log-likelihood of the data" and the two-line PyTorch loss that actually trains a diffusion model. The second post will take this objective and turn it into runnable code.

Diffusion models now dominate image, video, audio, and molecular generation. Every one of them, from DDPM to Stable Diffusion to the latest video models, is trained with essentially the same objective: sample a clean datapoint, add a known amount of Gaussian noise, and ask a neural network to predict the noise back. A simple MSE loss. The reason this simple loss works, and the reason it corresponds to maximum likelihood on a model of incredible expressive power, is the subject of this post.

Reference: Ho, Jain, Abbeel, "Denoising Diffusion Probabilistic Models" (NeurIPS 2020).

Notation Cheat Sheet

Symbol	Meaning
$\mathbf{x}_0$	Clean data sample (e.g. an image)
$\mathbf{x}_1, \ldots, \mathbf{x}_T$	Progressively noised latents ( $T$ is typically $1000$ )
$q(\mathbf{x}_t \mid \mathbf{x}_{t-1})$	Fixed forward (noising) kernel
$p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t)$	Learned reverse (denoising) kernel
$\beta_t$	Noise variance added at step $t$
$\alpha_t = 1 - \beta_t$ , $\;\bar\alpha_t = \prod_{s=1}^{t}\alpha_s$	Cumulative signal-retention factor
$\boldsymbol{\epsilon}$	Standard Gaussian noise $\sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
$\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$	The neural network (predicts the noise)

Equations below are numbered in order of appearance. Any time you see $(n)$ in prose, it refers back to the display equation tagged $(n)$ .

1. The Goal: Maximum Likelihood

A diffusion model is a pair of Markov chains. The forward process $q$ takes a clean image $\mathbf{x}_0$ and gradually destroys it by adding Gaussian noise over $T$ steps, until $\mathbf{x}_T$ is pure noise. The reverse process $p_\theta$ is a neural network that learns to undo this, starting from random noise and progressively denoising to produce a sample. Training means: given the fixed forward process, learn the reverse process to match it.

Forward and reverse Markov chains of a diffusion model

The forward chain (blue, top) is a fixed Gaussian corruption schedule. The reverse chain (red, bottom) is a neural network that tries to invert it step by step.

We have data $\mathbf{x}_0 \sim q(\mathbf{x}_0)$ and a model $p_\theta(\mathbf{x}_0)$ , and we want to maximize the log-likelihood

\log p_\theta(\mathbf{x}_0). \tag{1}

In a diffusion model, $p_\theta(\mathbf{x}_0)$ is defined as the marginal of a joint distribution over a sequence of latent variables $\mathbf{x}_1, \ldots, \mathbf{x}_T$ :

p_\theta(\mathbf{x}_0) = \int p_\theta(\mathbf{x}_{0:T})\, d\mathbf{x}_{1:T}, \tag{2}

where the reverse process is

p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T) \prod_{t=1}^{T} p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t), \qquad p(\mathbf{x}_T) = \mathcal{N}(\mathbf{x}_T; \mathbf{0}, \mathbf{I}). \tag{3}

Why This is Intractable

The marginal in $(2)$ requires integrating over all possible trajectories $\mathbf{x}_{1:T}$ , which lives in an extremely high-dimensional space. There is no closed form, and Monte Carlo estimates of the integrand have astronomical variance. So we cannot directly optimize $\log p_\theta(\mathbf{x}_0)$ .

Intuition. To compute $p_\theta(\mathbf{x}_0)$ directly, we would need to sum the probabilities of every possible sequence of intermediate noisy images $\mathbf{x}_1, \ldots, \mathbf{x}_T$ that could have led to $\mathbf{x}_0$ . For a $256 \times 256$ image, each $\mathbf{x}_t$ lives in $\mathbb{R}^{196{,}608}$ , and there are $T = 1000$ such steps. The number of plausible trajectories is effectively infinite, and random sampling misses almost all the probability mass. This is why we need a smarter trick, the ELBO.

2. The ELBO Fix

We introduce the forward process $q(\mathbf{x}_{1:T} \mid \mathbf{x}_0)$ , a fixed (non-learned) Gaussian Markov chain that progressively adds noise:

q(\mathbf{x}_{1:T} \mid \mathbf{x}_0) = \prod_{t=1}^{T} q(\mathbf{x}_t \mid \mathbf{x}_{t-1}), \tag{4}

q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}\!\left(\mathbf{x}_t;\, \sqrt{1-\beta_t}\,\mathbf{x}_{t-1},\, \beta_t \mathbf{I}\right). \tag{5}

Using Jensen's inequality (or equivalently the non-negativity of KL divergence):

\begin{aligned} \log p_\theta(\mathbf{x}_0) &= \log \int p_\theta(\mathbf{x}_{0:T})\, d\mathbf{x}_{1:T} \\ &= \log \mathbb{E}_{q(\mathbf{x}_{1:T}\mid \mathbf{x}_0)}\!\left[\frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\mid \mathbf{x}_0)}\right] \\ &\geq \mathbb{E}_{q(\mathbf{x}_{1:T}\mid \mathbf{x}_0)}\!\left[\log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\mid \mathbf{x}_0)}\right] \;=:\; \mathcal{L}_{\mathrm{ELBO}}. \end{aligned} \tag{6}

So instead of maximizing the intractable $\log p_\theta(\mathbf{x}_0)$ , we maximize its lower bound $\mathcal{L}_{\mathrm{ELBO}}$ .

Intuition. Why is a lower bound good enough? If we push the bound up, the true log-likelihood (which sits above it) gets pushed up too. It is like lifting a ceiling by pushing the floor: as long as the gap does not matter much, the floor is easier to work with. The "floor" $\mathcal{L}_{\mathrm{ELBO}}$ turns out to be a clean sum of KL terms we can compute and optimize, while the "ceiling" $\log p_\theta(\mathbf{x}_0)$ would require the impossible trajectory integral. The gap between them is always non-negative, so maximizing the ELBO pushes up the true log-likelihood too.

3. Rewriting the ELBO as a Sum of KL Terms

Step 1: Expand Using the Markov Factorizations

Plug the factorizations from $(3)$ and $(4)$ into the (negated) ELBO:

\begin{aligned} -\mathcal{L}_{\mathrm{ELBO}} &= \mathbb{E}_q\!\left[ \log \frac{q(\mathbf{x}_{1:T}\mid \mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \right] \\ &= \mathbb{E}_q\!\left[ \log \frac{\prod_{t=1}^{T} q(\mathbf{x}_t\mid \mathbf{x}_{t-1})}{p(\mathbf{x}_T)\prod_{t=1}^{T} p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)} \right] \\ &= \mathbb{E}_q\!\left[ -\log p(\mathbf{x}_T) + \sum_{t=1}^{T} \log \frac{q(\mathbf{x}_t\mid \mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)} \right]. \end{aligned} \tag{7}

Step 2: Flip the Forward Transition with Bayes' Rule

We cannot directly compare $q(\mathbf{x}_t\mid \mathbf{x}_{t-1})$ (a forward step) with $p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)$ (a reverse step). They go in opposite directions. The trick of Ho et al. is to flip the forward transition into a reverse one using Bayes' rule, conditioned on $\mathbf{x}_0$ .

Intuition. We want the model (reverse) to match the forward process, but they describe things in opposite directions. Comparing them is like comparing "what is the probability of adding noise from step $t{-}1$ to $t$ ?" with "what is the probability of removing noise from step $t$ to $t{-}1$ ?". Both reference the same pair of states, but the arrow of time is flipped. Bayes' rule is the tool for flipping conditional probabilities: given the forward $q(\mathbf{x}_t \mid \mathbf{x}_{t-1})$ , it gives us the "ground-truth reverse" $q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)$ . Then we can directly compare our learned reverse $p_\theta$ against that ground-truth reverse, apples to apples. The extra conditioning on $\mathbf{x}_0$ is what makes the posterior have a closed form; without it, the posterior would itself be intractable.

Why we can insert $\mathbf{x}_0$ into the conditioning. The defining property of a Markov chain is that the next state depends only on the current state, not on the full history. Formally, for any $s < t-1$ :

q(\mathbf{x}_t \mid \mathbf{x}_{t-1}, \mathbf{x}_{t-2}, \ldots, \mathbf{x}_s) = q(\mathbf{x}_t \mid \mathbf{x}_{t-1}). \tag{8}

Once you know $\mathbf{x}_{t-1}$ , knowing any earlier state (including $\mathbf{x}_0$ ) gives no additional information about $\mathbf{x}_t$ . The state $\mathbf{x}_{t-1}$ already "screens off" the past. Setting $s = 0$ in $(8)$ gives

q(\mathbf{x}_t \mid \mathbf{x}_{t-1}, \mathbf{x}_0) = q(\mathbf{x}_t \mid \mathbf{x}_{t-1}). \tag{9}

The two forms are literally equal. Why write the longer form at all? Because Bayes' rule needs $\mathbf{x}_0$ to appear in the conditioning set on both sides. The Markov equality $(9)$ lets us smuggle $\mathbf{x}_0$ in for free, which then unlocks the Bayes manipulation.

Applying Bayes' rule to the right-hand side:

q(\mathbf{x}_t \mid \mathbf{x}_{t-1}, \mathbf{x}_0) = \frac{q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)\, q(\mathbf{x}_t \mid \mathbf{x}_0)}{q(\mathbf{x}_{t-1} \mid \mathbf{x}_0)}. \tag{10}

We only do this for $t \geq 2$ ; the $t = 1$ term is handled separately. Substituting $(10)$ into the sum:

\begin{aligned} \sum_{t=2}^{T} \log \frac{q(\mathbf{x}_t\mid \mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)} &= \sum_{t=2}^{T} \log \frac{q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)\, q(\mathbf{x}_t\mid \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)\, q(\mathbf{x}_{t-1}\mid \mathbf{x}_0)} \\ &= \underbrace{\sum_{t=2}^{T} \log \frac{q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)}}_{\text{will become KL terms}} + \underbrace{\sum_{t=2}^{T} \log \frac{q(\mathbf{x}_t\mid \mathbf{x}_0)}{q(\mathbf{x}_{t-1}\mid \mathbf{x}_0)}}_{\text{telescopes}}. \end{aligned} \tag{11}

Step 3: The Telescoping Sum

The second sum in $(11)$ telescopes because each denominator cancels the previous numerator:

\sum_{t=2}^{T} \log \frac{q(\mathbf{x}_t\mid \mathbf{x}_0)}{q(\mathbf{x}_{t-1}\mid \mathbf{x}_0)} = \log \frac{q(\mathbf{x}_T\mid \mathbf{x}_0)}{q(\mathbf{x}_1\mid \mathbf{x}_0)}. \tag{12}

Step 4: Put It All Together

We now substitute $(11)$ and $(12)$ back into $(7)$ .

Substitution 1: split $(7)$ into $t=1$ and $t \geq 2$ . Step 2's Bayes trick was applied only for $t \geq 2$ , so separate that term out:

\sum_{t=1}^{T} \log \frac{q(\mathbf{x}_t \mid \mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)} = \log\frac{q(\mathbf{x}_1 \mid \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \mid \mathbf{x}_1)} + \sum_{t=2}^{T} \log\frac{q(\mathbf{x}_t \mid \mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)}. \tag{13}

Substitution 2: plug in $(11)$ for the $t \geq 2$ sum, then $(12)$ for the telescope. Putting both into $(13)$ , then $(13)$ into $(7)$ :

-\mathcal{L}_{\mathrm{ELBO}} = \mathbb{E}_q\!\Bigg[ -\log p(\mathbf{x}_T) + \log\frac{q(\mathbf{x}_T\mid \mathbf{x}_0)}{q(\mathbf{x}_1\mid \mathbf{x}_0)} + \sum_{t=2}^{T} \log\frac{q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)} + \log\frac{q(\mathbf{x}_1\mid \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \mid \mathbf{x}_1)} \Bigg]. \tag{14}

We now regroup the four pieces inside the expectation in $(14)$ . This is not a new derivation, only bookkeeping.

Bookkeeping 1: merge $-\log p(\mathbf{x}_T)$ with the telescoped term.

-\log p(\mathbf{x}_T) + \log\frac{q(\mathbf{x}_T\mid \mathbf{x}_0)}{q(\mathbf{x}_1\mid \mathbf{x}_0)} = \log\frac{q(\mathbf{x}_T\mid \mathbf{x}_0)}{p(\mathbf{x}_T)} \;-\; \log q(\mathbf{x}_1\mid \mathbf{x}_0). \tag{15}

Bookkeeping 2: split the last term of $(14)$ .

\log\frac{q(\mathbf{x}_1\mid \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \mid \mathbf{x}_1)} = \log q(\mathbf{x}_1\mid \mathbf{x}_0) \;-\; \log p_\theta(\mathbf{x}_0\mid \mathbf{x}_1). \tag{16}

Bookkeeping 3: the $\log q(\mathbf{x}_1\mid\mathbf{x}_0)$ terms cancel. The $-\log q(\mathbf{x}_1\mid\mathbf{x}_0)$ from $(15)$ cancels the $+\log q(\mathbf{x}_1\mid\mathbf{x}_0)$ from $(16)$ . This cancellation is the whole point of introducing the $\mathbf{x}_0$ -conditioned posterior: it makes the endpoint terms line up.

Bookkeeping 4: distribute $\mathbb{E}_q$ linearly. Using $\mathbb{E}_q[A+B+\cdots] = \mathbb{E}_q[A] + \mathbb{E}_q[B] + \cdots$ :

-\mathcal{L}_{\mathrm{ELBO}} = \mathbb{E}_q\!\left[ \log\frac{q(\mathbf{x}_T\mid \mathbf{x}_0)}{p(\mathbf{x}_T)} \right] + \sum_{t=2}^{T} \mathbb{E}_q\!\left[ \log\frac{q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)} \right] - \mathbb{E}_q[\log p_\theta(\mathbf{x}_0\mid \mathbf{x}_1)]. \tag{17}

Each expectation of the form $\mathbb{E}_q[\log(q/p)]$ is, by definition, a KL divergence. Rewriting $(17)$ accordingly gives the final three-part form:

-\mathcal{L}_{\mathrm{ELBO}} = \underbrace{D_{\mathrm{KL}}\!\big(q(\mathbf{x}_T\mid \mathbf{x}_0)\,\|\,p(\mathbf{x}_T)\big)}_{L_T} + \sum_{t=2}^{T} \underbrace{\mathbb{E}_q\!\left[D_{\mathrm{KL}}\!\big(q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)\,\|\,p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)\big)\right]}_{L_{t-1}} \;-\; \underbrace{\mathbb{E}_q[\log p_\theta(\mathbf{x}_0 \mid \mathbf{x}_1)]}_{L_0}. \tag{18}

Step 5: How the KL Identity Turns Log-Ratios into Gaussian KLs

The only identity used in this step is the definition of KL as an expectation of a log-ratio:

D_{\mathrm{KL}}(q \,\|\, p) \;=\; \mathbb{E}_{x \sim q}\!\left[\log \frac{q(x)}{p(x)}\right]. \tag{19}

We apply $(19)$ to each of the three pieces of $(17)$ .

First term. The integrand $\log\frac{q(\mathbf{x}_T\mid\mathbf{x}_0)}{p(\mathbf{x}_T)}$ depends only on $\mathbf{x}_T$ (given $\mathbf{x}_0$ ), so the full trajectory expectation collapses to one over $\mathbf{x}_T$ :

\mathbb{E}_q\!\left[\log\frac{q(\mathbf{x}_T\mid \mathbf{x}_0)}{p(\mathbf{x}_T)}\right] = \mathbb{E}_{q(\mathbf{x}_T\mid\mathbf{x}_0)}\!\left[\log\frac{q(\mathbf{x}_T\mid \mathbf{x}_0)}{p(\mathbf{x}_T)}\right] = D_{\mathrm{KL}}\!\big(q(\mathbf{x}_T\mid \mathbf{x}_0)\,\|\,p(\mathbf{x}_T)\big). \tag{20}

No outer expectation is needed; $\mathbf{x}_T$ is fully consumed inside the KL. Strictly speaking, there is an implicit $\mathbb{E}_{q(\mathbf{x}_0)}$ outside since $\mathbf{x}_0$ is random too; we absorb it into the overall data-level average.

Middle term (one summand). The integrand depends on $\mathbf{x}_{t-1}, \mathbf{x}_t, \mathbf{x}_0$ , so the trajectory expectation reduces to one over the joint $q(\mathbf{x}_{t-1}, \mathbf{x}_t \mid \mathbf{x}_0)$ . Split it via the chain rule,

q(\mathbf{x}_{t-1}, \mathbf{x}_t \mid \mathbf{x}_0) = q(\mathbf{x}_t \mid \mathbf{x}_0)\, q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0), \tag{21}

which lets us write the single expectation as two nested ones:

\mathbb{E}_q\!\left[\log\frac{q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)}\right] = \mathbb{E}_{q(\mathbf{x}_t\mid\mathbf{x}_0)}\!\Bigg[\; \mathbb{E}_{q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)}\!\left[\log\frac{q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)}\right]\;\Bigg]. \tag{22}

The inner expectation in $(22)$ is exactly the KL identity $(19)$ , so it equals $D_{\mathrm{KL}}\!\big(q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)\,\|\,p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)\big)$ , leaving

\mathbb{E}_{q(\mathbf{x}_t\mid\mathbf{x}_0)}\!\Big[\,D_{\mathrm{KL}}\!\big(q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)\,\|\,p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)\big)\,\Big]. \tag{23}

The outer $\mathbb{E}_q$ remains because the KL value depends on the random $\mathbf{x}_t$ : different $\mathbf{x}_t$ samples give different KL values, so we must average over them.

Last term. $-\mathbb{E}_q[\log p_\theta(\mathbf{x}_0\mid \mathbf{x}_1)]$ is not a log-ratio, just the log of a single distribution. The KL identity does not apply. It stays as an expectation, a simple reconstruction likelihood, and is labelled $L_0$ .

What we gained. Nothing mathematical changed: we merely renamed each $\mathbb{E}_q[\log(q/p)]$ as a KL (plus possibly an outer expectation over remaining random variables). The point is that KL divergences between Gaussians have closed-form formulas, whereas raw log-ratios inside expectations do not. This renaming is what makes the next section's computation possible.

Why This Form Is Nice

$L_T$ has no learnable parameters (the forward process is fixed), so it is a constant.
$L_0$ is a simple reconstruction term.
Each $L_{t-1}$ is a KL between two Gaussians, which has a closed form.

Intuition. The three types of terms correspond to three distinct jobs.

$L_T$ (end of chain): make sure the forward process actually ends in pure noise. With enough steps and a reasonable schedule, this is already ensured, so we can forget about it.

$L_{t-1}$ (denoising steps): the heart of the loss. At each timestep, the network must match the ideal Bayesian denoising distribution. Summed over all $t$ , this is the signal that teaches the model to denoise.

$L_0$ (final reconstruction): handles the last jump from $\mathbf{x}_1$ back to clean data $\mathbf{x}_0$ , which needs discrete-pixel likelihood rather than Gaussian KL.

4. Closed-Form Posterior $q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)$

A key property of the forward process is that $\mathbf{x}_t$ given $\mathbf{x}_0$ is Gaussian in closed form. With $\alpha_t = 1-\beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^{t}\alpha_s$ ,

q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}\!\left(\mathbf{x}_t;\, \sqrt{\bar\alpha_t}\,\mathbf{x}_0,\, (1-\bar\alpha_t)\mathbf{I}\right), \tag{24}

and the posterior is also Gaussian:

q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}\!\left(\mathbf{x}_{t-1};\, \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0),\, \tilde\beta_t \mathbf{I}\right), \tag{25}

with

\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) = \frac{\sqrt{\bar\alpha_{t-1}}\,\beta_t}{1-\bar\alpha_t}\,\mathbf{x}_0 + \frac{\sqrt{\alpha_t}\,(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}\,\mathbf{x}_t, \tag{26}

\tilde\beta_t = \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\,\beta_t. \tag{27}

Derivation of $\tilde{\boldsymbol{\mu}}_t$ and $\tilde\beta_t$

We derive the posterior by applying Bayes' rule and matching the resulting quadratic to a Gaussian.

Step 1: Bayes' rule with three known Gaussians.

q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) = \frac{q(\mathbf{x}_t \mid \mathbf{x}_{t-1})\, q(\mathbf{x}_{t-1}\mid \mathbf{x}_0)}{q(\mathbf{x}_t \mid \mathbf{x}_0)}, \tag{28}

where (using the Markov property for the first) all three right-hand side distributions are known:

\begin{aligned} q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) &= \mathcal{N}\!\big(\mathbf{x}_t;\, \sqrt{\alpha_t}\,\mathbf{x}_{t-1},\, \beta_t \mathbf{I}\big), \\ q(\mathbf{x}_{t-1} \mid \mathbf{x}_0) &= \mathcal{N}\!\big(\mathbf{x}_{t-1};\, \sqrt{\bar\alpha_{t-1}}\,\mathbf{x}_0,\, (1-\bar\alpha_{t-1})\mathbf{I}\big), \\ q(\mathbf{x}_t \mid \mathbf{x}_0) &= \mathcal{N}\!\big(\mathbf{x}_t;\, \sqrt{\bar\alpha_t}\,\mathbf{x}_0,\, (1-\bar\alpha_t)\mathbf{I}\big). \end{aligned} \tag{29}

Step 2: Work in log-space, keep only $\mathbf{x}_{t-1}$ terms. Treat $\mathbf{x}_t$ and $\mathbf{x}_0$ as fixed. The third density in $(29)$ (the denominator of $(28)$ ) has no $\mathbf{x}_{t-1}$ , so it contributes only a constant. Dropping constants:

\log q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) = -\frac{(\mathbf{x}_t - \sqrt{\alpha_t}\,\mathbf{x}_{t-1})^2}{2\beta_t} - \frac{(\mathbf{x}_{t-1} - \sqrt{\bar\alpha_{t-1}}\,\mathbf{x}_0)^2}{2(1-\bar\alpha_{t-1})} + C. \tag{30}

Step 3: Expand the squares.

\begin{aligned} (\mathbf{x}_t - \sqrt{\alpha_t}\,\mathbf{x}_{t-1})^2 &= \mathbf{x}_t^2 - 2\sqrt{\alpha_t}\,\mathbf{x}_t\,\mathbf{x}_{t-1} + \alpha_t\,\mathbf{x}_{t-1}^2, \\ (\mathbf{x}_{t-1} - \sqrt{\bar\alpha_{t-1}}\,\mathbf{x}_0)^2 &= \mathbf{x}_{t-1}^2 - 2\sqrt{\bar\alpha_{t-1}}\,\mathbf{x}_0\,\mathbf{x}_{t-1} + \bar\alpha_{t-1}\,\mathbf{x}_0^2. \end{aligned} \tag{31}

Dropping terms without $\mathbf{x}_{t-1}$ (they become part of $C$ ) and substituting into $(30)$ :

\log q(\mathbf{x}_{t-1}\mid\mathbf{x}_t, \mathbf{x}_0) = -\frac{1}{2}\Bigg[\underbrace{\left(\frac{\alpha_t}{\beta_t} + \frac{1}{1-\bar\alpha_{t-1}}\right)}_{A}\,\mathbf{x}_{t-1}^2 \;-\; 2\underbrace{\left(\frac{\sqrt{\alpha_t}\,\mathbf{x}_t}{\beta_t} + \frac{\sqrt{\bar\alpha_{t-1}}\,\mathbf{x}_0}{1-\bar\alpha_{t-1}}\right)}_{B}\,\mathbf{x}_{t-1}\Bigg] + C. \tag{32}

Step 4: Match to Gaussian shape. A Gaussian $\mathcal{N}(\mathbf{x}_{t-1}; \tilde{\boldsymbol{\mu}}_t, \tilde\beta_t)$ has log-density $-\frac{1}{2}\!\left[\frac{1}{\tilde\beta_t}\mathbf{x}_{t-1}^2 - 2\frac{\tilde{\boldsymbol{\mu}}_t}{\tilde\beta_t}\mathbf{x}_{t-1}\right] + C$ . Matching coefficients in $(32)$ :

\tilde\beta_t = \frac{1}{A}, \qquad \tilde{\boldsymbol{\mu}}_t = \frac{B}{A} = B \cdot \tilde\beta_t. \tag{33}

Step 5: Simplify $\tilde\beta_t = 1/A$ . Combine the fractions in $A$ over common denominator $\beta_t(1-\bar\alpha_{t-1})$ :

A = \frac{\alpha_t(1-\bar\alpha_{t-1}) + \beta_t}{\beta_t(1-\bar\alpha_{t-1})}. \tag{34}

Simplify the numerator using $\alpha_t + \beta_t = 1$ and $\alpha_t \bar\alpha_{t-1} = \bar\alpha_t$ :

\alpha_t(1-\bar\alpha_{t-1}) + \beta_t = \alpha_t - \alpha_t\bar\alpha_{t-1} + \beta_t = (\alpha_t + \beta_t) - \bar\alpha_t = 1 - \bar\alpha_t. \tag{35}

Hence

A = \frac{1-\bar\alpha_t}{\beta_t(1-\bar\alpha_{t-1})} \quad\Longrightarrow\quad \tilde\beta_t = \frac{\beta_t(1-\bar\alpha_{t-1})}{1-\bar\alpha_t} = \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\,\beta_t, \tag{36}

which matches $(27)$ .

Step 6: Simplify $\tilde{\boldsymbol{\mu}}_t = B \cdot \tilde\beta_t$ . Multiply each term of $B$ by $\tilde\beta_t = \frac{\beta_t(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}$ and cancel:

$\mathbf{x}_0$ coefficient: $\dfrac{\sqrt{\bar\alpha_{t-1}}\,\mathbf{x}_0}{1-\bar\alpha_{t-1}} \cdot \dfrac{\beta_t(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}$ . The $(1-\bar\alpha_{t-1})$ cancels, leaving $\dfrac{\sqrt{\bar\alpha_{t-1}}\,\beta_t}{1-\bar\alpha_t}$ .
$\mathbf{x}_t$ coefficient: $\dfrac{\sqrt{\alpha_t}\,\mathbf{x}_t}{\beta_t} \cdot \dfrac{\beta_t(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}$ . The $\beta_t$ cancels, leaving $\dfrac{\sqrt{\alpha_t}\,(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}$ .

Summing:

\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) = \frac{\sqrt{\bar\alpha_{t-1}}\,\beta_t}{1-\bar\alpha_t}\,\mathbf{x}_0 + \frac{\sqrt{\alpha_t}\,(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}\,\mathbf{x}_t, \tag{37}

which matches $(26)$ .

Intuition. $\tilde{\boldsymbol{\mu}}_t$ is a weighted average of $\mathbf{x}_0$ (the clean signal) and $\mathbf{x}_t$ (the current noisy state). At large $t$ (lots of noise), $\bar\alpha_t \to 0$ and more weight falls on $\mathbf{x}_0$ ; at small $t$ (little noise), more weight falls on $\mathbf{x}_t$ . The posterior is exactly the right Bayesian interpolation between "where we started" and "where we are now". Geometrically, $\tilde{\boldsymbol{\mu}}_t$ sits on the line segment between $\mathbf{x}_t$ and $\mathbf{x}_0$ ; the further along the chain we are, the less trustworthy $\mathbf{x}_t$ becomes, and the more the posterior pulls toward $\mathbf{x}_0$ . This is Bayesian denoising: "I know you are trying to get back to $\mathbf{x}_0$ , I know you are currently at $\mathbf{x}_t$ , so the best next step is this weighted mixture."

5. The Simplification to a Noise-Prediction Loss

Step 1: KL Between Two Gaussians with the Same Covariance

We parameterize the reverse process as

p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t) = \mathcal{N}\!\left(\mathbf{x}_{t-1};\, \boldsymbol{\mu}_\theta(\mathbf{x}_t, t),\, \sigma_t^2 \mathbf{I}\right), \tag{38}

with $\sigma_t^2$ fixed (commonly $\sigma_t^2 = \tilde\beta_t$ or $\beta_t$ ). Both $q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)$ in $(25)$ and $p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)$ in $(38)$ are Gaussians with the same (isotropic) covariance, so the KL has a clean closed form:

D_{\mathrm{KL}}\!\left(\mathcal{N}(\boldsymbol{\mu}_1, \sigma^2\mathbf{I}) \,\|\, \mathcal{N}(\boldsymbol{\mu}_2, \sigma^2\mathbf{I})\right) = \frac{1}{2\sigma^2} \|\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2\|^2. \tag{39}

Applying $(39)$ to $L_{t-1}$ :

L_{t-1} = \mathbb{E}_q\!\left[\frac{1}{2\sigma_t^2}\,\big\| \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t,\mathbf{x}_0) - \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) \big\|^2\right] + C, \tag{40}

where $C$ absorbs any $\theta$ -independent constants.

Step 2: Reparameterize $\mathbf{x}_t$ in Terms of Noise $\boldsymbol{\epsilon}$

From the closed-form marginal $(24)$ , we can sample $\mathbf{x}_t$ via

\mathbf{x}_t = \sqrt{\bar\alpha_t}\,\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\,\boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), \tag{41}

which means, equivalently,

\mathbf{x}_0 = \frac{1}{\sqrt{\bar\alpha_t}}\!\left(\mathbf{x}_t - \sqrt{1-\bar\alpha_t}\,\boldsymbol{\epsilon}\right). \tag{42}

Step 3: Substitute Into $\tilde{\boldsymbol{\mu}}_t$

Take the posterior mean from $(26)$ ,

\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) = \frac{\sqrt{\bar\alpha_{t-1}}\,\beta_t}{1-\bar\alpha_t}\,\mathbf{x}_0 + \frac{\sqrt{\alpha_t}\,(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}\,\mathbf{x}_t,

substitute $(42)$ for $\mathbf{x}_0$ , and simplify using $\bar\alpha_t = \alpha_t \bar\alpha_{t-1}$ :

\begin{aligned} \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) &= \frac{\sqrt{\bar\alpha_{t-1}}\,\beta_t}{1-\bar\alpha_t}\cdot \frac{\mathbf{x}_t - \sqrt{1-\bar\alpha_t}\,\boldsymbol{\epsilon}}{\sqrt{\bar\alpha_t}} + \frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}\,\mathbf{x}_t \\ &= \frac{\beta_t}{(1-\bar\alpha_t)\sqrt{\alpha_t}}\,\mathbf{x}_t - \frac{\beta_t}{\sqrt{\alpha_t}\sqrt{1-\bar\alpha_t}}\,\boldsymbol{\epsilon} + \frac{\alpha_t(1-\bar\alpha_{t-1})}{(1-\bar\alpha_t)\sqrt{\alpha_t}}\,\mathbf{x}_t. \end{aligned} \tag{43}

The two $\mathbf{x}_t$ coefficients combine: $\beta_t + \alpha_t(1-\bar\alpha_{t-1}) = 1 - \bar\alpha_t$ (using $\alpha_t = 1-\beta_t$ ), so

\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) = \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\,\boldsymbol{\epsilon}\right). \tag{44}

Step 4: Parameterize the Model by Predicting Noise

Expression $(44)$ suggests mirroring the same form for $\boldsymbol{\mu}_\theta$ , but with a learned noise estimate $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ in place of the true $\boldsymbol{\epsilon}$ :

\boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right). \tag{45}

Intuition. Why train the network to predict the noise instead of the denoised image? Because of the reparameterization $(41)$ , knowing $\mathbf{x}_t$ and $\boldsymbol{\epsilon}$ is equivalent to knowing $\mathbf{x}_t$ and $\mathbf{x}_0$ . They carry the same information. So the network could equally well predict (a) the clean image $\mathbf{x}_0$ directly, (b) the noise $\boldsymbol{\epsilon}$ that was added, or (c) the posterior mean $\tilde{\boldsymbol{\mu}}_t$ itself. All three are mathematically equivalent. In practice, predicting $\boldsymbol{\epsilon}$ tends to be easier for the network: the noise is unit-variance Gaussian (standardized), while $\mathbf{x}_0$ has arbitrary scale and structure. A network outputting values of roughly fixed scale is easier to train, and gradients behave better. This is a practical choice, not a mathematical necessity, but it is the choice that made DDPM work.

Now the difference $\tilde{\boldsymbol{\mu}}_t - \boldsymbol{\mu}_\theta$ simplifies dramatically; subtracting $(45)$ from $(44)$ , the $\mathbf{x}_t$ terms cancel:

\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) - \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{\beta_t}{\sqrt{\alpha_t}\sqrt{1-\bar\alpha_t}}\,\big(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) - \boldsymbol{\epsilon}\big). \tag{46}

Plugging $(46)$ back into $(40)$ :

L_{t-1} = \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}}\!\left[ \frac{\beta_t^2}{2\sigma_t^2\,\alpha_t\,(1-\bar\alpha_t)}\, \big\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \big\|^2 \right]. \tag{47}

6. The Final, Tractable Objective

Ho et al. observed that simply dropping the time-dependent weighting in front of $(47)$ improves sample quality. The final training objective is:

\boxed{\; \mathcal{L}_{\mathrm{simple}}(\theta) = \mathbb{E}_{t,\,\mathbf{x}_0,\,\boldsymbol{\epsilon}}\!\left[ \big\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta\!\big(\sqrt{\bar\alpha_t}\,\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\,\boldsymbol{\epsilon},\; t\big) \big\|^2 \right]\;} \tag{48}

where $t \sim \mathrm{Uniform}\{1,\ldots,T\}$ , $\mathbf{x}_0 \sim q(\mathbf{x}_0)$ , and $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0},\mathbf{I})$ .

What We Actually Do

Algorithm 1DDPM training (Ho et al. 2020)

That is the whole training loop. No loop over timesteps, no Monte Carlo over trajectories, no adversarial game. Just a supervised regression problem on $(\mathbf{x}_t, t) \mapsto \boldsymbol{\epsilon}$ .

Intuition. What started as "maximize the likelihood of a high-dimensional integral that can never be computed" has become "sample a random timestep, add some noise, and ask the network to guess the noise back". The miracle is that every step of this simplification is either exact, or a tiny empirically-justified relaxation. The final objective is basically a denoising autoencoder loss applied at random noise levels, simple enough that a few lines of PyTorch can implement it.

7. Inference: How We Actually Sample

Training gives us $\boldsymbol{\epsilon}_\theta$ . But we never directly sample from $p_\theta(\mathbf{x}_0)$ ; the marginal is exactly the intractable integral $(2)$ we started with. Instead, inference walks the reverse Markov chain one step at a time, which is exactly what chain $(3)$ was built for.

Starting from pure noise $\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ , we iterate for $t = T, T-1, \ldots, 1$ :

\mathbf{x}_{t-1} \sim p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}\!\left(\mathbf{x}_{t-1};\, \boldsymbol{\mu}_\theta(\mathbf{x}_t, t),\, \sigma_t^2 \mathbf{I}\right). \tag{49}

Because this is a Gaussian, we sample via the standard reparameterization,

\mathbf{x}_{t-1} = \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) + \sigma_t \mathbf{z}, \qquad \mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}). \tag{50}

Step 1: Plug in the Noise-Prediction Parameterization

We already picked $\boldsymbol{\mu}_\theta$ in training, equation $(45)$ . Substituting it into $(50)$ gives the closed-form reverse step:

\boxed{\; \mathbf{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right) + \sigma_t \mathbf{z}.\;} \tag{51}

Read this in three pieces.

$\dfrac{1}{\sqrt{\alpha_t}}\,\mathbf{x}_t$ un-scales the current state. Recall the forward step shrinks $\mathbf{x}$ by $\sqrt{\alpha_t}$ ; we dilate it back.
$-\dfrac{\beta_t}{\sqrt{\alpha_t}\sqrt{1-\bar\alpha_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ subtracts the estimated noise, scaled by how much noise is in $\mathbf{x}_t$ relative to how much one step adds.
$\sigma_t \mathbf{z}$ re-injects stochasticity, the same way the forward chain did. Without it, the chain would collapse onto a single deterministic trajectory.

Step 2: The Last Step Is Special

At $t = 1$ we want to land on clean data $\mathbf{x}_0$ , not draw from a Gaussian around it. DDPM handles this by setting $\mathbf{z} = \mathbf{0}$ at the final step:

\mathbf{x}_0 = \frac{1}{\sqrt{\alpha_1}}\!\left(\mathbf{x}_1 - \frac{\beta_1}{\sqrt{1-\bar\alpha_1}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_1, 1)\right). \tag{52}

Equivalently, we take the mean and drop the noise term. Any remaining stochasticity would just blur the output.

Step 3: Choosing $\sigma_t$

The reverse variance $\sigma_t^2$ was not learned; we fixed it in $(38)$ . Two standard choices, both introduced by Ho et al. and both producing nearly identical sample quality:

\sigma_t^2 = \beta_t \qquad \text{or} \qquad \sigma_t^2 = \tilde\beta_t = \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\,\beta_t. \tag{53}

$\beta_t$ is the variance of the forward step. It matches the upper bound on the true posterior variance (which holds when $\mathbf{x}_0 \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ ).
$\tilde\beta_t$ from $(27)$ is the exact posterior variance when $\mathbf{x}_0$ is known. It's the lower bound on the optimal $\sigma_t^2$ .

The truth sits between them, but in practice nothing swings much. Improved-DDPM (Nichol and Dhariwal, 2021) learns $\sigma_t$ as an interpolation between the two, which gives a small quality bump.

Intuition. Sampling is just the forward chain run backwards, but with a learned best-guess at each step. At every $t$ , the network says "here is what I think the noise component of $\mathbf{x}_t$ is", and we subtract a scaled version of that from $\mathbf{x}_t$ to land at $\mathbf{x}_{t-1}$ . Then we add fresh Gaussian noise (except on the last step) so that the trajectory stays random, matching the statistical variety of the forward chain we trained against.

The Full Sampling Algorithm

Putting the pieces together:

Algorithm 2DDPM sampling (Ho et al. 2020)

It calls the network exactly $T$ times (typically $T = 1000$ ), which is why vanilla DDPM sampling is so slow.

Why Sampling Is Slow, and How to Fix It

The chain is Markov and the transitions are entangled: $\mathbf{x}_{t-1}$ genuinely depends on $\mathbf{x}_t$ , so there is no algebraic shortcut that collapses $T$ steps into one, the way $(41)$ collapses training.

Three families of work attack this bottleneck, all reusing the same trained $\boldsymbol{\epsilon}_\theta$ (the loss in $(48)$ is already everything you need):

DDIM (Song, Meng, Ermon, 2021) derives a deterministic reverse process whose marginals at each $t$ still match the forward chain. Samples can be produced in $25$ to $50$ steps instead of $1000$ with only a small quality loss.
DPM-Solver (Lu et al., 2022) treats the reverse chain as a solver for a probability-flow ODE and uses higher-order numerical integration to reach good samples in $\sim 10$ to $20$ steps.
Classifier-free guidance (Ho and Salimans, 2021) does not reduce step count but scales $\boldsymbol{\epsilon}_\theta$ between conditional and unconditional predictions to sharpen samples. This is the trick that made text-to-image diffusion actually work.

All three sit on top of the $\boldsymbol{\epsilon}_\theta$ you trained with $(48)$ ; none of them require re-training.

8. Summary of the Simplification Path

Step	Form
Intractable (1), (2)	$\max_\theta \log p_\theta(\mathbf{x}_0) = \log\!\int p_\theta(\mathbf{x}_{0:T})\,d\mathbf{x}_{1:T}$
$\downarrow$ Jensen's inequality with $q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)$
Lower bound (6)	$\mathcal{L}_{\mathrm{ELBO}} = \mathbb{E}_q\!\left[\log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}\right]$
$\downarrow$ Bayes' rule on $q$ , telescoping
Sum of KLs (18)	$L_T + \sum_{t\geq 2} L_{t-1} + L_0$ , all Gaussian-Gaussian
$\downarrow$ reparameterize $\mathbf{x}_t$ in terms of $\boldsymbol{\epsilon}$
Weighted MSE (47)	$\sum_t w_t \, \mathbb{E}\!\left[\\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\\|^2\right]$
$\downarrow$ drop weights $w_t \to 1$
Tractable objective (48)	$\mathcal{L}_{\mathrm{simple}} = \mathbb{E}\!\left[\\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\\|^2\right]$

The key insight is that each transformation either preserves the bound exactly (the ELBO, Bayes flip, telescope, and KL rewrite) or trades a tighter bound for a cleaner objective that empirically trains better (the weight drop). The end result is an MSE loss between true noise and predicted noise, something we can compute from a single forward pass of a neural network.

What's Next

We now have the objective function and the sampler. Two things are missing:

The $T \approx 1000$ steps of the DDPM sampler are painfully slow. Can we reuse the same trained $\boldsymbol{\epsilon}_\theta$ with a smarter sampler and get samples in $\sim 25$ steps instead?
We still need actual code: a schedule, a UNet, a training loop.

Part 2 answers the first question. It derives DDIM (Denoising Diffusion Implicit Models), which generalizes the reverse process into a family parameterized by $\eta$ . Setting $\eta = 0$ gives a deterministic sampler that produces quality equivalent to DDPM in $25$ to $50$ steps. No retraining: the loss $(48)$ already contains everything DDIM needs.

Part 3 answers the second. We build DDPM end-to-end in PyTorch: noise schedule, small UNet with time embeddings, training loop from $(48)$ , and the iterative sampler $(51)$ . Every equation in this post maps to a few lines of code.

Diffusion Deep Dive Part 1: From an Impossible Integral to a Two-Line Loss (and Back Out to Samples)

Notation Cheat Sheet

1. The Goal: Maximum Likelihood

Why This is Intractable

2. The ELBO Fix

3. Rewriting the ELBO as a Sum of KL Terms

Step 1: Expand Using the Markov Factorizations

Step 2: Flip the Forward Transition with Bayes' Rule

Step 3: The Telescoping Sum

Step 4: Put It All Together

Step 5: How the KL Identity Turns Log-Ratios into Gaussian KLs

Why This Form Is Nice

4. Closed-Form Posterior $q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)$

Derivation of $\tilde{\boldsymbol{\mu}}_t$ and $\tilde\beta_t$

5. The Simplification to a Noise-Prediction Loss

Step 1: KL Between Two Gaussians with the Same Covariance

Step 2: Reparameterize $\mathbf{x}_t$ in Terms of Noise $\boldsymbol{\epsilon}$

Step 3: Substitute Into $\tilde{\boldsymbol{\mu}}_t$

Step 4: Parameterize the Model by Predicting Noise

6. The Final, Tractable Objective

What We Actually Do

7. Inference: How We Actually Sample

Step 1: Plug in the Noise-Prediction Parameterization

Step 2: The Last Step Is Special

Step 3: Choosing $\sigma_t$

The Full Sampling Algorithm

Why Sampling Is Slow, and How to Fix It

8. Summary of the Simplification Path

What's Next

Continue the Series

Related Articles

Responses

Diffusion Deep Dive Part 1: From an Impossible Integral to a Two-Line Loss (and Back Out to Samples)

Notation Cheat Sheet

1. The Goal: Maximum Likelihood

Why This is Intractable

2. The ELBO Fix

3. Rewriting the ELBO as a Sum of KL Terms

Step 1: Expand Using the Markov Factorizations

Step 2: Flip the Forward Transition with Bayes' Rule

Step 3: The Telescoping Sum

Step 4: Put It All Together

Step 5: How the KL Identity Turns Log-Ratios into Gaussian KLs

Why This Form Is Nice

4. Closed-Form Posterior q(xt−1∣xt,x0)q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)q(xt−1​∣xt​,x0​)

Derivation of μ~t\tilde{\boldsymbol{\mu}}_tμ~​t​ and β~t\tilde\beta_tβ~​t​

5. The Simplification to a Noise-Prediction Loss

Step 1: KL Between Two Gaussians with the Same Covariance

Step 2: Reparameterize xt\mathbf{x}_txt​ in Terms of Noise ϵ\boldsymbol{\epsilon}ϵ

Step 3: Substitute Into μ~t\tilde{\boldsymbol{\mu}}_tμ~​t​

Step 4: Parameterize the Model by Predicting Noise

6. The Final, Tractable Objective

What We Actually Do

7. Inference: How We Actually Sample

Step 1: Plug in the Noise-Prediction Parameterization

Step 2: The Last Step Is Special

Step 3: Choosing σt\sigma_tσt​

The Full Sampling Algorithm

Why Sampling Is Slow, and How to Fix It

8. Summary of the Simplification Path

What's Next

Continue the Series

Related Articles

Responses

4. Closed-Form Posterior $q(\mathbf{x}_{t-1}\mid \mathbf{x}_t, \mathbf{x}_0)$

Derivation of $\tilde{\boldsymbol{\mu}}_t$ and $\tilde\beta_t$

Step 2: Reparameterize $\mathbf{x}_t$ in Terms of Noise $\boldsymbol{\epsilon}$

Step 3: Substitute Into $\tilde{\boldsymbol{\mu}}_t$

Step 3: Choosing $\sigma_t$