Diffusion Models from First Principles

Every image you've seen from Midjourney, DALL-E, or Stable Diffusion was born from pure noise. Literal static — the kind you'd see on an old untuned TV. A neural network looked at that static and, step by painstaking step, sculpted it into a photograph, a painting, a dream.

The standard internet explanation of how this works is "diffusion models." And that's not wrong, but it's also not very useful. Simply knowing the name of something is very different from understanding it.

So what does constitute understanding? My answer: having a model that allows you to make predictions. If you can reliably predict how and why each step of the process works, then you probably understand it.

In this article, we'll build up diffusion models from scratch — starting from pure intuition, adding math only when it earns its keep, and building interactive demos along the way so you can see and feel every concept. By the time you're done, you won't just know what diffusion models are. You'll be able to derive them on a napkin.

Here's the core insight, in one sentence: if you learn to reverse each tiny step of a destruction process, you can create from scratch.

Let's unpack that.

Imagine you film yourself scrambling an egg. You crack it into a pan, poke the yolk, and stir until it's a uniform yellow mush. Easy. Anyone can do it. The process is irreversible — you can never un-scramble the egg.

But what if you had the film? What if you could study that film, frame by frame, and learn exactly what changed between each pair of consecutive frames?

Each individual change is tiny — a few molecules shifting here, a bit of yolk mixing there. And tiny changes are learnable. If you could train a model to predict "given frame 57, what did frame 56 look like?", and you could do that for every pair of frames... you could run the film backward. You could un-scramble the egg.

That's diffusion models in a nutshell.

Replace "egg" with "image" and "scrambling" with "adding Gaussian noise," and you have the entire framework:

Forward process (the scrambling): Start with a clean image. Gradually add random noise, step by step, until it's pure static. This is trivial — no learning required.

Reverse process (the un-scrambling): Train a neural network to undo each tiny noise step. Then, starting from pure static, apply the learned reverse steps one by one to conjure an image from nothing.

The trick is that each step only removes a tiny bit of noise. The network doesn't need to imagine an entire image in one shot — it just needs to make a small, local improvement. That's a much easier problem.

Let's see this in action. Below is a simple 8×8 pixel image. Drag the slider to add noise, step by step, and watch it dissolve into static.

The Destruction Film

Timestep t = 0

Clean image Pure noise

Signal remaining

Noise level

Drag the slider to watch a smiley face dissolve into noise. Each step adds a tiny bit of Gaussian noise. After enough steps, the original image is completely unrecoverable.

Notice something important: at the beginning, you can clearly see the smiley face even with some noise. In the middle, you can kind of tell something is there. By the end, it's indistinguishable from pure random static. The information has been destroyed.

But here's the key: between any two adjacent timesteps, the change is small. And small changes are predictable. That's the opening we need.

Let's get precise about what "adding noise step by step" means.

We start with a clean image x₀. At each timestep t, we add a small amount of Gaussian noise to get a slightly noisier version x_t. The amount of noise at each step is controlled by a parameter β_t (beta), called the noise schedule.

The math for a single step is:

q(x t | x t-1) = N(x t; \sqrt(1 - β t) \cdot x t-1, β t \cdot I)

In plain English: to get x_t from x_t-1, you slightly shrink the image (multiply by √(1 - β_t), which is just under 1) and add a little Gaussian noise (with variance β_t).

Why shrink the image? If we just added noise without shrinking, the overall magnitude of the image would grow without bound. The shrinking factor ensures the variance stays controlled. Together, the shrink + noise ensure that after enough steps, the result is a standard Gaussian: N(0, I). This is important — it means the endpoint of the forward process is always the same, regardless of what image you started with.

β_t is typically small — something like 0.0001 to 0.02 — and increases gradually over the course of the process. Early steps barely touch the image. Later steps add more noise. A typical diffusion model uses T = 1000 total steps.

Here's a more detailed view. Pick a shape and watch how the pixel distribution changes as noise is added:

Noise Kitchen

Pixel value distribution (converges to Gaussian)

Timestep t = 0 / 1000

t = 0 (clean) t = 1000 (pure noise)

Pick a shape, then drag the slider. The histogram on the right shows how the pixel values spread out toward a Gaussian bell curve as noise increases.

The histogram is the real story here. At t = 0, the pixel values cluster around a few specific values (the colors of the shape). As t increases, they spread out. By t = 1000, they form a near-perfect bell curve — a Gaussian distribution centered at zero. The original image has been completely forgotten.

This convergence to a Gaussian is not a coincidence. It's a consequence of the Central Limit Theorem: adding many small independent random variables always produces a Gaussian. It doesn't matter what image you started with — the end state is always the same.

Running 1000 noise steps one by one would be painfully slow during training. Fortunately, there's a beautiful mathematical shortcut.

Define α_t = 1 - β_t and ᾱ_t = α₁ · α₂ · ... · α_t (the cumulative product). Then you can jump directly from the clean image x₀ to any timestep t in a single shot:

x t = \sqrtᾱ t \cdot x 0 + \sqrt(1 - ᾱ t) \cdot ε where ε ~ N(0, I)

This is the reparameterization trick, and it's crucial. Instead of noising step-by-step, you can sample one noise vector ε and mix it with the clean image. The mixing ratio is controlled by ᾱ_t:

When t is small, ᾱ_t ≈ 1 — mostly signal, little noise
When t is large, ᾱ_t ≈ 0 — mostly noise, little signal

Signal-to-Noise Ratio (SNR): The quantity ᾱ_t / (1 - ᾱ_t) is the signal-to-noise ratio at timestep t. It starts high (clean image) and decays to near-zero (pure noise). A well-designed noise schedule ensures a smooth decay of SNR across timesteps — not too fast, not too slow. This is why the cosine schedule often works better than a linear one: it provides a smoother SNR curve.

Why does this matter? Because during training, we need to generate noisy versions of images at random timesteps, thousands of times. The shortcut lets us do this in one operation instead of hundreds.

Try it above: both the step-by-step path and the direct jump arrive at the same noisy image. This is not an approximation — they are mathematically identical (given the same noise vector).

Now for the million-dollar question: can we run the film backward?

Mathematically, we want p(x_t-1 | x_t) — given a noisy image at step t, what did it look like one step earlier? If we had this, we could start from pure noise x_T and walk backward to a clean image x₀.

The catch: computing p(x_t-1 | x_t) exactly requires knowing p(x_t) — the probability distribution over all possible images at noise level t. That's intractable. It would mean knowing every possible image that could exist.

The solution: approximate it with a neural network.

We train a network p_θ(x_t-1 | x_t) that takes in a noisy image and predicts what a slightly less noisy version looks like. The key insight is that this reverse step is also Gaussian (when the forward steps are small enough), so the network only needs to predict two things: a mean and a variance.

Why is the reverse also Gaussian? When β_t is small, each forward step makes a tiny perturbation. For small perturbations, the reverse of a Gaussian process is also approximately Gaussian. This is a result from stochastic differential equations. In practice, the variance is usually fixed to β_t or β̃_t (a related quantity), so the network only needs to predict the mean.

And here's where it gets elegant. Remember the forward shortcut formula? We can rewrite the reverse mean in terms of a noise prediction. Instead of asking the network "what did x_t-1 look like?", we ask: "what noise ε was added to create x_t?"

If the network can predict the noise ε_θ(x_t, t), we can compute the reverse step as:

x t-1 = (1/\sqrtα t)(x t - (β t / \sqrt(1 - ᾱ t)) \cdot ε θ (x t, t)) + σ t \cdot z

where α_t = 1 - β_t, and z ~ N(0, I) is fresh random noise (except at the final step). The σ_t term controls the stochasticity of sampling — more on that later.

Let's see this in action. Below, step backward from noise and watch an image emerge:

Reverse the Film

Timestep t = 100 (noise)

Clean image (t=0) Pure noise (t=100)

Drag the slider from right to left to step backward from noise to image. Or click "Auto-reverse" to watch it happen automatically. This simulates what a trained diffusion model does at generation time.

What you're seeing is a simulation of the reverse process. In a real diffusion model, a neural network would be computing each denoising step. Here, we're using the known original image to compute the "ideal" reverse — but the principle is the same.

We said the network predicts the "noise." But there are actually three equivalent ways to frame what the network learns, and they're all mathematically interchangeable:

Prediction target	What the network outputs	Intuition
Noise prediction (ε)	The noise that was added	"What's the garbage? I'll subtract it."
Data prediction (x₀)	The original clean image	"What's hiding under all that noise?"
Score prediction (∇ log p(x_t))	Direction toward higher probability	"Which way should I nudge to improve?"

These are equivalent because of a simple relationship. Given x_t = √ᾱ_t · x₀ + √(1 - ᾱ_t) · ε:

If you know ε, you can solve for x₀: x₀ = (x_t - √(1 - ᾱ_t) · ε) / √ᾱ_t
If you know x₀, you can solve for ε
The score is just: ∇_x log p(x_t) = -ε / √(1 - ᾱ_t)

In practice, noise prediction (ε-prediction) works best and is the most common choice. The network sees a noisy image, and its job is simply: "tell me what the noise looks like, and I'll subtract it."

The architecture of choice is a U-Net — a convolutional neural network with skip connections that preserves spatial detail. The key modification for diffusion models: the timestep t is also fed as input (typically via sinusoidal embeddings), so the network knows how noisy the image is.

Simplified U-Net architecture for diffusion models. The noisy image and timestep go in; predicted noise comes out. Skip connections (dashed yellow) let fine details flow from encoder to decoder.

The U-Net architecture is brilliantly suited for this task. The encoder compresses the spatial information, the bottleneck captures global context, and the decoder reconstructs the output — with skip connections ensuring that fine spatial details aren't lost.

Why not just a plain CNN? A plain convolutional network would lose spatial information as it gets deeper. The skip connections in U-Net let the decoder directly access features from the encoder, preserving both high-level semantics and pixel-level detail. This is critical for denoising, where you need to reconstruct fine structures.

Here is the entire training algorithm for a diffusion model. It's shockingly simple:

That's it. The beauty is in the simplicity. There are no adversarial networks fighting each other (as in GANs), no complex reconstruction losses, no mode collapse. Just: add noise, predict it, minimize the difference.

Why MSE on noise? This simple loss is actually a weighted variational lower bound (ELBO) on the data log-likelihood. Ho et al. (2020) showed that this simplified loss, which drops certain weighting terms from the ELBO, empirically produces better samples. The mathematical justification runs deep — it's connected to variational inference, score matching, and denoising autoencoders — but the training procedure itself remains beautifully simple.

Below, watch a visualization of one training iteration. The animation loops, showing how a network gradually improves its noise predictions:

Training Playground

Clean x₀

+ε→

Noisy x_t

t = ???

→

Predicted ε̂

Training Loss (MSE)

Epoch 0 Epoch 0

Watch the network learn to predict noise. The loss chart shows MSE decreasing as training progresses. This is a simplified visualization — real training uses millions of images and thousands of epochs.

A few practical notes about training:

Random timesteps: Each training step samples a random t. This means the network learns to denoise at all noise levels simultaneously.
Timestep conditioning: The network receives t as input (via sinusoidal embeddings, similar to transformer position encodings). This tells it how much noise to expect.
Scale: Stable Diffusion was trained on billions of image-text pairs. But the algorithm is the same simple loop above.

Diffusion vs GANs: GANs are notoriously hard to train — the generator and discriminator can destabilize each other, leading to mode collapse (generating only a few types of images) or training divergence. Diffusion models train with a simple MSE loss on a single network. This makes them far more stable, though they're slower at generation time (many steps vs one forward pass for GANs).

Once trained, generating an image is the reverse process in action:

Each step takes the current noisy image, asks the network "what noise do you see?", subtracts most of it, and adds back a small amount of fresh noise (the σ_t · z term). That last bit of re-randomization is important — it's what makes the process stochastic and allows the model to generate diverse outputs.

The z term is why different random seeds give different images from the same model. And at the very last step (t = 1), we skip the noise addition to get a clean final result.

Try it below. Click "Generate" to watch noise gradually resolve into a recognizable shape:

The Generator

Ready

Denoising Steps

Speed

Click "Generate" to start from noise and walk backward to an image. The filmstrip shows snapshots along the way. Try "New Seed" for different results from the same model.

Notice how the image starts as pure static, then vague structure emerges, then finer details fill in. This coarse-to-fine progression is characteristic of diffusion models — global structure (is it a face? a landscape?) is determined in the early steps, while fine details (texture, edges) are refined in the later steps.

There's an alternative — and beautiful — way to understand what diffusion models are doing. It comes from a field called score-based generative modeling, and it provides deep insight into why diffusion models work at all.

The score function of a distribution p(x) is the gradient of the log-probability:

s(x) = \nabla x log p(x)

Think of it as an arrow at every point in space, pointing in the direction of "more probable data." In a region of high probability (near real images), the arrows are small. In low-probability regions (noise), the arrows point strongly toward the data.

If you had the score function everywhere, you could generate samples using Langevin dynamics: start at a random point and follow the score (plus a bit of noise) until you arrive at a high-probability region:

x i+1 = x i + (η/2) \cdot \nabla x log p(x i) + \sqrtη \cdot z i

This is remarkably similar to the diffusion sampling process — and that's not a coincidence.

The connection to noise prediction is direct. Recall that the score at noise level t is:

\nabla x log p(x t) = -ε / \sqrt(1 - ᾱ t)

So a noise-predicting network is a score estimator (up to scaling). When the network predicts "the noise is pointing this way," it's equivalently saying "real data is that way." The entire diffusion sampling process is gradient ascent on the data log-likelihood, performed at progressively finer noise scales.

Why multiple noise scales? Estimating the score in low-density regions (far from any data) is hard — there aren't many training examples there. But at high noise levels, the data distribution is spread out, making the score well-defined everywhere. By starting at high noise (blurry score) and progressively reducing noise (sharper score), diffusion models solve the problem that bedeviled earlier score-based methods.

The noise schedule β₁, β₂, ..., β_T is one of the most important design choices in a diffusion model. It determines how quickly information is destroyed — and how smoothly the model can learn to reverse the process.

Linear schedule (original DDPM): β_t increases linearly from β₁ = 10^-4 to β_T = 0.02. Simple but flawed — it destroys information too quickly in the middle timesteps.

Cosine schedule (Improved DDPM): Defines ᾱ_t to follow a cosine curve, ensuring a smoother decay of signal-to-noise ratio. Information is preserved longer, giving the model more to work with at intermediate timesteps.

Schedule Comparison

t = 0 t = T

ᾱ_t (signal)

1 - ᾱ_t (noise)

SNR (log scale)

Compare how different schedules control the signal-to-noise ratio. The cosine schedule provides a smoother, more gradual transition than linear.

The cosine schedule has become the default for most modern diffusion models because it avoids the "information cliff" of the linear schedule — a region where the SNR drops precipitously, making learning harder.

DDPM's main weakness: sampling requires 1000 forward passes through the U-Net. That's slow. A single image might need 10–30 seconds on a GPU. Can we do better?

DDIM (Denoising Diffusion Implicit Models) was the first breakthrough. The key insight: make the reverse process deterministic by removing the random noise term σ_t · z. Without randomness, you can skip steps — jumping from t = 1000 to t = 950 directly, instead of stepping through every integer.

This reduces sampling to 20–50 steps with minimal quality loss. The tradeoff: deterministic sampling means the same seed always gives the same output (which is often desirable).

The race to reduce steps has continued:

DPM-Solver: Uses higher-order ODE solvers for faster convergence (10–20 steps)
Consistency Models: Map any noise level directly to the final image in a single step
Rectified Flows: Learn straight-line paths from noise to data, enabling few-step generation
Distillation: Train a student model to mimic a multi-step teacher in fewer steps

Today's best models can produce high-quality images in 1–4 steps. The gap between "slow but beautiful" and "fast but ugly" has almost completely closed.

Everything we've discussed so far operates directly on pixel values. But a 512×512 color image has 786,432 dimensions. Running a U-Net on that is expensive.

The key innovation of Latent Diffusion (the architecture behind Stable Diffusion): don't do diffusion in pixel space. Instead:

Compress the image to a much smaller latent representation using a pre-trained VAE (Variational Autoencoder)
Do diffusion in this compact latent space
Decode back to pixel space using the VAE decoder

A 512×512 image might compress to a 64×64×4 latent — that's a 48x reduction in dimensionality. The diffusion U-Net is dramatically cheaper to run.

The Stable Diffusion pipeline. A text prompt is encoded by CLIP and injected via cross-attention. Diffusion happens in a compact latent space (64×64), then the VAE decoder reconstructs the full-resolution image.

Text Conditioning & Classifier-Free Guidance

How does a text prompt steer image generation? The text is encoded by a language model (CLIP), producing a sequence of embedding vectors. These embeddings are fed into the U-Net via cross-attention layers — at each spatial location, the network can "attend to" relevant parts of the prompt.

But conditioning alone isn't enough. Classifier-free guidance (CFG) is the trick that makes text-to-image models actually follow prompts:

ε̂ guided = ε̂ uncond + w \cdot (ε̂ cond - ε̂ uncond)

During training, the text condition is randomly dropped (replaced with null) some percentage of the time. At inference, the model makes two predictions: one with the prompt, one without. The difference is amplified by a guidance scale w (typically 7–15). Higher w means stronger prompt adherence but less diversity.

The guidance scale tradeoff: At w = 1, you get the raw conditional model — diverse but often ignoring the prompt. At w = 15, images closely match the text but look more "generic." Most users settle around w = 7.5. This tradeoff between fidelity and diversity is a fundamental property of guided generation.

If you've made it this far, you understand diffusion models better than most people who use them every day. Here are some resources for going deeper:

The DDPM Paper (Ho et al., 2020). "Denoising Diffusion Probabilistic Models" — the paper that started the modern diffusion era. Clean writing, clear math, and the training algorithm we derived above.
Score-Based Generative Modeling (Song & Ermon, 2019). "Generative Modeling by Estimating Gradients of the Data Distribution" — the score-based perspective that gave us deep insights into why diffusion works.
Lilian Weng's blog post: "What are Diffusion Models?" — one of the best technical explanations with thorough math derivations.
The Latent Diffusion Paper (Rombach et al., 2022). "High-Resolution Image Synthesis with Latent Diffusion Models" — the architecture behind Stable Diffusion.
Calvin Luo's tutorial: "Understanding Diffusion Models: A Unified Perspective" — a comprehensive mathematical treatment connecting all the different perspectives.

This explainer covers the fundamental principles. Real production models add many engineering details — EMA, attention, adaptive normalization, progressive training — but the core ideas are exactly what you've just learned. If you can explain the forward process, the training loop, and the sampling algorithm from memory, you've got it.

Diffusion Models from First Principles

How adding noise teaches machines to create. From random static to stunning images.

Contents

The Core Idea — Destruction as Creation

The Forward Process

A Beautiful Shortcut

The Reverse Process

What Does the Network Learn?

The Training Loop

Sampling — Birth of an Image

The Score Function Perspective

Variance Schedules & Noise Levels

Speeding Things Up

From Pixels to Latent Space

Text Conditioning & Classifier-Free Guidance

Further Resources