Diffusion Models from First Principles

How adding noise teaches machines to create. From random static to stunning images.

Contents

  1. Introduction
  2. The Core Idea — Destruction as Creation
  3. The Forward Process
  4. A Beautiful Shortcut
  5. The Reverse Process
  6. What Does the Network Learn?
  7. The Training Loop
  8. Sampling — Birth of an Image
  9. The Score Function Perspective
  10. Variance Schedules & Noise Levels
  11. Speeding Things Up
  12. From Pixels to Latent Space
  13. Further Resources

Every image you've seen from Midjourney, DALL-E, or Stable Diffusion was born from pure noise. Literal static — the kind you'd see on an old untuned TV. A neural network looked at that static and, step by painstaking step, sculpted it into a photograph, a painting, a dream.

The standard internet explanation of how this works is "diffusion models." And that's not wrong, but it's also not very useful. Simply knowing the name of something is very different from understanding it.

So what does constitute understanding? My answer: having a model that allows you to make predictions. If you can reliably predict how and why each step of the process works, then you probably understand it.

In this article, we'll build up diffusion models from scratch — starting from pure intuition, adding math only when it earns its keep, and building interactive demos along the way so you can see and feel every concept. By the time you're done, you won't just know what diffusion models are. You'll be able to derive them on a napkin.

Here's the core insight, in one sentence: if you learn to reverse each tiny step of a destruction process, you can create from scratch.

Let's unpack that.

II

The Core Idea — Destruction as Creation

Imagine you film yourself scrambling an egg. You crack it into a pan, poke the yolk, and stir until it's a uniform yellow mush. Easy. Anyone can do it. The process is irreversible — you can never un-scramble the egg.

But what if you had the film? What if you could study that film, frame by frame, and learn exactly what changed between each pair of consecutive frames?

Each individual change is tiny — a few molecules shifting here, a bit of yolk mixing there. And tiny changes are learnable. If you could train a model to predict "given frame 57, what did frame 56 look like?", and you could do that for every pair of frames... you could run the film backward. You could un-scramble the egg.

That's diffusion models in a nutshell.

Replace "egg" with "image" and "scrambling" with "adding Gaussian noise," and you have the entire framework:

Forward process (the scrambling): Start with a clean image. Gradually add random noise, step by step, until it's pure static. This is trivial — no learning required.

Reverse process (the un-scrambling): Train a neural network to undo each tiny noise step. Then, starting from pure static, apply the learned reverse steps one by one to conjure an image from nothing.

The trick is that each step only removes a tiny bit of noise. The network doesn't need to imagine an entire image in one shot — it just needs to make a small, local improvement. That's a much easier problem.

Let's see this in action. Below is a simple 8×8 pixel image. Drag the slider to add noise, step by step, and watch it dissolve into static.

The Destruction Film
Clean image Pure noise
Signal remaining
Noise level

Drag the slider to watch a smiley face dissolve into noise. Each step adds a tiny bit of Gaussian noise. After enough steps, the original image is completely unrecoverable.

Notice something important: at the beginning, you can clearly see the smiley face even with some noise. In the middle, you can kind of tell something is there. By the end, it's indistinguishable from pure random static. The information has been destroyed.

But here's the key: between any two adjacent timesteps, the change is small. And small changes are predictable. That's the opening we need.

III

The Forward Process

Let's get precise about what "adding noise step by step" means.

We start with a clean image x0. At each timestep t, we add a small amount of Gaussian noise to get a slightly noisier version xt. The amount of noise at each step is controlled by a parameter βt (beta), called the noise schedule.

The math for a single step is:

q(xt | xt-1) = N(xt; √(1 - βt) · xt-1, βt · I)

In plain English: to get xt from xt-1, you slightly shrink the image (multiply by √(1 - βt), which is just under 1) and add a little Gaussian noise (with variance βt).

Why shrink the image? If we just added noise without shrinking, the overall magnitude of the image would grow without bound. The shrinking factor ensures the variance stays controlled. Together, the shrink + noise ensure that after enough steps, the result is a standard Gaussian: N(0, I). This is important — it means the endpoint of the forward process is always the same, regardless of what image you started with.

βt is typically small — something like 0.0001 to 0.02 — and increases gradually over the course of the process. Early steps barely touch the image. Later steps add more noise. A typical diffusion model uses T = 1000 total steps.

Here's a more detailed view. Pick a shape and watch how the pixel distribution changes as noise is added:

Noise Kitchen
Pixel value distribution (converges to Gaussian)
t = 0 (clean) t = 1000 (pure noise)

Pick a shape, then drag the slider. The histogram on the right shows how the pixel values spread out toward a Gaussian bell curve as noise increases.

The histogram is the real story here. At t = 0, the pixel values cluster around a few specific values (the colors of the shape). As t increases, they spread out. By t = 1000, they form a near-perfect bell curve — a Gaussian distribution centered at zero. The original image has been completely forgotten.

This convergence to a Gaussian is not a coincidence. It's a consequence of the Central Limit Theorem: adding many small independent random variables always produces a Gaussian. It doesn't matter what image you started with — the end state is always the same.

IV

A Beautiful Shortcut

Running 1000 noise steps one by one would be painfully slow during training. Fortunately, there's a beautiful mathematical shortcut.

Define αt = 1 - βt and ᾱt = α1 · α2 · ... · αt (the cumulative product). Then you can jump directly from the clean image x0 to any timestep t in a single shot:

xt = √ᾱt · x0 + √(1 - ᾱt) · ε     where ε ~ N(0, I)

This is the reparameterization trick, and it's crucial. Instead of noising step-by-step, you can sample one noise vector ε and mix it with the clean image. The mixing ratio is controlled by ᾱt:

Signal-to-Noise Ratio (SNR): The quantity ᾱt / (1 - ᾱt) is the signal-to-noise ratio at timestep t. It starts high (clean image) and decays to near-zero (pure noise). A well-designed noise schedule ensures a smooth decay of SNR across timesteps — not too fast, not too slow. This is why the cosine schedule often works better than a linear one: it provides a smoother SNR curve.

Why does this matter? Because during training, we need to generate noisy versions of images at random timesteps, thousands of times. The shortcut lets us do this in one operation instead of hundreds.

The Shortcut — Step-by-Step vs Direct Jump
Step-by-step
0 sequential steps
=
Direct jump
1 operation

Both paths produce the same result. The shortcut formula lets us skip all intermediate steps during training.

Try it above: both the step-by-step path and the direct jump arrive at the same noisy image. This is not an approximation — they are mathematically identical (given the same noise vector).

V

The Reverse Process

Now for the million-dollar question: can we run the film backward?

Mathematically, we want p(xt-1 | xt) — given a noisy image at step t, what did it look like one step earlier? If we had this, we could start from pure noise xT and walk backward to a clean image x0.

The catch: computing p(xt-1 | xt) exactly requires knowing p(xt) — the probability distribution over all possible images at noise level t. That's intractable. It would mean knowing every possible image that could exist.

The solution: approximate it with a neural network.

We train a network pθ(xt-1 | xt) that takes in a noisy image and predicts what a slightly less noisy version looks like. The key insight is that this reverse step is also Gaussian (when the forward steps are small enough), so the network only needs to predict two things: a mean and a variance.

Why is the reverse also Gaussian? When βt is small, each forward step makes a tiny perturbation. For small perturbations, the reverse of a Gaussian process is also approximately Gaussian. This is a result from stochastic differential equations. In practice, the variance is usually fixed to βt or β̃t (a related quantity), so the network only needs to predict the mean.

And here's where it gets elegant. Remember the forward shortcut formula? We can rewrite the reverse mean in terms of a noise prediction. Instead of asking the network "what did xt-1 look like?", we ask: "what noise ε was added to create xt?"

If the network can predict the noise εθ(xt, t), we can compute the reverse step as:

xt-1 = (1/√αt)(xt - (βt / √(1 - ᾱt)) · εθ(xt, t)) + σt · z

where αt = 1 - βt, and z ~ N(0, I) is fresh random noise (except at the final step). The σt term controls the stochasticity of sampling — more on that later.

Let's see this in action. Below, step backward from noise and watch an image emerge:

Reverse the Film
Clean image (t=0) Pure noise (t=100)

Drag the slider from right to left to step backward from noise to image. Or click "Auto-reverse" to watch it happen automatically. This simulates what a trained diffusion model does at generation time.

What you're seeing is a simulation of the reverse process. In a real diffusion model, a neural network would be computing each denoising step. Here, we're using the known original image to compute the "ideal" reverse — but the principle is the same.

VI

What Does the Network Learn?

We said the network predicts the "noise." But there are actually three equivalent ways to frame what the network learns, and they're all mathematically interchangeable:

Prediction target What the network outputs Intuition
Noise prediction (ε) The noise that was added "What's the garbage? I'll subtract it."
Data prediction (x0) The original clean image "What's hiding under all that noise?"
Score prediction (∇ log p(xt)) Direction toward higher probability "Which way should I nudge to improve?"

These are equivalent because of a simple relationship. Given xt = √ᾱt · x0 + √(1 - ᾱt) · ε:

In practice, noise prediction (ε-prediction) works best and is the most common choice. The network sees a noisy image, and its job is simply: "tell me what the noise looks like, and I'll subtract it."

The architecture of choice is a U-Net — a convolutional neural network with skip connections that preserves spatial detail. The key modification for diffusion models: the timestep t is also fed as input (typically via sinusoidal embeddings), so the network knows how noisy the image is.

Noisy Image xt Conv Down Conv Down Bottleneck Conv Up Conv Up Predicted Noise εθ skip connections Timestep t
Simplified U-Net architecture for diffusion models. The noisy image and timestep go in; predicted noise comes out. Skip connections (dashed yellow) let fine details flow from encoder to decoder.

The U-Net architecture is brilliantly suited for this task. The encoder compresses the spatial information, the bottleneck captures global context, and the decoder reconstructs the output — with skip connections ensuring that fine spatial details aren't lost.

Why not just a plain CNN? A plain convolutional network would lose spatial information as it gets deeper. The skip connections in U-Net let the decoder directly access features from the encoder, preserving both high-level semantics and pixel-level detail. This is critical for denoising, where you need to reconstruct fine structures.
VII

The Training Loop

Here is the entire training algorithm for a diffusion model. It's shockingly simple:

Algorithm: Training a Diffusion Model
  1. Pick a clean image x0 from your dataset
  2. Sample a random timestep t ~ Uniform(1, T)
  3. Sample random noise ε ~ N(0, I)
  4. Compute the noisy image: xt = √ᾱt · x0 + √(1 - ᾱt) · ε
  5. Feed xt and t to the network, get prediction εθ(xt, t)
  6. Compute loss: L = || ε - εθ(xt, t) ||2
  7. Backprop and update weights. Repeat.

That's it. The beauty is in the simplicity. There are no adversarial networks fighting each other (as in GANs), no complex reconstruction losses, no mode collapse. Just: add noise, predict it, minimize the difference.

Why MSE on noise? This simple loss is actually a weighted variational lower bound (ELBO) on the data log-likelihood. Ho et al. (2020) showed that this simplified loss, which drops certain weighting terms from the ELBO, empirically produces better samples. The mathematical justification runs deep — it's connected to variational inference, score matching, and denoising autoencoders — but the training procedure itself remains beautifully simple.

Below, watch a visualization of one training iteration. The animation loops, showing how a network gradually improves its noise predictions:

Training Playground
Clean x0
+ε→
Noisy xt
t = ???
Predicted ε̂
Training Loss (MSE)
Epoch 0 Epoch 0

Watch the network learn to predict noise. The loss chart shows MSE decreasing as training progresses. This is a simplified visualization — real training uses millions of images and thousands of epochs.

A few practical notes about training:

Diffusion vs GANs: GANs are notoriously hard to train — the generator and discriminator can destabilize each other, leading to mode collapse (generating only a few types of images) or training divergence. Diffusion models train with a simple MSE loss on a single network. This makes them far more stable, though they're slower at generation time (many steps vs one forward pass for GANs).
VIII

Sampling — Birth of an Image

Once trained, generating an image is the reverse process in action:

Algorithm: Sampling (DDPM)
  1. Start with pure noise: xT ~ N(0, I)
  2. For t = T, T-1, ..., 1:
    • Predict the noise: ε̂ = εθ(xt, t)
    • Sample z ~ N(0, I) if t > 1, else z = 0
    • Compute: xt-1 = (1/√αt)(xt - (βt/√(1-ᾱt)) · ε̂) + σt · z
  3. Return x0

Each step takes the current noisy image, asks the network "what noise do you see?", subtracts most of it, and adds back a small amount of fresh noise (the σt · z term). That last bit of re-randomization is important — it's what makes the process stochastic and allows the model to generate diverse outputs.

The z term is why different random seeds give different images from the same model. And at the very last step (t = 1), we skip the noise addition to get a clean final result.

Try it below. Click "Generate" to watch noise gradually resolve into a recognizable shape:

The Generator
Ready
Denoising Steps

Click "Generate" to start from noise and walk backward to an image. The filmstrip shows snapshots along the way. Try "New Seed" for different results from the same model.

Notice how the image starts as pure static, then vague structure emerges, then finer details fill in. This coarse-to-fine progression is characteristic of diffusion models — global structure (is it a face? a landscape?) is determined in the early steps, while fine details (texture, edges) are refined in the later steps.

IX

The Score Function Perspective

There's an alternative — and beautiful — way to understand what diffusion models are doing. It comes from a field called score-based generative modeling, and it provides deep insight into why diffusion models work at all.

The score function of a distribution p(x) is the gradient of the log-probability:

s(x) = ∇x log p(x)

Think of it as an arrow at every point in space, pointing in the direction of "more probable data." In a region of high probability (near real images), the arrows are small. In low-probability regions (noise), the arrows point strongly toward the data.

If you had the score function everywhere, you could generate samples using Langevin dynamics: start at a random point and follow the score (plus a bit of noise) until you arrive at a high-probability region:

xi+1 = xi + (η/2) · ∇x log p(xi) + √η · zi

This is remarkably similar to the diffusion sampling process — and that's not a coincidence.

Score Vector Field

A 2D illustration. The purple blobs are "data" (high probability). The arrows show the score — they point toward data from everywhere. Click "Sample a particle" to watch a random point follow the score function into a data cluster.

The connection to noise prediction is direct. Recall that the score at noise level t is:

x log p(xt) = -ε / √(1 - ᾱt)

So a noise-predicting network is a score estimator (up to scaling). When the network predicts "the noise is pointing this way," it's equivalently saying "real data is that way." The entire diffusion sampling process is gradient ascent on the data log-likelihood, performed at progressively finer noise scales.

Why multiple noise scales? Estimating the score in low-density regions (far from any data) is hard — there aren't many training examples there. But at high noise levels, the data distribution is spread out, making the score well-defined everywhere. By starting at high noise (blurry score) and progressively reducing noise (sharper score), diffusion models solve the problem that bedeviled earlier score-based methods.
X

Variance Schedules & Noise Levels

The noise schedule β1, β2, ..., βT is one of the most important design choices in a diffusion model. It determines how quickly information is destroyed — and how smoothly the model can learn to reverse the process.

Linear schedule (original DDPM): βt increases linearly from β1 = 10-4 to βT = 0.02. Simple but flawed — it destroys information too quickly in the middle timesteps.

Cosine schedule (Improved DDPM): Defines ᾱt to follow a cosine curve, ensuring a smoother decay of signal-to-noise ratio. Information is preserved longer, giving the model more to work with at intermediate timesteps.

Schedule Comparison
t = 0 t = T
ᾱt (signal)
1 - ᾱt (noise)
SNR (log scale)

Compare how different schedules control the signal-to-noise ratio. The cosine schedule provides a smoother, more gradual transition than linear.

The cosine schedule has become the default for most modern diffusion models because it avoids the "information cliff" of the linear schedule — a region where the SNR drops precipitously, making learning harder.

XI

Speeding Things Up

DDPM's main weakness: sampling requires 1000 forward passes through the U-Net. That's slow. A single image might need 10–30 seconds on a GPU. Can we do better?

DDIM (Denoising Diffusion Implicit Models) was the first breakthrough. The key insight: make the reverse process deterministic by removing the random noise term σt · z. Without randomness, you can skip steps — jumping from t = 1000 to t = 950 directly, instead of stepping through every integer.

This reduces sampling to 20–50 steps with minimal quality loss. The tradeoff: deterministic sampling means the same seed always gives the same output (which is often desirable).

Speed vs Quality
1 step (fast, bad) 100 steps (slow, good)

Fewer steps means faster generation but more artifacts. Real models produce stunning results at 20–50 steps.

The race to reduce steps has continued:

Today's best models can produce high-quality images in 1–4 steps. The gap between "slow but beautiful" and "fast but ugly" has almost completely closed.

XII

From Pixels to Latent Space

Everything we've discussed so far operates directly on pixel values. But a 512×512 color image has 786,432 dimensions. Running a U-Net on that is expensive.

The key innovation of Latent Diffusion (the architecture behind Stable Diffusion): don't do diffusion in pixel space. Instead:

  1. Compress the image to a much smaller latent representation using a pre-trained VAE (Variational Autoencoder)
  2. Do diffusion in this compact latent space
  3. Decode back to pixel space using the VAE decoder

A 512×512 image might compress to a 64×64×4 latent — that's a 48x reduction in dimensionality. The diffusion U-Net is dramatically cheaper to run.

"a sunset" CLIP Encoder cross-attention VAE Encoder 512×512 image Diffusion U-Net in 64×64 latent space (noise ↔ denoise) VAE Decoder Generated Image 512×512 output Noise z
The Stable Diffusion pipeline. A text prompt is encoded by CLIP and injected via cross-attention. Diffusion happens in a compact latent space (64×64), then the VAE decoder reconstructs the full-resolution image.

Text Conditioning & Classifier-Free Guidance

How does a text prompt steer image generation? The text is encoded by a language model (CLIP), producing a sequence of embedding vectors. These embeddings are fed into the U-Net via cross-attention layers — at each spatial location, the network can "attend to" relevant parts of the prompt.

But conditioning alone isn't enough. Classifier-free guidance (CFG) is the trick that makes text-to-image models actually follow prompts:

ε̂guided = ε̂uncond + w · (ε̂cond - ε̂uncond)

During training, the text condition is randomly dropped (replaced with null) some percentage of the time. At inference, the model makes two predictions: one with the prompt, one without. The difference is amplified by a guidance scale w (typically 7–15). Higher w means stronger prompt adherence but less diversity.

The guidance scale tradeoff: At w = 1, you get the raw conditional model — diverse but often ignoring the prompt. At w = 15, images closely match the text but look more "generic." Most users settle around w = 7.5. This tradeoff between fidelity and diversity is a fundamental property of guided generation.
XIII

Further Resources

If you've made it this far, you understand diffusion models better than most people who use them every day. Here are some resources for going deeper:

This explainer covers the fundamental principles. Real production models add many engineering details — EMA, attention, adaptive normalization, progressive training — but the core ideas are exactly what you've just learned. If you can explain the forward process, the training loop, and the sampling algorithm from memory, you've got it.