Every image you've seen from Midjourney, DALL-E, or Stable Diffusion was born from pure noise. Literal static — the kind you'd see on an old untuned TV. A neural network looked at that static and, step by painstaking step, sculpted it into a photograph, a painting, a dream.
The standard internet explanation of how this works is "diffusion models." And that's not wrong, but it's also not very useful. Simply knowing the name of something is very different from understanding it.
So what does constitute understanding? My answer: having a model that allows you to make predictions. If you can reliably predict how and why each step of the process works, then you probably understand it.
In this article, we'll build up diffusion models from scratch — starting from pure intuition, adding math only when it earns its keep, and building interactive demos along the way so you can see and feel every concept. By the time you're done, you won't just know what diffusion models are. You'll be able to derive them on a napkin.
Here's the core insight, in one sentence: if you learn to reverse each tiny step of a destruction process, you can create from scratch.
Let's unpack that.
The Core Idea — Destruction as Creation
Imagine you film yourself scrambling an egg. You crack it into a pan, poke the yolk, and stir until it's a uniform yellow mush. Easy. Anyone can do it. The process is irreversible — you can never un-scramble the egg.
But what if you had the film? What if you could study that film, frame by frame, and learn exactly what changed between each pair of consecutive frames?
Each individual change is tiny — a few molecules shifting here, a bit of yolk mixing there. And tiny changes are learnable. If you could train a model to predict "given frame 57, what did frame 56 look like?", and you could do that for every pair of frames... you could run the film backward. You could un-scramble the egg.
That's diffusion models in a nutshell.
Replace "egg" with "image" and "scrambling" with "adding Gaussian noise," and you have the entire framework:
Forward process (the scrambling): Start with a clean image. Gradually add random noise, step by step, until it's pure static. This is trivial — no learning required.
Reverse process (the un-scrambling): Train a neural network to undo each tiny noise step. Then, starting from pure static, apply the learned reverse steps one by one to conjure an image from nothing.
The trick is that each step only removes a tiny bit of noise. The network doesn't need to imagine an entire image in one shot — it just needs to make a small, local improvement. That's a much easier problem.
Let's see this in action. Below is a simple 8×8 pixel image. Drag the slider to add noise, step by step, and watch it dissolve into static.
Notice something important: at the beginning, you can clearly see the smiley face even with some noise. In the middle, you can kind of tell something is there. By the end, it's indistinguishable from pure random static. The information has been destroyed.
But here's the key: between any two adjacent timesteps, the change is small. And small changes are predictable. That's the opening we need.
The Forward Process
Let's get precise about what "adding noise step by step" means.
We start with a clean image x0. At each timestep t, we add a small amount of Gaussian noise to get a slightly noisier version xt. The amount of noise at each step is controlled by a parameter βt (beta), called the noise schedule.
The math for a single step is:
In plain English: to get xt from xt-1, you slightly shrink the image (multiply by √(1 - βt), which is just under 1) and add a little Gaussian noise (with variance βt).
βt is typically small — something like 0.0001 to 0.02 — and increases gradually over the course of the process. Early steps barely touch the image. Later steps add more noise. A typical diffusion model uses T = 1000 total steps.
Here's a more detailed view. Pick a shape and watch how the pixel distribution changes as noise is added:
The histogram is the real story here. At t = 0, the pixel values cluster around a few specific values (the colors of the shape). As t increases, they spread out. By t = 1000, they form a near-perfect bell curve — a Gaussian distribution centered at zero. The original image has been completely forgotten.
This convergence to a Gaussian is not a coincidence. It's a consequence of the Central Limit Theorem: adding many small independent random variables always produces a Gaussian. It doesn't matter what image you started with — the end state is always the same.
A Beautiful Shortcut
Running 1000 noise steps one by one would be painfully slow during training. Fortunately, there's a beautiful mathematical shortcut.
Define αt = 1 - βt and ᾱt = α1 · α2 · ... · αt (the cumulative product). Then you can jump directly from the clean image x0 to any timestep t in a single shot:
This is the reparameterization trick, and it's crucial. Instead of noising step-by-step, you can sample one noise vector ε and mix it with the clean image. The mixing ratio is controlled by ᾱt:
- When t is small, ᾱt ≈ 1 — mostly signal, little noise
- When t is large, ᾱt ≈ 0 — mostly noise, little signal
Why does this matter? Because during training, we need to generate noisy versions of images at random timesteps, thousands of times. The shortcut lets us do this in one operation instead of hundreds.
Try it above: both the step-by-step path and the direct jump arrive at the same noisy image. This is not an approximation — they are mathematically identical (given the same noise vector).
The Reverse Process
Now for the million-dollar question: can we run the film backward?
Mathematically, we want p(xt-1 | xt) — given a noisy image at step t, what did it look like one step earlier? If we had this, we could start from pure noise xT and walk backward to a clean image x0.
The catch: computing p(xt-1 | xt) exactly requires knowing p(xt) — the probability distribution over all possible images at noise level t. That's intractable. It would mean knowing every possible image that could exist.
The solution: approximate it with a neural network.
We train a network pθ(xt-1 | xt) that takes in a noisy image and predicts what a slightly less noisy version looks like. The key insight is that this reverse step is also Gaussian (when the forward steps are small enough), so the network only needs to predict two things: a mean and a variance.
And here's where it gets elegant. Remember the forward shortcut formula? We can rewrite the reverse mean in terms of a noise prediction. Instead of asking the network "what did xt-1 look like?", we ask: "what noise ε was added to create xt?"
If the network can predict the noise εθ(xt, t), we can compute the reverse step as:
where αt = 1 - βt, and z ~ N(0, I) is fresh random noise (except at the final step). The σt term controls the stochasticity of sampling — more on that later.
Let's see this in action. Below, step backward from noise and watch an image emerge:
What you're seeing is a simulation of the reverse process. In a real diffusion model, a neural network would be computing each denoising step. Here, we're using the known original image to compute the "ideal" reverse — but the principle is the same.
What Does the Network Learn?
We said the network predicts the "noise." But there are actually three equivalent ways to frame what the network learns, and they're all mathematically interchangeable:
| Prediction target | What the network outputs | Intuition |
|---|---|---|
| Noise prediction (ε) | The noise that was added | "What's the garbage? I'll subtract it." |
| Data prediction (x0) | The original clean image | "What's hiding under all that noise?" |
| Score prediction (∇ log p(xt)) | Direction toward higher probability | "Which way should I nudge to improve?" |
These are equivalent because of a simple relationship. Given xt = √ᾱt · x0 + √(1 - ᾱt) · ε:
- If you know ε, you can solve for x0: x0 = (xt - √(1 - ᾱt) · ε) / √ᾱt
- If you know x0, you can solve for ε
- The score is just: ∇x log p(xt) = -ε / √(1 - ᾱt)
In practice, noise prediction (ε-prediction) works best and is the most common choice. The network sees a noisy image, and its job is simply: "tell me what the noise looks like, and I'll subtract it."
The architecture of choice is a U-Net — a convolutional neural network with skip connections that preserves spatial detail. The key modification for diffusion models: the timestep t is also fed as input (typically via sinusoidal embeddings), so the network knows how noisy the image is.
The U-Net architecture is brilliantly suited for this task. The encoder compresses the spatial information, the bottleneck captures global context, and the decoder reconstructs the output — with skip connections ensuring that fine spatial details aren't lost.
The Training Loop
Here is the entire training algorithm for a diffusion model. It's shockingly simple:
That's it. The beauty is in the simplicity. There are no adversarial networks fighting each other (as in GANs), no complex reconstruction losses, no mode collapse. Just: add noise, predict it, minimize the difference.
Below, watch a visualization of one training iteration. The animation loops, showing how a network gradually improves its noise predictions:
A few practical notes about training:
- Random timesteps: Each training step samples a random t. This means the network learns to denoise at all noise levels simultaneously.
- Timestep conditioning: The network receives t as input (via sinusoidal embeddings, similar to transformer position encodings). This tells it how much noise to expect.
- Scale: Stable Diffusion was trained on billions of image-text pairs. But the algorithm is the same simple loop above.
Sampling — Birth of an Image
Once trained, generating an image is the reverse process in action:
Each step takes the current noisy image, asks the network "what noise do you see?", subtracts most of it, and adds back a small amount of fresh noise (the σt · z term). That last bit of re-randomization is important — it's what makes the process stochastic and allows the model to generate diverse outputs.
The z term is why different random seeds give different images from the same model. And at the very last step (t = 1), we skip the noise addition to get a clean final result.
Try it below. Click "Generate" to watch noise gradually resolve into a recognizable shape:
Notice how the image starts as pure static, then vague structure emerges, then finer details fill in. This coarse-to-fine progression is characteristic of diffusion models — global structure (is it a face? a landscape?) is determined in the early steps, while fine details (texture, edges) are refined in the later steps.
The Score Function Perspective
There's an alternative — and beautiful — way to understand what diffusion models are doing. It comes from a field called score-based generative modeling, and it provides deep insight into why diffusion models work at all.
The score function of a distribution p(x) is the gradient of the log-probability:
Think of it as an arrow at every point in space, pointing in the direction of "more probable data." In a region of high probability (near real images), the arrows are small. In low-probability regions (noise), the arrows point strongly toward the data.
If you had the score function everywhere, you could generate samples using Langevin dynamics: start at a random point and follow the score (plus a bit of noise) until you arrive at a high-probability region:
This is remarkably similar to the diffusion sampling process — and that's not a coincidence.
The connection to noise prediction is direct. Recall that the score at noise level t is:
So a noise-predicting network is a score estimator (up to scaling). When the network predicts "the noise is pointing this way," it's equivalently saying "real data is that way." The entire diffusion sampling process is gradient ascent on the data log-likelihood, performed at progressively finer noise scales.
Variance Schedules & Noise Levels
The noise schedule β1, β2, ..., βT is one of the most important design choices in a diffusion model. It determines how quickly information is destroyed — and how smoothly the model can learn to reverse the process.
Linear schedule (original DDPM): βt increases linearly from β1 = 10-4 to βT = 0.02. Simple but flawed — it destroys information too quickly in the middle timesteps.
Cosine schedule (Improved DDPM): Defines ᾱt to follow a cosine curve, ensuring a smoother decay of signal-to-noise ratio. Information is preserved longer, giving the model more to work with at intermediate timesteps.
The cosine schedule has become the default for most modern diffusion models because it avoids the "information cliff" of the linear schedule — a region where the SNR drops precipitously, making learning harder.
Speeding Things Up
DDPM's main weakness: sampling requires 1000 forward passes through the U-Net. That's slow. A single image might need 10–30 seconds on a GPU. Can we do better?
DDIM (Denoising Diffusion Implicit Models) was the first breakthrough. The key insight: make the reverse process deterministic by removing the random noise term σt · z. Without randomness, you can skip steps — jumping from t = 1000 to t = 950 directly, instead of stepping through every integer.
This reduces sampling to 20–50 steps with minimal quality loss. The tradeoff: deterministic sampling means the same seed always gives the same output (which is often desirable).
The race to reduce steps has continued:
- DPM-Solver: Uses higher-order ODE solvers for faster convergence (10–20 steps)
- Consistency Models: Map any noise level directly to the final image in a single step
- Rectified Flows: Learn straight-line paths from noise to data, enabling few-step generation
- Distillation: Train a student model to mimic a multi-step teacher in fewer steps
Today's best models can produce high-quality images in 1–4 steps. The gap between "slow but beautiful" and "fast but ugly" has almost completely closed.
From Pixels to Latent Space
Everything we've discussed so far operates directly on pixel values. But a 512×512 color image has 786,432 dimensions. Running a U-Net on that is expensive.
The key innovation of Latent Diffusion (the architecture behind Stable Diffusion): don't do diffusion in pixel space. Instead:
- Compress the image to a much smaller latent representation using a pre-trained VAE (Variational Autoencoder)
- Do diffusion in this compact latent space
- Decode back to pixel space using the VAE decoder
A 512×512 image might compress to a 64×64×4 latent — that's a 48x reduction in dimensionality. The diffusion U-Net is dramatically cheaper to run.
Text Conditioning & Classifier-Free Guidance
How does a text prompt steer image generation? The text is encoded by a language model (CLIP), producing a sequence of embedding vectors. These embeddings are fed into the U-Net via cross-attention layers — at each spatial location, the network can "attend to" relevant parts of the prompt.
But conditioning alone isn't enough. Classifier-free guidance (CFG) is the trick that makes text-to-image models actually follow prompts:
During training, the text condition is randomly dropped (replaced with null) some percentage of the time. At inference, the model makes two predictions: one with the prompt, one without. The difference is amplified by a guidance scale w (typically 7–15). Higher w means stronger prompt adherence but less diversity.
Further Resources
If you've made it this far, you understand diffusion models better than most people who use them every day. Here are some resources for going deeper:
- The DDPM Paper (Ho et al., 2020). "Denoising Diffusion Probabilistic Models" — the paper that started the modern diffusion era. Clean writing, clear math, and the training algorithm we derived above.
- Score-Based Generative Modeling (Song & Ermon, 2019). "Generative Modeling by Estimating Gradients of the Data Distribution" — the score-based perspective that gave us deep insights into why diffusion works.
- Lilian Weng's blog post: "What are Diffusion Models?" — one of the best technical explanations with thorough math derivations.
- The Latent Diffusion Paper (Rombach et al., 2022). "High-Resolution Image Synthesis with Latent Diffusion Models" — the architecture behind Stable Diffusion.
- Calvin Luo's tutorial: "Understanding Diffusion Models: A Unified Perspective" — a comprehensive mathematical treatment connecting all the different perspectives.
This explainer covers the fundamental principles. Real production models add many engineering details — EMA, attention, adaptive normalization, progressive training — but the core ideas are exactly what you've just learned. If you can explain the forward process, the training loop, and the sampling algorithm from memory, you've got it.