AI & ML March 21, 2026

How Do Diffusion Models Work?

A 7-minute read

Diffusion models generate images, audio, and video by learning to reverse a destruction process. Here's the surprisingly elegant idea behind them.

In 2020, a team at Google Brain published a paper showing that a simple idea could generate remarkably high-quality images: add noise to a picture until it becomes pure static, then train a neural network to run that process in reverse. The paper, titled Denoising Diffusion Probabilistic Models by Jonathan Ho and colleagues, became the foundation for Stable Diffusion, DALL-E, Midjourney, and most of the AI image generation tools that followed. The idea sounds almost too simple. But the math it rests on is what makes it work.

The short answer

Diffusion models work in two phases. First, during training, they watch as real images get destroyed by progressively adding random noise, step by step, until nothing of the original remains. Second, they learn to reverse this process: given a slightly noisy image, predict what a slightly less noisy version should look like. At generation time, the model starts with pure random noise and applies this learned denoising process hundreds of times, coaxing a coherent image into existence from static. Text prompts steer the process by encoding your words into a mathematical representation that guides each denoising step toward the kind of image you described.

The full picture

The core idea: destruction and reversal

The name “diffusion” comes from physics. In thermodynamics, diffusion describes how a drop of ink spreads through water until it’s evenly distributed. Similarly, diffusion models describe how a structured image gradually diffuses into uniform noise.

The key insight is that if you understand how something falls apart, you can learn to rebuild it.

Take any photograph. At step one, add a tiny amount of random noise. At step two, add a little more. Do this for 1,000 steps, and the image becomes indistinguishable from static. This is the forward diffusion process, and it’s mathematically well-defined. Each step follows a specific formula: keep most of the current pixels, add a controlled amount of Gaussian noise. The ratio is set so that by the final step, the image is pure noise with no trace of the original.

The clever part: this forward process doesn’t require any learning. It’s a fixed mathematical procedure. What does require learning is the reverse process: given a slightly noisy image, predict what a slightly cleaner version looks like. That’s what the neural network does.

What the neural network actually learns

During training, the model sees millions of partially-noised images alongside the amount of noise that was added. Its job is to predict the noise in the image, so it can be subtracted out.

This is called a noise prediction network, and it’s the heart of the DDPM (Denoising Diffusion Probabilistic Model) architecture. For each noisy image and each noise level, the model learns to output a prediction of the noise that was added. Subtract that predicted noise from the image, and you get a slightly cleaner version.

Training minimizes a surprisingly simple loss function: the difference between the noise the model predicts and the noise that was actually added. That’s it. No adversarial training. No complex loss terms. Just: predict the noise, minimize the error.

The architecture that does this prediction is typically a U-Net, a neural network originally developed for medical image segmentation. It processes images at multiple scales simultaneously, which helps it understand both fine textures (individual pixels) and global structure (the overall composition of the image).

Generating from noise

At generation time, the process runs in reverse. Start with pure Gaussian noise. Run the trained model on it, asking: what noise is in this image? Subtract the predicted noise. Now you have a slightly less noisy image. Run the model again. Subtract more noise. Repeat 50 to 1,000 times, depending on the sampler you’re using.

Each step nudges the image further from noise and closer to something that looks like a real image. The model doesn’t know what specific image it’s building toward; it’s just applying its learned understanding of what less-noisy images should look like, one step at a time. Structure emerges.

This is why diffusion models produce diverse outputs. Two runs with identical settings but different starting noise will produce entirely different images, since the model is exploring a vast space of possible images that match the learned distribution.

How text prompts guide the process

Text-conditioned diffusion models add a crucial ingredient: they make each denoising step aware of a text prompt.

When you type “a photograph of a red barn at sunset,” the prompt is encoded into a numerical vector using a text encoder, typically a model like CLIP (Contrastive Language-Image Pretraining) developed by OpenAI researchers in 2021. This vector captures the semantic meaning of your words in a mathematical form.

During each denoising step, the noise prediction network uses cross-attention: it compares the current noisy image against the text embedding and adjusts its noise prediction to favor image directions that align with your prompt. Think of it as a constant check at every step: “is this becoming something that matches what the user described?”

This is why prompt wording matters. The model is trying to satisfy a mathematical relationship between your words and the emerging pixels. More specific prompts narrow the space of possible outputs; more abstract prompts leave more room for variation.

Latent diffusion and why it makes things practical

Early diffusion models worked directly on full-resolution pixels, which was extremely slow. Each denoising step required processing a full 512x512 image through a large neural network, hundreds of times.

Latent diffusion models, introduced in a 2022 paper by Robin Rombach and colleagues at LMU Munich, solved this. Instead of denoising in pixel space, they work in a compressed “latent space.”

A separate encoder network first compresses the image into a much smaller representation that preserves the essential structure. The diffusion process happens entirely in this compressed space, which is 64x smaller than the pixel image. At the end, a decoder reconstructs the full-resolution image from the denoised latent.

This compression doesn’t significantly hurt quality, because most of an image’s redundancy lives in the raw pixels. The semantic content is preserved in the latent representation. The result: Stable Diffusion can generate images in seconds rather than minutes, on consumer hardware.

What makes diffusion models better than what came before

Before diffusion models, the dominant approach was GANs (Generative Adversarial Networks), developed by Ian Goodfellow and colleagues in 2014. GANs pit two networks against each other: a generator that creates images and a discriminator that tries to spot fakes. The competition drives quality upward.

GANs were impressive, but they had consistent problems: training instability, mode collapse (the generator producing a narrow range of outputs), and difficulty scaling. Getting a GAN to train reliably required careful tuning.

Diffusion models sidestep most of these problems. Training is stable because the loss function is simple and well-behaved. Mode collapse is rare because the stochastic noise at each step naturally encourages diversity. And because each denoising step is a relatively simple prediction task, the models scale well with more data and more compute.

The tradeoff is speed: generating an image requires hundreds of forward passes through the network, compared to a GAN’s single pass. Researchers have substantially narrowed this gap with faster samplers like DDIM and DPM-Solver, and with distillation techniques that compress many denoising steps into just a few.

Beyond images

The same architecture applies wherever you can define a forward destruction process and learn to reverse it.

For audio, diffusion models add noise to waveforms and learn to denoise them, producing speech and music generation systems like those used in modern text-to-speech pipelines. For video, the models extend to temporal sequences, denoising across frames simultaneously. For protein structure prediction and molecular design, diffusion models operate on 3D coordinate spaces. ProteinMPNN and related tools use diffusion-like processes to design protein sequences with specific structural properties.

The underlying principle is always the same: define a tractable way to destroy structure, then learn to reverse it.