How Do Diffusion Models Work?
A 7-minute read
Diffusion models generate images, audio, and video by learning to reverse a destruction process. Here's the surprisingly elegant idea behind them.
In 2020, a team at Google Brain published a paper showing that a simple idea could generate remarkably high-quality images: add noise to a picture until it becomes pure static, then train a neural network to run that process in reverse. The paper, titled Denoising Diffusion Probabilistic Models by Jonathan Ho and colleagues, became the foundation for Stable Diffusion, DALL-E, Midjourney, and most of the AI image generation tools that followed. The idea sounds almost too simple. But the math it rests on is what makes it work.
The short answer
Diffusion models work in two phases. First, during training, they watch as real images get destroyed by progressively adding random noise, step by step, until nothing of the original remains. Second, they learn to reverse this process: given a slightly noisy image, predict what a slightly less noisy version should look like. At generation time, the model starts with pure random noise and applies this learned denoising process hundreds of times, coaxing a coherent image into existence from static. Text prompts steer the process by encoding your words into a mathematical representation that guides each denoising step toward the kind of image you described.
The full picture
The core idea: destruction and reversal
The name “diffusion” comes from physics. In thermodynamics, diffusion describes how a drop of ink spreads through water until it’s evenly distributed. Similarly, diffusion models describe how a structured image gradually diffuses into uniform noise.
The key insight is that if you understand how something falls apart, you can learn to rebuild it.
Take any photograph. At step one, add a tiny amount of random noise. At step two, add a little more. Do this for 1,000 steps, and the image becomes indistinguishable from static. This is the forward diffusion process, and it’s mathematically well-defined. Each step follows a specific formula: keep most of the current pixels, add a controlled amount of Gaussian noise. The ratio is set so that by the final step, the image is pure noise with no trace of the original.
The clever part: this forward process doesn’t require any learning. It’s a fixed mathematical procedure. What does require learning is the reverse process: given a slightly noisy image, predict what a slightly cleaner version looks like. That’s what the neural network does.
What the neural network actually learns
During training, the model sees millions of partially-noised images alongside the amount of noise that was added. Its job is to predict the noise in the image, so it can be subtracted out.
This is called a noise prediction network, and it’s the heart of the DDPM (Denoising Diffusion Probabilistic Model) architecture. For each noisy image and each noise level, the model learns to output a prediction of the noise that was added. Subtract that predicted noise from the image, and you get a slightly cleaner version.
Training minimizes a surprisingly simple loss function: the difference between the noise the model predicts and the noise that was actually added. That’s it. No adversarial training. No complex loss terms. Just: predict the noise, minimize the error.
The architecture that does this prediction is typically a U-Net, a neural network originally developed for medical image segmentation. It processes images at multiple scales simultaneously, which helps it understand both fine textures (individual pixels) and global structure (the overall composition of the image).
Generating from noise
At generation time, the process runs in reverse. Start with pure Gaussian noise. Run the trained model on it, asking: what noise is in this image? Subtract the predicted noise. Now you have a slightly less noisy image. Run the model again. Subtract more noise. Repeat 50 to 1,000 times, depending on the sampler you’re using.
Each step nudges the image further from noise and closer to something that looks like a real image. The model doesn’t know what specific image it’s building toward; it’s just applying its learned understanding of what less-noisy images should look like, one step at a time. Structure emerges.
This is why diffusion models produce diverse outputs. Two runs with identical settings but different starting noise will produce entirely different images, since the model is exploring a vast space of possible images that match the learned distribution.
How text prompts guide the process
Text-conditioned diffusion models add a crucial ingredient: they make each denoising step aware of a text prompt.
When you type “a photograph of a red barn at sunset,” the prompt is encoded into a numerical vector using a text encoder, typically a model like CLIP (Contrastive Language-Image Pretraining) developed by OpenAI researchers in 2021. This vector captures the semantic meaning of your words in a mathematical form.
During each denoising step, the noise prediction network uses cross-attention: it compares the current noisy image against the text embedding and adjusts its noise prediction to favor image directions that align with your prompt. Think of it as a constant check at every step: “is this becoming something that matches what the user described?”
This is why prompt wording matters. The model is trying to satisfy a mathematical relationship between your words and the emerging pixels. More specific prompts narrow the space of possible outputs; more abstract prompts leave more room for variation.
Latent diffusion and why it makes things practical
Early diffusion models worked directly on full-resolution pixels, which was extremely slow. Each denoising step required processing a full 512x512 image through a large neural network, hundreds of times.
Latent diffusion models, introduced in a 2022 paper by Robin Rombach and colleagues at LMU Munich, solved this. Instead of denoising in pixel space, they work in a compressed “latent space.”
A separate encoder network first compresses the image into a much smaller representation that preserves the essential structure. The diffusion process happens entirely in this compressed space, which is 64x smaller than the pixel image. At the end, a decoder reconstructs the full-resolution image from the denoised latent.
This compression doesn’t significantly hurt quality, because most of an image’s redundancy lives in the raw pixels. The semantic content is preserved in the latent representation. The result: Stable Diffusion can generate images in seconds rather than minutes, on consumer hardware.
What makes diffusion models better than what came before
Before diffusion models, the dominant approach was GANs (Generative Adversarial Networks), developed by Ian Goodfellow and colleagues in 2014. GANs pit two networks against each other: a generator that creates images and a discriminator that tries to spot fakes. The competition drives quality upward.
GANs were impressive, but they had consistent problems: training instability, mode collapse (the generator producing a narrow range of outputs), and difficulty scaling. Getting a GAN to train reliably required careful tuning.
Diffusion models sidestep most of these problems. Training is stable because the loss function is simple and well-behaved. Mode collapse is rare because the stochastic noise at each step naturally encourages diversity. And because each denoising step is a relatively simple prediction task, the models scale well with more data and more compute.
The tradeoff is speed: generating an image requires hundreds of forward passes through the network, compared to a GAN’s single pass. Researchers have substantially narrowed this gap with faster samplers like DDIM and DPM-Solver, and with distillation techniques that compress many denoising steps into just a few.
Beyond images
The same architecture applies wherever you can define a forward destruction process and learn to reverse it.
For audio, diffusion models add noise to waveforms and learn to denoise them, producing speech and music generation systems like those used in modern text-to-speech pipelines. For video, the models extend to temporal sequences, denoising across frames simultaneously. For protein structure prediction and molecular design, diffusion models operate on 3D coordinate spaces. ProteinMPNN and related tools use diffusion-like processes to design protein sequences with specific structural properties.
The underlying principle is always the same: define a tractable way to destroy structure, then learn to reverse it.
Why it matters
Diffusion models matter because they represent a fundamental breakthrough in generative AI. Before their arrival, generating high-quality images required adversarial training (GANs), which was notoriously unstable and prone to mode collapse. Diffusion models offer stable, predictable training with a simple loss function, and they produce more diverse, higher-quality outputs. This stability is why the technology rapidly moved from research papers to consumer tools like Stable Diffusion and Midjourney, enabling millions of people to generate images from text.
The architecture’s versatility matters too. While images were the first major success, the same principle, learn to reverse a destruction process, applies wherever structured data exists. Audio generation, video synthesis, molecule design, and even robotic motion planning have all seen diffusion-based approaches outperform prior methods. This makes diffusion models a general-purpose tool for creative and scientific tasks, not just a niche image generator.
Understanding how diffusion models work also helps set realistic expectations. The hundreds of denoising steps aren’t a quirk; they’re necessary because each step only removes a small amount of noise, making the prediction task tractable. The diversity of outputs isn’t a bug; it’s a feature of starting from random noise and exploring the space of possible images. And the dependence on text prompts isn’t magic; it’s a mathematical constraint where the model tries to satisfy a semantic relationship between your words and the emerging pixels.
For anyone building with AI, knowing the difference between latent diffusion (used in Stable Diffusion) and pixel-space diffusion (used in earlier models) helps choose the right tool for the task. Latent diffusion is faster and more practical for most applications, while pixel-space diffusion offers maximum quality at the cost of speed.
Common misconceptions
“Diffusion models store and retrieve images from training data.” Diffusion models learn statistical patterns about what images look like, not copies of specific images. The model stores billions of numerical weights that encode those patterns. When generating, it constructs new images from learned patterns, not from retrieving stored data. This is why the same prompt can produce infinitely many different images.
“More steps always means better quality.” Not necessarily. While more steps generally improve quality up to a point, diminishing returns set in. Modern fast samplers can achieve comparable quality in fewer steps (sometimes 10-20 instead of 50-100), and beyond a certain threshold, additional steps may not meaningfully improve the output while costing significantly more compute.
“The model knows what it’s drawing.” The model doesn’t have a mental image of what it’s building toward. At each step, it’s just predicting what noise to remove to make the image slightly less noisy. Structure emerges incrementally through hundreds of these small improvements. The model is not “seeing” a picture in its mind and rendering it; it’s following a gradient toward images that match the statistical patterns it learned during training.
Key terms
Forward diffusion The process of progressively adding noise to an image until it becomes pure static. This is a fixed mathematical procedure that requires no learning.
Reverse diffusion The learned process of removing noise from an image, running the forward process in reverse. The neural network learns this during training.
Noise prediction network The neural network that predicts what noise was added to an image, so it can be subtracted out. This is the core component of diffusion models like DDPM.
Latent space A compressed representation of an image used in latent diffusion models. Instead of denoising full-resolution pixels, the model works in this smaller space and reconstructs the final image at the end.
U-Net The typical architecture used for the noise prediction network. Originally developed for medical imaging, it processes images at multiple scales to capture both fine details and global structure.
Cross-attention The mechanism that allows text prompts to guide the diffusion process. The model compares the current noisy image against the text embedding and adjusts its predictions to favor outputs matching the prompt.
Sampler The algorithm that determines how to step through the diffusion process during generation. Different samplers (DDPM, DDIM, DPM-Solver) offer different tradeoffs between speed and quality.