Architecture

Diffusion Model

Generative model architecture that produces images, audio, or video by iteratively denoising from random noise.

Diffusion models are a class of generative models that learn to reverse a noising process — starting from pure noise and gradually denoising it toward a sample from the target distribution. They underpin most modern image generation systems (Stable Diffusion, FLUX, DALL-E 3), video generation systems (Sora, Mochi, CogVideoX), and increasingly audio generation.

The compute profile is unusual: generating a single output requires running the model forward 20-50+ times (one per denoising step), making inference substantially slower than for autoregressive models of similar parameter count. Modern accelerations include flow matching, distillation to fewer steps, and consistency models that produce comparable quality in 1-4 steps.

VRAM scales with output resolution and, for video models, with the number of frames generated jointly. A 1080p 5-second video clip at 24 fps requires processing 120 frames in one denoising loop, multiplying VRAM versus single-image generation.

Related Terms

Concepts directly relevant to Diffusion Model.

LLM Inference

Serving trained large language models to user requests, often memory-bandwidth-bound.

Memory Bandwidth

How fast a GPU can read and write its VRAM, measured in gigabytes per second.

Workloads Where Diffusion Model Matters

GPU fit analysis for the workloads this concept directly influences.

This definition is part of AIMC's reference glossary — 36 concepts across 10 categories.

Browse full glossary