Diffusion models are a class of generative models that learn to reverse a noising process — starting from pure noise and gradually denoising it toward a sample from the target distribution. They underpin most modern image generation systems (Stable Diffusion, FLUX, DALL-E 3), video generation systems (Sora, Mochi, CogVideoX), and increasingly audio generation.
The compute profile is unusual: generating a single output requires running the model forward 20-50+ times (one per denoising step), making inference substantially slower than for autoregressive models of similar parameter count. Modern accelerations include flow matching, distillation to fewer steps, and consistency models that produce comparable quality in 1-4 steps.
VRAM scales with output resolution and, for video models, with the number of frames generated jointly. A 1080p 5-second video clip at 24 fps requires processing 120 frames in one denoising loop, multiplying VRAM versus single-image generation.