FP8 is an 8-bit floating-point format that emerged around 2022 with NVIDIA's H100 (Hopper) architecture and AMD's MI300 series. Two common variants exist: E4M3 (4 exponent bits, 3 mantissa bits, narrower range but more precision) and E5M2 (5 exponent bits, 2 mantissa bits, wider range less precision).
FP8 enables roughly 2× the throughput of FP16 on supported hardware, with the H100 reaching approximately 3,958 TFLOPS at FP8 with sparsity. This makes it particularly valuable for inference workloads at scale, where every doubling of throughput translates directly to cost reduction.
Adoption is still maturing — not all model architectures train stably in FP8, and most production deployments use FP8 only for inference of FP16/BF16-trained weights via post-training quantization.