Numerical Precision

FP8

8-bit floating-point format introduced for high-throughput inference and training.

FP8 is an 8-bit floating-point format that emerged around 2022 with NVIDIA's H100 (Hopper) architecture and AMD's MI300 series. Two common variants exist: E4M3 (4 exponent bits, 3 mantissa bits, narrower range but more precision) and E5M2 (5 exponent bits, 2 mantissa bits, wider range less precision).

FP8 enables roughly 2× the throughput of FP16 on supported hardware, with the H100 reaching approximately 3,958 TFLOPS at FP8 with sparsity. This makes it particularly valuable for inference workloads at scale, where every doubling of throughput translates directly to cost reduction.

Adoption is still maturing — not all model architectures train stably in FP8, and most production deployments use FP8 only for inference of FP16/BF16-trained weights via post-training quantization.

Related Terms

Concepts directly relevant to FP8.

FP16

16-bit half-precision floating-point format used heavily in deep learning.

BF16 (bfloat16)

Alternative 16-bit floating-point format with wider exponent range than FP16.

Quantization

Reducing the numerical precision of model weights to lower memory and compute requirements.

Workloads Where FP8 Matters

GPU fit analysis for the workloads this concept directly influences.

Llm Inference

Ranked GPUs →

This definition is part of AIMC's reference glossary — 36 concepts across 10 categories.

Browse full glossary