Memory bandwidth measures the maximum rate at which a GPU can transfer data between its VRAM and processing cores. Reported in gigabytes per second (GB/s), it depends on memory type (HBM vs GDDR), bus width, and clock speed.
For many workloads, especially LLM inference, memory bandwidth is the binding constraint rather than raw compute. Generating each token requires reading the entire model weights through the chip; a model that fits in VRAM but bandwidth-starves the cores will run slowly regardless of TFLOPS rating.
Rough hierarchy in 2026: H200 SXM ~4.8 TB/s, B200 SXM ~8.0 TB/s, H100 SXM ~3.35 TB/s, A100 SXM 80GB ~2.0 TB/s, RTX 5090 ~1.8 TB/s, RTX 4090 ~1.0 TB/s. This is one reason datacenter parts dominate inference at scale despite consumer cards reaching high FP16 TFLOPS.