Memory Bandwidth

How fast a GPU can read and write its VRAM, measured in gigabytes per second.

Memory bandwidth measures the maximum rate at which a GPU can transfer data between its VRAM and processing cores. Reported in gigabytes per second (GB/s), it depends on memory type (HBM vs GDDR), bus width, and clock speed.

For many workloads, especially LLM inference, memory bandwidth is the binding constraint rather than raw compute. Generating each token requires reading the entire model weights through the chip; a model that fits in VRAM but bandwidth-starves the cores will run slowly regardless of TFLOPS rating.

Rough hierarchy in 2026: H200 SXM ~4.8 TB/s, B200 SXM ~8.0 TB/s, H100 SXM ~3.35 TB/s, A100 SXM 80GB ~2.0 TB/s, RTX 5090 ~1.8 TB/s, RTX 4090 ~1.0 TB/s. This is one reason datacenter parts dominate inference at scale despite consumer cards reaching high FP16 TFLOPS.

Related Terms

Concepts directly relevant to Memory Bandwidth.

HBM (High Bandwidth Memory)

Stacked DRAM technology used for high-throughput GPU memory in datacenter parts.

TFLOPS

Trillion floating-point operations per second — the standard GPU compute throughput metric.

Workloads Where Memory Bandwidth Matters

GPU fit analysis for the workloads this concept directly influences.

Llm Inference

Ranked GPUs →

This definition is part of AIMC's reference glossary — 36 concepts across 10 categories.

Browse full glossary