VRAM, or Video Random Access Memory, is the high-speed memory physically attached to a GPU. It holds the model weights, activations, optimizer states, and any data the GPU is actively computing on. VRAM capacity is the most common hard limit on what workloads a given GPU can run — a model whose weights exceed available VRAM either won't load at all or requires partitioning across multiple GPUs.
Datacenter GPUs ship with HBM-based VRAM ranging from 40 GB (A100 40GB) to 192 GB (B200). Workstation cards typically carry 24-96 GB of GDDR6 or GDDR7. Consumer cards range from 8-32 GB. The gap matters because LLM training and inference scale almost linearly with model parameter count — a 70B-parameter model in BF16 needs roughly 140 GB of VRAM for inference (weights alone), and significantly more for training when gradients, optimizer states, and activations are added.
AIMC's fit-score algorithm uses VRAM capacity as one of three primary inputs alongside FP16 TFLOPS and GPU class. The per-workload VRAM minimums published across AIMC's fit analyses reflect full-precision (FP16/BF16) operation. Quantization (INT8, FP8, INT4) reduces VRAM needs proportionally with some accuracy tradeoff, allowing larger models to fit in smaller cards.