LLM training is the most VRAM-intensive workload in modern AI. A 7B-parameter model in FP16 requires roughly 14 GB just for weights, plus optimizer states (Adam roughly doubles or triples that), activations, and gradients. Production fine-tuning runs typically need 40 GB or more per GPU, and full pretraining of 70B+ models is multi-node territory using FSDP, DeepSpeed ZeRO-3, or Megatron-LM.
Memory bandwidth matters more than raw FLOPs for large transformer training because attention operations are memory-bound. NVLink and SXM form factors enable efficient tensor-parallel and pipeline-parallel sharding across cards. FP8 support on Hopper (H100, H200) and Blackwell (B200) GPUs roughly halves memory pressure compared to FP16 with minimal accuracy loss.
The dominant frameworks are PyTorch with FSDP/DDP, Hugging Face Accelerate, DeepSpeed, and JAX. For LoRA and QLoRA fine-tuning, the memory threshold drops significantly: a QLoRA fine-tune of Llama 3 8B fits comfortably in 24 GB.