LLM inference has very different requirements from training: VRAM still gates which models you can serve, but throughput, latency, and cost-per-token become the dominant economics. Production inference servers like vLLM, TGI (Text Generation Inference), and SGLang use PagedAttention, continuous batching, and speculative decoding to maximize tokens-per-second per dollar.
Quantization is the lever: a Llama 3 70B model in FP16 needs roughly 140 GB, but the same model in INT4 fits in 35 GB and serves at nearly the same quality for most workloads. AWQ, GPTQ, and bitsandbytes are the common quantization toolchains. Smaller models (Llama 3 8B, Mistral 7B, Phi-3) routinely run on a single 24 GB GPU at 50-150 tokens/second.
For consumer inference (single-user, low concurrency), llama.cpp and Ollama run quantized models on hardware as modest as 8 GB. For multi-tenant serving with high concurrency, datacenter GPUs with high memory bandwidth (HBM) dominate the cost-per-token curve.