Quantization compresses a trained model by representing its weights and activations in lower-precision formats (INT8, FP8, INT4) than the original training precision (typically FP16 or BF16). This reduces VRAM consumption proportionally and, on hardware with low-precision throughput advantages, increases inference speed.
A 70B-parameter model in BF16 occupies roughly 140 GB of VRAM; quantized to INT4, the same model fits in approximately 35 GB — runnable on a single A100 80GB or H100. The tradeoff is some accuracy loss, especially for sensitive layers like attention output projections.
Common quantization schemes include GPTQ, AWQ, GGUF, and Bitsandbytes for LLM inference. AIMC's fit-score algorithm does not assume quantization; the VRAM minimums it reports for each workload reflect full-precision (FP16/BF16) operation. Quantization-aware users should treat AIMC's VRAM requirements as upper bounds.