Workload Concept

Quantization

Reducing the numerical precision of model weights to lower memory and compute requirements.

Quantization compresses a trained model by representing its weights and activations in lower-precision formats (INT8, FP8, INT4) than the original training precision (typically FP16 or BF16). This reduces VRAM consumption proportionally and, on hardware with low-precision throughput advantages, increases inference speed.

A 70B-parameter model in BF16 occupies roughly 140 GB of VRAM; quantized to INT4, the same model fits in approximately 35 GB — runnable on a single A100 80GB or H100. The tradeoff is some accuracy loss, especially for sensitive layers like attention output projections.

Common quantization schemes include GPTQ, AWQ, GGUF, and Bitsandbytes for LLM inference. AIMC's fit-score algorithm does not assume quantization; the VRAM minimums it reports for each workload reflect full-precision (FP16/BF16) operation. Quantization-aware users should treat AIMC's VRAM requirements as upper bounds.

Related Terms

Concepts directly relevant to Quantization.

FP8

8-bit floating-point format introduced for high-throughput inference and training.

FP16

16-bit half-precision floating-point format used heavily in deep learning.

LLM Inference

Serving trained large language models to user requests, often memory-bandwidth-bound.

Workloads Where Quantization Matters

GPU fit analysis for the workloads this concept directly influences.

Llm Inference

Ranked GPUs →

This definition is part of AIMC's reference glossary — 36 concepts across 10 categories.

Browse full glossary