Workload Concept

LLM Inference

Serving trained large language models to user requests, often memory-bandwidth-bound.

LLM inference is the serving of trained models to user requests. Unlike training, inference doesn't require backward passes — it's primarily forward computation, with VRAM consumed by model weights, KV cache, and intermediate activations.

The compute profile is unusual: token generation is often memory-bandwidth-bound rather than compute-bound. Each generated token requires reading all model weights through the chip; faster memory bandwidth directly increases tokens-per-second throughput. This is why HBM3 vs HBM2e vs GDDR ratings matter substantially for inference.

Inference commonly runs at FP16, BF16, or quantized to FP8 / INT8 / INT4 to reduce memory footprint and increase throughput. AIMC's /for/llm-inference hub ranks GPUs at a 12 GB VRAM minimum — meaningful smaller models can be served on workstation or upper-tier consumer cards.