Throughput measures the rate at which a system processes work — typically tokens-per-second for LLM inference, images-per-second for diffusion models, or queries-per-second for embedding workloads.
In LLM inference, throughput is usually expressed as tokens-per-second per user (T/s/user) or aggregate across concurrent users. Higher batch sizes increase aggregate throughput but typically also increase TTFT and per-user latency. Memory bandwidth, compute throughput at the inference precision, and KV-cache management all influence the observed value.
Throughput is the primary economic metric for production inference: cost per million tokens equals price per GPU-hour divided by throughput. AIMC reports price per GPU-hour; throughput depends on the model, software stack, and batch configuration.