Performance Metric

Throughput

How many operations (tokens, images, queries) a system processes per second.

Throughput measures the rate at which a system processes work — typically tokens-per-second for LLM inference, images-per-second for diffusion models, or queries-per-second for embedding workloads.

In LLM inference, throughput is usually expressed as tokens-per-second per user (T/s/user) or aggregate across concurrent users. Higher batch sizes increase aggregate throughput but typically also increase TTFT and per-user latency. Memory bandwidth, compute throughput at the inference precision, and KV-cache management all influence the observed value.

Throughput is the primary economic metric for production inference: cost per million tokens equals price per GPU-hour divided by throughput. AIMC reports price per GPU-hour; throughput depends on the model, software stack, and batch configuration.

Related Terms

Concepts directly relevant to Throughput.

Latency

End-to-end time between a request and a complete response.

TTFT (Time to First Token)

Latency from receiving an LLM prompt to producing the first output token.

Memory Bandwidth

How fast a GPU can read and write its VRAM, measured in gigabytes per second.

TFLOPS

Trillion floating-point operations per second — the standard GPU compute throughput metric.

Workloads Where Throughput Matters

GPU fit analysis for the workloads this concept directly influences.

Llm Inference

Ranked GPUs →

This definition is part of AIMC's reference glossary — 36 concepts across 10 categories.

Browse full glossary