Performance Metric

Latency

End-to-end time between a request and a complete response.

Latency measures the time between a request entering a system and the corresponding response being complete. For LLM inference, this is typically expressed as request latency (total time to generate a complete response) or as per-token latency once generation begins.

Latency components include network round-trip, queueing delay, prompt-encoding (TTFT), and generation time (proportional to output length divided by per-token throughput). For real-time applications like chat, low TTFT plus reasonable throughput matters more than peak batch-mode throughput.

Latency-sensitive deployments often prefer GPUs with high memory bandwidth even if they have lower aggregate TFLOPS than alternatives. AIMC's fit-score algorithm doesn't directly weight latency; users with strict latency targets should select hardware with strong memory bandwidth ratings.

Related Terms

Concepts directly relevant to Latency.

Throughput

How many operations (tokens, images, queries) a system processes per second.

TTFT (Time to First Token)

Latency from receiving an LLM prompt to producing the first output token.

Memory Bandwidth

How fast a GPU can read and write its VRAM, measured in gigabytes per second.

Workloads Where Latency Matters

GPU fit analysis for the workloads this concept directly influences.

Llm Inference

Ranked GPUs →

This definition is part of AIMC's reference glossary — 36 concepts across 10 categories.

Browse full glossary