Latency measures the time between a request entering a system and the corresponding response being complete. For LLM inference, this is typically expressed as request latency (total time to generate a complete response) or as per-token latency once generation begins.
Latency components include network round-trip, queueing delay, prompt-encoding (TTFT), and generation time (proportional to output length divided by per-token throughput). For real-time applications like chat, low TTFT plus reasonable throughput matters more than peak batch-mode throughput.
Latency-sensitive deployments often prefer GPUs with high memory bandwidth even if they have lower aggregate TFLOPS than alternatives. AIMC's fit-score algorithm doesn't directly weight latency; users with strict latency targets should select hardware with strong memory bandwidth ratings.