TTFT (Time to First Token)

Latency from receiving an LLM prompt to producing the first output token.

TTFT, or Time to First Token, measures the latency between an inference request arriving at the server and the first output token being produced. It's a critical metric for interactive applications like chatbots, where any delay before output begins feels unresponsive to users.

TTFT is dominated by the prompt-encoding phase — computing attention across the entire input context — before any generation can begin. Long contexts increase TTFT roughly linearly; a 128K-token prompt may take several seconds to encode even on H100-class hardware.

Production LLM serving stacks (vLLM, TGI, TensorRT-LLM) heavily optimize TTFT through batching, prefill scheduling, and KV-cache management. AIMC tracks GPU rental pricing; throughput characteristics like TTFT depend on software stack choices in addition to hardware.

Related Terms

Concepts directly relevant to TTFT (Time to First Token).

LLM Inference

Serving trained large language models to user requests, often memory-bandwidth-bound.

Throughput

How many operations (tokens, images, queries) a system processes per second.

Latency

End-to-end time between a request and a complete response.

Workloads Where TTFT (Time to First Token) Matters

GPU fit analysis for the workloads this concept directly influences.

Llm Inference

Ranked GPUs →

This definition is part of AIMC's reference glossary — 36 concepts across 10 categories.

Browse full glossary