TTFT, or Time to First Token, measures the latency between an inference request arriving at the server and the first output token being produced. It's a critical metric for interactive applications like chatbots, where any delay before output begins feels unresponsive to users.
TTFT is dominated by the prompt-encoding phase — computing attention across the entire input context — before any generation can begin. Long contexts increase TTFT roughly linearly; a 128K-token prompt may take several seconds to encode even on H100-class hardware.
Production LLM serving stacks (vLLM, TGI, TensorRT-LLM) heavily optimize TTFT through batching, prefill scheduling, and KV-cache management. AIMC tracks GPU rental pricing; throughput characteristics like TTFT depend on software stack choices in addition to hardware.