The KV (Key-Value) cache is a fundamental optimization in transformer inference. During autoregressive generation, each new token attends to all previous tokens — without caching, the model would recompute Key and Value projections for the entire context at every step.
Instead, the KV cache stores these projections after the first forward pass and reuses them for subsequent tokens. This reduces the per-token compute from O(n²) to O(n), making practical inference possible. However, the cache itself consumes VRAM proportionally to context length, batch size, model depth, and hidden dimension.
For a 70B-parameter model at 128K context, the KV cache can exceed 30 GB per request — which is why VRAM-efficient serving frameworks like vLLM use PagedAttention to manage cache memory like virtual memory, and why grouped-query attention (sharing K/V heads) is now standard in production LLMs.