Architecture

KV Cache

Memory caching technique that stores transformer attention keys and values to avoid recomputing them at every token generation step.

The KV (Key-Value) cache is a fundamental optimization in transformer inference. During autoregressive generation, each new token attends to all previous tokens — without caching, the model would recompute Key and Value projections for the entire context at every step.

Instead, the KV cache stores these projections after the first forward pass and reuses them for subsequent tokens. This reduces the per-token compute from O(n²) to O(n), making practical inference possible. However, the cache itself consumes VRAM proportionally to context length, batch size, model depth, and hidden dimension.

For a 70B-parameter model at 128K context, the KV cache can exceed 30 GB per request — which is why VRAM-efficient serving frameworks like vLLM use PagedAttention to manage cache memory like virtual memory, and why grouped-query attention (sharing K/V heads) is now standard in production LLMs.

Related Terms

Concepts directly relevant to KV Cache.

Transformer

Neural network architecture that underpins modern large language models and many vision models.

Attention (Self-Attention)

The mechanism that lets transformer models weigh which parts of the input matter most for each output.

LLM Inference

Serving trained large language models to user requests, often memory-bandwidth-bound.

Memory Bandwidth

How fast a GPU can read and write its VRAM, measured in gigabytes per second.

Workloads Where KV Cache Matters

GPU fit analysis for the workloads this concept directly influences.

Llm Inference

Ranked GPUs →

This definition is part of AIMC's reference glossary — 36 concepts across 10 categories.

Browse full glossary