Attention is the core mechanism in transformer models that allows each position in the sequence to dynamically focus on other positions. Self-attention computes three matrices — Query, Key, and Value — and uses dot-product similarity to determine how much each token should influence the others.
The standard attention operation is computationally expensive: it scales quadratically with sequence length, which is why long-context inference can be substantially slower than short-context. Production deployments rely heavily on optimized attention kernels (FlashAttention, FlashAttention-2, PagedAttention) to reduce memory bandwidth pressure and achieve practical throughput.
Multi-head attention runs several attention operations in parallel with different learned projections, capturing different types of relationships in the input. Grouped-query attention (used in Llama-3, Mistral, and others) shares Key/Value projections across multiple Query heads to reduce KV-cache memory pressure during inference.