Attention (Self-Attention)

The mechanism that lets transformer models weigh which parts of the input matter most for each output.

Attention is the core mechanism in transformer models that allows each position in the sequence to dynamically focus on other positions. Self-attention computes three matrices — Query, Key, and Value — and uses dot-product similarity to determine how much each token should influence the others.

The standard attention operation is computationally expensive: it scales quadratically with sequence length, which is why long-context inference can be substantially slower than short-context. Production deployments rely heavily on optimized attention kernels (FlashAttention, FlashAttention-2, PagedAttention) to reduce memory bandwidth pressure and achieve practical throughput.

Multi-head attention runs several attention operations in parallel with different learned projections, capturing different types of relationships in the input. Grouped-query attention (used in Llama-3, Mistral, and others) shares Key/Value projections across multiple Query heads to reduce KV-cache memory pressure during inference.

Related Terms

Concepts directly relevant to Attention (Self-Attention).

Transformer

Neural network architecture that underpins modern large language models and many vision models.

KV Cache

Memory caching technique that stores transformer attention keys and values to avoid recomputing them at every token generation step.

LLM Inference

Serving trained large language models to user requests, often memory-bandwidth-bound.

Workloads Where Attention (Self-Attention) Matters

GPU fit analysis for the workloads this concept directly influences.

Llm Inference

Ranked GPUs →

This definition is part of AIMC's reference glossary — 36 concepts across 10 categories.

Browse full glossary