The transformer is a neural network architecture introduced in the 2017 paper "Attention Is All You Need." It replaced recurrent architectures (LSTMs, RNNs) for most sequence modeling tasks and now underpins virtually all modern large language models, from GPT and Claude to Llama, Mistral, and Gemini.
The architecture's core innovation is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when producing each output. This enables parallel processing of entire sequences (unlike RNNs, which process tokens sequentially) and captures long-range dependencies effectively.
Transformer training and inference are both compute-intensive and memory-intensive. The compute scales quadratically with sequence length for the attention layers, which is why context-length improvements (FlashAttention, Ring Attention, KV-cache optimization) are an active area of research. AIMC's LLM workload fit-scoring assumes transformer-based architectures as the default.