RAG (Retrieval-Augmented Generation)

Architecture pattern that grounds LLM responses in retrieved documents from a knowledge base.

Retrieval-Augmented Generation is an architectural pattern that combines a vector retrieval step with LLM generation. Given a user query, the system first retrieves relevant documents from a vector database (using semantic similarity from embedding models), then includes those documents in the LLM's context window before generating a response.

RAG addresses two LLM limitations: knowledge cutoffs (the model can't know about events after training) and hallucinations (the model invents facts when uncertain). Retrieved documents ground the response in verifiable source material.

The compute profile is a hybrid: embedding generation is throughput-bound and runs on modest GPUs; LLM inference is the heavier component. Production RAG systems often serve embeddings on consumer or workstation cards (for cost) and the LLM on datacenter cards (for latency and context length). Vector databases like Pinecone, Weaviate, and Qdrant handle the retrieval layer.

Related Terms

Concepts directly relevant to RAG (Retrieval-Augmented Generation).

LLM Inference

Serving trained large language models to user requests, often memory-bandwidth-bound.

Workloads Where RAG (Retrieval-Augmented Generation) Matters

GPU fit analysis for the workloads this concept directly influences.

This definition is part of AIMC's reference glossary — 36 concepts across 10 categories.

Browse full glossary