Retrieval-Augmented Generation is an architectural pattern that combines a vector retrieval step with LLM generation. Given a user query, the system first retrieves relevant documents from a vector database (using semantic similarity from embedding models), then includes those documents in the LLM's context window before generating a response.
RAG addresses two LLM limitations: knowledge cutoffs (the model can't know about events after training) and hallucinations (the model invents facts when uncertain). Retrieved documents ground the response in verifiable source material.
The compute profile is a hybrid: embedding generation is throughput-bound and runs on modest GPUs; LLM inference is the heavier component. Production RAG systems often serve embeddings on consumer or workstation cards (for cost) and the LLM on datacenter cards (for latency and context length). Vector databases like Pinecone, Weaviate, and Qdrant handle the retrieval layer.