RAG — retrieval-augmented generation — is the pattern where you give an LLM access to a searchable knowledge base, retrieve the most relevant chunks at query time, stuff them into the prompt as context, and let the model generate an answer grounded in those documents. It is the dominant architecture for AI products that need to answer questions over private or up-to-date data.
The basic pipeline:
- Index time — chunk every document into 200–800 token pieces, embed each chunk with an embedding model, store the vectors in a vector database (pgvector, Pinecone, Qdrant, Weaviate, Chroma).
- Query time — embed the user's question with the same model, find the top-k most similar chunks, optionally rerank them with a cross-encoder for precision.
- Generation — assemble a prompt that includes the retrieved chunks plus the question, send it to the LLM, return the answer (often with citations linking back to the source chunks).
RAG solves three problems plain LLMs cannot:
- Knowledge cutoff — the model can answer about events after its training date as long as the knowledge base is fresh.
- Private data — the model never sees your data during training; it sees it at inference time, scoped per-tenant.
- Citations — every claim can be linked back to a source document, which matters enormously for compliance, trust and debugging.
In 2026 the RAG landscape has matured beyond the naive recipe:
- Hybrid search — combine vector similarity with keyword (BM25) search; almost always more accurate than vectors alone.
- Reranking — a small cross-encoder model rescores the top 50 candidates and keeps the top 5; cheap and high impact.
- Query rewriting — let the LLM reformulate the user's question into something better suited for retrieval.
- Multi-hop RAG — when a question needs information from multiple documents, retrieve, reason, retrieve again.
- Graph RAG — use a knowledge graph over the documents for structured retrieval; Microsoft's GraphRAG popularised the approach.
- Agentic RAG — let an agent decide when and how to retrieve, sometimes searching multiple sources or calling tools.
For a US engineering team in 2026, the build-vs-buy decision usually lands on building the RAG pipeline yourself for control, with off-the-shelf vector DBs and embedding APIs underneath. A working v1 takes a couple of days. A production-grade system with reranking, evaluation and per-tenant isolation takes weeks.