How Does Retrieval-Augmented Generation (RAG) Work?
A 7-minute read
RAG lets AI answer questions using your documents, not just its training data. Instead of memorizing facts, it looks them up at the moment you ask. Here's how the lookup actually works.
Your company’s AI assistant is trained on data from 2023. You need it to answer questions about your internal sales playbook from 2025. The solution isn’t retraining the model from scratch. It’s retrieval-augmented generation — giving the AI a search engine for your documents so it can look things up instead of guessing.
The short answer
Retrieval-Augmented Generation (RAG) is a technique that connects a language model to an external knowledge store. When you ask a question, the system first retrieves the most relevant documents or passages from that store, then passes them to the language model along with your question. The model generates an answer based on the retrieved content, not solely from what it memorized during training. The result is an AI that can answer questions about documents it was never trained on, updated in real time as those documents change.
The full picture
The problem RAG solves
Language models are trained on a fixed snapshot of text. A model trained through early 2024 has no knowledge of what happened in late 2024. A model trained on the public internet has no knowledge of your company’s internal documents. And models that do “know” something often state it confidently even when they are wrong — a phenomenon called hallucination.
The traditional solution was fine-tuning: taking a base model and training it further on your specific data. Fine-tuning is expensive, time-consuming, and static. Once done, the model’s knowledge is frozen again. If your documents change next week, you need to fine-tune again.
RAG takes a different approach. Instead of baking knowledge into the model, it gives the model a way to look up knowledge on demand. The model itself doesn’t need to know your sales playbook. It just needs to know how to use retrieved text when answering a question about it.
The two phases: indexing and retrieval
RAG works in two distinct phases. The first happens before any user asks anything. The second happens at query time.
Phase 1: Indexing
Your documents — PDFs, web pages, internal wikis, spreadsheets — are processed and converted into a searchable format. This involves three steps.
First, documents are split into chunks. A 50-page PDF becomes hundreds of smaller passages, each maybe 300 to 500 words long. This is necessary because language models have a limited context window: they can only process so much text at once. Chunking ensures that only the relevant slices of a document are retrieved, rather than forcing the model to read the whole thing.
Second, each chunk is converted into a vector embedding. A vector embedding is a list of numbers (typically 1,000 to 3,000 numbers) that represents the semantic meaning of the text. Two passages that mean similar things will have similar embeddings, even if they use completely different words. A passage about “revenue growth” and one about “increasing sales” will be close together in vector space.
A 2023 paper from Meta AI on dense passage retrieval established the foundation for this approach: embedding-based retrieval dramatically outperforms keyword search for questions that don’t match the exact phrasing of the documents.
Third, the embeddings are stored in a vector database — a specialized database optimized for finding the closest vectors to a query vector. Common options include Pinecone, Weaviate, Chroma, and pgvector (built into PostgreSQL).
Phase 2: Retrieval and generation
When you ask a question, the system converts your question into a vector embedding using the same embedding model used for the documents. It then searches the vector database for the chunks whose embeddings are closest to your question’s embedding — typically the top 3 to 10 most relevant passages.
Those retrieved chunks are assembled into a context block and passed to the language model alongside your original question. The model’s prompt looks something like this: “Based on the following information: [retrieved passages]. Answer this question: [your question].”
The model then generates an answer using the retrieved passages as its primary source material. Because the relevant text is right there in its context, it doesn’t need to rely on memorized patterns. If the answer isn’t in the retrieved passages, a well-prompted system will say so rather than make something up.
Why vector search works better than keyword search
Traditional search matches keywords. If you search for “Q3 revenue performance” and the document says “third quarter sales results,” keyword search may not find it. Vector search finds it because the embeddings for those two phrases are mathematically close — the system understands they mean the same thing.
This matters enormously in practice. Users ask questions in their own words. Documents are written in someone else’s words. The semantic gap between question phrasing and document phrasing is where keyword search fails and vector search succeeds.
A 2024 benchmark from Hugging Face comparing retrieval methods found that vector search consistently outperformed BM25 keyword search on open-domain question answering tasks, particularly when question and document phrasing diverged.
The retrieval-generation tension
The tricky part of RAG is the handoff from retrieval to generation. If the wrong chunks are retrieved, the model generates a confident-sounding answer based on irrelevant text. If too few chunks are retrieved, the model lacks necessary context. If too many are retrieved, the model’s context window fills up and it may fail to focus on the relevant parts.
This is why RAG systems are tuned. Key parameters include:
- Chunk size: Smaller chunks are more precise; larger chunks provide more context. Most systems use 256–512 tokens.
- Top-k: How many chunks to retrieve. More isn’t always better; irrelevant chunks confuse the model.
- Reranking: A second model scores the retrieved chunks for relevance before passing them to the main model, filtering out anything that seemed semantically close but isn’t actually useful.
- Hybrid search: Combining vector search with keyword search (BM25) catches cases where an exact term like a product code or person’s name is important. Pure vector search sometimes misses exact matches.
Advanced RAG: beyond the basics
Simple RAG — retrieve, paste, generate — works well for many use cases. More sophisticated systems add layers.
Query rewriting improves retrieval by reformulating the user’s question before searching. A question like “what did we decide about pricing?” is vague. The system might rewrite it as “pricing strategy decisions Q4 2025” before searching, which returns more targeted results.
HyDE (Hypothetical Document Embeddings), developed by researchers at Carnegie Mellon, generates a hypothetical answer to the question first, embeds that hypothetical answer, and uses the embedding to search. The intuition is that a hypothetical answer’s embedding will be closer to the real answer’s embedding than the question’s embedding is. This sounds circular but works surprisingly well in practice.
Agentic RAG combines retrieval with agent-style planning. Instead of one retrieval step, the system reasons about what to search for, retrieves, evaluates whether it found enough, and searches again if needed. This approach handles complex multi-part questions where a single retrieval pass misses relevant context.
Where RAG runs into limits
RAG isn’t magic. It has failure modes worth understanding.
Knowledge that isn’t in the documents. RAG can only retrieve what’s been indexed. If no indexed document contains the answer, the model may hallucinate anyway, especially if asked to infer something not explicitly stated.
Numerical and tabular data. Embedding-based retrieval treats text as bags of meaning. It struggles with precise numerical comparisons across documents — “which of these five contracts has the highest liability cap?” often requires a different approach, like converting tables to structured databases and querying them directly.
Long-range dependencies. A question whose answer requires synthesizing across many different documents — rather than finding one passage — is harder for RAG. The top-k chunks may not include every relevant piece.
Chunking artifacts. If an important passage is split across two chunks, and only one chunk is retrieved, the answer may be incomplete. Chunking strategy matters more than it seems.
A 2024 paper from Stanford on RAG failure modes found that the most common cause of wrong answers in RAG systems was retrieving plausible-sounding but incorrect chunks — the retrieval was close, but not close enough.
Why it matters
RAG is the architecture behind most enterprise AI deployments today. When a company says their AI “knows” their product documentation, their legal contracts, or their customer history, what they usually mean is that they’ve built a RAG pipeline.
The reason RAG dominates enterprise AI isn’t just accuracy. It’s control. With RAG, you know exactly what information the model has access to. You can audit which passages it retrieved. You can update your document store and the AI’s knowledge updates immediately. You can restrict retrieval to specific document categories for different users.
For anyone building AI tools on top of their own data — internal assistants, customer support bots, research tools — RAG is the practical starting point. It is faster to build than fine-tuning, more controllable than prompting alone, and more accurate than either when implemented well.
Common misconceptions
“RAG eliminates hallucination.” RAG reduces hallucination by grounding the model in real documents, but it doesn’t eliminate it. A model can still misinterpret retrieved text, over-generalize from it, or fail to say “I don’t know” when the retrieved passages don’t actually contain the answer.
“You need a vector database to do RAG.” You need semantic search. A vector database is the most common tool, but for small document sets, you can embed documents and do similarity search in memory with a few lines of NumPy. The vector database matters at scale.
“Fine-tuning is better than RAG for company-specific knowledge.” Not necessarily. Fine-tuning teaches a model style, tone, and task formatting. RAG teaches it specific facts. They solve different problems. For factual knowledge that changes over time, RAG is almost always the right choice.
“RAG is only for question-answering.” RAG is a retrieval mechanism. It’s useful anywhere you want a model’s output to be grounded in specific documents: summarization, drafting, analysis, code generation from internal documentation. The retrieval part generalizes beyond Q&A.