Language models are trained on a snapshot of the world. Once training ends, their knowledge freezes. Ask GPT-4 about something that happened last month, or ask it about a private internal document it's never seen, and it'll either hallucinate or admit it doesn't know. Retrieval-Augmented Generation — RAG — is the most practical solution to this problem. Instead of baking all knowledge into model weights, RAG lets a model look things up before answering.
The Core Idea
RAG is made up of two stages that happen at inference time (when a user sends a query): retrieve, then generate. When a user asks a question, the system first searches a knowledge base for relevant context, then passes that context alongside the original question into the LLM. The model isn't guessing anymore — it's reading before it responds.
This might sound simple, and conceptually it is. The complexity lives in the retrieval step.
From Documents to Vectors
To search a knowledge base semantically — meaning you find relevant content, not just keyword matches — you need to convert text into vectors. A vector is just an array of numbers that represents the meaning of a piece of text in high-dimensional space. Similar content ends up with similar vectors.
The process works like this: you take your source documents, split them into chunks, run each chunk through an embedding model, and store the resulting vectors in a vector database. At query time, you embed the user's question using the same model, then search the database for the chunks whose vectors are closest to the query vector.
The embedding model is doing a lot of heavy lifting here. "What does RAG stand for?" and "RAG stands for Retrieval-Augmented Generation" aren't an exact keyword match, but their vector representations will be close enough that cosine similarity surfaces the right document.
Putting It Together: The Full Pipeline
Once you have the retrieved chunks, you inject them into the LLM prompt as context. A minimal implementation looks like this:
The instruction "if the answer isn't in the context, say you don't know" is doing important work. Without it, the model may fall back on its training data and hallucinate a confident answer. Grounding the model in retrieved context is only useful if you also constrain it to stay there.
