TextLayer

Language models are trained on a snapshot of the world. Once training ends, their knowledge freezes. Ask GPT-4 about something that happened last month, or ask it about a private internal document it's never seen, and it'll either hallucinate or admit it doesn't know. Retrieval-Augmented Generation — RAG — is the most practical solution to this problem. Instead of baking all knowledge into model weights, RAG lets a model look things up before answering.

The Core Idea

RAG is made up of two stages that happen at inference time (when a user sends a query): retrieve, then generate. When a user asks a question, the system first searches a knowledge base for relevant context, then passes that context alongside the original question into the LLM. The model isn't guessing anymore — it's reading before it responds.

This might sound simple, and conceptually it is. The complexity lives in the retrieval step.

From Documents to Vectors

To search a knowledge base semantically — meaning you find relevant content, not just keyword matches — you need to convert text into vectors. A vector is just an array of numbers that represents the meaning of a piece of text in high-dimensional space. Similar content ends up with similar vectors.

The process works like this: you take your source documents, split them into chunks, run each chunk through an embedding model, and store the resulting vectors in a vector database. At query time, you embed the user's question using the same model, then search the database for the chunks whose vectors are closest to the query vector.

Python

from openai import OpenAI
import numpy as np

client = OpenAI()

def embed(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Embed a small document store
docs = [
    "RAG stands for Retrieval-Augmented Generation.",
    "Vector databases store embeddings for fast similarity search.",
    "LLMs generate text based on a prompt."
]
doc_embeddings = [embed(doc) for doc in docs]

# Query
query = "What does RAG stand for?"
query_embedding = embed(query)

# Find most relevant doc
scores = [cosine_similarity(query_embedding, d) for d in doc_embeddings]
best_match = docs[np.argmax(scores)]
print(f"Most relevant: {best_match}")

The embedding model is doing a lot of heavy lifting here. "What does RAG stand for?" and "RAG stands for Retrieval-Augmented Generation" aren't an exact keyword match, but their vector representations will be close enough that cosine similarity surfaces the right document.

Putting It Together: The Full Pipeline

Once you have the retrieved chunks, you inject them into the LLM prompt as context. A minimal implementation looks like this:

Python

def answer_with_rag(question: str, context_chunks: list[str]) -> str:
    context = "\n\n".join(context_chunks)
    prompt = f"""Use the following context to answer the question.
If the answer isn't in the context, say you don't know.

Context:
{context}

Question: {question}"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

The instruction "if the answer isn't in the context, say you don't know" is doing important work. Without it, the model may fall back on its training data and hallucinate a confident answer. Grounding the model in retrieved context is only useful if you also constrain it to stay there.

Q&A

Question 1

Why not just use a longer context window instead of RAG?

Modern models support 128k or even 1M token context windows, which raises a fair question. The short answer is cost and latency: embedding your entire knowledge base into every request is prohibitively expensive and slow at scale. RAG lets you retrieve the 3–5 most relevant chunks rather than passing 10,000 pages every time. There's also a quality argument — models tend to perform better when given focused, relevant context rather than a massive haystack to search through.

Question 1

What's chunking, and why does it matter?

Chunking is how you split source documents before embedding them. Chunk too large and your vectors become blurry, averaging across too many ideas. Chunk too small and you lose context that makes a passage meaningful. Most teams land somewhere between 256–512 tokens per chunk, with some overlap between chunks to avoid cutting a sentence across a boundary. Getting chunking right is often where RAG performance lives or dies.

Question 1

Does RAG work for structured data like databases or spreadsheets?

Not as naturally. RAG was designed for unstructured text. For structured data, the better pattern is usually Text-to-SQL — using an LLM to translate natural language into a database query, then returning the result. Hybrid systems that do both exist, but they add significant complexity.