25 RAG: Retrieval-Augmented Generation

A language model trained through mid-2024 cannot answer questions about a document you wrote last month, a customer conversation from yesterday, or a product specification that exists only in your internal wiki. Its knowledge is frozen at training time.

Retrieval-Augmented Generation (RAG) addresses this by inserting relevant context into the prompt at inference time. We store documents in a searchable index, retrieve the most relevant passages when a query arrives, and provide those passages to the model alongside the question. The model answers based on the retrieved context rather than relying on parametric memory alone.

This chapter covers the complete RAG pipeline in depth: chunking strategies, embedding models, vector store construction, retrieval, reranking, prompt construction, and evaluation. The adjacent chapters on prompt engineering (13.35) and LLM safety (14) apply directly to RAG systems.

25.1 The RAG Architecture

A RAG system has two phases.

Ingestion (offline, runs once or on a schedule): 1. Load source documents — PDFs, web pages, database records, markdown files 2. Split documents into chunks of manageable size 3. Encode each chunk as a dense vector embedding 4. Store (chunk text, embedding, metadata) in a vector database

Retrieval and generation (online, runs per query): 1. Encode the query using the same embedding model 2. Search the vector database for the most similar chunks 3. Optionally rerank the retrieved chunks with a more expensive model 4. Construct a prompt: system instructions + retrieved context + query 5. Call the LLM and return the response

The quality of the final answer depends on every step in both phases. A well-designed generator cannot compensate for poor retrieval; good retrieval is wasted if the prompt assembles the context poorly. We will look at each step in turn.

RAG is not a complete substitute for fine-tuning. RAG excels at injecting factual knowledge from a large corpus; fine-tuning is better for teaching the model a specific reasoning style, domain-specific terminology, or a task format. The two are often combined.

25.2 Document Chunking

The chunk is the atomic unit of RAG. Every retrieval decision is made at the chunk level, so chunking strategy directly determines what context the model sees.

Fixed-size chunking splits on character or token count with an overlap window. Simple and predictable, but cuts mid-sentence and mid-paragraph, producing semantically incomplete fragments. Good as a baseline.

Recursive character splitting tries sentence and paragraph boundaries first, falling back to character count only when necessary. This is the most common strategy in LangChain’s RecursiveCharacterTextSplitter and produces better semantic coherence than pure fixed-size splitting.

Semantic chunking clusters sentences by embedding similarity, creating chunks that group semantically related content regardless of position. More expensive but produces the most coherent chunks for long, discursive documents.

Document-aware chunking uses the document structure — headings, tables, code blocks — to create chunks that align with logical sections. Requires parsing the document format (HTML, Markdown, PDF structure) but preserves context that flat character splitting destroys.

Two hyperparameters matter most: chunk size (100–1000 tokens, depending on document type and context window) and overlap (typically 10–20% of chunk size, to prevent truncating content that spans a boundary).

Code

# Chunking strategies — implemented from scratch to show the logic
import re
import numpy as np

sample_document = """
Introduction to Customer Segmentation

Customer segmentation is the process of dividing customers into groups based on shared
characteristics. The goal is to tailor marketing, product development, and service delivery
to the distinct needs of each segment.

Common Segmentation Approaches

Demographic segmentation groups customers by age, income, education, or occupation.
It is the oldest and most widely used form. The limitation is that people with identical
demographics often behave very differently.

Behavioral segmentation groups customers by their actions: purchase frequency, recency,
average order value, or product category affinity. This is more predictive than demographic
segmentation but requires transactional data.

RFM Analysis

RFM (Recency, Frequency, Monetary) scoring is a behavioral segmentation technique that
assigns each customer three scores based on when they last purchased, how often they
purchase, and how much they spend. Customers are then ranked and grouped into segments.
"""

def fixed_size_chunk(text, chunk_size=200, overlap=40):
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunks.append(text[start:end].strip())
        start += chunk_size - overlap
    return [c for c in chunks if c]

def paragraph_chunk(text, max_chars=400, overlap_chars=60):
    paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
    chunks, current = [], ""
    for para in paragraphs:
        if len(current) + len(para) < max_chars:
            current = (current + "\n\n" + para).strip()
        else:
            if current:
                chunks.append(current)
            current = para
    if current:
        chunks.append(current)
    return chunks

fixed = fixed_size_chunk(sample_document)
para  = paragraph_chunk(sample_document)

print(f"Fixed-size chunks: {len(fixed)}")
for i, c in enumerate(fixed[:2]):
    print(f"  [{i}] {c[:80]}...")
print()
print(f"Paragraph-aware chunks: {len(para)}")
for i, c in enumerate(para[:2]):
    print(f"  [{i}] {c[:80]}...")

25.3 Embedding Models

An embedding model encodes a text chunk as a dense vector of floating-point numbers, typically 384–1536 dimensions. Chunks that are semantically similar end up close in this vector space; dissimilar chunks are far apart. Retrieval is then a nearest-neighbor search.

sentence-transformers provides a large catalog of pre-trained embedding models that run locally. all-MiniLM-L6-v2 is fast and compact (22M parameters, 384 dimensions). BAAI/bge-large-en-v1.5 is heavier but more accurate.

API-based embeddings (OpenAI text-embedding-3-small, Cohere embed-english-v3.0, Voyage AI) avoid local compute but add latency and cost at query time. They are particularly strong for cross-lingual and long-document tasks.

The embedding model used at ingestion and at query time must be identical — if they differ, the similarity computations are meaningless. This is the most common source of silent bugs when upgrading a RAG system.

Install: pip install sentence-transformers

Code

try:
    from sentence_transformers import SentenceTransformer, util
    st_available = True
except ImportError:
    st_available = False
    print("pip install sentence-transformers to run this cell")

if st_available:
    model = SentenceTransformer("all-MiniLM-L6-v2")

    sentences = [
        "Customer segmentation divides customers into groups by shared traits.",
        "RFM analysis scores customers on recency, frequency, and monetary value.",
        "Machine learning models can predict customer churn probability.",
        "Revenue declined 8% due to increased competition in the SMB segment.",
        "Behavioral segmentation groups customers by their purchasing actions.",
    ]

    embeddings = model.encode(sentences, normalize_embeddings=True)
    print(f"Embedding shape: {embeddings.shape}  (n_sentences x dim)")
    print()

    query = "How do we group customers by their buying behavior?"
    q_emb = model.encode(query, normalize_embeddings=True)

    scores = embeddings @ q_emb   # cosine similarity (vectors are normalized)
    ranked = sorted(zip(scores, sentences), reverse=True)

    print(f"Query: \"{query}\"\n")
    print("Ranked results:")
    for score, sent in ranked:
        print(f"  {score:.3f}  {sent}")

25.4 Vector Stores and Similarity Search

A vector store indexes embeddings for fast approximate nearest-neighbor search. Without an index, retrieval requires computing the dot product between the query and every stored chunk — fine for a few thousand chunks but impractical at millions.

FAISS (Facebook AI Similarity Search) is the most widely used library for local vector search. Its IndexFlatL2 performs exact search; IndexIVFFlat partitions the space into cells for approximate search at scale; IndexHNSW builds a graph index that is faster still. Install: pip install faiss-cpu

ChromaDB is an embedded vector database with a persistence layer, metadata filtering, and a simple Python API. Good choice for prototypes and small-to-medium deployments. Install: pip install chromadb

Managed services — Pinecone, Weaviate, Qdrant, pgvector (PostgreSQL extension) — add replication, access control, hybrid search (dense + sparse), and managed scaling. They are the standard choice for production systems above ~1M documents.

For most use cases, the similarity metric is cosine similarity (inner product of normalized vectors). L2 (Euclidean) distance gives equivalent rankings for normalized vectors. Maximum Inner Product Search (MIPS) is used for unnormalized embeddings.

Code

try:
    import faiss
    faiss_available = True
except ImportError:
    faiss_available = False
    print("pip install faiss-cpu to run this cell")

if st_available and faiss_available:
    # Build a small corpus and index it
    corpus = [
        "RFM analysis scores customers on recency, frequency, and spend.",
        "Demographic segmentation groups by age, income, and location.",
        "Behavioral segmentation uses purchase history and browsing patterns.",
        "Churn prediction models identify customers likely to cancel.",
        "Customer lifetime value estimates long-term revenue per customer.",
        "Net Promoter Score measures customer loyalty and satisfaction.",
        "A/B testing compares two versions of a product or message.",
        "Cohort analysis tracks groups of users over time from a shared start date.",
    ]

    corpus_embs = model.encode(corpus, normalize_embeddings=True).astype("float32")

    dim   = corpus_embs.shape[1]
    index = faiss.IndexFlatIP(dim)   # Inner Product = cosine for normalized vectors
    index.add(corpus_embs)
    print(f"Index built: {index.ntotal} vectors, dim={dim}")
    print()

    # Query
    queries = [
        "How do I measure customer loyalty?",
        "What predicts whether a customer will leave?",
    ]
    for q in queries:
        q_emb = model.encode(q, normalize_embeddings=True).astype("float32").reshape(1,-1)
        scores, indices = index.search(q_emb, k=3)
        print(f"Query: \"{q}\"  ->  top 3 results:")
        for score, idx in zip(scores[0], indices[0]):
            print(f"  {score:.3f}  {corpus[idx]}")
        print()

25.5 Reranking

First-pass retrieval with dense embeddings is fast but imprecise. The embedding model encodes a query and each document independently, then measures similarity. It cannot model the fine-grained interaction between query tokens and document tokens.

Reranking adds a second pass that is slower but more accurate. A cross-encoder takes the query and a single candidate document as a pair, processes them jointly through a transformer, and outputs a relevance score. Because the query and document tokens attend to each other, the model can capture subtle interactions that the bi-encoder (embedding) model misses.

The typical pattern: retrieve top-50 with the fast embedding model, rerank with the cross-encoder, pass top-5 to the generator. This gives the accuracy of the cross-encoder at a fraction of the cost of running it over the full corpus.

Popular options: - cross-encoder/ms-marco-MiniLM-L-6-v2 (sentence-transformers): fast, good general performance - Cohere Rerank API: managed, state-of-the-art, multilingual - BAAI/bge-reranker-large: strong open-source reranker

Install: pip install sentence-transformers (cross-encoders are in the same package)

Code

if st_available:
    from sentence_transformers import CrossEncoder

    # Use a lightweight cross-encoder for reranking
    cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

    query_rerank = "How do we measure whether customers are likely to leave?"

    # Simulate first-pass retrieval returning 5 candidates
    candidates = [
        "Churn prediction models identify customers likely to cancel.",
        "Net Promoter Score measures customer loyalty and satisfaction.",
        "Cohort analysis tracks groups of users over time from a shared start date.",
        "Customer lifetime value estimates long-term revenue per customer.",
        "RFM analysis scores customers on recency, frequency, and spend.",
    ]

    # Embedding-based scores (first pass)
    q_emb  = model.encode(query_rerank, normalize_embeddings=True)
    c_embs = model.encode(candidates, normalize_embeddings=True)
    bi_scores = (c_embs @ q_emb).tolist()

    # Cross-encoder scores (reranking)
    pairs = [(query_rerank, c) for c in candidates]
    ce_scores = cross_encoder.predict(pairs).tolist()

    import pandas as pd
    df = pd.DataFrame({"candidate": candidates, "bi_score": bi_scores, "ce_score": ce_scores})
    df["bi_rank"] = df["bi_score"].rank(ascending=False).astype(int)
    df["ce_rank"] = df["ce_score"].rank(ascending=False).astype(int)
    print(df[["bi_rank","ce_rank","candidate"]].sort_values("ce_rank").to_string(index=False))
    print()
    print("Cross-encoder reranking can change the top result — especially for nuanced queries.")

25.6 Prompt Construction for RAG

Once we have retrieved and optionally reranked chunks, we assemble them into the context window. How we do this affects both answer quality and citation reliability.

A typical RAG prompt structure:

[System]
You are a knowledge assistant. Answer the question using only the provided context.
If the context does not contain enough information, say "I don't know based on the
available documents." Do not make up information.

[Context]
<document id="1">
[text of chunk 1]
</document>
<document id="2">
[text of chunk 2]
</document>

[Question]
{user_question}

Key choices:

Context ordering: due to the “lost in the middle” phenomenon, models attend more to the beginning and end of long contexts. Place the most relevant chunk first, not in the middle.

Context compression: long chunks with irrelevant content dilute signal. Extractive compression (keeping only sentences that mention query terms) or abstractive compression (a separate LLM call to summarize the chunk) can improve answer quality.

Citations: tagging chunks with IDs allows the model to cite specific sources. Instruct the model to reference the document ID in its answer, then link citations to the original documents in the UI.

Code

# End-to-end RAG pipeline

import os

def build_rag_prompt(query, chunks, chunk_ids):
    context_blocks = ""
    for cid, chunk in zip(chunk_ids, chunks):
        context_blocks += f'<document id="{cid}">\n{chunk}\n</document>\n\n'

    system = (
        "You are a knowledge assistant. Answer the question using only the provided documents.\n"
        "If the documents do not contain enough information, say 'I don't know based on these documents.' "
        "Cite the document ID(s) that support your answer in square brackets, e.g. [1]."
    )
    user = f"Documents:\n{context_blocks}\nQuestion: {query}"
    return system, user

def rag_query(query, corpus_texts, top_k=3):
    if not st_available:
        return "[sentence-transformers not available]"

    # 1. Encode query + corpus
    q_emb  = model.encode(query, normalize_embeddings=True).astype("float32").reshape(1,-1)
    c_embs = model.encode(corpus_texts, normalize_embeddings=True).astype("float32")

    # 2. Retrieve top-k
    scores = (c_embs @ q_emb.T).flatten()
    top_idx = scores.argsort()[::-1][:top_k]
    retrieved = [corpus_texts[i] for i in top_idx]
    chunk_ids = [str(i+1) for i in top_idx]

    # 3. Build prompt
    system, user = build_rag_prompt(query, retrieved, chunk_ids)

    # 4. Generate
    return call_claude(system, user)

# Run a query against our small corpus
user_query = "What is RFM analysis and what does each letter stand for?"
print(f"Query: {user_query}")
print()
response = rag_query(user_query, corpus)
print("Answer:")
print(response)

25.7 Retrieval Evaluation

A RAG system has two distinct failure modes: retrieval failure (the right chunks are not retrieved) and generation failure (the chunks are there but the answer is wrong). Diagnosing which is failing requires evaluating both separately.

Retrieval metrics require a set of (query, relevant_document_ids) pairs as ground truth.

Metric	Definition
Precision@k	Fraction of top-k retrieved chunks that are relevant
Recall@k	Fraction of all relevant chunks captured in top-k
MRR	Mean Reciprocal Rank — average of 1/rank for the first relevant chunk
NDCG@k	Normalized Discounted Cumulative Gain — weights highly-ranked relevant chunks more

End-to-end metrics evaluate the generated answer quality. RAGAS (Retrieval Augmented Generation Assessment) provides four automated metrics:

Faithfulness: does the answer contain only information from the retrieved context?
Answer Relevance: does the answer address the question?
Context Precision: are the retrieved chunks relevant to the question?
Context Recall: do the retrieved chunks contain the information needed to answer?

Install: pip install ragas

25.8 Advanced Retrieval Techniques

Several techniques address failure modes in the basic retrieve-then-generate pipeline.

HyDE (Hypothetical Document Embeddings) generates a hypothetical answer to the query using the LLM, then retrieves chunks similar to that hypothetical answer rather than similar to the raw query. This is effective when queries are short and ambiguous but relevant documents are detailed and specific.

Multi-query retrieval rewrites the query in several different ways, retrieves for each variant, and combines the results with deduplication. It improves recall for ambiguous queries at the cost of additional LLM calls.

Contextual compression runs a secondary model over each retrieved chunk to extract only the sentences most relevant to the query, discarding irrelevant context before passing to the generator. This reduces context window usage and “lost in the middle” effects.

Self-RAG adds a reflection step: the model generates a draft answer, then evaluates whether it needs to retrieve more information, and retrieves again if so. This adaptive approach is more expensive but handles multi-hop questions more reliably.

Hybrid search combines dense retrieval (embeddings) with sparse retrieval (BM25 keyword matching) and merges rankings using Reciprocal Rank Fusion. It consistently outperforms either approach alone across a variety of document types.

25.9 Key Takeaways

RAG retrieves relevant document chunks at inference time and injects them into the prompt — bridging the gap between a model’s training cutoff and current information
Chunking strategy determines retrieval quality; paragraph-aware or semantic chunking outperforms fixed-size splitting for most document types
The embedding model used at ingestion and query time must be identical; a mismatch silently destroys retrieval quality
Reranking with a cross-encoder adds a second pass that captures query-document interaction — retrieve top-50, rerank, pass top-5 to the generator
Evaluate RAG systems on both retrieval (Precision@k, Recall@k, MRR) and generation (faithfulness, answer relevance via RAGAS)
Advanced techniques — HyDE, multi-query, hybrid search — each address a specific retrieval failure mode; profile before adding complexity

Recommended reading: - Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Lewis et al., 2020 - RAGAS documentation: docs.ragas.io - LangChain RAG cookbook: python.langchain.com/docs/tutorials/rag