26 LLM Applications: RAG, Prompt Engineering, and Fine-Tuning

Chapters 13.1–13.3 covered the mechanics of calling a large language model: the HuggingFace ecosystem, cloud APIs, and local inference. This chapter covers the three practical skills that determine whether an LLM application actually works — and what to do when it does not.

Every LLM has a knowledge cutoff: it knows nothing about events after its training date, and nothing about your organization’s specific data, documents, or processes. Beyond that, LLMs hallucinate — they generate plausible-sounding text that is factually wrong, confidently and without warning. This is not a bug that will be fixed; it is a structural property of how language models work.

The three techniques we cover each address a different aspect of this:

Retrieval-Augmented Generation (RAG): grounds the model’s answers in retrieved documents, addressing both the knowledge cutoff and hallucination problems
Prompt engineering: shapes the model’s output through careful instruction design
Fine-tuning: adapts a base model to a specific domain or behavioral style

26.1 Retrieval-Augmented Generation

RAG is an architecture that augments an LLM with a retrieval system. Instead of asking the model to answer from memory, we retrieve the most relevant documents from an external knowledge base, inject them into the prompt, and ask the model to generate an answer grounded in what we provided. The model’s role shifts from oracle to reader and synthesizer.

User Query
    |                         Knowledge Base
    |                    (documents, chunked)
    v                              |
[Embed query] ----similarity----> [Embed chunks]
    |                              |
    +----> [Top-k chunks] ---------+
                 |
    [Prompt = query + retrieved chunks]
                 |
              [LLM]
                 |
           [Grounded answer]

The retrieval step depends on text embeddings — dense vector representations that capture semantic meaning. Two texts with similar meaning will have embeddings close together in vector space (high cosine similarity), even if they share no words. A query about reducing staff will retrieve documents about layoffs and workforce restructuring even though those words are absent from the query.

Code

# Install: pip install sentence-transformers faiss-cpu

try:
    from sentence_transformers import SentenceTransformer
    import faiss
    HAVE_DEPS = True
except ImportError:
    HAVE_DEPS = False
    print('Run: pip install sentence-transformers faiss-cpu')

import numpy as np
import pandas as pd

if HAVE_DEPS:
    model = SentenceTransformer('all-MiniLM-L6-v2')

    sentences = [
        'The company announced record quarterly earnings.',
        'Q3 revenue exceeded analyst expectations by 15%.',
        'The CEO resigned amid accounting irregularities.',
        'A new product line targeting Gen Z will launch next year.',
        'The board approved a $500M share buyback program.',
    ]

    embeddings = model.encode(sentences)
    print(f'Embedding shape: {embeddings.shape}')  # (5, 384)

    from sklearn.metrics.pairwise import cosine_similarity
    sim = cosine_similarity(embeddings)
    df_sim = pd.DataFrame(sim.round(3),
                           index=[s[:40] for s in sentences],
                           columns=[s[:20]+'...' for s in sentences])
    print()
    print('Cosine similarity matrix:')
    print(df_sim)

26.2 Building a RAG Pipeline

A minimal RAG pipeline has four parts: ingestion, retrieval, prompt construction, and generation.

Ingestion splits documents into chunks, embeds each chunk, and stores the embeddings in a vector index. Chunk size matters. Chunks that are too small lose context; chunks that are too large dilute the signal. Typical sizes are 256–512 tokens with a 10–20% overlap so that content at chunk boundaries is not lost. LangChain’s RecursiveCharacterTextSplitter splits on \n\n, \n, then spaces and is a good default.

Retrieval embeds the user query using the same embedding model and finds the top-\(k\) most similar chunks using approximate nearest-neighbor search. For local prototyping, FAISS (Facebook AI Similarity Search) is fast and requires no external service. For production, managed vector databases such as Chroma, Weaviate, Pinecone, or pgvector add persistence and filtering.

Prompt construction places the retrieved chunks into the prompt alongside the question, with explicit instructions to answer only from the provided context.

Generation passes the assembled prompt to any LLM and returns the answer.

Code

if HAVE_DEPS:
    # Ingestion
    documents = [
        'Our refund policy allows returns within 30 days of purchase with a receipt.',
        'Products purchased during sale events are final sale and cannot be returned.',
        'To initiate a return, visit any store or contact support@example.com.',
        'Refunds are processed within 5-7 business days to the original payment method.',
        'Exchanges for a different size or color are accepted within 60 days.',
        'Damaged or defective items qualify for immediate replacement at no cost.',
        'Gift cards and downloadable software cannot be returned.',
        'International orders follow the same policy but shipping costs are non-refundable.',
    ]
    doc_emb = model.encode(documents)
    faiss.normalize_L2(doc_emb)
    idx = faiss.IndexFlatIP(doc_emb.shape[1])
    idx.add(doc_emb)

    # Retrieval
    def retrieve(query, k=2):
        q = model.encode([query]); faiss.normalize_L2(q)
        scores, idxs = idx.search(q, k)
        return [(scores[0][i], documents[idxs[0][i]]) for i in range(k)]

    # Prompt construction
    def build_prompt(query, chunks):
        context = '\n'.join(f'- {doc}' for _, doc in chunks)
        return (f'Answer the question using ONLY the policy documents below.\n\n'
                f'Policy:\n{context}\n\nQuestion: {query}\n\nAnswer:')

    for q in ['Can I return a sweater I bought last month?',
              'I received a damaged item. What should I do?']:
        chunks = retrieve(q)
        print(f'Query: {q}')
        for score, doc in chunks:
            print(f'  [{score:.3f}] {doc}')
        print()

26.3 Connecting to a Live LLM

The retrieval step above produces a prompt. We pass that prompt to any LLM to get the final answer.

import anthropic

client = anthropic.Anthropic(api_key='your-api-key')

def rag_answer(query):
    chunks = retrieve(query, k=3)
    prompt = build_prompt(query, chunks)
    msg = client.messages.create(
        model='claude-sonnet-4-6',
        max_tokens=512,
        messages=[{'role': 'user', 'content': prompt}]
    )
    return msg.content[0].text

The pattern is identical for the OpenAI API — just swap in openai.OpenAI() and the chat completions endpoint. LangChain and LlamaIndex are higher-level frameworks that automate the pipeline, but for straightforward applications the direct approach above is easier to debug and understand.

26.4 Prompt Engineering

Prompt engineering is not about tricks or magic phrases. It is about communicating intent clearly and completely. Three patterns cover most of what practitioners need.

Zero-shot prompting describes the task with no examples. It works for simple, well-defined tasks where the expected format is obvious.

Classify the sentiment as Positive, Negative, or Neutral. Reply with one word.
Review: The product arrived quickly but stopped working after two days.
Sentiment:

Few-shot prompting provides 2–5 labeled examples before the actual query. This is the first thing to try when zero-shot produces inconsistent output. Each example adds tokens to every request, so keep examples short.

Review: Fast shipping, exactly as described. → Positive
Review: Broken on arrival. → Negative
Review: Arrived on time. → Neutral

Review: Arrived quickly but stopped working after two days. →

Chain-of-thought (CoT) asks the model to reason step by step before answering. Simply adding let’s think step by step significantly improves performance on multi-step reasoning, arithmetic, and logic. Adding a worked example (few-shot CoT) is even more reliable.

Code

def zero_shot(review):
    return (
        'Classify sentiment as Positive, Negative, or Neutral. Reply with one word.\n\n'
        f'Review: {review}\nSentiment:'
    )

def few_shot(review):
    return (
        'Classify sentiment as Positive, Negative, or Neutral.\n\n'
        'Review: Fast shipping, exactly as described. → Positive\n'
        'Review: Broken on arrival, terrible quality. → Negative\n'
        'Review: Arrived on time, nothing special. → Neutral\n\n'
        f'Review: {review} →'
    )

def cot(problem):
    return (
        f'{problem}\n\n'
        'Think step by step, showing each calculation. '
        'Put your final answer on the last line as: Answer: [number]'
    )

reviews = [
    'Arrived quickly but stopped working after two days.',
    'Absolutely love this. Best purchase all year.',
]
for r in reviews:
    print(f'Zero-shot: {zero_shot(r)[:80]}...')
    print(f'Few-shot:  {few_shot(r)[:80]}...')
    print()

problem = 'A store has 120 apples. They sell 35% on Monday and 20 more on Tuesday. How many remain?'
print('CoT prompt:')
print(cot(problem))

26.5 System Prompts and Templates

System prompts define the model’s role, persona, and constraints for an entire conversation. A well-designed system prompt is often more important than any individual user prompt.

Useful things to include:

Role: You are a concise financial analyst assistant.
Output format: Always respond in JSON with keys: summary, sentiment, confidence.
Scope: Only answer questions about our product catalog.
Tone: Be direct and professional. Do not use filler phrases like ‘Certainly!’ or ‘Great question!’
Safety: Do not make specific investment recommendations.

Prompt templates separate the fixed structure from the variable inputs, making prompts maintainable, testable, and reusable across the codebase.

SYSTEM = (
    'You are a customer support assistant for Acme Corp. '
    'Answer only questions about returns, shipping, and product availability. '
    'Respond in 2-3 sentences maximum.'
)

USER_TEMPLATE = (
    'Customer question: {question}\n\n'
    'Relevant policies:\n{policy_context}'
)

# Anthropic API:
message = client.messages.create(
    model='claude-sonnet-4-6', max_tokens=256,
    system=SYSTEM,
    messages=[{'role': 'user', 'content': USER_TEMPLATE.format(
        question='Can I return a sweater?',
        policy_context='Returns accepted within 30 days with receipt.'
    )}]
)

26.6 Cost Optimization

API-based LLMs charge per token (roughly 4 characters = 1 token). At scale, poorly designed prompts can multiply costs significantly.

The most impactful lever is model selection. GPT-4o-mini and Claude Haiku are around 50x cheaper than their flagship counterparts. Simple extraction, classification, and short summaries almost always work fine on smaller models. Complex reasoning, nuanced judgment, and multi-step code generation may require a larger model. Measure both quality and cost before assuming the large model is necessary.

Other useful levers: shorter system prompts (every call pays for them), fewer few-shot examples, limiting max_tokens for short tasks, caching for repeated identical prompts (most providers support this), and batch API for non-real-time workloads (typically 50% cheaper).

26.7 Fine-Tuning

26.7.1 When to fine-tune vs. use RAG

This is one of the most common decisions in LLM application development. The wrong choice wastes significant time and money.

RAG is the right choice when the application needs to answer questions from specific documents, when that knowledge changes frequently, or when the ability to cite sources is important. Fine-tuning is the right choice when a consistent output format or style is needed that prompts alone cannot reliably produce, when the domain requires specific reasoning patterns, or when per-call cost needs to be reduced by using a smaller model.

A good rule of thumb is to start with RAG and prompt engineering, and fine-tune only when there is clear evidence that prompts are insufficient and a labeled dataset of at least a few hundred high-quality examples is available.

26.7.2 LoRA and Parameter-Efficient Fine-Tuning

Full fine-tuning updates all model parameters and requires significant GPU memory. A 7B parameter model requires around 28 GB in float32. It also risks catastrophic forgetting — the model loses its general capabilities.

LoRA (Low-Rank Adaptation, Hu et al. 2021) freezes the original model weights and adds small trainable rank-decomposition matrices to the attention layers. This reduces trainable parameters by roughly 1000x while retaining most of the quality benefit of full fine-tuning. Multiple task-specific LoRA adapters can coexist on the same base model. QLoRA combines 4-bit quantization with LoRA, making it possible to fine-tune a 13B model on a single 24 GB GPU.

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM

base = AutoModelForCausalLM.from_pretrained('mistralai/Mistral-7B-v0.1')
lora_cfg = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16, lora_alpha=32,
    target_modules=['q_proj', 'v_proj'],
    lora_dropout=0.05,
)
model = get_peft_model(base, lora_cfg)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 3,756,638,208 || trainable%: 0.11%

Install: pip install peft trl bitsandbytes transformers datasets

26.8 Evaluating LLM Outputs

Standard metrics like accuracy and F1 apply when the output is a classification label or can be compared to a gold string. For open-ended generation they break down.

LLM-as-judge has become the standard approach for evaluating RAG and open-ended responses. We define evaluation criteria — accuracy, faithfulness, relevance — and ask a stronger model (GPT-4, Claude Sonnet) to score responses on a 1–5 scale. Human agreement with LLM judges is now competitive with inter-human agreement on many tasks.

RAGAS (pip install ragas) is a specialized framework for evaluating RAG systems. It measures faithfulness (does the answer contradict the retrieved context?), answer relevance (does it address the question?), context recall (was the right information retrieved?), and context precision (is retrieved content actually used?).

Code

# LLM-as-judge prompt pattern

JUDGE_TEMPLATE = (
    'You are evaluating a question-answering system.\n\n'
    'Question: {question}\n\n'
    'Retrieved context:\n{context}\n\n'
    'Answer: {answer}\n\n'
    'Score on two criteria, each 1-5:\n'
    '1. Faithfulness (1=contradicts context, 5=fully grounded)\n'
    '2. Relevance (1=does not answer, 5=fully answers)\n\n'
    'Respond as JSON: {"faithfulness": <1-5>, "relevance": <1-5>, "reasoning": "brief"}'
)

example = {
    'question': 'Can I return something bought during a sale?',
    'context':  'Products purchased during sale events are final sale and cannot be returned.',
    'good_answer': 'No, sale items are final sale and cannot be returned.',
    'bad_answer':  'Yes, you can return any product within 30 days.',
}

for label, answer in [('Good', example['good_answer']), ('Bad', example['bad_answer'])]:
    prompt = JUDGE_TEMPLATE.format(
        question=example['question'],
        context=example['context'],
        answer=answer
    )
    print(f'[{label} answer] {answer}')
    print(f'Prompt excerpt: {prompt[:200]}...')
    print()

print('Pass each prompt to a strong LLM and parse the JSON.')
print('Faithfulness >= 4 and relevance >= 4 indicates a trustworthy response.')

26.9 Guardrails

Production LLM applications need defenses against both accidental misuse and intentional adversarial inputs. Prompt injection is the most common attack: an adversary embeds instructions inside user-provided content (a document the model is asked to summarize, a customer review it is asked to classify) that override the system prompt.

The most practical defenses are:

Never put untrusted user content in the system prompt — put it in the user turn only
Validate and sanitize inputs before passing them to the model
Filter outputs before returning them: check for PII, off-topic content, or policy violations
Use structured outputs (JSON schema) to limit the model’s freedom to deviate
For higher-stakes applications, use a dedicated moderation model on both inputs and outputs

NeMo Guardrails (NVIDIA, pip install nemoguardrails) provides a framework for defining rules about what the model should and should not engage with. Under the EU AI Act (2024), high-risk AI systems must include input/output filtering, interaction logging, and human-override capabilities. Building safety in from the start is substantially easier than retrofitting it.

26.10 Key Takeaways

The three techniques complement each other and are often used together. RAG provides the knowledge; prompt engineering shapes the format and behavior; fine-tuning bakes in domain style or consistent reasoning patterns.

RAG is the right default when the model needs access to specific documents or frequently changing data
Prompt engineering is always the first thing to try — it is free and often sufficient
Fine-tuning is worth the effort when prompts cannot reliably produce the required behavior
Evaluation must be built in from the start — LLM-as-judge and RAGAS provide practical options
Safety is not optional — output filtering, input validation, and logging belong in every production deployment

pip install sentence-transformers faiss-cpu anthropic openai
pip install peft trl bitsandbytes transformers  # fine-tuning
pip install ragas                                # RAG evaluation
pip install nemoguardrails                       # safety rails

Recommended reading: - Building LLM Applications — Chip Huyen (free online) - HuggingFace PEFT documentation: huggingface.co/docs/peft - RAGAS documentation: docs.ragas.io