23  Local LLMsUnlike OpenAI where data must be sent to an external party, local LLMs run entirely on your machine.Reasons to prefer a local LLM:- Data security — sensitive data never leaves your network- Cost — no per-token API charges after the hardware is paid for- Latency / offline — works without an internet connection- Customisation — fine-tune on proprietary dataThe 2025 landscape:| Tool | Best for ||—|—|| Ollama | Easiest setup; production-grade local serving; OpenAI-compatible API || HuggingFace Transformers | Programmatic model access, custom pipelines, research || llama.cpp / llama-cpp-python | Maximum efficiency on CPU; GGUF quantised models || GPT4All | Beginner-friendly GUI + simple Python API |Hardware reality check:- 7B parameter models (Llama 3.2, Mistral 7B, Gemma 2 9B) — run well on 8 GB VRAM (or CPU with 16 GB RAM at reduced speed)- 13–14B models — need 16 GB VRAM or 32 GB RAM- 70B models — require multi-GPU or high-end workstation; most users run 4-bit quantised versionsThe recommended starting point for most practitioners in 2025 is Ollama — it handles model downloads, quantisation, and serving behind a local REST API with zero configuration.

23.1 OllamaOllama is the easiest way to run large language models locally in 2025. It bundles the model runtime, quantisation, and a local API server into a single installation.Installation:- Windows / Mac: download the installer from https://ollama.com- Linux: curl -fsSL https://ollama.com/install.sh | shPull a model from the command line (first time only):bashollama pull llama3.2 # 3B parameter model, ~2 GBollama pull mistral # Mistral 7B, ~4 GBollama pull gemma2 # Gemma 2 9B by Google, ~5 GBollama pull phi4 # Microsoft Phi-4 14B, ~9 GBollama pull qwen2.5:7b # Alibaba Qwen 2.5 7Bollama pull llava # LLaVA multimodal (vision + text)Python SDK: pip install ollama

Code
# pip install ollamaimport ollama# Simple single-turn chatresponse = ollama.chat(    model='llama3.2',    messages=[{'role': 'user', 'content': 'What is the capital of China?'}])print(response['message']['content'])

23.1.1 Available modelsYou can list models you have already pulled:

Code
import ollama# List locally available modelsmodels = ollama.list()for m in models['models']:    size_gb = m['size'] / 1e9    print(f"{m['name']:40s}  {size_gb:.1f} GB")

23.1.2 Streaming responsesFor long responses, streaming lets you display tokens as they are generated rather than waiting for the full reply.

Code
import ollamastream = ollama.chat(    model='llama3.2',    messages=[{'role': 'user', 'content': 'Write a short poem about data science.'}],    stream=True)for chunk in stream:    print(chunk['message']['content'], end='', flush=True)print()  # newline at end

23.1.3 Multi-turn conversationOllama (and all chat APIs) maintain context by passing the full message history. Build a history list and append each exchange.

Code
import ollamahistory = []def chat(user_input, model='llama3.2'):    history.append({'role': 'user', 'content': user_input})    response = ollama.chat(model=model, messages=history)    reply = response['message']['content']    history.append({'role': 'assistant', 'content': reply})    return replyprint(chat('My name is Alex. What is a good first programming language to learn?'))print()print(chat('What is my name?'))  # model should remember from prior turn

23.1.4 System promptsA system message sets the model’s persona and constraints — it is prepended before all user messages and is not visible to the user.

Code
import ollamamessages = [    {'role': 'system', 'content': 'You are a terse data analyst who answers every question '                                  'in one sentence and always cites a number.'},    {'role': 'user',   'content': 'Why is Python popular for data science?'}]response = ollama.chat(model='llama3.2', messages=messages)print(response['message']['content'])

23.1.5 Text summarisation with a local LLMOne of the highest-value use cases for local LLMs is processing confidential documents that cannot be sent to a cloud API.

Code
import ollama# Read a text file — replace with your documentwith open('op_ed.txt', 'r', encoding='utf-8') as f:    document = f.read()word_count = len(document.split())print(f'Document: {word_count:,} words')# Summarise with local Llamaprompt = f"""Summarise the following article in 3 bullet points.  Be concise and focus on the main insights.Article:{document[:4000]}"""  # trim to fit context window of small modelsresponse = ollama.chat(    model='llama3.2',    messages=[{'role': 'user', 'content': prompt}])print(response['message']['content'])

23.1.6 OpenAI-compatible endpointOllama exposes the same REST API as OpenAI (http://localhost:11434/v1). This means you can swap base_url and use existing OpenAI client code unchanged — useful when migrating cloud code to a local model.

Code
from openai import OpenAI# Point the OpenAI client at the local Ollama serverclient = OpenAI(    base_url='http://localhost:11434/v1',    api_key='ollama'  # required by the client but not validated by Ollama)completion = client.chat.completions.create(    model='llama3.2',    messages=[{'role': 'user', 'content': 'Explain gradient descent in one paragraph.'}])print(completion.choices[0].message.content)

23.1.7 Vision: describing images with LLaVAllava (Large Language and Vision Assistant) is a multimodal model that can answer questions about images. Pull it once: ollama pull llava

Code
import ollama# Replace with an image file on your machineIMAGE_PATH = '20240321_194345.jpg'response = ollama.chat(    model='llava',    messages=[{        'role': 'user',        'content': 'Describe what you see in this image.',        'images': [IMAGE_PATH]    }])print(response['message']['content'])

23.2 HuggingFace TransformersThe transformers library gives you direct programmatic access to model weights — useful when you need custom pre/post-processing, fine-tuning, or access to models not yet in Ollama.Key models available on HuggingFace (2025):| Model | HF repo | Notes ||—|—|—|| Llama 3.2 3B | meta-llama/Llama-3.2-3B-Instruct | Gated — requires HF account + access request || Llama 3.1 8B | meta-llama/Meta-Llama-3.1-8B-Instruct | Gated || Mistral 7B v0.3 | mistralai/Mistral-7B-Instruct-v0.3 | Open || Gemma 2 9B | google/gemma-2-9b-it | Gated || Phi-4 | microsoft/phi-4 | Open || Qwen2.5 7B | Qwen/Qwen2.5-7B-Instruct | Open || SmolLM2 1.7B | HuggingFaceTB/SmolLM2-1.7B-Instruct | Open; runs on laptop CPU |Required packages:bashpip install transformers torch accelerate bitsandbytes

Code
# Authenticate with HuggingFace (required for gated models)# Store your token in the environment variable HF_TOKEN, not hardcoded hereimport osfrom huggingface_hub import login# login(token=os.environ['HF_TOKEN'])  # uncomment and set env var# Or run once from a terminal: huggingface-cli login

23.2.1 Running Llama 3 with the pipeline APIThe HuggingFace pipeline() function is the simplest way to load a model for inference. Use device_map='auto' to spread the model across available GPUs/CPU automatically.

Code
from transformers import pipelineimport torch# SmolLM2 1.7B — small enough to run on CPU, no login requiredmodel_id = 'HuggingFaceTB/SmolLM2-1.7B-Instruct'# For larger gated models swap in:# model_id = 'meta-llama/Meta-Llama-3.1-8B-Instruct'pipe = pipeline(    'text-generation',    model=model_id,    torch_dtype=torch.bfloat16,    device_map='auto')messages = [    {'role': 'system', 'content': 'You are a helpful assistant.'},    {'role': 'user',   'content': 'Explain overfitting in one paragraph.'}]output = pipe(messages, max_new_tokens=256)print(output[0]['generated_text'][-1]['content'])

23.2.2 4-bit quantisation with BitsAndBytes4-bit quantisation (QLoRA) reduces a 7B model from ~14 GB to ~4 GB of VRAM with minimal quality loss. Requires bitsandbytes and a CUDA GPU.

Code
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfigimport torch# 4-bit quantisation configbnb_config = BitsAndBytesConfig(    load_in_4bit=True,    bnb_4bit_quant_type='nf4',    bnb_4bit_compute_dtype=torch.bfloat16,    bnb_4bit_use_double_quant=True)model_id = 'Qwen/Qwen2.5-7B-Instruct'  # open model, no login requiredtokenizer = AutoTokenizer.from_pretrained(model_id)model = AutoModelForCausalLM.from_pretrained(    model_id,    quantization_config=bnb_config,    device_map='auto')messages = [{'role': 'user', 'content': 'What are decision trees?'}]text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)model_inputs = tokenizer([text], return_tensors='pt').to(model.device)generated_ids = model.generate(**model_inputs, max_new_tokens=200)# Decode only the new tokens (not the prompt)new_ids = [out[len(inp):] for inp, out in zip(model_inputs.input_ids, generated_ids)]print(tokenizer.batch_decode(new_ids, skip_special_tokens=True)[0])

23.3 GPT4AllGPT4All (pip install gpt4all) is a beginner-friendly library that downloads and runs GGUF-format models locally. Its Python API is minimal but easy to use. Ollama generally provides better performance and a wider model selection, but GPT4All has the advantage of a GUI desktop app.Source: https://docs.gpt4all.io

Code
# pip install gpt4allfrom gpt4all import GPT4All# Modern GPT4All models (2025); list at https://gpt4all.io/index.html# GPT4All will download the model on first run (~2–4 GB)model = GPT4All('Meta-Llama-3-8B-Instruct.Q4_0.gguf')with model.chat_session():    reply = model.generate('Explain random forests in one paragraph.', max_tokens=300)    print(reply)

23.3.1 Streaming with GPT4All

Code
from gpt4all import GPT4Allmodel = GPT4All('Meta-Llama-3-8B-Instruct.Q4_0.gguf')with model.chat_session():    tokens = []    for token in model.generate('Write a haiku about machine learning.', streaming=True):        tokens.append(token)        print(token, end='', flush=True)print()

23.4 Key Takeaways| Approach | When to use ||—|—|| Ollama | Default choice — easiest setup, best performance, OpenAI-compatible API || HF Transformers | When you need fine-tuning, custom architecture, or research-grade access || 4-bit quantisation (BnB) | Run 7–14B models on consumer GPUs (8–16 GB VRAM) || GPT4All | Quick demos, GUI, CPU-only environments |Model recommendations (2025):- General purpose / chat: llama3.2 (3B) or llama3.1:8b via Ollama- Coding: qwen2.5-coder:7b or deepseek-coder-v2- Reasoning: phi4 (14B) — strong benchmark performance per GB of model- Multilingual: qwen2.5:7b — 29 languages- Vision: llava or llama3.2-vision- Embedding: nomic-embed-text via ollama pull nomic-embed-text