Code
# pip install ollamaimport ollama# Simple single-turn chatresponse = ollama.chat( model='llama3.2', messages=[{'role': 'user', 'content': 'What is the capital of China?'}])print(response['message']['content'])curl -fsSL https://ollama.com/install.sh | shPull a model from the command line (first time only):bashollama pull llama3.2 # 3B parameter model, ~2 GBollama pull mistral # Mistral 7B, ~4 GBollama pull gemma2 # Gemma 2 9B by Google, ~5 GBollama pull phi4 # Microsoft Phi-4 14B, ~9 GBollama pull qwen2.5:7b # Alibaba Qwen 2.5 7Bollama pull llava # LLaVA multimodal (vision + text)Python SDK: pip install ollama# pip install ollamaimport ollama# Simple single-turn chatresponse = ollama.chat( model='llama3.2', messages=[{'role': 'user', 'content': 'What is the capital of China?'}])print(response['message']['content'])import ollama# List locally available modelsmodels = ollama.list()for m in models['models']: size_gb = m['size'] / 1e9 print(f"{m['name']:40s} {size_gb:.1f} GB")import ollamastream = ollama.chat( model='llama3.2', messages=[{'role': 'user', 'content': 'Write a short poem about data science.'}], stream=True)for chunk in stream: print(chunk['message']['content'], end='', flush=True)print() # newline at endimport ollamahistory = []def chat(user_input, model='llama3.2'): history.append({'role': 'user', 'content': user_input}) response = ollama.chat(model=model, messages=history) reply = response['message']['content'] history.append({'role': 'assistant', 'content': reply}) return replyprint(chat('My name is Alex. What is a good first programming language to learn?'))print()print(chat('What is my name?')) # model should remember from prior turnsystem message sets the model’s persona and constraints — it is prepended before all user messages and is not visible to the user.import ollamamessages = [ {'role': 'system', 'content': 'You are a terse data analyst who answers every question ' 'in one sentence and always cites a number.'}, {'role': 'user', 'content': 'Why is Python popular for data science?'}]response = ollama.chat(model='llama3.2', messages=messages)print(response['message']['content'])import ollama# Read a text file — replace with your documentwith open('op_ed.txt', 'r', encoding='utf-8') as f: document = f.read()word_count = len(document.split())print(f'Document: {word_count:,} words')# Summarise with local Llamaprompt = f"""Summarise the following article in 3 bullet points. Be concise and focus on the main insights.Article:{document[:4000]}""" # trim to fit context window of small modelsresponse = ollama.chat( model='llama3.2', messages=[{'role': 'user', 'content': prompt}])print(response['message']['content'])http://localhost:11434/v1). This means you can swap base_url and use existing OpenAI client code unchanged — useful when migrating cloud code to a local model.from openai import OpenAI# Point the OpenAI client at the local Ollama serverclient = OpenAI( base_url='http://localhost:11434/v1', api_key='ollama' # required by the client but not validated by Ollama)completion = client.chat.completions.create( model='llama3.2', messages=[{'role': 'user', 'content': 'Explain gradient descent in one paragraph.'}])print(completion.choices[0].message.content)llava (Large Language and Vision Assistant) is a multimodal model that can answer questions about images. Pull it once: ollama pull llavaimport ollama# Replace with an image file on your machineIMAGE_PATH = '20240321_194345.jpg'response = ollama.chat( model='llava', messages=[{ 'role': 'user', 'content': 'Describe what you see in this image.', 'images': [IMAGE_PATH] }])print(response['message']['content'])transformers library gives you direct programmatic access to model weights — useful when you need custom pre/post-processing, fine-tuning, or access to models not yet in Ollama.Key models available on HuggingFace (2025):| Model | HF repo | Notes ||—|—|—|| Llama 3.2 3B | meta-llama/Llama-3.2-3B-Instruct | Gated — requires HF account + access request || Llama 3.1 8B | meta-llama/Meta-Llama-3.1-8B-Instruct | Gated || Mistral 7B v0.3 | mistralai/Mistral-7B-Instruct-v0.3 | Open || Gemma 2 9B | google/gemma-2-9b-it | Gated || Phi-4 | microsoft/phi-4 | Open || Qwen2.5 7B | Qwen/Qwen2.5-7B-Instruct | Open || SmolLM2 1.7B | HuggingFaceTB/SmolLM2-1.7B-Instruct | Open; runs on laptop CPU |Required packages:bashpip install transformers torch accelerate bitsandbytes# Authenticate with HuggingFace (required for gated models)# Store your token in the environment variable HF_TOKEN, not hardcoded hereimport osfrom huggingface_hub import login# login(token=os.environ['HF_TOKEN']) # uncomment and set env var# Or run once from a terminal: huggingface-cli loginpipeline() function is the simplest way to load a model for inference. Use device_map='auto' to spread the model across available GPUs/CPU automatically.from transformers import pipelineimport torch# SmolLM2 1.7B — small enough to run on CPU, no login requiredmodel_id = 'HuggingFaceTB/SmolLM2-1.7B-Instruct'# For larger gated models swap in:# model_id = 'meta-llama/Meta-Llama-3.1-8B-Instruct'pipe = pipeline( 'text-generation', model=model_id, torch_dtype=torch.bfloat16, device_map='auto')messages = [ {'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': 'Explain overfitting in one paragraph.'}]output = pipe(messages, max_new_tokens=256)print(output[0]['generated_text'][-1]['content'])bitsandbytes and a CUDA GPU.from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfigimport torch# 4-bit quantisation configbnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True)model_id = 'Qwen/Qwen2.5-7B-Instruct' # open model, no login requiredtokenizer = AutoTokenizer.from_pretrained(model_id)model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map='auto')messages = [{'role': 'user', 'content': 'What are decision trees?'}]text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)model_inputs = tokenizer([text], return_tensors='pt').to(model.device)generated_ids = model.generate(**model_inputs, max_new_tokens=200)# Decode only the new tokens (not the prompt)new_ids = [out[len(inp):] for inp, out in zip(model_inputs.input_ids, generated_ids)]print(tokenizer.batch_decode(new_ids, skip_special_tokens=True)[0])pip install gpt4all) is a beginner-friendly library that downloads and runs GGUF-format models locally. Its Python API is minimal but easy to use. Ollama generally provides better performance and a wider model selection, but GPT4All has the advantage of a GUI desktop app.Source: https://docs.gpt4all.io# pip install gpt4allfrom gpt4all import GPT4All# Modern GPT4All models (2025); list at https://gpt4all.io/index.html# GPT4All will download the model on first run (~2–4 GB)model = GPT4All('Meta-Llama-3-8B-Instruct.Q4_0.gguf')with model.chat_session(): reply = model.generate('Explain random forests in one paragraph.', max_tokens=300) print(reply)from gpt4all import GPT4Allmodel = GPT4All('Meta-Llama-3-8B-Instruct.Q4_0.gguf')with model.chat_session(): tokens = [] for token in model.generate('Write a haiku about machine learning.', streaming=True): tokens.append(token) print(token, end='', flush=True)print()llama3.2 (3B) or llama3.1:8b via Ollama- Coding: qwen2.5-coder:7b or deepseek-coder-v2- Reasoning: phi4 (14B) — strong benchmark performance per GB of model- Multilingual: qwen2.5:7b — 29 languages- Vision: llava or llama3.2-vision- Embedding: nomic-embed-text via ollama pull nomic-embed-text