28 Agentic AI: Tool Use, Planning, and Multi-Step Reasoning

Single-turn question answering is only one mode of working with language models. The more consequential mode — and the one that most expands what these systems can do — is agentic behavior: a model that uses tools, plans multi-step workflows, maintains state across interactions, and delegates subtasks to specialized systems.

An agent is, at its core, an LLM that can take actions. Instead of producing a text response and stopping, it decides which tool to call, calls it, observes the result, and decides what to do next — repeating until the task is complete. This loop is the foundation of systems that browse the web, write and execute code, query databases, send emails, and coordinate with other agents.

We cover function calling, the ReAct reasoning pattern, memory systems, planning, multi-agent architectures, LangGraph for stateful workflows, and the practical considerations that separate working agent prototypes from reliable production systems.

28.1 What Makes an Agent

The term “agent” is overloaded. A more precise definition: an agent is a system that exhibits at least two of the following four properties.

Tool use: the ability to call external functions — APIs, databases, code interpreters, file systems, other models — and incorporate the results into subsequent reasoning.

Planning: the ability to decompose a complex goal into steps, reason about dependencies between steps, and adapt the plan when a step fails or produces unexpected results.

Memory: the ability to access and update state beyond the current context window — conversation history, retrieved documents, results from prior steps, or a persistent knowledge store.

Multi-step execution: the ability to run a sequence of actions before producing a final result, rather than answering in a single forward pass.

A simple RAG pipeline has memory (the retrieved context) but no tool use or planning. A function-calling assistant has tool use and some planning but no persistent memory. A full autonomous agent has all four. Matching the architecture to the task avoids unnecessary complexity — most business use cases need function calling and maybe simple planning, not a fully autonomous agent.

28.2 Function Calling and Tool Use

Both Anthropic and OpenAI expose function calling (also called tool use) as a first-class API feature. The model is given a list of available tools, each described by a JSON schema, and can respond by requesting a tool call instead of producing text. The application executes the tool, passes the result back, and the model continues.

A tool definition has three parts: - name: what to call it (used in the model’s tool call) - description: what it does, in natural language — this is what the model reads to decide whether and when to use it - input_schema: the JSON schema of the expected arguments

The description is more important than the schema. A model with a well-described tool and a loose schema consistently outperforms one with an exact schema and a vague description.

On the Anthropic API, tools are passed in the tools parameter; the model returns a tool_use content block when it wants to call one. The application runs the tool and returns a tool_result block. The conversation continues until the model produces a text response instead of a tool call.

Code

import os, json

def call_claude_tools(messages, tools, model="claude-haiku-4-5-20251001", max_tokens=1024):
    try:
        import anthropic
        client = anthropic.Anthropic()
        resp = client.messages.create(
            model=model, max_tokens=max_tokens,
            tools=tools, messages=messages
        )
        return resp
    except Exception as e:
        print(f"[API not available: {e}]")
        return None

# Define tools
TOOLS = [
    {
        "name": "get_revenue_data",
        "description": "Retrieve quarterly revenue figures for a company. "
                       "Use this when the user asks about revenue, growth, or financial performance.",
        "input_schema": {
            "type": "object",
            "properties": {
                "company": {"type": "string", "description": "Company name"},
                "quarters": {"type": "integer", "description": "Number of most recent quarters", "default": 4}
            },
            "required": ["company"]
        }
    },
    {
        "name": "calculate_growth_rate",
        "description": "Calculate the period-over-period growth rate between two values.",
        "input_schema": {
            "type": "object",
            "properties": {
                "current":  {"type": "number"},
                "previous": {"type": "number"}
            },
            "required": ["current", "previous"]
        }
    }
]

# Mock tool implementations
def get_revenue_data(company, quarters=4):
    data = {"Acme": [12.4, 11.5, 10.8, 9.9], "BetaCo": [5.2, 4.9, 4.8, 4.7]}
    vals = data.get(company, [10.0, 9.5, 9.0, 8.5])[:quarters]
    return {"company": company, "quarterly_revenue_m": vals, "currency": "USD"}

def calculate_growth_rate(current, previous):
    rate = (current - previous) / previous * 100
    return {"growth_rate_pct": round(rate, 2)}

def run_tool(name, inputs):
    if name == "get_revenue_data":   return get_revenue_data(**inputs)
    if name == "calculate_growth_rate": return calculate_growth_rate(**inputs)
    return {"error": f"Unknown tool: {name}"}

# Run the agentic loop
def agent_loop(user_message):
    messages = [{"role": "user", "content": user_message}]
    print(f"User: {user_message}")
    print()

    for step in range(5):   # safety limit
        resp = call_claude_tools(messages, TOOLS)
        if resp is None:
            print("[No API response — showing expected flow]")
            return

        # Collect all tool calls in this turn
        tool_calls = [b for b in resp.content if b.type == "tool_use"]
        text_blocks = [b for b in resp.content if b.type == "text"]

        if text_blocks:
            print(f"Agent: {text_blocks[0].text}")
            return

        # Execute tools and collect results
        tool_results = []
        for tc in tool_calls:
            result = run_tool(tc.name, tc.input)
            print(f"  [Tool: {tc.name}({tc.input})] -> {result}")
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": tc.id,
                "content": json.dumps(result)
            })

        messages.append({"role": "assistant", "content": resp.content})
        messages.append({"role": "user",      "content": tool_results})

agent_loop("What was Acme Corp revenue last quarter, and how does it compare to the quarter before?")

28.3 The ReAct Loop

ReAct (Reason + Act) is a prompting pattern that interleaves explicit reasoning with tool calls. Before taking each action, the model writes a thought explaining why it is taking that action. After observing the result, it writes another thought before deciding the next step.

Thought: I need to know Acme's Q4 revenue to answer the question.
Action: get_revenue_data(company="Acme", quarters=1)
Observation: {"quarterly_revenue_m": [12.4]}

Thought: Now I need Q3 to compute the growth rate.
Action: get_revenue_data(company="Acme", quarters=2)
Observation: {"quarterly_revenue_m": [12.4, 11.5]}

Thought: I have both values. Q4=12.4, Q3=11.5. Growth = (12.4-11.5)/11.5 = 7.8%.
Action: finish("Acme's Q4 revenue was $12.4M, up 7.8% from $11.5M in Q3.")

The explicit reasoning step serves several purposes: it makes the agent’s decisions auditable, reduces errors by forcing the model to articulate its plan, and makes debugging much easier when the agent takes wrong actions. Modern API-based function calling gets the same benefit when we use models that include extended thinking.

28.4 Memory Systems

By default, a language model has no memory beyond its context window. In a multi-turn conversation or a long-running task, this is a severe constraint. Agent frameworks address it with several types of memory.

In-context memory is simply the conversation history included in the prompt. It is limited by the context window and grows linearly with each turn. For long conversations, periodic summarization compresses earlier history into a shorter representation.

External / vector memory stores prior interactions, documents, or task results in a vector database. At each step, the agent retrieves the most relevant memories and includes them in the context. This is effectively RAG applied to memory.

Working memory is a structured store — a dict or database — that the agent can read and write during a task. Useful for tracking intermediate results, scratchpad computations, and task progress.

Episodic memory records summaries of completed tasks — what was attempted, what worked, what failed. A model can consult its episodic memory to avoid repeating mistakes or to apply lessons from similar past tasks.

LangGraph (covered below) provides a native state management layer that handles working memory as typed state, passed between graph nodes.

28.5 Multi-Agent Architectures

Complex tasks often benefit from multiple specialized agents working together, coordinated by an orchestrator.

Orchestrator-worker: a high-level planner (the orchestrator) receives the task, decomposes it into subtasks, and delegates each to a specialized worker agent. Workers report back; the orchestrator synthesizes the results. This mirrors how a manager delegates to specialists. The orchestrator needs broad knowledge; workers need depth in their domain.

Parallel execution: independent subtasks run simultaneously across multiple agents, reducing total latency. For example, a research agent might send the same question to a web search agent, a database agent, and a document retrieval agent in parallel, then synthesize the results.

Debate and critique: one agent generates a draft answer; a second agent critiques it; a third synthesizes a final answer incorporating the critique. This pattern reduces hallucination and improves reasoning quality for complex analytical tasks.

Specialized agents: code execution agents, data analysis agents, and browser agents each have tool sets tailored to their domain. A general orchestrator routes tasks to the right specialist.

The main risks: error propagation (a mistake in an early step compounds), cost (each agent call costs tokens), and latency. Start with the simplest architecture that solves the problem and add agents when single-agent approaches hit limits.

Code

# Simple multi-agent orchestrator pattern (without LangGraph dependencies)
import json

def analyst_agent(task):
    """Simulated analyst agent: returns structured analysis."""
    # In production: call LLM with analyst system prompt + task
    return {
        "agent": "analyst",
        "findings": f"Analysis of: {task}",
        "metrics": {"revenue_growth": "7.8%", "margin_improvement": "3pp"},
        "confidence": "medium"
    }

def risk_agent(context):
    """Simulated risk agent: identifies risks from context."""
    return {
        "agent": "risk",
        "risks": [
            "Customer concentration: top 3 accounts = 62% of revenue",
            "Churn increased from 3.8% to 4.2% — early warning"
        ]
    }

def synthesis_agent(analyst_result, risk_result):
    """Simulated synthesis agent: combines specialist outputs."""
    return (
        f"Revenue grew {analyst_result['metrics']['revenue_growth']} with margin "
        f"improvement of {analyst_result['metrics']['margin_improvement']}. "
        f"Key risks: {risk_result['risks'][0]}. "
        f"Confidence: {analyst_result['confidence']}."
    )

def orchestrate(user_task):
    print(f"Task: {user_task}")
    print()

    # Step 1: Run specialist agents (could be parallel with threading/asyncio)
    analyst_result = analyst_agent(user_task)
    print(f"Analyst findings: {analyst_result['findings']}")

    risk_result = risk_agent(analyst_result)
    print(f"Risk findings: {risk_result['risks']}")

    # Step 2: Synthesize
    final = synthesis_agent(analyst_result, risk_result)
    print()
    print(f"Final answer: {final}")

orchestrate("Summarize Acme Corp Q3 2024 financial performance and identify key risks.")

28.6 LangGraph for Stateful Workflows

LangGraph (from LangChain) is a framework for building stateful, multi-step agent workflows as directed graphs. Nodes are functions or LLM calls; edges define the control flow; a shared state object is passed between nodes and updated at each step.

The key concepts:

State: a typed dict (using Python TypedDict or Pydantic) that holds the current context — messages, intermediate results, tool outputs, and any application-specific fields.

Nodes: Python functions that receive the current state, perform some operation (an LLM call, a tool call, data transformation), and return an updated state.

Edges: connections between nodes. A conditional edge reads the state and decides which node to visit next — allowing branching, loops, and error handling.

Human-in-the-loop: LangGraph supports “interrupt” nodes that pause execution and wait for human approval before continuing — essential for high-stakes actions like sending emails or modifying databases.

Install: pip install langgraph langchain-anthropic

Code

# LangGraph: a minimal research-and-answer workflow
# Requires: pip install langgraph langchain-anthropic

LANGGRAPH_EXAMPLE = """
from typing import TypedDict, Annotated, List
from langgraph.graph import StateGraph, END
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, AIMessage

class AgentState(TypedDict):
    messages:  List[dict]
    retrieved: str
    answer:    str

llm = ChatAnthropic(model="claude-haiku-4-5-20251001")

def retrieve_node(state: AgentState) -> AgentState:
    """Retrieve relevant context (would call a vector DB in production)."""
    query = state["messages"][-1]["content"]
    # Simulated retrieval
    state["retrieved"] = f"[Retrieved context for: {query}]"
    return state

def generate_node(state: AgentState) -> AgentState:
    """Generate the answer using context."""
    prompt = [
        HumanMessage(content=(
            f"Context: {state['retrieved']}\n\n"
            f"Question: {state['messages'][-1]['content']}"
        ))
    ]
    response = llm.invoke(prompt)
    state["answer"] = response.content
    return state

# Build the graph
workflow = StateGraph(AgentState)
workflow.add_node("retrieve", retrieve_node)
workflow.add_node("generate", generate_node)
workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "generate")
workflow.add_edge("generate", END)

app = workflow.compile()

# Run
result = app.invoke({
    "messages": [{"role": "user", "content": "What is RFM analysis?"}],
    "retrieved": "",
    "answer": ""
})
print(result["answer"])
"""

print("LangGraph workflow definition:")
print(LANGGRAPH_EXAMPLE[:600])
print("...")
print()
print("Key methods:")
print("  StateGraph(State) — create graph with typed state")
print("  .add_node(name, fn) — add a node")
print("  .add_edge(a, b) — unconditional edge")
print("  .add_conditional_edges(a, fn, map) — branch on state")
print("  .compile() — returns runnable app")
print("  app.invoke(initial_state) — run the workflow")

28.7 Production Considerations

Agent systems that work well in demos often fail in production. A few failure modes are responsible for the majority of reliability problems.

Compounding errors: each step has some probability of error. A 10-step pipeline where each step is 90% accurate has an end-to-end accuracy of $0.9^{10} \approx 35\%$. Keep pipelines short; validate intermediate results; add checkpoints where a human or a separate classifier confirms correctness before proceeding.

Infinite loops: without a step limit or a termination condition, an agent can loop indefinitely — especially when a tool returns an error and the agent keeps retrying. Always enforce a maximum number of steps.

Cost and latency: agents make multiple LLM calls. A 10-step agent using Claude Opus at $15/MTok can cost 10x more per user query than a single-call response. Profile costs before scaling; use smaller models for non-critical steps.

Permission scope: an agent that can only read data is far safer than one that can write to a database, send emails, or call external APIs. Apply the principle of least privilege (Chapter 14) to every tool the agent can access.

Observability: every agent action should be logged with its inputs, outputs, latency, and cost. Tools like LangSmith, Arize Phoenix, and Weights & Biases Weave provide traces for multi-step agent runs.

28.8 Key Takeaways

An agent is an LLM with tool use, planning, memory, and multi-step execution — match the architecture to the task; most use cases need only function calling
Function calling lets the model request specific tool executions; the application runs the tool and returns results for the model to continue reasoning
ReAct (Reason + Act) interleaves explicit reasoning with tool calls, making agent decisions auditable and debugging tractable
Memory types — in-context, vector, working, episodic — serve different purposes; start with in-context and add external stores when the context window is the bottleneck
Multi-agent architectures (orchestrator-worker, parallel, debate) handle tasks too complex for a single agent but require care to avoid error propagation
LangGraph provides typed state and conditional edges for complex stateful workflows, including human-in-the-loop approval steps
In production: enforce step limits, log every action, apply least-privilege permissions, and profile costs before scaling

Recommended reading: - LangGraph documentation: langchain-ai.github.io/langgraph - ReAct: Synergizing Reasoning and Acting in Language Models — Yao et al., 2022 - Anthropic tool use guide: docs.anthropic.com/en/docs/build-with-claude/tool-use