Ollama as a Local Agent Hub

To run local agents, you need a Model Runtime—a piece of software that loads the massive LLM weights into your GPU memory and provides an API for your code to talk to. Ollama has become the industry favorite because of its simplicity, "Docker-like" CLI, and excellent support for tool-calling models.

In this lesson, we will learn how to set up Ollama and connect it as the "Brain" of our LangGraph agents.

1. What is Ollama?

Think of Ollama as "Docker for LLMs."

It handles the complex math and GPU drivers for you.
It provides a standardized local API (usually on localhost:11434) that mimics the OpenAI API.

Core Commands

ollama pull llama3.1      # Download a model
ollama run llama3.1       # Start a chat in the terminal
ollama serve              # Start the background API server

2. Choosing Agentic Models in Ollama

Not every model in Ollama can handle a Tool Call. You must look for models that were specifically fine-tuned for Function Calling.

Top Recommendations:

Llama 3.1 (8B or 70B): Currently the gold standard for local agency.
Mistral / Mixtral: Known for high instruction-following accuracy.
Command R: Specifically designed for RAG and long-context tool use.

3. Connecting Ollama to LangChain

Since Ollama supports the OpenAI-compatible API, connecting it to your existing LangChain code is a one-line change.

from langchain_ollama import ChatOllama

# Connect to the local server
llm = ChatOllama(
    model="llama3.1",
    temperature=0,
    # Optional: Set the base URL if Ollama is on a different machine
    # base_url="http://192.168.1.15:11434"
)

# It behaves exactly like ChatOpenAI!
response = llm.invoke("Hello from my own hardware")

4. Hardware Management: VRAM is the King

When running Ollama, the limiting factor is VRAM (Video RAM) on your Graphics Card.

8GB VRAM: Can run 7B or 8B models (Llama 3, Mistral) comfortably.
24GB VRAM (NVIDIA 3090/4090): Can run "Medium" models like Command R or quantized 70B models.
Mac M1/M2/M3 (Unified Memory): Excellent for local agents because the system RAM is shared with the GPU. A 64GB Mac can run massive models with zero lag.

5. Ollama in Production (Docker)

To use Ollama in a professional CI/CD pipeline, you should run it inside a Docker container.

FROM ollama/ollama
# Pre-download the model so it's ready on startup
RUN ollama serve & sleep 5 && ollama pull llama3.1

6. The "Modelfile" Secret

Ollama allows you to create your own "Model Variations" using a Modelfile. This is perfect for setting a permanent System Prompt for an agent.

# Modelfile
FROM llama3.1
PARAMETER temperature 0.2
SYSTEM "You are a specialized Python coder and you only output code blocks."

Create it with: ollama create my-coder-agent -f Modelfile.

Summary and Mental Model

Think of Ollama as the Engine of your car.

You provide the Fuel (Prompts).
Your code is the Steering Wheel (LangGraph).

Ollama allows you to swap out the engine (Llama vs. Mistral) without changing the design of the car.

Exercise: Local Setup

Installation: Install Ollama from ollama.com. Pull the llama3.1 model.

Connectivity: Try to call the Ollama API using curl:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Why is local AI better?"
}'

Logic: What happens if your code tries to call a model that isn't downloaded yet?
- (Hint: How would you add a "Check and Pull" node in your graph?) Ready to compare performance? Next lesson: Local vs Cloud Model Performance.

The Local Hub: Mastering Ollama for Agents