
The Local Hub: Mastering Ollama for Agents
Master the most popular local LLM runtime. Learn how to pull models, manage context, and integrate Ollama into your LangChain and LangGraph workflows.
Ollama as a Local Agent Hub
To run local agents, you need a Model Runtime—a piece of software that loads the massive LLM weights into your GPU memory and provides an API for your code to talk to. Ollama has become the industry favorite because of its simplicity, "Docker-like" CLI, and excellent support for tool-calling models.
In this lesson, we will learn how to set up Ollama and connect it as the "Brain" of our LangGraph agents.
1. What is Ollama?
Think of Ollama as "Docker for LLMs."
- It handles the complex math and GPU drivers for you.
- It provides a standardized local API (usually on
localhost:11434) that mimics the OpenAI API.
Core Commands
ollama pull llama3.1 # Download a model
ollama run llama3.1 # Start a chat in the terminal
ollama serve # Start the background API server
2. Choosing Agentic Models in Ollama
Not every model in Ollama can handle a Tool Call. You must look for models that were specifically fine-tuned for Function Calling.
Top Recommendations:
- Llama 3.1 (8B or 70B): Currently the gold standard for local agency.
- Mistral / Mixtral: Known for high instruction-following accuracy.
- Command R: Specifically designed for RAG and long-context tool use.
3. Connecting Ollama to LangChain
Since Ollama supports the OpenAI-compatible API, connecting it to your existing LangChain code is a one-line change.
from langchain_ollama import ChatOllama
# Connect to the local server
llm = ChatOllama(
model="llama3.1",
temperature=0,
# Optional: Set the base URL if Ollama is on a different machine
# base_url="http://192.168.1.15:11434"
)
# It behaves exactly like ChatOpenAI!
response = llm.invoke("Hello from my own hardware")
4. Hardware Management: VRAM is the King
When running Ollama, the limiting factor is VRAM (Video RAM) on your Graphics Card.
- 8GB VRAM: Can run 7B or 8B models (Llama 3, Mistral) comfortably.
- 24GB VRAM (NVIDIA 3090/4090): Can run "Medium" models like Command R or quantized 70B models.
- Mac M1/M2/M3 (Unified Memory): Excellent for local agents because the system RAM is shared with the GPU. A 64GB Mac can run massive models with zero lag.
5. Ollama in Production (Docker)
To use Ollama in a professional CI/CD pipeline, you should run it inside a Docker container.
FROM ollama/ollama
# Pre-download the model so it's ready on startup
RUN ollama serve & sleep 5 && ollama pull llama3.1
6. The "Modelfile" Secret
Ollama allows you to create your own "Model Variations" using a Modelfile. This is perfect for setting a permanent System Prompt for an agent.
# Modelfile
FROM llama3.1
PARAMETER temperature 0.2
SYSTEM "You are a specialized Python coder and you only output code blocks."
Create it with: ollama create my-coder-agent -f Modelfile.
Summary and Mental Model
Think of Ollama as the Engine of your car.
- You provide the Fuel (Prompts).
- Your code is the Steering Wheel (LangGraph).
Ollama allows you to swap out the engine (Llama vs. Mistral) without changing the design of the car.
Exercise: Local Setup
- Installation: Install Ollama from ollama.com. Pull the
llama3.1model. - Connectivity: Try to call the Ollama API using
curl:curl http://localhost:11434/api/generate -d '{ "model": "llama3.1", "prompt": "Why is local AI better?" }' - Logic: What happens if your code tries to call a model that isn't downloaded yet?
- (Hint: How would you add a "Check and Pull" node in your graph?) Ready to compare performance? Next lesson: Local vs Cloud Model Performance.