Module 2 Lesson 2: Ollama Architecture
How Ollama works under the hood. Understanding the service, the CLI, and the llama.cpp engine.
Ollama Architecture: How It Thinks
To the user, Ollama feels like a magic box that turns prompts into text. To an engineer, it’s a beautifully designed bridge between high-level user requests and low-level hardware instructions.
Let’s look at the three main layers of its architecture.
1. The Interaction Layer (The CLI & API)
Ollama is not a "monolith." When you install it, you are actually installing two things:
- The Ollama Server: A background process (daemon) that listens for instructions.
- The Ollama CLI: The
ollama [command]interface you use in your terminal.
When you type ollama run llama3, the CLI sends an HTTP POST request to the local server. The server then manages the model and streams the response back to your terminal window.
2. The Management Layer (The Orchestrator)
This is the "Brain" of Ollama. It handles:
- Model Registry: Checking if you have the model downloaded.
- Memory Management: Deciding how much of the model to put on your GPU vs. your CPU.
- Quantization Handling: Making sure the compressed model weights are read correctly.
- Customization: Reading "Modelfiles" and applying system prompts.
3. The Inference Engine (The Engine Room)
At the very bottom of the stack sits llama.cpp.
Ollama doesn't reinvent the wheel; it uses llama.cpp, which is an open-source C++ implementation of the Llama model architecture.
- Optimization: It is written to be extremely fast on CPUs and GPUs.
- Portability: It allows Ollama to run on Windows, Mac, and Linux without needing a specific "branch" for each one.
- Unified Format: Ollama uses the GGUF format (which we’ll cover in Module 4), which is the standard file format that
llama.cppunderstands.
The Workflow of a Request
What happens when you ask: "What is 2+2?"
- Request: You type the prompt into the CLI.
- API Call: CLI hits
POST /api/generate. - Model Loading: The Server checks if the model is in RAM. If not, it reads it from your SSD.
- Inference: The engine (llama.cpp) runs billions of math operations on your GPU/CPU.
- Tokenization: The computer's numbers are turned back into human words.
- Streaming: The words are sent back one-by-one to your CLI.
Why This Architecture Wins
By separating the Server from the Inference Engine:
- Crash Protection: If a model crashes, the server stays alive.
- Remote Access: You can run the Ollama server on a powerful PC in your basement and use the CLI on your light laptop upstairs.
- Extensibility: You can build a web-based GUI that talks to the same API the CLI uses.
Key Takeaways
- Ollama is a Client-Server system.
- It uses llama.cpp as the high-performance inference engine.
- It exposes a REST API on port
11434, making it easy for external apps to integrate for automation or UI development. - It manages your hardware automatically, so you don't have to manually configure GPU settings.