Local Embeddings with Ollama

For many production systems, privacy and cost are major hurdles. Sending millions of chunks to a cloud API (like OpenAI) can be expensive and may violate data residency rules. Ollama allows you to run high-quality embedding models entirely on your own hardware.

Why Local Embeddings?

Zero Latency (almost): No network round-trip to a cloud server.
Infinite Scale: You only pay for electricity and hardware, not per token.
Data Security: Sensitive documents never leave your server.

Popular Local Embedding Models

Nomic Embed Text: A high-performance model designed specifically for long-context retrieval (8192 tokens).
All-MiniLM-L6-V2: Lightweight and very fast, great for edge devices.
BGE (Beijing Academy of AI): Frequently tops the leaderboard for accuracy.

Implementation with Ollama

First, pull the model: ollama pull nomic-embed-text

Then, use the Python client:

import ollama

def get_local_embedding(text):
    response = ollama.embeddings(
        model='nomic-embed-text',
        prompt=text
    )
    return response['embedding']

vector = get_local_embedding("The quick brown fox jumps over the lazy dog.")
print(len(vector)) # Should be 768 for Nomic

Performance Tuning

Local embeddings are CPU/GPU intensive.

Batching: Process 10-50 chunks in a single call to maximize GPU throughput.
Quantization: Using smaller model formats (e.g., 4-bit) can speed up generation with minimal loss in accuracy.

When to Avoid Local Embeddings

If you don't have a GPU (CPU embedding is slow for large datasets).
If your application requires the absolute "state-of-the-art" (SOTA) performance only available in massive cloud models.

Exercises

Pull the nomic-embed-text model in Ollama.
Compare the time it takes to embed 100 sentences locally vs. using an API.
How does the "Dimension" (length of the vector) of a local model compare to OpenAI's models?