Local Embeddings with Ollama

Local Embeddings with Ollama

Learn how to generate high-quality embeddings locally for privacy and cost efficiency.

Local Embeddings with Ollama

For many production systems, privacy and cost are major hurdles. Sending millions of chunks to a cloud API (like OpenAI) can be expensive and may violate data residency rules. Ollama allows you to run high-quality embedding models entirely on your own hardware.

Why Local Embeddings?

  1. Zero Latency (almost): No network round-trip to a cloud server.
  2. Infinite Scale: You only pay for electricity and hardware, not per token.
  3. Data Security: Sensitive documents never leave your server.

Popular Local Embedding Models

  • Nomic Embed Text: A high-performance model designed specifically for long-context retrieval (8192 tokens).
  • All-MiniLM-L6-V2: Lightweight and very fast, great for edge devices.
  • BGE (Beijing Academy of AI): Frequently tops the leaderboard for accuracy.

Implementation with Ollama

First, pull the model: ollama pull nomic-embed-text

Then, use the Python client:

import ollama

def get_local_embedding(text):
    response = ollama.embeddings(
        model='nomic-embed-text',
        prompt=text
    )
    return response['embedding']

vector = get_local_embedding("The quick brown fox jumps over the lazy dog.")
print(len(vector)) # Should be 768 for Nomic

Performance Tuning

Local embeddings are CPU/GPU intensive.

  • Batching: Process 10-50 chunks in a single call to maximize GPU throughput.
  • Quantization: Using smaller model formats (e.g., 4-bit) can speed up generation with minimal loss in accuracy.

When to Avoid Local Embeddings

  • If you don't have a GPU (CPU embedding is slow for large datasets).
  • If your application requires the absolute "state-of-the-art" (SOTA) performance only available in massive cloud models.

Exercises

  1. Pull the nomic-embed-text model in Ollama.
  2. Compare the time it takes to embed 100 sentences locally vs. using an API.
  3. How does the "Dimension" (length of the vector) of a local model compare to OpenAI's models?

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn