
Local Embeddings with Ollama
Learn how to generate high-quality embeddings locally for privacy and cost efficiency.
Local Embeddings with Ollama
For many production systems, privacy and cost are major hurdles. Sending millions of chunks to a cloud API (like OpenAI) can be expensive and may violate data residency rules. Ollama allows you to run high-quality embedding models entirely on your own hardware.
Why Local Embeddings?
- Zero Latency (almost): No network round-trip to a cloud server.
- Infinite Scale: You only pay for electricity and hardware, not per token.
- Data Security: Sensitive documents never leave your server.
Popular Local Embedding Models
- Nomic Embed Text: A high-performance model designed specifically for long-context retrieval (8192 tokens).
- All-MiniLM-L6-V2: Lightweight and very fast, great for edge devices.
- BGE (Beijing Academy of AI): Frequently tops the leaderboard for accuracy.
Implementation with Ollama
First, pull the model:
ollama pull nomic-embed-text
Then, use the Python client:
import ollama
def get_local_embedding(text):
response = ollama.embeddings(
model='nomic-embed-text',
prompt=text
)
return response['embedding']
vector = get_local_embedding("The quick brown fox jumps over the lazy dog.")
print(len(vector)) # Should be 768 for Nomic
Performance Tuning
Local embeddings are CPU/GPU intensive.
- Batching: Process 10-50 chunks in a single call to maximize GPU throughput.
- Quantization: Using smaller model formats (e.g., 4-bit) can speed up generation with minimal loss in accuracy.
When to Avoid Local Embeddings
- If you don't have a GPU (CPU embedding is slow for large datasets).
- If your application requires the absolute "state-of-the-art" (SOTA) performance only available in massive cloud models.
Exercises
- Pull the
nomic-embed-textmodel in Ollama. - Compare the time it takes to embed 100 sentences locally vs. using an API.
- How does the "Dimension" (length of the vector) of a local model compare to OpenAI's models?