
Keyword Search vs Semantic Search: Bridging the Interaction Gap
Master the differences between lexical and vector-based search. Learn about inverted indexes, TF-IDF, embeddings, and why Hybrid Search is the production standard for AI applications.
Keyword Search vs Semantic Search
In the previous lesson, we identified the "Semantic Gap"—the disconnect between strings and meaning. Today, we will break down the two primary ways we bridge that gap: Keyword Search (Lexical) and Semantic Search (Vector-based).
For developers building production systems, it is rarely a choice between one or the other. Instead, it is about understanding the strengths and weaknesses of each to build a "Hybrid" system that survives the complexities of real-world user behavior.
1. What is Keyword Search? (Lexical Retrieval)
Keyword search is the technology that powered the web from 1990 to 2015. It is based on string matching and term frequency.
How it works: The Inverted Index
Imagine a physical book. At the back, there is an index. You look for the word "Vector," and it tells you it's on pages 12, 45, and 89.
Computer systems like Elasticsearch, Solr, and standard SQL indexes use an Inverted Index. When you ingest a document, the system breaks it into "Tokens" (words), removes common words (stop words like "the", "a"), and creates a map of which words appear in which documents.
graph LR
D1[Doc 1: 'I love cats'] --> T[Tokenization]
D2[Doc 2: 'Cats are cute'] --> T
T --> I[Inverted Index]
I --> W1['love' -> Doc 1]
I --> W2['cats' -> Doc 1, Doc 2]
I --> W3['cute' -> Doc 2]
The Scoring: TF-IDF and BM25
How does the database know which document is better? It uses math like TF-IDF (Term Frequency-Inverse Document Frequency) or its modern successor BM25.
- TF (Term Frequency): How many times does "cat" appear in this document? (More is usually better).
- IDF (Inverse Document Frequency): How rare is the word "cat" across the whole database? (Rare words like "Vector" are more important than common words like "Example").
Strengths of Keyword Search:
- Precision for Unique IDs: Searching for a SKU number (
B08XY123), a name (Sudeep), or a specific technical term. - Speed: Inverted indexes are incredibly fast and memory-efficient.
- No External Models: You don't need an AI model to build or query an inverted index.
Weaknesses:
- Zero-result problem: If the user makes a typo or uses a synonym, they get nothing.
- No context: "How to stop a car" and "A car stop" look the same to a keyword index.
2. What is Semantic Search? (Vector Retrieval)
Semantic search doesn't look for words. It looks for Relationships. It uses the geometric distance between vectors to find "neighbors" in meaning.
How it works: The Dense Vector Space
Instead of a sparse map of words, semantic search uses a Dense Vector. Every document is compressed into a fixed-size list of numbers (e.g., 1536 floats).
In this space, "Apple" the fruit is closer to "Orange" than it is to "Apple" the computer company, because the embedding model was trained on millions of sentences and learned that context matters.
graph TD
subgraph 3D Space
A[Query: 'Healthy Snacks']
B[Result 1: 'Organic Apples']
C[Result 2: 'Granola Bars']
D[Result 3: 'Deep Fried Bacon']
end
A -.-> B
A -.-> C
A ----X D
The Scoring: Distance Metrics
Instead of BM25, semantic search uses:
- Cosine Similarity: Comparing the angle between two vectors (measures direction/meaning).
- Euclidean Distance (L2): Comparing the straight-line distance between points.
- Dot Product: Measures both direction and magnitude.
Strengths:
- Synonym Handling: Automatically understands that "purchase" and "buy" are related.
- Multilingual: Cross-lingual models can find an English document even if the query is in Spanish.
- Multimodal: Can find an image based on a text description.
Weaknesses:
- "The Black Box": It is hard to explain why a vector database returned a specific result.
- Model Dependency: Your search quality is 100% dependent on your embedding model (e.g., OpenAI
text-embedding-3-smallvs. OpenSourceLlama-3-Embed). - Computational Cost: Vector calculations are heavier than string comparisons.
3. The Comparison Matrix
| Feature | Keyword Search (BM25) | Semantic Search (Vector) |
|---|---|---|
| Logic | Exact String Match | Meaning & Context |
| Speed | Extreme (O(log n)) | Fast with ANN (Approximate) |
| Storage | Small (Inverted Index) | Large (Vector Blobs) |
| Tolerance | Fragile (Typos kill it) | Robust (Handles typos & slang) |
| Out-of-Vocabulary | Fails on new words | Flexible |
| Best For | Part numbers, Names, Exact Phrases | FAQs, Recommendations, Discovery |
4. When Keyword Search Actually Wins
Beginners often think "Vectors are the future, so keywords are dead." This is a mistake.
In production AI, keyword search is often superior in these scenarios:
- Searching for "Special" Tokens: Words like
iPhone 15 Pro MaxorLog4j. Embedding models often "smooth out" these specific tokens into generic "phones" or "software," losing the precision the user needs. - Short Queries: If a user searches for
Nike, a vector database might return "Adidas" or "Shoes" because they are semantically similar. But the user specifically wanted the brandNike. - Boolean Filtering: "Show me shoes AND size 10 AND under $100." Traditional databases handle these hard constraints much better than pure vector math.
5. The Industry Solution: Hybrid Search
To get the "Best of Both Worlds," production systems use Hybrid Search.
In a Hybrid Search architecture:
- We run a Keyword Search (BM25) to find exact matches.
- We run a Vector Search to find conceptual matches.
- We combine the results using a technique called Reciprocal Rank Fusion (RRF).
RRF (Reciprocal Rank Fusion)
RRF gives a score to each document based on its rank in both lists. If a document is #1 in keywords and #2 in vectors, it gets a very high combined score. If it's #1 in vectors but not even in the top 100 for keywords, it still gets a decent score, but less than a "perfect" match.
# Simplified RRF Concept
def rrf_score(rank_lexical, rank_vector, k=60):
return (1.0 / (k + rank_lexical)) + (1.0 / (k + rank_vector))
6. Python Implementation: Simulating Hybrid Search
Let's use Python to simulate how we might combine a simple keyword check with a vector search.
from sentence_transformers import SentenceTransformer, util
import torch
# Dataset
docs = [
"How to reset your password",
"Password security best practices",
"Forgotten my account access code",
"New user registration guide"
]
# Query
query = "I forgot my password"
# 1. Lexical Score (Simulated BM25)
# We look for exact word overlaps
lexical_scores = []
query_words = set(query.lower().split())
for doc in docs:
doc_words = set(doc.lower().split())
overlap = len(query_words.intersection(doc_words))
lexical_scores.append(overlap / len(query_words))
# 2. Vector Score (Semantic)
model = SentenceTransformer('all-MiniLM-L6-v2')
doc_embeddings = model.encode(docs, convert_to_tensor=True)
query_embedding = model.encode(query, convert_to_tensor=True)
vector_scores = util.cosine_similarity(query_embedding, doc_embeddings)[0].tolist()
# 3. Hybrid Combination (Simple weighted average for demonstration)
ALPHA = 0.5 # Balance weight
hybrid_scores = [(ALPHA * l) + ((1-ALPHA) * v) for l, v in zip(lexical_scores, vector_scores)]
# Sort and display
results = sorted(zip(docs, hybrid_scores), key=lambda x: x[1], reverse=True)
print(f"Query: {query}\n")
print("Top Hybrid Results:")
for doc, score in results:
print(f"[{score:.4f}] {doc}")
Analysis of results:
The result "How to reset your password" will rank highest because it has high keyword overlap ("password") AND high semantic similarity ("reset" sounds like "forgot"). "Forgotten my account access code" will also rank high because of semantic similarity, even though it shares zero keywords with "password."
7. Real-World Architecture using OpenSearch
In Module 7, we will dive into OpenSearch, which is one of the few databases that allows you to do Lexical and Vector search in the same engine.
A production OpenSearch query for hybrid search looks like this (conceptually):
{
"query": {
"hybrid": {
"queries": [
{ "match": { "text": "forgot password" } }, // Lexical
{ "knn": { "vector_field": { "vector": [...], "k": 10 } } } // Semantic
]
}
}
}
Summary and Key Takeaways
Understanding the trade-offs between Keyword and Semantic search is the difference between a demo and a production system.
- Keyword Search is for precision, specific IDs, and exact phrases.
- Semantic Search is for intent, ambiguity, and multi-modal discovery.
- Hybrid Search is the industry standard, combining both via Reciprocal Rank Fusion.
In the next lesson, we will explore Why Relational Databases are not enough, looking at the specific hardware and algorithmic limitations that prevent PostgreSQL or MySQL from being viable vector engines at scale.
Exercise: Comparing Search Quality
Go to an e-commerce site (like Amazon) and a documentation site (like Microsoft Learn or MDN).
- Search for a specific product ID (e.g., a serial number).
- Search for a vague concept (e.g., "Something to fix a leaky pipe").
- Observe which site handles both well.
Can you tell which one is using pure keyword matching? (Look for sites that fail when you use a synonym like "faucet" instead of "tap").