
Python Masterclass: Implementing Hybrid Search in OpenSearch
Go from theory to code. Build a production-ready Python client for OpenSearch Hybrid Search, including score normalization and pipeline management.
Python: Implementing Hybrid Search
We have reached the culmination of Module 7. You have seen the architecture, the mappings, and the decision frameworks for OpenSearch. Now, we write the code.
In this lesson, we will build a complete Python search client. We will go through the process of:
- Creating a Search Pipeline for normalization.
- Ingesting documents with text and vectors.
- Executing a Hybrid Query that targets both the inverted index and the k-NN index.
- Handling the results in a way that respects the RRF (Reciprocal Rank Fusion) ranking.
1. Prerequisites: The "Search Pipeline"
Before we can run a hybrid query in Python, we must ensure OpenSearch is configured to merge the scores. We do this once during the setup of our application.
from opensearchpy import OpenSearch
client = OpenSearch(hosts=[{'host': 'localhost', 'port': 9200}], http_auth=('admin', 'admin'))
def create_search_pipeline():
pipeline_id = "norm-pipeline"
pipeline_body = {
"description": "Normalize scores for hybrid search",
"phase_results_processors": [
{
"normalization": {
"normalization": {"technique": "min_max"},
"combination": {
"technique": "arithmetic_mean",
"parameters": {"weights": [0.3, 0.7]} # 30% Keywords, 70% Vectors
}
}
}
]
}
client.transport.perform_request('PUT', f'/_search/pipeline/{pipeline_id}', body=pipeline_body)
print(f"Pipeline '{pipeline_id}' created.")
# create_search_pipeline()
2. Setting Up the Index Mapping
We need an index that supports both text (for Keywords) and knn_vector (for Semantic).
def create_hybrid_index(index_name):
settings = {
"settings": {"index": {"knn": True}},
"mappings": {
"properties": {
"content_text": {"type": "text"},
"content_vector": {
"type": "knn_vector",
"dimension": 1536,
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "nmslib"
}
},
"metadata": {"type": "keyword"}
}
}
}
client.indices.create(index=index_name, body=settings)
3. The Hybrid Search Function
This is the core of your application. Notice how we pass the query twice: once as strings for the keyword engine and once as vectors for the AI engine.
def search_hybrid(query_text, query_vector, index_name="my_index"):
search_query = {
"size": 5,
"query": {
"hybrid": {
"queries": [
{
"match": {
"content_text": query_text
}
},
{
"knn": {
"content_vector": {
"vector": query_vector,
"k": 10
}
}
}
]
}
}
}
# Passing the pipeline via params
response = client.search(
index=index_name,
body=search_query,
params={"search_pipeline": "norm-pipeline"}
)
return response['hits']['hits']
4. Handling Results: The Metadata Advantage
Unlike Pinecone, where you often just get IDs, OpenSearch returns the full source document. This allows you to build rich UIs immediately.
results = search_hybrid("password reset", [0.12, 0.33, ...])
for hit in results:
score = hit['_score']
text = hit['_source']['content_text']
category = hit['_source']['metadata']
print(f"[{score:.4f}] Category: {category}")
print(f"Content: {text[:100]}...")
print("-" * 30)
5. The "Enterprise" Tip: Bulk Helpers
When dealing with large enterprise datasets, don't use the standard index() method. Use the helpers.bulk method. It is 10x faster and handles retries automatically.
from opensearchpy import helpers
def bulk_ingest(docs):
actions = [
{
"_index": "my_index",
"_source": doc
}
for doc in docs
]
helpers.bulk(client, actions)
Summary and Module 7 Wrap-up
You have completed the transition to Enterprise Vector Search.
- Architecture: You understand that OpenSearch is a document store + a keyword engine + a vector database.
- Configuration: You can define
knn_vectormappings and hardware engines (nmslib/faiss). - Logic: You can implement Hybrid Retrieval using normalization and RRF.
- Code: you have a working Python pattern for production search pipelines.
What's Next?
In Module 8: CRUD Operations in Vector Databases, we go back to basics but with a "Production" twist. We will learn how to handle updates (UPSERTS), deletions, and the terrifying problem of Re-indexing when your model changes.
Exercise: Building a Re-ranker Hook
Modify the search_hybrid function above:
- Retrieve the top 20 results using the hybrid query.
- If the highest score is below 0.5, print a warning: "Low confidence result."
- Look into the
rank_featuresfield in OpenSearch. How could you add "User Popularity" as a third signal in your hybrid search?