Combining Keyword and Vector Search

We have discussed the theoretical benefits of Hybrid Search throughout this course. Now, we implement it. OpenSearch is uniquely positioned as the industry standard for hybrid retrieval because it treats Keyword Search (BM25) and Vector Search (k-NN) as "First-Class Citizens."

In this lesson, we will explore the Hybrid Query type, how to normalize scores between vectors and text, and how to use the "Search Pipeline" feature in OpenSearch to automate the ranking of results.

1. The Challenge of Score Normalization

The biggest technical hurdle in hybrid search is that text scores and vector scores use different scales:

BM25 (Text): Scores are typically 0 to 20+. They are unbounded and depend on word frequency.
k-NN (Vector): Scores are typically 0 to 1 (normalized cosine).

You cannot simply add 15.5 (keyword) to 0.85 (vector). The keyword result would always win.

2. Solution 1: Global Score Normalization

OpenSearch provides a feature called Normalization Processors. It takes the results from both searches and scales them to a uniform range (usually 0.0 to 1.0) before combining them.

graph TD
    Q[Query] --> K[Keyword Search: Score 12.0]
    Q --> V[Vector Search: Score 0.88]
    K --> KN[Norm: 0.95]
    V --> VN[Norm: 0.90]
    KN & VN --> C[Combined: 1.85]

3. Solution 2: Reciprocal Rank Fusion (RRF)

RRF is the more robust approach. It doesn't care about the raw "scores." It only cares about the Rank (The position in the list).

A document that is #1 in vector search and #5 in keyword search will receive a higher RRF score than a document that is #1 in vector search but #10,000 in keyword search. This ensures that documents that satisfy both intents are promoted to the top.

4. The OpenSearch Hybrid Query Syntax

To use hybrid search, you define a hybrid query with multiple clauses:

{
  "query": {
    "hybrid": {
      "queries": [
        {
          "match": {
            "text_content": "how to change my password"
          }
        },
        {
          "knn": {
            "vector_field": {
              "vector": [0.1, 0.2, ...],
              "k": 10
            }
          }
        }
      ]
    }
  }
}

The Search Pipeline

To make this work, you must apply a Search Pipeline to your request. This pipeline is the background worker that performs the normalization or RRF merging.

# Example: Creating a simple RRF pipeline
PUT /_search/pipeline/my_hybrid_pipeline
{
  "description": "A pipeline to perform RRF on hybrid results",
  "phase_results_processors": [
    {
      "normalization": {
        "normalization": {
          "technique": "min_max"
        },
        "combination": {
            "technique": "arithmetic_mean",
            "parameters": {
                "weights": [0.3, 0.7]
            }
        }
      }
    }
  ]
}

5. Python Implementation: Running a Hybrid Query

Using opensearch-py, we can run the query and specify our pipeline.

search_body = {
  "query": {
    "hybrid": {
      "queries": [
        {"match": {"content": "refund policy"}},
        {"knn": {"vec": {"vector": query_vec, "k": 10}}}
      ]
    }
  }
}

# we specify the 'search_pipeline' parameter in our API call
response = client.search(
    body=search_body, 
    index="customer_support",
    params={"search_pipeline": "my_rrf_pipeline"}
)

for hit in response['hits']['hits']:
    print(f"Doc: {hit['_id']} | Hybrid Score: {hit['_score']}")

6. When to Weighted Toward Keywords vs. Vectors

Technical Support: Weight towards Keywords. Specific error codes (e.g., Error 404) must match exactly.
Content Discovery (Netflix/Spotify): Weight towards Vectors. The "vibe" and "mood" are more important than exact title matches.
E-commerce: Start at 50/50 and tune based on user behavior.

Summary and Key Takeaways

Hybrid search is the "Secret Weapon" of the world's most successful search engines.

Precision + Context: Keywords handle specific tokens; Vectors handle broad intent.
Normalization is required to compare different scoring systems.
RRF (Reciprocal Rank Fusion) is the preferred way to combine results without worrying about score ranges.
Search Pipelines in OpenSearch automate the complexity of result merging.

In the next lesson, we will conclude the "Strategic" part of this module by discussing When to choose OpenSearch over pure vector databases like Pinecone or Chroma.

Exercise: Tuning the Weights

You find that for the query "How do I fix my iPhone?", the vector search is returning "How to fix an Android phone" as the #1 result because they are semantically identical.

How would increasing the weight of the Keyword Search fix this?
If you use RRF, will the "iPhone" keyword result naturally move to the top if the original text contains the word "iPhone"?
Propose a "Search Pipeline" configuration that gives 80% weight to keywords and 20% to vectors.

Hybrid Search in OpenSearch: The Best of Both Worlds