Cross-Encoder Concepts

To understand why re-ranking works, you must understand the difference between Bi-Encoders and Cross-Encoders.

The Bi-Encoder (First-Pass Retrieval)

Ingestion: Documents are embedded once and stored in the vector database.
Query Time: The query is embedded once. We calculate the distance between the query vector and document vectors.
Analogy: It's like having a library where books are organized by general topic (e.g., "Gardening"). You just walk to the Gardening section.

The Cross-Encoder (The Re-Ranker)

Ingestion: Nothing is stored.
Query Time: The model takes the Query $(Q)$ and a Document $(D)$ as a single pair $(Q, D)$ and runs them through the neural network together.
Analogy: It's like having an expert librarian who takes your specific question and reads the first page of 20 gardening books to tell you which one exactly answers your question about "Pruning Hydrangeas."

Modern Cross-Encoder Architectures

Most cross-encoders are based on the BERT or RoBERTa architecture. They use the [CLS] token output to produce a single similarity score between 0 and 1.

Why are Cross-Encoders Slow?

Because the model has to process every query-document pair. If you want to search 1,000 documents, the model has to run 1,000 independent inferences. This is why we only use them at the end of the pipeline for the Top 20-100 results.

Performance Gains

Adding a cross-encoder can increase Mean Reciprocal Rank (MRR)—a metric of how often the best result is at #1—by as much as 15-20% compared to raw vector search.

Practical Tooling

Hugging Face: Access hundreds of open-source cross-encoders (e.g., BGE-Reranker).
Sentence-Transformers: The easiest Python library for implementing cross-modal ranking.

Exercises

Explain why we cannot use a Cross-Encoder for the initial search across 10 million documents.
What is the difference between "Representation-based" (Bi-Encoder) and "Interaction-based" (Cross-Encoder) models?
Find a cross-encoder model on Hugging Face. What datasets was it trained on?