Hybrid Search Tutorial
Hybrid Search Tutorial
Difficulty: Intermediate
Prerequisites: Basic HeliosDB knowledge, familiarity with full-text search concepts
Module: src/search/
This tutorial covers BM25 lexical search, hybrid BM25+vector search, and the RRF/MMR reranking algorithms available in HeliosDB.
1. Overview
HeliosDB provides three search capabilities:
| Capability | Module | Best For |
|---|---|---|
| BM25 | search::bm25 | Keyword-based full-text search |
| Vector | vector/ (HNSW) | Semantic similarity search |
| Hybrid | search::hybrid | Combining keyword + semantic for RAG pipelines |
The search module is designed for RAG (Retrieval-Augmented Generation) workloads where you need both lexical precision and semantic recall.
2. BM25 Inverted Index
Creating an Index
use heliosdb_lite::search::{Bm25Index, Bm25Params};
// Default Okapi BM25 parameters: k1=1.2, b=0.75let mut index = Bm25Index::new(Bm25Params::default());
// Custom parameters for short documentslet mut index = Bm25Index::new(Bm25Params { k1: 1.5, b: 0.5 });Indexing Documents
Each document has a 64-bit ID and text content:
index.add_document(1, "HeliosDB is a fast embedded database engine");index.add_document(2, "PostgreSQL is a powerful open source database");index.add_document(3, "Vector search enables semantic similarity queries");index.add_document(4, "BM25 is a probabilistic information retrieval function");Searching
let results = index.search("database engine", 10);
for doc in &results { println!("Doc {}: score {:.4}", doc.doc_id, doc.score);}// Doc 1: score 0.8234 (matches both "database" and "engine")// Doc 2: score 0.4112 (matches "database" only)BM25 Scoring Formula
The Okapi BM25 score for a document D and query Q is:
score(D, Q) = sum_{t in Q} IDF(t) * f(t,D)*(k1+1) / (f(t,D) + k1*(1 - b + b*|D|/avgdl))Where:
IDF(t) = ln((N - n(t) + 0.5) / (n(t) + 0.5) + 1)— inverse document frequencyf(t,D)— term frequency of t in document D|D|— document length in tokensavgdl— average document length across all documents
3. Unicode-Aware Tokenizer
The tokenizer handles international text correctly:
use heliosdb_lite::search::tokenize;
let tokens = tokenize("Cafe resume naive", &Default::default());// tokens = ["cafe", "resume", "naive"]// ASCII folding: accented characters mapped to ASCII equivalentsFeatures:
- Lowercase normalization
- ASCII folding (e.g., e with acute accent to “e”)
- Whitespace and punctuation splitting
- Configurable via
TokenizerConfig
4. Bloom Filter Skip
For large corpora, the Bloom filter skip adapter eliminates unnecessary posting list lookups:
use heliosdb_lite::search::filter_skip::TermFilter;
// If the Bloom filter says "xyzzy" is NOT in any posting list,// the BM25 index skips the lookup entirely -- zero I/O cost.This is especially effective when queries contain rare or misspelled terms that do not exist in the corpus.
5. Hybrid Search
Hybrid search combines BM25 lexical results with vector similarity results into a single ranking.
Setup
use heliosdb_lite::search::{ hybrid_search, HybridConfig, HybridScore, Bm25Index, Bm25Params, ScoredDoc,};
// Build BM25 indexlet mut bm25 = Bm25Index::new(Bm25Params::default());bm25.add_document(1, "machine learning optimization");bm25.add_document(2, "database query optimization");bm25.add_document(3, "neural network training");
// Vector results from HNSW index (pre-computed)let vector_results = vec![ ScoredDoc { doc_id: 3, score: 0.95 }, ScoredDoc { doc_id: 1, score: 0.82 }, ScoredDoc { doc_id: 2, score: 0.45 },];Linear Fusion
Weight BM25 and vector scores directly:
let config = HybridConfig { per_ranker_k: 50, final_k: 10, score: HybridScore::Linear { w_bm25: 0.4, w_vec: 0.6 },};
let results = hybrid_search(&bm25, "optimization", &vector_results, config);RRF Fusion (Default)
Reciprocal Rank Fusion combines rankings without requiring score calibration:
let config = HybridConfig { per_ranker_k: 50, final_k: 10, score: HybridScore::Rrf(Default::default()), // k=60};
let results = hybrid_search(&bm25, "optimization", &vector_results, config);RRF scores each document as sum(1 / (k + rank)) across all input lists. The smoothing constant k (default 60) prevents top-ranked items from dominating.
6. Rerankers
Reciprocal Rank Fusion (RRF)
Combine any number of ranked result lists:
use heliosdb_lite::search::reranker::{rrf, RrfParams};
let list_a = vec![ ScoredDoc { doc_id: 1, score: 0.9 }, ScoredDoc { doc_id: 2, score: 0.7 },];let list_b = vec![ ScoredDoc { doc_id: 2, score: 0.95 }, ScoredDoc { doc_id: 3, score: 0.80 },];
let fused = rrf(&[&list_a, &list_b], RrfParams { k: 60 }, 10);// Doc 2 ranks highest (appears in both lists)Maximal Marginal Relevance (MMR)
Diversify results by penalizing redundancy:
use heliosdb_lite::search::reranker::mmr;
// lambda = 1.0: pure relevance (no diversification)// lambda = 0.5: balance relevance and diversity// lambda = 0.0: maximize diversity
let diversified = mmr( &candidates, 0.7, // lambda: 70% relevance, 30% diversity |a, b| cosine_similarity(a, b), // similarity function 10, // return top 10);MMR uses the greedy formula:
MMR = argmax_{d in candidates} [lambda * relevance(d) - (1-lambda) * max_{s in selected} sim(d, s)]7. RAG Pipeline Example
A typical RAG pipeline combining all three:
// 1. Index documents with BM25let mut bm25 = Bm25Index::new(Bm25Params::default());for (id, text) in documents { bm25.add_document(id, &text);}
// 2. Get vector results from HNSW// (using HeliosDB's existing vector search)let vector_results = hnsw_search(&query_embedding, 50);
// 3. Hybrid fusionlet hybrid = hybrid_search(&bm25, &query_text, &vector_results, HybridConfig::default());
// 4. MMR diversificationlet final_results = mmr(&hybrid, 0.7, similarity_fn, 5);
// 5. Pass top-5 to LLM as context8. Parameter Tuning Guide
| Parameter | Default | Effect of Increasing |
|---|---|---|
| BM25 k1 | 1.2 | Higher term frequency saturation |
| BM25 b | 0.75 | Stronger document length normalization |
| RRF k | 60 | Less dominance by top-ranked items |
| MMR lambda | 0.7 | More relevance, less diversity |
| per_ranker_k | 50 | More candidates from each ranker |
| final_k | 10 | More results returned |
9. Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
| All scores are 0 | Query terms not in any document | Check tokenization; verify documents were indexed |
| Poor relevance | Default BM25 params not suited | Tune k1/b for your document length distribution |
| Redundant results | No diversification | Apply MMR with lambda < 1.0 |
| Slow on large corpus | No Bloom filter | Enable filter_skip for term non-existence checks |
Next Steps
- Combine with HeliosDB’s existing FTS (
@@operator) for SQL-level text search - Use HNSW indexes for the vector component of hybrid search
- See
tests/bm25_with_bloom.rsfor comprehensive test examples