Hybrid Search Tutorial — BM25 + Vector
Hybrid Search Tutorial — BM25 + Vector
Available since: v3.6.0 (2026-04-15)
Build: default — no feature flag required
Module: heliosdb_lite::search (src/search/)
UVP
Lexical search and dense-vector search complement each other — one finds keyword overlap, the other finds semantic neighbours. The search module gives Lite a first-class hybrid pipeline: a unicode-aware tokenizer, an Okapi BM25 inverted index (k1=1.2, b=0.75), Reciprocal Rank Fusion (k=60) or weighted-linear combination, and Maximal Marginal Relevance for diversification. Plug your existing HNSW results into hybrid_search() and ship a RAG retrieval stack in the same binary as your database — no Elasticsearch, no Pinecone, no glue service.
Prerequisites
- HeliosDB Lite v3.6+
- Rust 1.75+
- A vector store (HNSW from
heliosdb_lite::vector, your own embedder, or any caller-suppliedVec<ScoredDoc>) - 20 minutes
1. The Tokenizer
search::tokenize splits on unicode word boundaries (via unicode-segmentation), lowercases by default, and drops short tokens. Two knobs:
use heliosdb_lite::search::tokenizer::{tokenize, TokenizerConfig};
let cfg = TokenizerConfig::default(); // min_len=2, lowercase=truelet toks = tokenize("The Quick Brown Fox", &cfg);assert_eq!(toks, vec!["the", "quick", "brown", "fox"]);
// Drop articles by raising the floor.let cfg = TokenizerConfig { min_len: 3, lowercase: true };let toks = tokenize("a an the quick", &cfg);assert_eq!(toks, vec!["the", "quick"]);Unicode words are preserved — tokenize("café niño", &cfg) keeps both as-is (lowercased).
2. BM25 Index
use heliosdb_lite::search::{Bm25Index, Bm25Params};use heliosdb_lite::search::tokenizer::TokenizerConfig;
let mut idx = Bm25Index::default(); // k1=1.2, b=0.75idx.add_document(1, "the quick brown fox");idx.add_document(2, "the lazy dog sleeps");idx.add_document(3, "quick brown dogs run quickly");
let results = idx.search("quick brown", 10);for r in &results { println!("doc {} score {}", r.doc_id, r.score);}// Doc 1 and 3 contain both query terms; doc 2 does not.Bm25Index::new gives you full control:
let params = Bm25Params { k1: 1.5, b: 0.6 }; // tweak per corpuslet tok = TokenizerConfig::default();let idx = Bm25Index::new(params, tok);ScoredDoc { doc_id: u64, score: f32 } is the universal currency in this module — every search and re-ranker speaks it.
3. Bloom Filter Skip — BloomTermFilter
If most of your queries contain rare or nonsense terms, the inverted-index probe is wasteful. BloomTermFilter wraps the existing storage::bloom_filter::BloomFilter and lets BM25 skip a term in constant time when it’s guaranteed absent.
use heliosdb_lite::search::{Bm25Index};use heliosdb_lite::search::filter_skip::BloomTermFilter;
// Build a Bloom over the corpus vocabulary once.let vocab = ["rust", "async", "borrow", "checker", "guide", "explained"];let bloom = BloomTermFilter::build(vocab);
let mut idx = Bm25Index::default();idx.add_document(1, "rust async programming guide");idx.add_document(2, "rust borrow checker explained");
// Query containing one in-vocab and one definitely-absent term.let r = idx.search_with_filter("rust zzz_does_not_exist_in_index", 10, Some(&bloom));// "zzz_does_not_exist_in_index" is rejected by the Bloom -> never probed.assert!(!r.is_empty());The contract is conservative: may_contain(term) == false MUST mean the term is definitely absent. False positives are fine — the HashMap probe just returns None and contributes zero score.
4. RRF — Combining Two Ranked Lists
Reciprocal Rank Fusion lets you merge BM25 and vector results without normalising score scales. Each rank r (1-based) contributes 1 / (k + r). Default k = 60 (Cormack/Clarke 2009).
use heliosdb_lite::search::{ScoredDoc, RrfParams, rrf};
let bm25 = vec![ ScoredDoc { doc_id: 1, score: 10.0 }, ScoredDoc { doc_id: 2, score: 9.0 }, ScoredDoc { doc_id: 3, score: 8.0 },];let vec_ = vec![ ScoredDoc { doc_id: 2, score: 0.95 }, ScoredDoc { doc_id: 1, score: 0.90 }, ScoredDoc { doc_id: 4, score: 0.30 },];
let fused = rrf(&[&bm25, &vec_], RrfParams::default(), 5);// Docs 1 and 2 appear in both lists -> they outrank singletons 3 and 4.5. MMR — Diversification
When you only want N results but the top of the list is full of near-duplicates, Maximal Marginal Relevance trades relevance against novelty.
MMR = argmax_d λ · relevance(d) − (1 − λ) · max_{s ∈ selected} sim(d, s)λ = 1.0 is plain relevance ordering; λ = 0.0 is maximally diverse.
use heliosdb_lite::search::{ScoredDoc, mmr};
let cands = vec![ ScoredDoc { doc_id: 1, score: 1.00 }, ScoredDoc { doc_id: 2, score: 0.99 }, // near-duplicate of 1 ScoredDoc { doc_id: 3, score: 0.50 },];
// `sim(a, b)` returns 1.0 if 1 and 2 are duplicates, else 0.0.let sim = |a: u64, b: u64| -> f32 { if (a, b) == (1, 2) || (a, b) == (2, 1) { 1.0 } else { 0.0 }};
let picked = mmr(&cands, 0.0, 2, sim);// First pick is doc 1; second pick is doc 3, NOT doc 2.Pair MMR with your retrieval embedder’s cosine_sim for content-aware diversity.
6. The Hybrid Path — hybrid_search()
This is where it all comes together. You give it a BM25 index, a query string, the vector results from your existing HNSW lookup, and a HybridConfig. The module fuses the two and returns a single ranked list.
use heliosdb_lite::search::{ Bm25Index, ScoredDoc, hybrid_search, HybridConfig, HybridScore, RrfParams,};
let mut idx = Bm25Index::default();idx.add_document(1, "rust async programming guide");idx.add_document(2, "python machine learning basics");idx.add_document(3, "rust borrow checker explained");
// Pretend these came back from your HNSW index.let vector_results = vec![ ScoredDoc { doc_id: 2, score: 0.95 }, ScoredDoc { doc_id: 3, score: 0.90 }, ScoredDoc { doc_id: 1, score: 0.10 },];
// RRF fusion (default).let cfg = HybridConfig::default();let r = hybrid_search(&idx, "rust", &vector_results, cfg);println!("rrf: {:?}", r.iter().map(|d| d.doc_id).collect::<Vec<_>>());
// Weighted-linear fusion (normalises each list, then combines).let cfg = HybridConfig { score: HybridScore::Linear { w_bm25: 0.5, w_vec: 0.5 }, ..HybridConfig::default()};let r = hybrid_search(&idx, "rust", &vector_results, cfg);println!("linear: {:?}", r.iter().map(|d| d.doc_id).collect::<Vec<_>>());HybridConfig:
pub struct HybridConfig { pub per_ranker_k: usize, // top-K to ask each individual ranker for (default 50) pub final_k: usize, // final top-K to return (default 10) pub score: HybridScore, // RRF (default) or Linear { w_bm25, w_vec }}If vector_results is empty, hybrid_search falls back to BM25-only — you can pass an empty slice while you’re wiring in your embedder.
7. End-to-End — RAG Retrieval
use heliosdb_lite::search::{ Bm25Index, ScoredDoc, hybrid_search, HybridConfig,};use heliosdb_lite::search::filter_skip::BloomTermFilter;use heliosdb_lite::search::reranker::{mmr};
fn retrieve( bm25: &Bm25Index, bloom: &BloomTermFilter, query: &str, knn: Vec<ScoredDoc>, // from your HNSW lookup final_k: usize,) -> Vec<ScoredDoc> { // Stage 1: BM25 + skip filter, fuse with vector via RRF. let cfg = HybridConfig { per_ranker_k: 50, final_k: 32, // wider than final_k for MMR ..HybridConfig::default() }; let fused = hybrid_search(bm25, query, &knn, cfg);
// Stage 2: greedy diversification. let sim = |_a: u64, _b: u64| 0.0; // plug in your cosine_sim by id mmr(&fused, 0.7, final_k, sim)}Where Next
- VECTOR_SEARCH_TUTORIAL — HNSW with PQ compression to populate the vector half.
- GRAPH_ENGINE_TUTORIAL — sister RAG-native module for relational hops.
- QUERY_PLAN_CACHING_TUTORIAL — cache the SQL that fetches your candidate IDs.
- GETTING_STARTED_TUTORIAL — install HeliosDB Lite and run your first query.