Skip to content

Hybrid Search Tutorial — BM25 + Vector

Hybrid Search Tutorial — BM25 + Vector

Available since: v3.6.0 (2026-04-15) Build: default — no feature flag required Module: heliosdb_lite::search (src/search/)


UVP

Lexical search and dense-vector search complement each other — one finds keyword overlap, the other finds semantic neighbours. The search module gives Lite a first-class hybrid pipeline: a unicode-aware tokenizer, an Okapi BM25 inverted index (k1=1.2, b=0.75), Reciprocal Rank Fusion (k=60) or weighted-linear combination, and Maximal Marginal Relevance for diversification. Plug your existing HNSW results into hybrid_search() and ship a RAG retrieval stack in the same binary as your database — no Elasticsearch, no Pinecone, no glue service.


Prerequisites

  • HeliosDB Lite v3.6+
  • Rust 1.75+
  • A vector store (HNSW from heliosdb_lite::vector, your own embedder, or any caller-supplied Vec<ScoredDoc>)
  • 20 minutes

1. The Tokenizer

search::tokenize splits on unicode word boundaries (via unicode-segmentation), lowercases by default, and drops short tokens. Two knobs:

use heliosdb_lite::search::tokenizer::{tokenize, TokenizerConfig};
let cfg = TokenizerConfig::default(); // min_len=2, lowercase=true
let toks = tokenize("The Quick Brown Fox", &cfg);
assert_eq!(toks, vec!["the", "quick", "brown", "fox"]);
// Drop articles by raising the floor.
let cfg = TokenizerConfig { min_len: 3, lowercase: true };
let toks = tokenize("a an the quick", &cfg);
assert_eq!(toks, vec!["the", "quick"]);

Unicode words are preserved — tokenize("café niño", &cfg) keeps both as-is (lowercased).


2. BM25 Index

use heliosdb_lite::search::{Bm25Index, Bm25Params};
use heliosdb_lite::search::tokenizer::TokenizerConfig;
let mut idx = Bm25Index::default(); // k1=1.2, b=0.75
idx.add_document(1, "the quick brown fox");
idx.add_document(2, "the lazy dog sleeps");
idx.add_document(3, "quick brown dogs run quickly");
let results = idx.search("quick brown", 10);
for r in &results {
println!("doc {} score {}", r.doc_id, r.score);
}
// Doc 1 and 3 contain both query terms; doc 2 does not.

Bm25Index::new gives you full control:

let params = Bm25Params { k1: 1.5, b: 0.6 }; // tweak per corpus
let tok = TokenizerConfig::default();
let idx = Bm25Index::new(params, tok);

ScoredDoc { doc_id: u64, score: f32 } is the universal currency in this module — every search and re-ranker speaks it.


3. Bloom Filter Skip — BloomTermFilter

If most of your queries contain rare or nonsense terms, the inverted-index probe is wasteful. BloomTermFilter wraps the existing storage::bloom_filter::BloomFilter and lets BM25 skip a term in constant time when it’s guaranteed absent.

use heliosdb_lite::search::{Bm25Index};
use heliosdb_lite::search::filter_skip::BloomTermFilter;
// Build a Bloom over the corpus vocabulary once.
let vocab = ["rust", "async", "borrow", "checker", "guide", "explained"];
let bloom = BloomTermFilter::build(vocab);
let mut idx = Bm25Index::default();
idx.add_document(1, "rust async programming guide");
idx.add_document(2, "rust borrow checker explained");
// Query containing one in-vocab and one definitely-absent term.
let r = idx.search_with_filter("rust zzz_does_not_exist_in_index", 10, Some(&bloom));
// "zzz_does_not_exist_in_index" is rejected by the Bloom -> never probed.
assert!(!r.is_empty());

The contract is conservative: may_contain(term) == false MUST mean the term is definitely absent. False positives are fine — the HashMap probe just returns None and contributes zero score.


4. RRF — Combining Two Ranked Lists

Reciprocal Rank Fusion lets you merge BM25 and vector results without normalising score scales. Each rank r (1-based) contributes 1 / (k + r). Default k = 60 (Cormack/Clarke 2009).

use heliosdb_lite::search::{ScoredDoc, RrfParams, rrf};
let bm25 = vec![
ScoredDoc { doc_id: 1, score: 10.0 },
ScoredDoc { doc_id: 2, score: 9.0 },
ScoredDoc { doc_id: 3, score: 8.0 },
];
let vec_ = vec![
ScoredDoc { doc_id: 2, score: 0.95 },
ScoredDoc { doc_id: 1, score: 0.90 },
ScoredDoc { doc_id: 4, score: 0.30 },
];
let fused = rrf(&[&bm25, &vec_], RrfParams::default(), 5);
// Docs 1 and 2 appear in both lists -> they outrank singletons 3 and 4.

5. MMR — Diversification

When you only want N results but the top of the list is full of near-duplicates, Maximal Marginal Relevance trades relevance against novelty.

MMR = argmax_d λ · relevance(d) − (1 − λ) · max_{s ∈ selected} sim(d, s)

λ = 1.0 is plain relevance ordering; λ = 0.0 is maximally diverse.

use heliosdb_lite::search::{ScoredDoc, mmr};
let cands = vec![
ScoredDoc { doc_id: 1, score: 1.00 },
ScoredDoc { doc_id: 2, score: 0.99 }, // near-duplicate of 1
ScoredDoc { doc_id: 3, score: 0.50 },
];
// `sim(a, b)` returns 1.0 if 1 and 2 are duplicates, else 0.0.
let sim = |a: u64, b: u64| -> f32 {
if (a, b) == (1, 2) || (a, b) == (2, 1) { 1.0 } else { 0.0 }
};
let picked = mmr(&cands, 0.0, 2, sim);
// First pick is doc 1; second pick is doc 3, NOT doc 2.

Pair MMR with your retrieval embedder’s cosine_sim for content-aware diversity.


This is where it all comes together. You give it a BM25 index, a query string, the vector results from your existing HNSW lookup, and a HybridConfig. The module fuses the two and returns a single ranked list.

use heliosdb_lite::search::{
Bm25Index, ScoredDoc,
hybrid_search, HybridConfig, HybridScore, RrfParams,
};
let mut idx = Bm25Index::default();
idx.add_document(1, "rust async programming guide");
idx.add_document(2, "python machine learning basics");
idx.add_document(3, "rust borrow checker explained");
// Pretend these came back from your HNSW index.
let vector_results = vec![
ScoredDoc { doc_id: 2, score: 0.95 },
ScoredDoc { doc_id: 3, score: 0.90 },
ScoredDoc { doc_id: 1, score: 0.10 },
];
// RRF fusion (default).
let cfg = HybridConfig::default();
let r = hybrid_search(&idx, "rust", &vector_results, cfg);
println!("rrf: {:?}", r.iter().map(|d| d.doc_id).collect::<Vec<_>>());
// Weighted-linear fusion (normalises each list, then combines).
let cfg = HybridConfig {
score: HybridScore::Linear { w_bm25: 0.5, w_vec: 0.5 },
..HybridConfig::default()
};
let r = hybrid_search(&idx, "rust", &vector_results, cfg);
println!("linear: {:?}", r.iter().map(|d| d.doc_id).collect::<Vec<_>>());

HybridConfig:

pub struct HybridConfig {
pub per_ranker_k: usize, // top-K to ask each individual ranker for (default 50)
pub final_k: usize, // final top-K to return (default 10)
pub score: HybridScore, // RRF (default) or Linear { w_bm25, w_vec }
}

If vector_results is empty, hybrid_search falls back to BM25-only — you can pass an empty slice while you’re wiring in your embedder.


7. End-to-End — RAG Retrieval

use heliosdb_lite::search::{
Bm25Index, ScoredDoc,
hybrid_search, HybridConfig,
};
use heliosdb_lite::search::filter_skip::BloomTermFilter;
use heliosdb_lite::search::reranker::{mmr};
fn retrieve(
bm25: &Bm25Index,
bloom: &BloomTermFilter,
query: &str,
knn: Vec<ScoredDoc>, // from your HNSW lookup
final_k: usize,
) -> Vec<ScoredDoc> {
// Stage 1: BM25 + skip filter, fuse with vector via RRF.
let cfg = HybridConfig {
per_ranker_k: 50,
final_k: 32, // wider than final_k for MMR
..HybridConfig::default()
};
let fused = hybrid_search(bm25, query, &knn, cfg);
// Stage 2: greedy diversification.
let sim = |_a: u64, _b: u64| 0.0; // plug in your cosine_sim by id
mmr(&fused, 0.7, final_k, sim)
}

Where Next