Hybrid Search Tutorial — BM25 + Vector

Available since: v3.6.0 (2026-04-15) Build: default — no feature flag required Module: heliosdb_lite::search (src/search/)

UVP

Lexical search and dense-vector search complement each other — one finds keyword overlap, the other finds semantic neighbours. The search module gives Lite a first-class hybrid pipeline: a unicode-aware tokenizer, an Okapi BM25 inverted index (k1=1.2, b=0.75), Reciprocal Rank Fusion (k=60) or weighted-linear combination, and Maximal Marginal Relevance for diversification. Plug your existing HNSW results into hybrid_search() and ship a RAG retrieval stack in the same binary as your database — no Elasticsearch, no Pinecone, no glue service.

Prerequisites

HeliosDB Lite v3.6+
Rust 1.75+
A vector store (HNSW from heliosdb_lite::vector, your own embedder, or any caller-supplied Vec<ScoredDoc>)
20 minutes

1. The Tokenizer

search::tokenize splits on unicode word boundaries (via unicode-segmentation), lowercases by default, and drops short tokens. Two knobs:

use heliosdb_lite::search::tokenizer::{tokenize, TokenizerConfig};

let cfg = TokenizerConfig::default();        // min_len=2, lowercase=true
let toks = tokenize("The Quick Brown Fox", &cfg);
assert_eq!(toks, vec!["the", "quick", "brown", "fox"]);

// Drop articles by raising the floor.
let cfg = TokenizerConfig { min_len: 3, lowercase: true };
let toks = tokenize("a an the quick", &cfg);
assert_eq!(toks, vec!["the", "quick"]);

Unicode words are preserved — tokenize("café niño", &cfg) keeps both as-is (lowercased).

2. BM25 Index

use heliosdb_lite::search::{Bm25Index, Bm25Params};
use heliosdb_lite::search::tokenizer::TokenizerConfig;

let mut idx = Bm25Index::default();           // k1=1.2, b=0.75
idx.add_document(1, "the quick brown fox");
idx.add_document(2, "the lazy dog sleeps");
idx.add_document(3, "quick brown dogs run quickly");

let results = idx.search("quick brown", 10);
for r in &results {
    println!("doc {} score {}", r.doc_id, r.score);
}
// Doc 1 and 3 contain both query terms; doc 2 does not.

Bm25Index::new gives you full control:

let params = Bm25Params { k1: 1.5, b: 0.6 };  // tweak per corpus
let tok    = TokenizerConfig::default();
let idx    = Bm25Index::new(params, tok);

ScoredDoc { doc_id: u64, score: f32 } is the universal currency in this module — every search and re-ranker speaks it.

3. Bloom Filter Skip — `BloomTermFilter`

If most of your queries contain rare or nonsense terms, the inverted-index probe is wasteful. BloomTermFilter wraps the existing storage::bloom_filter::BloomFilter and lets BM25 skip a term in constant time when it’s guaranteed absent.

use heliosdb_lite::search::{Bm25Index};
use heliosdb_lite::search::filter_skip::BloomTermFilter;

// Build a Bloom over the corpus vocabulary once.
let vocab = ["rust", "async", "borrow", "checker", "guide", "explained"];
let bloom = BloomTermFilter::build(vocab);

let mut idx = Bm25Index::default();
idx.add_document(1, "rust async programming guide");
idx.add_document(2, "rust borrow checker explained");

// Query containing one in-vocab and one definitely-absent term.
let r = idx.search_with_filter("rust zzz_does_not_exist_in_index", 10, Some(&bloom));
// "zzz_does_not_exist_in_index" is rejected by the Bloom -> never probed.
assert!(!r.is_empty());

The contract is conservative: may_contain(term) == false MUST mean the term is definitely absent. False positives are fine — the HashMap probe just returns None and contributes zero score.

4. RRF — Combining Two Ranked Lists

Reciprocal Rank Fusion lets you merge BM25 and vector results without normalising score scales. Each rank r (1-based) contributes 1 / (k + r). Default k = 60 (Cormack/Clarke 2009).

use heliosdb_lite::search::{ScoredDoc, RrfParams, rrf};

let bm25 = vec![
    ScoredDoc { doc_id: 1, score: 10.0 },
    ScoredDoc { doc_id: 2, score: 9.0 },
    ScoredDoc { doc_id: 3, score: 8.0 },
];
let vec_ = vec![
    ScoredDoc { doc_id: 2, score: 0.95 },
    ScoredDoc { doc_id: 1, score: 0.90 },
    ScoredDoc { doc_id: 4, score: 0.30 },
];

let fused = rrf(&[&bm25, &vec_], RrfParams::default(), 5);
// Docs 1 and 2 appear in both lists -> they outrank singletons 3 and 4.

5. MMR — Diversification

When you only want N results but the top of the list is full of near-duplicates, Maximal Marginal Relevance trades relevance against novelty.

MMR = argmax_d   λ · relevance(d) − (1 − λ) · max_{s ∈ selected} sim(d, s)

λ = 1.0 is plain relevance ordering; λ = 0.0 is maximally diverse.

use heliosdb_lite::search::{ScoredDoc, mmr};

let cands = vec![
    ScoredDoc { doc_id: 1, score: 1.00 },
    ScoredDoc { doc_id: 2, score: 0.99 },   // near-duplicate of 1
    ScoredDoc { doc_id: 3, score: 0.50 },
];

// `sim(a, b)` returns 1.0 if 1 and 2 are duplicates, else 0.0.
let sim = |a: u64, b: u64| -> f32 {
    if (a, b) == (1, 2) || (a, b) == (2, 1) { 1.0 } else { 0.0 }
};

let picked = mmr(&cands, 0.0, 2, sim);
// First pick is doc 1; second pick is doc 3, NOT doc 2.

Pair MMR with your retrieval embedder’s cosine_sim for content-aware diversity.

6. The Hybrid Path — `hybrid_search()`

This is where it all comes together. You give it a BM25 index, a query string, the vector results from your existing HNSW lookup, and a HybridConfig. The module fuses the two and returns a single ranked list.

use heliosdb_lite::search::{
    Bm25Index, ScoredDoc,
    hybrid_search, HybridConfig, HybridScore, RrfParams,
};

let mut idx = Bm25Index::default();
idx.add_document(1, "rust async programming guide");
idx.add_document(2, "python machine learning basics");
idx.add_document(3, "rust borrow checker explained");

// Pretend these came back from your HNSW index.
let vector_results = vec![
    ScoredDoc { doc_id: 2, score: 0.95 },
    ScoredDoc { doc_id: 3, score: 0.90 },
    ScoredDoc { doc_id: 1, score: 0.10 },
];

// RRF fusion (default).
let cfg = HybridConfig::default();
let r   = hybrid_search(&idx, "rust", &vector_results, cfg);
println!("rrf: {:?}", r.iter().map(|d| d.doc_id).collect::<Vec<_>>());

// Weighted-linear fusion (normalises each list, then combines).
let cfg = HybridConfig {
    score: HybridScore::Linear { w_bm25: 0.5, w_vec: 0.5 },
    ..HybridConfig::default()
};
let r = hybrid_search(&idx, "rust", &vector_results, cfg);
println!("linear: {:?}", r.iter().map(|d| d.doc_id).collect::<Vec<_>>());

HybridConfig:

pub struct HybridConfig {
    pub per_ranker_k: usize,    // top-K to ask each individual ranker for (default 50)
    pub final_k: usize,         // final top-K to return (default 10)
    pub score: HybridScore,     // RRF (default) or Linear { w_bm25, w_vec }
}

If vector_results is empty, hybrid_search falls back to BM25-only — you can pass an empty slice while you’re wiring in your embedder.

7. End-to-End — RAG Retrieval

use heliosdb_lite::search::{
    Bm25Index, ScoredDoc,
    hybrid_search, HybridConfig,
};
use heliosdb_lite::search::filter_skip::BloomTermFilter;
use heliosdb_lite::search::reranker::{mmr};

fn retrieve(
    bm25:    &Bm25Index,
    bloom:   &BloomTermFilter,
    query:   &str,
    knn:     Vec<ScoredDoc>,                // from your HNSW lookup
    final_k: usize,
) -> Vec<ScoredDoc> {
    // Stage 1: BM25 + skip filter, fuse with vector via RRF.
    let cfg = HybridConfig {
        per_ranker_k: 50,
        final_k:      32,                   // wider than final_k for MMR
        ..HybridConfig::default()
    };
    let fused = hybrid_search(bm25, query, &knn, cfg);

    // Stage 2: greedy diversification.
    let sim = |_a: u64, _b: u64| 0.0;       // plug in your cosine_sim by id
    mmr(&fused, 0.7, final_k, sim)
}

Where Next

VECTOR_SEARCH_TUTORIAL — HNSW with PQ compression to populate the vector half.
GRAPH_ENGINE_TUTORIAL — sister RAG-native module for relational hops.
QUERY_PLAN_CACHING_TUTORIAL — cache the SQL that fetches your candidate IDs.
GETTING_STARTED_TUTORIAL — install HeliosDB Lite and run your first query.

Hybrid Search Tutorial — BM25 + Vector

Hybrid Search Tutorial — BM25 + Vector

UVP

Prerequisites

1. The Tokenizer

2. BM25 Index

3. Bloom Filter Skip — BloomTermFilter

4. RRF — Combining Two Ranked Lists

5. MMR — Diversification

6. The Hybrid Path — hybrid_search()

7. End-to-End — RAG Retrieval

Where Next

3. Bloom Filter Skip — `BloomTermFilter`

6. The Hybrid Path — `hybrid_search()`