Hybrid Search Tutorial

Difficulty: Intermediate Prerequisites: Basic HeliosDB knowledge, familiarity with full-text search concepts Module: src/search/

This tutorial covers BM25 lexical search, hybrid BM25+vector search, and the RRF/MMR reranking algorithms available in HeliosDB.

1. Overview

HeliosDB provides three search capabilities:

Capability	Module	Best For
BM25	`search::bm25`	Keyword-based full-text search
Vector	`vector/` (HNSW)	Semantic similarity search
Hybrid	`search::hybrid`	Combining keyword + semantic for RAG pipelines

The search module is designed for RAG (Retrieval-Augmented Generation) workloads where you need both lexical precision and semantic recall.

2. BM25 Inverted Index

Creating an Index

use heliosdb_lite::search::{Bm25Index, Bm25Params};

// Default Okapi BM25 parameters: k1=1.2, b=0.75
let mut index = Bm25Index::new(Bm25Params::default());

// Custom parameters for short documents
let mut index = Bm25Index::new(Bm25Params { k1: 1.5, b: 0.5 });

Indexing Documents

Each document has a 64-bit ID and text content:

index.add_document(1, "HeliosDB is a fast embedded database engine");
index.add_document(2, "PostgreSQL is a powerful open source database");
index.add_document(3, "Vector search enables semantic similarity queries");
index.add_document(4, "BM25 is a probabilistic information retrieval function");

Searching

let results = index.search("database engine", 10);

for doc in &results {
    println!("Doc {}: score {:.4}", doc.doc_id, doc.score);
}
// Doc 1: score 0.8234  (matches both "database" and "engine")
// Doc 2: score 0.4112  (matches "database" only)

BM25 Scoring Formula

The Okapi BM25 score for a document D and query Q is:

score(D, Q) = sum_{t in Q} IDF(t) * f(t,D)*(k1+1) / (f(t,D) + k1*(1 - b + b*|D|/avgdl))

Where:

IDF(t) = ln((N - n(t) + 0.5) / (n(t) + 0.5) + 1) — inverse document frequency
f(t,D) — term frequency of t in document D
|D| — document length in tokens
avgdl — average document length across all documents

3. Unicode-Aware Tokenizer

The tokenizer handles international text correctly:

use heliosdb_lite::search::tokenize;

let tokens = tokenize("Cafe resume naive", &Default::default());
// tokens = ["cafe", "resume", "naive"]
// ASCII folding: accented characters mapped to ASCII equivalents

Features:

Lowercase normalization
ASCII folding (e.g., e with acute accent to “e”)
Whitespace and punctuation splitting
Configurable via TokenizerConfig

4. Bloom Filter Skip

For large corpora, the Bloom filter skip adapter eliminates unnecessary posting list lookups:

use heliosdb_lite::search::filter_skip::TermFilter;

// If the Bloom filter says "xyzzy" is NOT in any posting list,
// the BM25 index skips the lookup entirely -- zero I/O cost.

This is especially effective when queries contain rare or misspelled terms that do not exist in the corpus.

5. Hybrid Search

Hybrid search combines BM25 lexical results with vector similarity results into a single ranking.

Setup

use heliosdb_lite::search::{
    hybrid_search, HybridConfig, HybridScore,
    Bm25Index, Bm25Params, ScoredDoc,
};

// Build BM25 index
let mut bm25 = Bm25Index::new(Bm25Params::default());
bm25.add_document(1, "machine learning optimization");
bm25.add_document(2, "database query optimization");
bm25.add_document(3, "neural network training");

// Vector results from HNSW index (pre-computed)
let vector_results = vec![
    ScoredDoc { doc_id: 3, score: 0.95 },
    ScoredDoc { doc_id: 1, score: 0.82 },
    ScoredDoc { doc_id: 2, score: 0.45 },
];

Linear Fusion

Weight BM25 and vector scores directly:

let config = HybridConfig {
    per_ranker_k: 50,
    final_k: 10,
    score: HybridScore::Linear { w_bm25: 0.4, w_vec: 0.6 },
};

let results = hybrid_search(&bm25, "optimization", &vector_results, config);

RRF Fusion (Default)

Reciprocal Rank Fusion combines rankings without requiring score calibration:

let config = HybridConfig {
    per_ranker_k: 50,
    final_k: 10,
    score: HybridScore::Rrf(Default::default()), // k=60
};

let results = hybrid_search(&bm25, "optimization", &vector_results, config);

RRF scores each document as sum(1 / (k + rank)) across all input lists. The smoothing constant k (default 60) prevents top-ranked items from dominating.

6. Rerankers

Reciprocal Rank Fusion (RRF)

Combine any number of ranked result lists:

use heliosdb_lite::search::reranker::{rrf, RrfParams};

let list_a = vec![
    ScoredDoc { doc_id: 1, score: 0.9 },
    ScoredDoc { doc_id: 2, score: 0.7 },
];
let list_b = vec![
    ScoredDoc { doc_id: 2, score: 0.95 },
    ScoredDoc { doc_id: 3, score: 0.80 },
];

let fused = rrf(&[&list_a, &list_b], RrfParams { k: 60 }, 10);
// Doc 2 ranks highest (appears in both lists)

Maximal Marginal Relevance (MMR)

Diversify results by penalizing redundancy:

use heliosdb_lite::search::reranker::mmr;

// lambda = 1.0: pure relevance (no diversification)
// lambda = 0.5: balance relevance and diversity
// lambda = 0.0: maximize diversity

let diversified = mmr(
    &candidates,
    0.7,           // lambda: 70% relevance, 30% diversity
    |a, b| cosine_similarity(a, b),  // similarity function
    10,            // return top 10
);

MMR uses the greedy formula:

MMR = argmax_{d in candidates} [lambda * relevance(d) - (1-lambda) * max_{s in selected} sim(d, s)]

7. RAG Pipeline Example

A typical RAG pipeline combining all three:

// 1. Index documents with BM25
let mut bm25 = Bm25Index::new(Bm25Params::default());
for (id, text) in documents {
    bm25.add_document(id, &text);
}

// 2. Get vector results from HNSW
// (using HeliosDB's existing vector search)
let vector_results = hnsw_search(&query_embedding, 50);

// 3. Hybrid fusion
let hybrid = hybrid_search(&bm25, &query_text, &vector_results, HybridConfig::default());

// 4. MMR diversification
let final_results = mmr(&hybrid, 0.7, similarity_fn, 5);

// 5. Pass top-5 to LLM as context

8. Parameter Tuning Guide

Parameter	Default	Effect of Increasing
BM25 k1	1.2	Higher term frequency saturation
BM25 b	0.75	Stronger document length normalization
RRF k	60	Less dominance by top-ranked items
MMR lambda	0.7	More relevance, less diversity
per_ranker_k	50	More candidates from each ranker
final_k	10	More results returned

9. Troubleshooting

Issue	Cause	Fix
All scores are 0	Query terms not in any document	Check tokenization; verify documents were indexed
Poor relevance	Default BM25 params not suited	Tune k1/b for your document length distribution
Redundant results	No diversification	Apply MMR with lambda < 1.0
Slow on large corpus	No Bloom filter	Enable filter_skip for term non-existence checks

Next Steps

Combine with HeliosDB’s existing FTS (@@ operator) for SQL-level text search
Use HNSW indexes for the vector component of hybrid search
See tests/bm25_with_bloom.rs for comprehensive test examples