Skip to content

Hybrid Search Tutorial

Hybrid Search Tutorial

Difficulty: Intermediate Prerequisites: Basic HeliosDB knowledge, familiarity with full-text search concepts Module: src/search/

This tutorial covers BM25 lexical search, hybrid BM25+vector search, and the RRF/MMR reranking algorithms available in HeliosDB.


1. Overview

HeliosDB provides three search capabilities:

CapabilityModuleBest For
BM25search::bm25Keyword-based full-text search
Vectorvector/ (HNSW)Semantic similarity search
Hybridsearch::hybridCombining keyword + semantic for RAG pipelines

The search module is designed for RAG (Retrieval-Augmented Generation) workloads where you need both lexical precision and semantic recall.


2. BM25 Inverted Index

Creating an Index

use heliosdb_lite::search::{Bm25Index, Bm25Params};
// Default Okapi BM25 parameters: k1=1.2, b=0.75
let mut index = Bm25Index::new(Bm25Params::default());
// Custom parameters for short documents
let mut index = Bm25Index::new(Bm25Params { k1: 1.5, b: 0.5 });

Indexing Documents

Each document has a 64-bit ID and text content:

index.add_document(1, "HeliosDB is a fast embedded database engine");
index.add_document(2, "PostgreSQL is a powerful open source database");
index.add_document(3, "Vector search enables semantic similarity queries");
index.add_document(4, "BM25 is a probabilistic information retrieval function");

Searching

let results = index.search("database engine", 10);
for doc in &results {
println!("Doc {}: score {:.4}", doc.doc_id, doc.score);
}
// Doc 1: score 0.8234 (matches both "database" and "engine")
// Doc 2: score 0.4112 (matches "database" only)

BM25 Scoring Formula

The Okapi BM25 score for a document D and query Q is:

score(D, Q) = sum_{t in Q} IDF(t) * f(t,D)*(k1+1) / (f(t,D) + k1*(1 - b + b*|D|/avgdl))

Where:

  • IDF(t) = ln((N - n(t) + 0.5) / (n(t) + 0.5) + 1) — inverse document frequency
  • f(t,D) — term frequency of t in document D
  • |D| — document length in tokens
  • avgdl — average document length across all documents

3. Unicode-Aware Tokenizer

The tokenizer handles international text correctly:

use heliosdb_lite::search::tokenize;
let tokens = tokenize("Cafe resume naive", &Default::default());
// tokens = ["cafe", "resume", "naive"]
// ASCII folding: accented characters mapped to ASCII equivalents

Features:

  • Lowercase normalization
  • ASCII folding (e.g., e with acute accent to “e”)
  • Whitespace and punctuation splitting
  • Configurable via TokenizerConfig

4. Bloom Filter Skip

For large corpora, the Bloom filter skip adapter eliminates unnecessary posting list lookups:

use heliosdb_lite::search::filter_skip::TermFilter;
// If the Bloom filter says "xyzzy" is NOT in any posting list,
// the BM25 index skips the lookup entirely -- zero I/O cost.

This is especially effective when queries contain rare or misspelled terms that do not exist in the corpus.


Hybrid search combines BM25 lexical results with vector similarity results into a single ranking.

Setup

use heliosdb_lite::search::{
hybrid_search, HybridConfig, HybridScore,
Bm25Index, Bm25Params, ScoredDoc,
};
// Build BM25 index
let mut bm25 = Bm25Index::new(Bm25Params::default());
bm25.add_document(1, "machine learning optimization");
bm25.add_document(2, "database query optimization");
bm25.add_document(3, "neural network training");
// Vector results from HNSW index (pre-computed)
let vector_results = vec![
ScoredDoc { doc_id: 3, score: 0.95 },
ScoredDoc { doc_id: 1, score: 0.82 },
ScoredDoc { doc_id: 2, score: 0.45 },
];

Linear Fusion

Weight BM25 and vector scores directly:

let config = HybridConfig {
per_ranker_k: 50,
final_k: 10,
score: HybridScore::Linear { w_bm25: 0.4, w_vec: 0.6 },
};
let results = hybrid_search(&bm25, "optimization", &vector_results, config);

RRF Fusion (Default)

Reciprocal Rank Fusion combines rankings without requiring score calibration:

let config = HybridConfig {
per_ranker_k: 50,
final_k: 10,
score: HybridScore::Rrf(Default::default()), // k=60
};
let results = hybrid_search(&bm25, "optimization", &vector_results, config);

RRF scores each document as sum(1 / (k + rank)) across all input lists. The smoothing constant k (default 60) prevents top-ranked items from dominating.


6. Rerankers

Reciprocal Rank Fusion (RRF)

Combine any number of ranked result lists:

use heliosdb_lite::search::reranker::{rrf, RrfParams};
let list_a = vec![
ScoredDoc { doc_id: 1, score: 0.9 },
ScoredDoc { doc_id: 2, score: 0.7 },
];
let list_b = vec![
ScoredDoc { doc_id: 2, score: 0.95 },
ScoredDoc { doc_id: 3, score: 0.80 },
];
let fused = rrf(&[&list_a, &list_b], RrfParams { k: 60 }, 10);
// Doc 2 ranks highest (appears in both lists)

Maximal Marginal Relevance (MMR)

Diversify results by penalizing redundancy:

use heliosdb_lite::search::reranker::mmr;
// lambda = 1.0: pure relevance (no diversification)
// lambda = 0.5: balance relevance and diversity
// lambda = 0.0: maximize diversity
let diversified = mmr(
&candidates,
0.7, // lambda: 70% relevance, 30% diversity
|a, b| cosine_similarity(a, b), // similarity function
10, // return top 10
);

MMR uses the greedy formula:

MMR = argmax_{d in candidates} [lambda * relevance(d) - (1-lambda) * max_{s in selected} sim(d, s)]

7. RAG Pipeline Example

A typical RAG pipeline combining all three:

// 1. Index documents with BM25
let mut bm25 = Bm25Index::new(Bm25Params::default());
for (id, text) in documents {
bm25.add_document(id, &text);
}
// 2. Get vector results from HNSW
// (using HeliosDB's existing vector search)
let vector_results = hnsw_search(&query_embedding, 50);
// 3. Hybrid fusion
let hybrid = hybrid_search(&bm25, &query_text, &vector_results, HybridConfig::default());
// 4. MMR diversification
let final_results = mmr(&hybrid, 0.7, similarity_fn, 5);
// 5. Pass top-5 to LLM as context

8. Parameter Tuning Guide

ParameterDefaultEffect of Increasing
BM25 k11.2Higher term frequency saturation
BM25 b0.75Stronger document length normalization
RRF k60Less dominance by top-ranked items
MMR lambda0.7More relevance, less diversity
per_ranker_k50More candidates from each ranker
final_k10More results returned

9. Troubleshooting

IssueCauseFix
All scores are 0Query terms not in any documentCheck tokenization; verify documents were indexed
Poor relevanceDefault BM25 params not suitedTune k1/b for your document length distribution
Redundant resultsNo diversificationApply MMR with lambda < 1.0
Slow on large corpusNo Bloom filterEnable filter_skip for term non-existence checks

Next Steps

  • Combine with HeliosDB’s existing FTS (@@ operator) for SQL-level text search
  • Use HNSW indexes for the vector component of hybrid search
  • See tests/bm25_with_bloom.rs for comprehensive test examples