Skip to content

HeliosDB Full-Text Search

HeliosDB Full-Text Search

High-performance full-text search engine for HeliosDB v3.0 with advanced query capabilities and multi-language support.

Features

  • Inverted Index: Efficient term-to-document mapping with positional information
  • BM25 Ranking: Industry-standard probabilistic ranking algorithm
  • Multi-language Support: Tokenization and stemming for 10+ languages
  • Advanced Query Syntax: Boolean operators, phrase search, proximity queries, fuzzy matching
  • Highlighting: Context-aware snippet generation with term highlighting
  • Compression: Roaring bitmaps and efficient storage

Quick Start

use heliosdb_fulltext::{FullTextIndex, Language};
#[tokio::main]
async fn main() -> anyhow::Result<()> {
// Create index
let mut index = FullTextIndex::new(Language::English)?;
// Add documents
index.add_document(1, "The quick brown fox jumps over the lazy dog").await?;
index.add_document(2, "A fast brown fox leaps across the sleeping hound").await?;
index.add_document(3, "The lazy dog sleeps in the sun").await?;
// Search
let results = index.search("quick fox", 10).await?;
for result in results {
println!("Doc {}: score={:.4}", result.doc_id, result.score);
println!(" {}", result.snippet.text);
}
Ok(())
}

Query Syntax

Simple Terms

hello world

Terms are implicitly ANDed together.

Boolean Operators

hello AND world
hello OR world
NOT spam
(quick OR fast) AND fox
"quick brown fox"

Finds exact phrase with terms in sequence.

"quick fox"~5

Finds terms within 5 words of each other.

hello~2

Allows up to 2 character edits (Levenshtein distance).

Wildcards

test* # Matches test, tests, testing, etc.
te?t # Matches test, text, etc.

Supported Languages

  • English
  • Spanish
  • French
  • German
  • Italian
  • Portuguese
  • Russian
  • Arabic
  • Dutch
  • Swedish

Architecture

Inverted Index

The inverted index maps terms to posting lists containing:

  • Document IDs
  • Term frequencies
  • Term positions (for phrase queries)
  • Compressed bitmaps for fast intersection
Term: "fox"
├── Posting: doc_id=1, tf=1, positions=[3]
├── Posting: doc_id=2, tf=1, positions=[3]
└── ...

Tokenization Pipeline

  1. Unicode Normalization: NFC normalization
  2. Word Segmentation: Unicode word boundaries
  3. Lowercasing: Optional case normalization
  4. Stopword Removal: Language-specific stopword lists
  5. Stemming: Porter stemmer for word root extraction

Ranking

BM25 Formula:

score(D,Q) = Σ IDF(qi) × (f(qi,D) × (k1 + 1)) / (f(qi,D) + k1 × (1 - b + b × |D|/avgdl))

Where:

  • f(qi,D): Term frequency of query term qi in document D
  • |D|: Document length
  • avgdl: Average document length
  • k1, b: Tuning parameters (default: k1=1.2, b=0.75)
  • IDF(qi): Inverse document frequency

Highlighting

  1. Find passages with most query term matches
  2. Extract context around matches
  3. Generate HTML/text with highlight markers
  4. Support multiple snippets per document

Configuration

use heliosdb_fulltext::{IndexConfig, RankingConfig, HighlightConfig};
let config = IndexConfig {
language: Language::English,
ranking: RankingConfig {
bm25_k1: 1.2,
bm25_b: 0.75,
phrase_boost: 2.0,
proximity_boost: 1.5,
..Default::default()
},
highlight: HighlightConfig {
max_snippet_length: 300,
context_before: 5,
context_after: 5,
..Default::default()
},
enable_fuzzy: true,
fuzzy_threshold: 0.8,
..Default::default()
};
let index = FullTextIndex::with_config(config)?;

Performance

  • Indexing: ~100K docs/sec (varies by document size)
  • Search: Sub-millisecond for most queries on million-doc corpus
  • Memory: ~10-20 bytes per token indexed
  • Compression: 5-10x with Roaring bitmaps

Integration with HeliosDB

// Example integration with HeliosDB storage layer
use heliosdb_fulltext::FullTextIndex;
use heliosdb_storage::RowId;
struct FullTextColumn {
index: FullTextIndex,
row_to_doc_id: HashMap<RowId, u64>,
}
impl FullTextColumn {
async fn index_row(&mut self, row_id: RowId, text: &str) -> Result<()> {
let doc_id = self.row_to_doc_id.len() as u64;
self.row_to_doc_id.insert(row_id, doc_id);
self.index.add_document(doc_id, text).await
}
async fn search_rows(&self, query: &str, limit: usize) -> Result<Vec<RowId>> {
let results = self.index.search(query, limit).await?;
let doc_to_row: HashMap<_, _> = self.row_to_doc_id
.iter()
.map(|(k, v)| (v, k))
.collect();
Ok(results.iter()
.filter_map(|r| doc_to_row.get(&r.doc_id).copied())
.collect())
}
}

Testing

Terminal window
# Run unit tests
cargo test
# Run with logging
RUST_LOG=debug cargo test
# Run benchmarks
cargo bench

Future Enhancements

  • CJK (Chinese, Japanese, Korean) tokenization
  • Geo-spatial search integration
  • Faceted search support
  • Spell correction and suggestions
  • Learning to rank (LTR) integration
  • Distributed indexing
  • Real-time index updates with segments
  • Query result caching

License

Apache-2.0