HeliosDB Full-Text Search
HeliosDB Full-Text Search
High-performance full-text search engine for HeliosDB v3.0 with advanced query capabilities and multi-language support.
Features
- Inverted Index: Efficient term-to-document mapping with positional information
- BM25 Ranking: Industry-standard probabilistic ranking algorithm
- Multi-language Support: Tokenization and stemming for 10+ languages
- Advanced Query Syntax: Boolean operators, phrase search, proximity queries, fuzzy matching
- Highlighting: Context-aware snippet generation with term highlighting
- Compression: Roaring bitmaps and efficient storage
Quick Start
use heliosdb_fulltext::{FullTextIndex, Language};
#[tokio::main]async fn main() -> anyhow::Result<()> { // Create index let mut index = FullTextIndex::new(Language::English)?;
// Add documents index.add_document(1, "The quick brown fox jumps over the lazy dog").await?; index.add_document(2, "A fast brown fox leaps across the sleeping hound").await?; index.add_document(3, "The lazy dog sleeps in the sun").await?;
// Search let results = index.search("quick fox", 10).await?;
for result in results { println!("Doc {}: score={:.4}", result.doc_id, result.score); println!(" {}", result.snippet.text); }
Ok(())}Query Syntax
Simple Terms
hello worldTerms are implicitly ANDed together.
Boolean Operators
hello AND worldhello OR worldNOT spam(quick OR fast) AND foxPhrase Search
"quick brown fox"Finds exact phrase with terms in sequence.
Proximity Search
"quick fox"~5Finds terms within 5 words of each other.
Fuzzy Search
hello~2Allows up to 2 character edits (Levenshtein distance).
Wildcards
test* # Matches test, tests, testing, etc.te?t # Matches test, text, etc.Supported Languages
- English
- Spanish
- French
- German
- Italian
- Portuguese
- Russian
- Arabic
- Dutch
- Swedish
Architecture
Inverted Index
The inverted index maps terms to posting lists containing:
- Document IDs
- Term frequencies
- Term positions (for phrase queries)
- Compressed bitmaps for fast intersection
Term: "fox"├── Posting: doc_id=1, tf=1, positions=[3]├── Posting: doc_id=2, tf=1, positions=[3]└── ...Tokenization Pipeline
- Unicode Normalization: NFC normalization
- Word Segmentation: Unicode word boundaries
- Lowercasing: Optional case normalization
- Stopword Removal: Language-specific stopword lists
- Stemming: Porter stemmer for word root extraction
Ranking
BM25 Formula:
score(D,Q) = Σ IDF(qi) × (f(qi,D) × (k1 + 1)) / (f(qi,D) + k1 × (1 - b + b × |D|/avgdl))Where:
f(qi,D): Term frequency of query term qi in document D|D|: Document lengthavgdl: Average document lengthk1,b: Tuning parameters (default: k1=1.2, b=0.75)IDF(qi): Inverse document frequency
Highlighting
- Find passages with most query term matches
- Extract context around matches
- Generate HTML/text with highlight markers
- Support multiple snippets per document
Configuration
use heliosdb_fulltext::{IndexConfig, RankingConfig, HighlightConfig};
let config = IndexConfig { language: Language::English, ranking: RankingConfig { bm25_k1: 1.2, bm25_b: 0.75, phrase_boost: 2.0, proximity_boost: 1.5, ..Default::default() }, highlight: HighlightConfig { max_snippet_length: 300, context_before: 5, context_after: 5, ..Default::default() }, enable_fuzzy: true, fuzzy_threshold: 0.8, ..Default::default()};
let index = FullTextIndex::with_config(config)?;Performance
- Indexing: ~100K docs/sec (varies by document size)
- Search: Sub-millisecond for most queries on million-doc corpus
- Memory: ~10-20 bytes per token indexed
- Compression: 5-10x with Roaring bitmaps
Integration with HeliosDB
// Example integration with HeliosDB storage layeruse heliosdb_fulltext::FullTextIndex;use heliosdb_storage::RowId;
struct FullTextColumn { index: FullTextIndex, row_to_doc_id: HashMap<RowId, u64>,}
impl FullTextColumn { async fn index_row(&mut self, row_id: RowId, text: &str) -> Result<()> { let doc_id = self.row_to_doc_id.len() as u64; self.row_to_doc_id.insert(row_id, doc_id); self.index.add_document(doc_id, text).await }
async fn search_rows(&self, query: &str, limit: usize) -> Result<Vec<RowId>> { let results = self.index.search(query, limit).await?;
let doc_to_row: HashMap<_, _> = self.row_to_doc_id .iter() .map(|(k, v)| (v, k)) .collect();
Ok(results.iter() .filter_map(|r| doc_to_row.get(&r.doc_id).copied()) .collect()) }}Testing
# Run unit testscargo test
# Run with loggingRUST_LOG=debug cargo test
# Run benchmarkscargo benchFuture Enhancements
- CJK (Chinese, Japanese, Korean) tokenization
- Geo-spatial search integration
- Faceted search support
- Spell correction and suggestions
- Learning to rank (LTR) integration
- Distributed indexing
- Real-time index updates with segments
- Query result caching
License
Apache-2.0