HeliosDB Full-Text Search

High-performance full-text search engine for HeliosDB v3.0 with advanced query capabilities and multi-language support.

Features

Inverted Index: Efficient term-to-document mapping with positional information
BM25 Ranking: Industry-standard probabilistic ranking algorithm
Multi-language Support: Tokenization and stemming for 10+ languages
Advanced Query Syntax: Boolean operators, phrase search, proximity queries, fuzzy matching
Highlighting: Context-aware snippet generation with term highlighting
Compression: Roaring bitmaps and efficient storage

Quick Start

use heliosdb_fulltext::{FullTextIndex, Language};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // Create index
    let mut index = FullTextIndex::new(Language::English)?;

    // Add documents
    index.add_document(1, "The quick brown fox jumps over the lazy dog").await?;
    index.add_document(2, "A fast brown fox leaps across the sleeping hound").await?;
    index.add_document(3, "The lazy dog sleeps in the sun").await?;

    // Search
    let results = index.search("quick fox", 10).await?;

    for result in results {
        println!("Doc {}: score={:.4}", result.doc_id, result.score);
        println!("  {}", result.snippet.text);
    }

    Ok(())
}

Query Syntax

Simple Terms

hello world

Terms are implicitly ANDed together.

Boolean Operators

hello AND world
hello OR world
NOT spam
(quick OR fast) AND fox

Phrase Search

"quick brown fox"

Finds exact phrase with terms in sequence.

Proximity Search

"quick fox"~5

Finds terms within 5 words of each other.

Fuzzy Search

hello~2

Allows up to 2 character edits (Levenshtein distance).

Wildcards

test*        # Matches test, tests, testing, etc.
te?t         # Matches test, text, etc.

Supported Languages

English
Spanish
French
German
Italian
Portuguese
Russian
Arabic
Dutch
Swedish

Architecture

Inverted Index

The inverted index maps terms to posting lists containing:

Document IDs
Term frequencies
Term positions (for phrase queries)
Compressed bitmaps for fast intersection

Term: "fox"
├── Posting: doc_id=1, tf=1, positions=[3]
├── Posting: doc_id=2, tf=1, positions=[3]
└── ...

Tokenization Pipeline

Unicode Normalization: NFC normalization
Word Segmentation: Unicode word boundaries
Lowercasing: Optional case normalization
Stopword Removal: Language-specific stopword lists
Stemming: Porter stemmer for word root extraction

Ranking

BM25 Formula:

score(D,Q) = Σ IDF(qi) × (f(qi,D) × (k1 + 1)) / (f(qi,D) + k1 × (1 - b + b × |D|/avgdl))

Where:

f(qi,D): Term frequency of query term qi in document D
|D|: Document length
avgdl: Average document length
k1, b: Tuning parameters (default: k1=1.2, b=0.75)
IDF(qi): Inverse document frequency

Highlighting

Find passages with most query term matches
Extract context around matches
Generate HTML/text with highlight markers
Support multiple snippets per document

Configuration

use heliosdb_fulltext::{IndexConfig, RankingConfig, HighlightConfig};

let config = IndexConfig {
    language: Language::English,
    ranking: RankingConfig {
        bm25_k1: 1.2,
        bm25_b: 0.75,
        phrase_boost: 2.0,
        proximity_boost: 1.5,
        ..Default::default()
    },
    highlight: HighlightConfig {
        max_snippet_length: 300,
        context_before: 5,
        context_after: 5,
        ..Default::default()
    },
    enable_fuzzy: true,
    fuzzy_threshold: 0.8,
    ..Default::default()
};

let index = FullTextIndex::with_config(config)?;

Performance

Indexing: ~100K docs/sec (varies by document size)
Search: Sub-millisecond for most queries on million-doc corpus
Memory: ~10-20 bytes per token indexed
Compression: 5-10x with Roaring bitmaps

Integration with HeliosDB

// Example integration with HeliosDB storage layer
use heliosdb_fulltext::FullTextIndex;
use heliosdb_storage::RowId;

struct FullTextColumn {
    index: FullTextIndex,
    row_to_doc_id: HashMap<RowId, u64>,
}

impl FullTextColumn {
    async fn index_row(&mut self, row_id: RowId, text: &str) -> Result<()> {
        let doc_id = self.row_to_doc_id.len() as u64;
        self.row_to_doc_id.insert(row_id, doc_id);
        self.index.add_document(doc_id, text).await
    }

    async fn search_rows(&self, query: &str, limit: usize) -> Result<Vec<RowId>> {
        let results = self.index.search(query, limit).await?;

        let doc_to_row: HashMap<_, _> = self.row_to_doc_id
            .iter()
            .map(|(k, v)| (v, k))
            .collect();

        Ok(results.iter()
            .filter_map(|r| doc_to_row.get(&r.doc_id).copied())
            .collect())
    }
}

Testing

# Run unit tests
cargo test

# Run with logging
RUST_LOG=debug cargo test

# Run benchmarks
cargo bench

Future Enhancements

CJK (Chinese, Japanese, Korean) tokenization
Geo-spatial search integration
Faceted search support
Spell correction and suggestions
Learning to rank (LTR) integration
Distributed indexing
Real-time index updates with segments
Query result caching

License

Apache-2.0