FSST Compression Quick Reference
FSST Compression Quick Reference
TL;DR
FSST (Fast Static Symbol Table) provides 2-5x string compression at GB/sec speeds with automatic dictionary management.
Quick Start
1. Train Dictionary for Table Column
use heliosdb_nano::storage::StorageEngine;
// Open enginelet engine = StorageEngine::open("./data", &config)?;
// Train dictionary (samples up to 10K rows automatically)let dict_key = engine.train_fsst_for_table("users", "email")?;// Returns: "users:email"
// New inserts automatically use the trained dictionary2. Load Dictionary on Startup
// Load dictionaries for frequently accessed columnsengine.load_fsst_dictionary("users", "email")?;engine.load_fsst_dictionary("logs", "message")?;
// Compression now uses loaded dictionaries3. Monitor Cache Performance
let stats = engine.fsst_cache_stats();println!("Hit rate: {:.2}%", stats.hit_rate() * 100.0);println!("Dictionaries: {}", stats.dictionary_count);println!("Cache size: {} MB", stats.size_bytes / 1024 / 1024);API Reference
StorageEngine Methods
train_fsst_for_table(table: &str, column: &str) -> Result<String>
Train FSST dictionary for a string column.
- Samples up to 10,000 rows from the table
- Stores dictionary in persistent storage
- Caches dictionary for fast access
- Returns dictionary key (e.g., “users:email”)
load_fsst_dictionary(table: &str, column: &str) -> Result<bool>
Load dictionary from persistent storage.
- Returns
trueif dictionary was found and loaded - Returns
falseif dictionary doesn’t exist - Caches dictionary in memory for fast access
fsst_cache_stats() -> DictionaryCacheStats
Get cache statistics.
hits: Number of cache hitsmisses: Number of cache missesevictions: Number of LRU evictionssize_bytes: Current cache sizedictionary_count: Number of cached dictionaries
list_fsst_dictionaries() -> Vec<String>
List all stored dictionary keys.
- Returns keys like “table_name:column_name”
clear_fsst_cache()
Clear in-memory cache (keeps persistent storage).
- Useful for memory management
- Dictionaries can be reloaded on demand
Configuration
Dictionary Cache Config
use heliosdb_nano::storage::compression::fsst::DictionaryCacheConfig;
let dict_config = DictionaryCacheConfig { max_size_mb: 100, // Max cache size (default: 100 MB) max_dictionaries: 1000, // Max cached dictionaries (default: 1000) eviction_policy: EvictionPolicy::LRU,};Compression Config
use heliosdb_nano::storage::compression::CompressionConfig;
let config = CompressionConfig { enabled: true, // Enable compression (default: true) min_data_size: 1024, // Min size to compress (default: 1 KB) min_compression_ratio: 1.2, // Min ratio to keep compressed (default: 1.2) column_overrides: HashMap::new(),};Best Practices
When to Train Dictionaries
✅ Good Cases:
- Columns with repeated patterns (emails, URLs, JSON)
- High-cardinality string columns
- After bulk inserts (>1000 rows)
- During off-peak hours for large tables
❌ Bad Cases:
- Tables with <100 rows
- Completely random strings (UUIDs without patterns)
- Columns with unique values (no repetition)
Training Workflow
// 1. Create tablecatalog.create_table("users", schema)?;
// 2. Bulk insert datafor user in users.iter() { engine.insert_tuple("users", user.to_tuple())?;}
// 3. Train dictionary after bulk loadengine.train_fsst_for_table("users", "email")?;engine.train_fsst_for_table("users", "address")?;
// 4. Subsequent inserts use trained dictionaries automaticallyMemory Management
// Check cache sizelet stats = engine.fsst_cache_stats();if stats.size_bytes > 90 * 1024 * 1024 { // Approaching 100 MB limit // Clear cache if needed (dictionaries persist) engine.clear_fsst_cache();}
// Reload critical dictionariesengine.load_fsst_dictionary("users", "email")?;Performance Tips
Optimal Sample Size
- Default: 10,000 rows or 16 KB (whichever is smaller)
- More rows = better dictionary quality
- Diminishing returns after 10K rows
- Training time: <1 second for typical datasets
Cache Hit Optimization
// Pre-load dictionaries for hot columnslet hot_columns = vec![ ("users", "email"), ("logs", "message"), ("products", "description"),];
for (table, column) in hot_columns { engine.load_fsst_dictionary(table, column)?;}Compression Ratio Validation
// Check if compression is beneficiallet compression_mgr = engine.compression_manager();if let Some(stats) = compression_mgr.get_stats("users") { if stats.overall_ratio < 1.2 { println!("Warning: Low compression ratio"); // Consider disabling compression for this table }}Typical Compression Ratios
| Data Type | Compression Ratio |
|---|---|
| Email addresses | 2-3x |
| URLs | 3-4x |
| JSON logs | 4-5x |
| UUIDs | 2-3x |
| Repeated text | 3-5x |
| Random strings | 1.0-1.5x |
Troubleshooting
Dictionary Not Working
// Check if dictionary existslet dicts = engine.list_fsst_dictionaries();if !dicts.contains(&"users:email".to_string()) { // Train dictionary engine.train_fsst_for_table("users", "email")?;}Low Compression Ratio
// Check compression statslet stats = engine.compression_manager().get_stats("users");if let Some(s) = stats { println!("Compression ratio: {:.2}x", s.overall_ratio); if s.overall_ratio < 1.5 { // Data might not have patterns suitable for FSST // Consider disabling compression for this table }}High Cache Misses
let stats = engine.fsst_cache_stats();let miss_rate = stats.misses as f64 / (stats.hits + stats.misses) as f64;if miss_rate > 0.5 { // High miss rate - consider: // 1. Increasing cache size // 2. Pre-loading critical dictionaries // 3. Reducing number of string columns}Advanced Features
Multiple Columns Per Table
// Train dictionaries for all string columnslet string_columns = vec!["email", "address", "phone"];for column in string_columns { engine.train_fsst_for_table("users", column)?;}Dictionary Versioning (Future)
// Not yet implemented - planned for future release// Will support:// - Multiple dictionary versions per column// - Automatic retraining on data distribution changes// - Dictionary migrationTesting
Basic Test
#[test]fn test_fsst_compression() { let config = Config::in_memory(); let engine = StorageEngine::open_in_memory(&config)?;
// Create table and insert data // ... (see tests/fsst_integration_tests.rs)
// Train dictionary let dict_key = engine.train_fsst_for_table("users", "email")?; assert_eq!(dict_key, "users:email");
// Verify compression let results = engine.scan_table("users")?; assert_eq!(results.len(), 100);}Run All Tests
cargo test --test fsst_integration_tests