Skip to content

FSST Compression Quick Reference

FSST Compression Quick Reference

TL;DR

FSST (Fast Static Symbol Table) provides 2-5x string compression at GB/sec speeds with automatic dictionary management.

Quick Start

1. Train Dictionary for Table Column

use heliosdb_nano::storage::StorageEngine;
// Open engine
let engine = StorageEngine::open("./data", &config)?;
// Train dictionary (samples up to 10K rows automatically)
let dict_key = engine.train_fsst_for_table("users", "email")?;
// Returns: "users:email"
// New inserts automatically use the trained dictionary

2. Load Dictionary on Startup

// Load dictionaries for frequently accessed columns
engine.load_fsst_dictionary("users", "email")?;
engine.load_fsst_dictionary("logs", "message")?;
// Compression now uses loaded dictionaries

3. Monitor Cache Performance

let stats = engine.fsst_cache_stats();
println!("Hit rate: {:.2}%", stats.hit_rate() * 100.0);
println!("Dictionaries: {}", stats.dictionary_count);
println!("Cache size: {} MB", stats.size_bytes / 1024 / 1024);

API Reference

StorageEngine Methods

train_fsst_for_table(table: &str, column: &str) -> Result<String>

Train FSST dictionary for a string column.

  • Samples up to 10,000 rows from the table
  • Stores dictionary in persistent storage
  • Caches dictionary for fast access
  • Returns dictionary key (e.g., “users:email”)

load_fsst_dictionary(table: &str, column: &str) -> Result<bool>

Load dictionary from persistent storage.

  • Returns true if dictionary was found and loaded
  • Returns false if dictionary doesn’t exist
  • Caches dictionary in memory for fast access

fsst_cache_stats() -> DictionaryCacheStats

Get cache statistics.

  • hits: Number of cache hits
  • misses: Number of cache misses
  • evictions: Number of LRU evictions
  • size_bytes: Current cache size
  • dictionary_count: Number of cached dictionaries

list_fsst_dictionaries() -> Vec<String>

List all stored dictionary keys.

  • Returns keys like “table_name:column_name”

clear_fsst_cache()

Clear in-memory cache (keeps persistent storage).

  • Useful for memory management
  • Dictionaries can be reloaded on demand

Configuration

Dictionary Cache Config

use heliosdb_nano::storage::compression::fsst::DictionaryCacheConfig;
let dict_config = DictionaryCacheConfig {
max_size_mb: 100, // Max cache size (default: 100 MB)
max_dictionaries: 1000, // Max cached dictionaries (default: 1000)
eviction_policy: EvictionPolicy::LRU,
};

Compression Config

use heliosdb_nano::storage::compression::CompressionConfig;
let config = CompressionConfig {
enabled: true, // Enable compression (default: true)
min_data_size: 1024, // Min size to compress (default: 1 KB)
min_compression_ratio: 1.2, // Min ratio to keep compressed (default: 1.2)
column_overrides: HashMap::new(),
};

Best Practices

When to Train Dictionaries

Good Cases:

  • Columns with repeated patterns (emails, URLs, JSON)
  • High-cardinality string columns
  • After bulk inserts (>1000 rows)
  • During off-peak hours for large tables

Bad Cases:

  • Tables with <100 rows
  • Completely random strings (UUIDs without patterns)
  • Columns with unique values (no repetition)

Training Workflow

// 1. Create table
catalog.create_table("users", schema)?;
// 2. Bulk insert data
for user in users.iter() {
engine.insert_tuple("users", user.to_tuple())?;
}
// 3. Train dictionary after bulk load
engine.train_fsst_for_table("users", "email")?;
engine.train_fsst_for_table("users", "address")?;
// 4. Subsequent inserts use trained dictionaries automatically

Memory Management

// Check cache size
let stats = engine.fsst_cache_stats();
if stats.size_bytes > 90 * 1024 * 1024 {
// Approaching 100 MB limit
// Clear cache if needed (dictionaries persist)
engine.clear_fsst_cache();
}
// Reload critical dictionaries
engine.load_fsst_dictionary("users", "email")?;

Performance Tips

Optimal Sample Size

  • Default: 10,000 rows or 16 KB (whichever is smaller)
  • More rows = better dictionary quality
  • Diminishing returns after 10K rows
  • Training time: <1 second for typical datasets

Cache Hit Optimization

// Pre-load dictionaries for hot columns
let hot_columns = vec![
("users", "email"),
("logs", "message"),
("products", "description"),
];
for (table, column) in hot_columns {
engine.load_fsst_dictionary(table, column)?;
}

Compression Ratio Validation

// Check if compression is beneficial
let compression_mgr = engine.compression_manager();
if let Some(stats) = compression_mgr.get_stats("users") {
if stats.overall_ratio < 1.2 {
println!("Warning: Low compression ratio");
// Consider disabling compression for this table
}
}

Typical Compression Ratios

Data TypeCompression Ratio
Email addresses2-3x
URLs3-4x
JSON logs4-5x
UUIDs2-3x
Repeated text3-5x
Random strings1.0-1.5x

Troubleshooting

Dictionary Not Working

// Check if dictionary exists
let dicts = engine.list_fsst_dictionaries();
if !dicts.contains(&"users:email".to_string()) {
// Train dictionary
engine.train_fsst_for_table("users", "email")?;
}

Low Compression Ratio

// Check compression stats
let stats = engine.compression_manager().get_stats("users");
if let Some(s) = stats {
println!("Compression ratio: {:.2}x", s.overall_ratio);
if s.overall_ratio < 1.5 {
// Data might not have patterns suitable for FSST
// Consider disabling compression for this table
}
}

High Cache Misses

let stats = engine.fsst_cache_stats();
let miss_rate = stats.misses as f64 / (stats.hits + stats.misses) as f64;
if miss_rate > 0.5 {
// High miss rate - consider:
// 1. Increasing cache size
// 2. Pre-loading critical dictionaries
// 3. Reducing number of string columns
}

Advanced Features

Multiple Columns Per Table

// Train dictionaries for all string columns
let string_columns = vec!["email", "address", "phone"];
for column in string_columns {
engine.train_fsst_for_table("users", column)?;
}

Dictionary Versioning (Future)

// Not yet implemented - planned for future release
// Will support:
// - Multiple dictionary versions per column
// - Automatic retraining on data distribution changes
// - Dictionary migration

Testing

Basic Test

#[test]
fn test_fsst_compression() {
let config = Config::in_memory();
let engine = StorageEngine::open_in_memory(&config)?;
// Create table and insert data
// ... (see tests/fsst_integration_tests.rs)
// Train dictionary
let dict_key = engine.train_fsst_for_table("users", "email")?;
assert_eq!(dict_key, "users:email");
// Verify compression
let results = engine.scan_table("users")?;
assert_eq!(results.len(), 100);
}

Run All Tests

Terminal window
cargo test --test fsst_integration_tests

See Also