Compression & Storage Quick Start Guide
Compression & Storage Quick Start Guide
HeliosDB Nano v3.6+
This guide explains HeliosDB Nano’s compression architecture and per-column storage optimization features.
Compression Architecture
HeliosDB Nano uses a two-layer compression strategy:
-
RocksDB LZ4 Block Compression (automatic, always-on)
- Applied to all data blocks
- 3.7-7x compression depending on data patterns
- Zero configuration required
-
Per-Column Storage Modes (optional, per-column)
- Dictionary encoding for low-cardinality strings
- Content-addressed deduplication for large values
- Columnar batching for analytics workloads
Automatic Compression (Zero Configuration)
All data in HeliosDB Nano is automatically compressed using RocksDB’s LZ4 block compression:
use heliosdb_nano::EmbeddedDatabase;
// Create database - compression is automaticlet db = EmbeddedDatabase::new("./mydb.helio")?;
// Create and populate tabledb.execute("CREATE TABLE sensors (id INT, reading FLOAT8, location TEXT)")?;db.execute("INSERT INTO sensors VALUES (1, 23.5, 'Building A, Floor 3, Room 301')")?;
// Query - decompression is automaticlet results = db.query("SELECT * FROM sensors", &[])?;That’s it! No configuration needed for basic compression.
Compression Ratios (10GB Benchmarks)
| Data Pattern | Compression Ratio | Example |
|---|---|---|
| Email addresses | 6.95x | 10.5 GB → 1.5 GB |
| Sorted text (regions) | 6.26x | 10.7 GB → 1.7 GB |
| Sequential integers | 3.96x | 10.7 GB → 2.5 GB |
| Low-cardinality strings | 3.96x | 10.7 GB → 2.7 GB |
| Floating-point prices | 3.70x | 18.3 GB → 2.9 GB |
Per-Column Storage Modes
For additional optimization, you can specify storage modes per column:
1. Dictionary Encoding
Best for columns with few unique values (< 64K distinct):
CREATE TABLE orders ( id INT PRIMARY KEY, status TEXT STORAGE DICTIONARY, -- 6 values: pending, shipped, etc. country TEXT STORAGE DICTIONARY -- ~200 country codes);Benefits:
- 50-95% storage reduction for low-cardinality strings
- 4-byte IDs instead of full strings in row tuples
- Faster equality comparisons
2. Content-Addressed Storage
Best for large values (>1KB) with duplicates:
CREATE TABLE documents ( id INT PRIMARY KEY, content TEXT STORAGE CONTENT_ADDRESSED, -- Many docs share templates thumbnail BYTEA STORAGE CONTENT_ADDRESSED -- Default images);Benefits:
- Near-100% deduplication for duplicate large values
- Blake3 hashing for fast lookups
- Only stores each unique value once
3. Columnar Storage
Best for analytics and aggregations:
CREATE TABLE metrics ( id INT PRIMARY KEY, timestamp INT8 STORAGE COLUMNAR, -- Time-series data value FLOAT8 STORAGE COLUMNAR -- Aggregation targets);Benefits:
- 20-50% better compression for numeric sequences
- 2-4x faster aggregations (SUM, AVG, COUNT)
- Efficient column-only scans
Changing Storage Modes (Online Migration)
You can change storage modes on existing tables:
-- Migrate existing data to dictionary encodingALTER TABLE orders ALTER COLUMN status SET STORAGE DICTIONARY;
-- Migrate to content-addressed storageALTER TABLE documents ALTER COLUMN content SET STORAGE CONTENT_ADDRESSED;
-- Migrate to columnar storageALTER TABLE metrics ALTER COLUMN value SET STORAGE COLUMNAR;
-- Revert to defaultALTER TABLE orders ALTER COLUMN status SET STORAGE DEFAULT;Note: Migrations are performed online and restructure all existing data.
When to Use Each Mode
| Scenario | Recommended | Reason |
|---|---|---|
| Status codes, enums | DICTIONARY | Few unique values |
| Country/region codes | DICTIONARY | ~200 unique values |
| Email templates | CONTENT_ADDRESSED | Many duplicates |
| Default profile images | CONTENT_ADDRESSED | High duplication |
| Time-series timestamps | COLUMNAR | Sequential, aggregations |
| Sensor readings | COLUMNAR | Numeric, analytics |
| Unique IDs (UUIDs) | DEFAULT | High cardinality |
| Primary keys | DEFAULT | Point lookups |
Fast Bulk Loading
For maximum ingestion performance, use bulk loading mode:
// Enable bulk load modedb.execute("SET bulk_load_mode = true")?;
// Load data at 700K-900K rows/seclet rows = db.direct_bulk_load("table_name", &batches)?;
// Disable bulk load modedb.execute("SET bulk_load_mode = false")?;Bulk Loading Performance
| Method | Throughput | Use Case |
|---|---|---|
direct_bulk_load | 700K-900K rows/sec | ETL, data warehouse loading |
| Standard INSERT | 30K-50K rows/sec | OLTP operations |
| Bulk Load Mode | 500K-700K rows/sec | Batch processing |
Monitoring Compression
Check Storage Usage
-- RocksDB statistics (when available)SELECT * FROM pg_stat_storage;Estimate Compression Ratio
Compare logical vs physical size:
- Logical: Row count × average row size
- Physical: Database file size
Performance Tips
1. Use Appropriate Storage Modes
-- Good: Dictionary for enumsstatus TEXT STORAGE DICTIONARY -- 6 unique values
-- Bad: Dictionary for UUIDsuser_id TEXT STORAGE DICTIONARY -- Millions of unique values2. Batch Inserts for Better Compression
LZ4 compression works better with larger data blocks:
// Good: Batch insertsdb.execute("BEGIN")?;for batch in data.chunks(1000) { db.execute(&insert_batch(batch))?;}db.execute("COMMIT")?;
// Also good: direct_bulk_loaddb.direct_bulk_load("table", &batches)?;3. Consider Data Order
Sorted data compresses better:
-- Better compression if data is pre-sortedCREATE TABLE events ( timestamp INT8 STORAGE COLUMNAR, -- Sorted timestamps compress well event_type TEXT STORAGE DICTIONARY);Troubleshooting
Low Compression Ratio
Check data patterns:
- Random data (UUIDs, hashes) compresses poorly
- Already-compressed data (images, videos) won’t compress further
- Encrypted data has no compressible patterns
Solution: Use DEFAULT storage for incompressible data.
Dictionary Too Large
Symptom: Dictionary encoding not helping
Check:
SELECT COUNT(DISTINCT column_name) FROM table_name;Solution: If >64K unique values, switch to DEFAULT.
Slow Aggregations
Symptom: SUM/AVG queries slow on large tables
Solution: Use COLUMNAR storage for aggregation targets:
ALTER TABLE metrics ALTER COLUMN value SET STORAGE COLUMNAR;Further Reading
- Per-Column Storage Modes Guide - Detailed storage mode documentation
- Bulk Loading Analysis - Performance optimization
- Benchmark Report - 10GB benchmark results
Last Updated: 2026-01-17 Version: HeliosDB Nano v3.6+