Compression & Storage Quick Start Guide

HeliosDB Nano v3.6+

This guide explains HeliosDB Nano’s compression architecture and per-column storage optimization features.

Compression Architecture

HeliosDB Nano uses a two-layer compression strategy:

RocksDB LZ4 Block Compression (automatic, always-on)
- Applied to all data blocks
- 3.7-7x compression depending on data patterns
- Zero configuration required
Per-Column Storage Modes (optional, per-column)
- Dictionary encoding for low-cardinality strings
- Content-addressed deduplication for large values
- Columnar batching for analytics workloads

Automatic Compression (Zero Configuration)

All data in HeliosDB Nano is automatically compressed using RocksDB’s LZ4 block compression:

use heliosdb_nano::EmbeddedDatabase;

// Create database - compression is automatic
let db = EmbeddedDatabase::new("./mydb.helio")?;

// Create and populate table
db.execute("CREATE TABLE sensors (id INT, reading FLOAT8, location TEXT)")?;
db.execute("INSERT INTO sensors VALUES (1, 23.5, 'Building A, Floor 3, Room 301')")?;

// Query - decompression is automatic
let results = db.query("SELECT * FROM sensors", &[])?;

That’s it! No configuration needed for basic compression.

Compression Ratios (10GB Benchmarks)

Data Pattern	Compression Ratio	Example
Email addresses	6.95x	10.5 GB → 1.5 GB
Sorted text (regions)	6.26x	10.7 GB → 1.7 GB
Sequential integers	3.96x	10.7 GB → 2.5 GB
Low-cardinality strings	3.96x	10.7 GB → 2.7 GB
Floating-point prices	3.70x	18.3 GB → 2.9 GB

Per-Column Storage Modes

For additional optimization, you can specify storage modes per column:

1. Dictionary Encoding

Best for columns with few unique values (< 64K distinct):

CREATE TABLE orders (
    id INT PRIMARY KEY,
    status TEXT STORAGE DICTIONARY,     -- 6 values: pending, shipped, etc.
    country TEXT STORAGE DICTIONARY     -- ~200 country codes
);

Benefits:

50-95% storage reduction for low-cardinality strings
4-byte IDs instead of full strings in row tuples
Faster equality comparisons

2. Content-Addressed Storage

Best for large values (>1KB) with duplicates:

CREATE TABLE documents (
    id INT PRIMARY KEY,
    content TEXT STORAGE CONTENT_ADDRESSED,  -- Many docs share templates
    thumbnail BYTEA STORAGE CONTENT_ADDRESSED -- Default images
);

Benefits:

Near-100% deduplication for duplicate large values
Blake3 hashing for fast lookups
Only stores each unique value once

3. Columnar Storage

Best for analytics and aggregations:

CREATE TABLE metrics (
    id INT PRIMARY KEY,
    timestamp INT8 STORAGE COLUMNAR,     -- Time-series data
    value FLOAT8 STORAGE COLUMNAR        -- Aggregation targets
);

Benefits:

20-50% better compression for numeric sequences
2-4x faster aggregations (SUM, AVG, COUNT)
Efficient column-only scans

Changing Storage Modes (Online Migration)

You can change storage modes on existing tables:

-- Migrate existing data to dictionary encoding
ALTER TABLE orders ALTER COLUMN status SET STORAGE DICTIONARY;

-- Migrate to content-addressed storage
ALTER TABLE documents ALTER COLUMN content SET STORAGE CONTENT_ADDRESSED;

-- Migrate to columnar storage
ALTER TABLE metrics ALTER COLUMN value SET STORAGE COLUMNAR;

-- Revert to default
ALTER TABLE orders ALTER COLUMN status SET STORAGE DEFAULT;

Note: Migrations are performed online and restructure all existing data.

When to Use Each Mode

Scenario	Recommended	Reason
Status codes, enums	DICTIONARY	Few unique values
Country/region codes	DICTIONARY	~200 unique values
Email templates	CONTENT_ADDRESSED	Many duplicates
Default profile images	CONTENT_ADDRESSED	High duplication
Time-series timestamps	COLUMNAR	Sequential, aggregations
Sensor readings	COLUMNAR	Numeric, analytics
Unique IDs (UUIDs)	DEFAULT	High cardinality
Primary keys	DEFAULT	Point lookups

Fast Bulk Loading

For maximum ingestion performance, use bulk loading mode:

// Enable bulk load mode
db.execute("SET bulk_load_mode = true")?;

// Load data at 700K-900K rows/sec
let rows = db.direct_bulk_load("table_name", &batches)?;

// Disable bulk load mode
db.execute("SET bulk_load_mode = false")?;

Bulk Loading Performance

Method	Throughput	Use Case
`direct_bulk_load`	700K-900K rows/sec	ETL, data warehouse loading
Standard INSERT	30K-50K rows/sec	OLTP operations
Bulk Load Mode	500K-700K rows/sec	Batch processing

Monitoring Compression

Check Storage Usage

-- RocksDB statistics (when available)
SELECT * FROM pg_stat_storage;

Estimate Compression Ratio

Compare logical vs physical size:

Logical: Row count × average row size
Physical: Database file size

Performance Tips

1. Use Appropriate Storage Modes

-- Good: Dictionary for enums
status TEXT STORAGE DICTIONARY  -- 6 unique values

-- Bad: Dictionary for UUIDs
user_id TEXT STORAGE DICTIONARY  -- Millions of unique values

2. Batch Inserts for Better Compression

LZ4 compression works better with larger data blocks:

// Good: Batch inserts
db.execute("BEGIN")?;
for batch in data.chunks(1000) {
    db.execute(&insert_batch(batch))?;
}
db.execute("COMMIT")?;

// Also good: direct_bulk_load
db.direct_bulk_load("table", &batches)?;

3. Consider Data Order

Sorted data compresses better:

-- Better compression if data is pre-sorted
CREATE TABLE events (
    timestamp INT8 STORAGE COLUMNAR,  -- Sorted timestamps compress well
    event_type TEXT STORAGE DICTIONARY
);

Troubleshooting

Low Compression Ratio

Check data patterns:

Random data (UUIDs, hashes) compresses poorly
Already-compressed data (images, videos) won’t compress further
Encrypted data has no compressible patterns

Solution: Use DEFAULT storage for incompressible data.

Dictionary Too Large

Symptom: Dictionary encoding not helping

Check:

SELECT COUNT(DISTINCT column_name) FROM table_name;

Solution: If >64K unique values, switch to DEFAULT.

Slow Aggregations

Symptom: SUM/AVG queries slow on large tables

Solution: Use COLUMNAR storage for aggregation targets:

ALTER TABLE metrics ALTER COLUMN value SET STORAGE COLUMNAR;

Compression & Storage Quick Start Guide

Compression & Storage Quick Start Guide

Compression Architecture

Automatic Compression (Zero Configuration)

Compression Ratios (10GB Benchmarks)

Per-Column Storage Modes

1. Dictionary Encoding

2. Content-Addressed Storage

3. Columnar Storage

Changing Storage Modes (Online Migration)

When to Use Each Mode

Fast Bulk Loading

Bulk Loading Performance

Monitoring Compression

Check Storage Usage

Estimate Compression Ratio

Performance Tips

1. Use Appropriate Storage Modes

2. Batch Inserts for Better Compression

3. Consider Data Order

Troubleshooting

Low Compression Ratio

Dictionary Too Large

Slow Aggregations

Further Reading