Skip to content

Compression & Storage Quick Start Guide

Compression & Storage Quick Start Guide

HeliosDB Nano v3.6+

This guide explains HeliosDB Nano’s compression architecture and per-column storage optimization features.


Compression Architecture

HeliosDB Nano uses a two-layer compression strategy:

  1. RocksDB LZ4 Block Compression (automatic, always-on)

    • Applied to all data blocks
    • 3.7-7x compression depending on data patterns
    • Zero configuration required
  2. Per-Column Storage Modes (optional, per-column)

    • Dictionary encoding for low-cardinality strings
    • Content-addressed deduplication for large values
    • Columnar batching for analytics workloads

Automatic Compression (Zero Configuration)

All data in HeliosDB Nano is automatically compressed using RocksDB’s LZ4 block compression:

use heliosdb_nano::EmbeddedDatabase;
// Create database - compression is automatic
let db = EmbeddedDatabase::new("./mydb.helio")?;
// Create and populate table
db.execute("CREATE TABLE sensors (id INT, reading FLOAT8, location TEXT)")?;
db.execute("INSERT INTO sensors VALUES (1, 23.5, 'Building A, Floor 3, Room 301')")?;
// Query - decompression is automatic
let results = db.query("SELECT * FROM sensors", &[])?;

That’s it! No configuration needed for basic compression.

Compression Ratios (10GB Benchmarks)

Data PatternCompression RatioExample
Email addresses6.95x10.5 GB → 1.5 GB
Sorted text (regions)6.26x10.7 GB → 1.7 GB
Sequential integers3.96x10.7 GB → 2.5 GB
Low-cardinality strings3.96x10.7 GB → 2.7 GB
Floating-point prices3.70x18.3 GB → 2.9 GB

Per-Column Storage Modes

For additional optimization, you can specify storage modes per column:

1. Dictionary Encoding

Best for columns with few unique values (< 64K distinct):

CREATE TABLE orders (
id INT PRIMARY KEY,
status TEXT STORAGE DICTIONARY, -- 6 values: pending, shipped, etc.
country TEXT STORAGE DICTIONARY -- ~200 country codes
);

Benefits:

  • 50-95% storage reduction for low-cardinality strings
  • 4-byte IDs instead of full strings in row tuples
  • Faster equality comparisons

2. Content-Addressed Storage

Best for large values (>1KB) with duplicates:

CREATE TABLE documents (
id INT PRIMARY KEY,
content TEXT STORAGE CONTENT_ADDRESSED, -- Many docs share templates
thumbnail BYTEA STORAGE CONTENT_ADDRESSED -- Default images
);

Benefits:

  • Near-100% deduplication for duplicate large values
  • Blake3 hashing for fast lookups
  • Only stores each unique value once

3. Columnar Storage

Best for analytics and aggregations:

CREATE TABLE metrics (
id INT PRIMARY KEY,
timestamp INT8 STORAGE COLUMNAR, -- Time-series data
value FLOAT8 STORAGE COLUMNAR -- Aggregation targets
);

Benefits:

  • 20-50% better compression for numeric sequences
  • 2-4x faster aggregations (SUM, AVG, COUNT)
  • Efficient column-only scans

Changing Storage Modes (Online Migration)

You can change storage modes on existing tables:

-- Migrate existing data to dictionary encoding
ALTER TABLE orders ALTER COLUMN status SET STORAGE DICTIONARY;
-- Migrate to content-addressed storage
ALTER TABLE documents ALTER COLUMN content SET STORAGE CONTENT_ADDRESSED;
-- Migrate to columnar storage
ALTER TABLE metrics ALTER COLUMN value SET STORAGE COLUMNAR;
-- Revert to default
ALTER TABLE orders ALTER COLUMN status SET STORAGE DEFAULT;

Note: Migrations are performed online and restructure all existing data.


When to Use Each Mode

ScenarioRecommendedReason
Status codes, enumsDICTIONARYFew unique values
Country/region codesDICTIONARY~200 unique values
Email templatesCONTENT_ADDRESSEDMany duplicates
Default profile imagesCONTENT_ADDRESSEDHigh duplication
Time-series timestampsCOLUMNARSequential, aggregations
Sensor readingsCOLUMNARNumeric, analytics
Unique IDs (UUIDs)DEFAULTHigh cardinality
Primary keysDEFAULTPoint lookups

Fast Bulk Loading

For maximum ingestion performance, use bulk loading mode:

// Enable bulk load mode
db.execute("SET bulk_load_mode = true")?;
// Load data at 700K-900K rows/sec
let rows = db.direct_bulk_load("table_name", &batches)?;
// Disable bulk load mode
db.execute("SET bulk_load_mode = false")?;

Bulk Loading Performance

MethodThroughputUse Case
direct_bulk_load700K-900K rows/secETL, data warehouse loading
Standard INSERT30K-50K rows/secOLTP operations
Bulk Load Mode500K-700K rows/secBatch processing

Monitoring Compression

Check Storage Usage

-- RocksDB statistics (when available)
SELECT * FROM pg_stat_storage;

Estimate Compression Ratio

Compare logical vs physical size:

  • Logical: Row count × average row size
  • Physical: Database file size

Performance Tips

1. Use Appropriate Storage Modes

-- Good: Dictionary for enums
status TEXT STORAGE DICTIONARY -- 6 unique values
-- Bad: Dictionary for UUIDs
user_id TEXT STORAGE DICTIONARY -- Millions of unique values

2. Batch Inserts for Better Compression

LZ4 compression works better with larger data blocks:

// Good: Batch inserts
db.execute("BEGIN")?;
for batch in data.chunks(1000) {
db.execute(&insert_batch(batch))?;
}
db.execute("COMMIT")?;
// Also good: direct_bulk_load
db.direct_bulk_load("table", &batches)?;

3. Consider Data Order

Sorted data compresses better:

-- Better compression if data is pre-sorted
CREATE TABLE events (
timestamp INT8 STORAGE COLUMNAR, -- Sorted timestamps compress well
event_type TEXT STORAGE DICTIONARY
);

Troubleshooting

Low Compression Ratio

Check data patterns:

  • Random data (UUIDs, hashes) compresses poorly
  • Already-compressed data (images, videos) won’t compress further
  • Encrypted data has no compressible patterns

Solution: Use DEFAULT storage for incompressible data.

Dictionary Too Large

Symptom: Dictionary encoding not helping

Check:

SELECT COUNT(DISTINCT column_name) FROM table_name;

Solution: If >64K unique values, switch to DEFAULT.

Slow Aggregations

Symptom: SUM/AVG queries slow on large tables

Solution: Use COLUMNAR storage for aggregation targets:

ALTER TABLE metrics ALTER COLUMN value SET STORAGE COLUMNAR;

Further Reading


Last Updated: 2026-01-17 Version: HeliosDB Nano v3.6+