Compression Codecs User Guide
Compression Codecs User Guide
HeliosDB Nano v3.0.1 - Complete Compression Reference
This guide provides detailed information about each compression codec in HeliosDB Nano, including optimal use cases, data characteristics, and estimated compression ratios for 10GB data scenarios.
Table of Contents
- Overview
- ALP - Adaptive Lossless floating-Point
- FSST - Fast Static Symbol Table
- Dictionary Encoding
- RLE - Run-Length Encoding
- Delta Encoding
- Codec Selection Guide
- Performance Comparison
- SQL Configuration
Overview
HeliosDB Nano includes five specialized compression codecs, each optimized for different data patterns:
| Codec | Target Data | Typical Ratio | Speed |
|---|---|---|---|
| ALP | Floating-point numbers | 2-4x | Very Fast |
| FSST | Strings with patterns | 2-3x | Fast |
| Dictionary | Low-cardinality columns | 5-20x | Very Fast |
| RLE | Repetitive/sorted data | 10-100x | Fastest |
| Delta | Sequential numbers | 2-10x | Very Fast |
ALP - Adaptive Lossless floating-Point
Description
ALP (Adaptive Lossless floating-Point) is a state-of-the-art compression algorithm for IEEE 754 floating-point data. Based on ACM SIGMOD 2024 research, it automatically adapts between two strategies:
- ALP Classic: For decimal-origin data (financial, percentages, measurements)
- ALP-RD: For high-precision floats (scientific, ML weights)
Technical Characteristics
- Encoding Speed: ~0.5 doubles per CPU cycle
- Decoding Speed: ~2.6 doubles per CPU cycle
- Compression: 100% lossless (zero precision loss)
- Block Size: 1024 values (optimized for CPU cache)
Good Use Cases
| Use Case | Description | Expected Ratio | 10GB Scenario |
|---|---|---|---|
| Financial Data | Prices: $10.12, $99.95, $1234.56 | 3-4x | 10GB → 2.5-3.3GB |
| Sensor Readings | Temperature: 23.5°C, 24.1°C, 23.8°C | 3-4x | 10GB → 2.5-3.3GB |
| Percentages | Values: 0.25, 0.50, 0.75, 0.33 | 4x | 10GB → 2.5GB |
| Coordinates | GPS: -122.4194, 37.7749 | 2.5-3x | 10GB → 3.3-4GB |
| Measurements | Scientific: 9.81, 3.14159, 2.718 | 2-3x | 10GB → 3.3-5GB |
Example - Price Data (10GB):
CREATE TABLE orders ( id INT PRIMARY KEY, price FLOAT8, -- ALP: 10GB → ~2.5GB quantity INT) WITH (compression = 'alp');Bad Use Cases
| Use Case | Why It’s Bad | Expected Ratio | Recommendation |
|---|---|---|---|
| Random doubles | No patterns to exploit | 1.0-1.2x | Don’t compress |
| ML weights | Full precision, random distribution | 1.2-1.5x | Consider storing as binary |
| Encrypted data | Appears random | ~1.0x | Don’t compress |
| Already compressed | No further reduction | ~1.0x | Store raw |
Example - Poor Compression:
Input: Random f64 values from rand::random()Original: 10GBCompressed: ~9GB (only 10% savings)Overhead may not be worth it10GB Data Estimates
| Data Pattern | Compression Ratio | Compressed Size | Space Saved |
|---|---|---|---|
| Financial prices (2 decimals) | 4.0x | 2.5 GB | 7.5 GB (75%) |
| Scientific measurements | 3.0x | 3.3 GB | 6.7 GB (67%) |
| GPS coordinates | 2.5x | 4.0 GB | 6.0 GB (60%) |
| Time-series sensor data | 3.5x | 2.9 GB | 7.1 GB (71%) |
| Random doubles | 1.2x | 8.3 GB | 1.7 GB (17%) |
FSST - Fast Static Symbol Table
Description
FSST (Fast Static Symbol Table) is a lightweight string compression algorithm that encodes common substrings (1-8 bytes) using a symbol table trained on sample data. It provides random access to individual strings.
Technical Characteristics
- Compression Speed: 1-3 GB/sec
- Decompression Speed: 1-3 GB/sec
- Symbol Table Size: ~2-3 KB per column
- Random Access: Yes (individual strings decompressible)
Good Use Cases
| Use Case | Description | Expected Ratio | 10GB Scenario |
|---|---|---|---|
| Email Addresses | user@example.com patterns | 2.5-3x | 10GB → 3.3-4GB |
| URLs | https://example.com/path patterns | 2-3x | 10GB → 3.3-5GB |
| Log Messages | Repetitive log formats | 2.5-3x | 10GB → 3.3-4GB |
| JSON Records | Structured text patterns | 2-2.5x | 10GB → 4-5GB |
| File Paths | /home/user/docs patterns | 2.5-3x | 10GB → 3.3-4GB |
| Request Logs | GET /api/v1/users patterns | 2.5-3.5x | 10GB → 2.9-4GB |
Example - Email Data (10GB):
CREATE TABLE users ( id INT PRIMARY KEY, email TEXT, -- FSST: 10GB → ~3.5GB name TEXT) WITH (compression = 'fsst');Bad Use Cases
| Use Case | Why It’s Bad | Expected Ratio | Recommendation |
|---|---|---|---|
| UUIDs | No substring patterns | 1.1-1.3x | Use binary storage |
| Base64 data | Uniform character distribution | 1.0-1.2x | Store raw |
| Hashes (SHA/MD5) | Random character patterns | ~1.0x | Store raw |
| Encrypted text | No compressible patterns | ~1.0x | Don’t compress |
| Random strings | No common substrings | 1.0-1.2x | Don’t compress |
Example - Poor Compression:
Input: 10GB of UUIDs (550a8400-e29b-41d4-a716-446655440000)Compressed: ~8.5GB (only 15% savings)Symbol table overhead may exceed savings10GB Data Estimates
| Data Pattern | Compression Ratio | Compressed Size | Space Saved |
|---|---|---|---|
| Email addresses (common domains) | 3.0x | 3.3 GB | 6.7 GB (67%) |
| URLs (same site) | 2.5x | 4.0 GB | 6.0 GB (60%) |
| Server logs (structured) | 3.0x | 3.3 GB | 6.7 GB (67%) |
| JSON records | 2.0x | 5.0 GB | 5.0 GB (50%) |
| UUIDs | 1.2x | 8.3 GB | 1.7 GB (17%) |
| Random text | 1.1x | 9.1 GB | 0.9 GB (9%) |
Dictionary Encoding
Description
Dictionary encoding replaces repeated values with compact integer indices into a dictionary of unique values. Ideal for columns with few unique values (low cardinality).
Technical Characteristics
- Max Dictionary Size: 65,536 unique values
- Index Width: 1, 2, or 4 bytes (auto-selected)
- Encoding Speed: Very fast (hash lookup)
- Decoding Speed: Very fast (array index)
Good Use Cases
| Use Case | Description | Expected Ratio | 10GB Scenario |
|---|---|---|---|
| Status Fields | active/inactive/pending | 10-50x | 10GB → 200MB-1GB |
| Country Codes | US, UK, DE, FR (~200 values) | 5-10x | 10GB → 1-2GB |
| Category Tags | electronics/clothing/food | 8-20x | 10GB → 500MB-1.25GB |
| Boolean-like | yes/no, true/false | 50-100x | 10GB → 100-200MB |
| Day of Week | Mon, Tue, Wed… (7 values) | 15-30x | 10GB → 333-666MB |
| Enum Fields | Predefined value sets | 10-50x | 10GB → 200MB-1GB |
Example - Status Field (10GB):
CREATE TABLE orders ( id INT PRIMARY KEY, status TEXT, -- Dictionary: 10GB → ~200MB (3 unique values) product_id INT) WITH (compression_columns = 'status:dictionary');Bad Use Cases
| Use Case | Why It’s Bad | Expected Ratio | Recommendation |
|---|---|---|---|
| High cardinality | >50% unique values | 0.8-1.5x | Use FSST |
| User IDs | Mostly unique values | ~1.0x | Don’t compress |
| Timestamps | All different | ~1.0x | Use Delta |
| Free-text fields | High uniqueness | ~1.0x | Use FSST |
| >65,536 unique | Exceeds dictionary limit | Fails | Use FSST |
Example - Poor Compression:
Input: 10GB of unique user IDs (user_12345, user_12346, ...)Dictionary Size: Would exceed 65,536 limit or be nearly 1:1Recommendation: Use FSST or store uncompressed10GB Data Estimates
| Data Pattern | Unique Values | Compression Ratio | Compressed Size | Space Saved |
|---|---|---|---|---|
| Status (3 values) | 3 | 50x | 200 MB | 9.8 GB (98%) |
| Country codes | 200 | 8x | 1.25 GB | 8.75 GB (87.5%) |
| Product categories | 500 | 6x | 1.67 GB | 8.33 GB (83.3%) |
| User types | 10 | 20x | 500 MB | 9.5 GB (95%) |
| City names | 10,000 | 4x | 2.5 GB | 7.5 GB (75%) |
| Unique emails | 1,000,000+ | ~1.0x | ~10 GB | ~0 GB (0%) |
RLE - Run-Length Encoding
Description
Run-Length Encoding compresses sequences of repeated values into (value, count) pairs. Extremely effective for sorted columns or data with long runs of identical values.
Technical Characteristics
- Minimum Run Length: 3 (shorter runs stored verbatim)
- Maximum Run Length: 4.2 billion per entry
- Encoding Speed: Fastest (simple counting)
- Decoding Speed: Fastest (simple expansion)
Good Use Cases
| Use Case | Description | Expected Ratio | 10GB Scenario |
|---|---|---|---|
| Sorted partition keys | Same value for millions of rows | 100-10,000x | 10GB → 1-100MB |
| Time-bucketed data | Same hour/day for many rows | 50-500x | 10GB → 20-200MB |
| Flag columns (sorted) | 0,0,0,…,1,1,1 | 100-1000x | 10GB → 10-100MB |
| Sparse data | Mostly NULLs or zeros | 50-200x | 10GB → 50-200MB |
| Clustered keys | Same foreign key in batches | 20-100x | 10GB → 100-500MB |
Example - Sorted Partition (10GB):
-- Data sorted by region (4 regions, 2.5GB each)CREATE TABLE events ( region TEXT, -- RLE: 10GB → ~1MB (only 4 runs!) event_time TIMESTAMP, data TEXT) WITH (compression_columns = 'region:rle');Extreme Example:
Input: 10GB of "active" status (all same value)Runs: 1Compressed: ~20 bytes (value + count)Ratio: ~500,000,000xBad Use Cases
| Use Case | Why It’s Bad | Expected Ratio | Recommendation |
|---|---|---|---|
| Random/unsorted | No consecutive duplicates | 0.5-1.0x | Use Dictionary |
| High cardinality unsorted | Every value different | ~0.5x (worse!) | Don’t use RLE |
| Alternating values | A,B,A,B,A,B… | ~0.3x (worse!) | Use Dictionary |
| UUIDs | All unique | ~0.5x (worse!) | Don’t compress |
Example - Poor Compression:
Input: 10GB of alternating true/false valuesRuns: 5 billion (one per value)Compressed: ~40GB (4x LARGER due to overhead!)CRITICAL: RLE makes this data WORSE10GB Data Estimates
| Data Pattern | Run Count | Compression Ratio | Compressed Size | Space Saved |
|---|---|---|---|---|
| Sorted partition (4 values) | 4 | 10,000x+ | ~1 MB | 9.999 GB (99.99%) |
| Hourly buckets (8760/year) | 8,760 | 1,000x | 10 MB | 9.99 GB (99.9%) |
| Daily flags (sorted) | 365 | 5,000x | 2 MB | 9.998 GB (99.98%) |
| Clustered FK (1000 groups) | 1,000 | 500x | 20 MB | 9.98 GB (99.8%) |
| Unsorted random | 5 billion | 0.5x | 20 GB | -10 GB (WORSE) |
Delta Encoding
Description
Delta encoding stores differences between consecutive values instead of absolute values. Uses zigzag + variable-length encoding for compact storage of small deltas.
Technical Characteristics
- Supported Types: INT4, INT8 (32/64-bit integers)
- Encoding: Zigzag encoding for signed deltas
- Storage: Variable-length integers (1-10 bytes per delta)
- Decoding: Sequential (requires reading from start)
Good Use Cases
| Use Case | Description | Expected Ratio | 10GB Scenario |
|---|---|---|---|
| Auto-increment IDs | 1, 2, 3, 4, 5… (delta=1) | 6-8x | 10GB → 1.25-1.67GB |
| Timestamps (ordered) | Regular intervals (delta~1000ms) | 4-8x | 10GB → 1.25-2.5GB |
| Counters | Monotonically increasing | 5-10x | 10GB → 1-2GB |
| Sequence numbers | 100, 101, 102… | 6-8x | 10GB → 1.25-1.67GB |
| Version numbers | 1, 2, 3… with gaps | 3-6x | 10GB → 1.67-3.3GB |
Example - Timestamps (10GB):
CREATE TABLE events ( id INT, event_time BIGINT, -- Delta: 10GB → ~1.5GB (uniform intervals) data TEXT) WITH (compression_columns = 'event_time:delta');Optimal Case - Sequential IDs:
Input: 10GB of sequential integers (1, 2, 3, 4, ...)Base: 1, Deltas: [1, 1, 1, 1, ...]Each delta = 1 byte (varint encoding)Compressed: ~1.25GB (8x compression)Bad Use Cases
| Use Case | Why It’s Bad | Expected Ratio | Recommendation |
|---|---|---|---|
| Random integers | Large deltas need more bytes | 0.8-1.2x | Don’t compress |
| Unsorted data | Deltas vary wildly | ~1.0x | Sort first or skip |
| Floating-point | Not supported | N/A | Use ALP |
| Sparse sequences | Large gaps = large deltas | 1.0-1.5x | Use Dictionary |
| Non-sequential | [1000, 5, 999999, 100] | ~1.0x | Don’t use Delta |
Example - Poor Compression:
Input: 10GB of random integers [1000000, 5, 999999, 100, 888888]Deltas: [-999995, 999994, -999899, 888788]Each delta = 4-5 bytes (large varints)Compressed: ~10GB (no savings)10GB Data Estimates
| Data Pattern | Average Delta | Compression Ratio | Compressed Size | Space Saved |
|---|---|---|---|---|
| Sequential IDs (delta=1) | 1 | 8x | 1.25 GB | 8.75 GB (87.5%) |
| Timestamps (1s intervals) | 1,000 | 5x | 2 GB | 8 GB (80%) |
| Timestamps (1ms intervals) | 1 | 8x | 1.25 GB | 8.75 GB (87.5%) |
| Version numbers (gaps) | ~100 | 4x | 2.5 GB | 7.5 GB (75%) |
| Random integers | varies | 1.0x | 10 GB | 0 GB (0%) |
Codec Selection Guide
Decision Tree
Is your data...
FLOATING-POINT (FLOAT4/FLOAT8)?├─ Yes → Use ALP│ └─ Expected: 2-4x compression└─ No ↓
TEXT/VARCHAR?├─ Yes → Check cardinality│ ├─ < 65,536 unique AND > 50% repetition → Use Dictionary (5-20x)│ ├─ Has substring patterns → Use FSST (2-3x)│ └─ Random/encrypted → Don't compress└─ No ↓
INTEGER (INT4/INT8)?├─ Yes → Check pattern│ ├─ Sequential/sorted → Use Delta (2-10x)│ ├─ Sorted with long runs → Use RLE (10-10000x)│ ├─ Low cardinality → Use Dictionary (5-20x)│ └─ Random → Don't compress└─ No ↓
BINARY/BLOB?└─ Don't compress (already efficient or encrypted)Quick Reference Matrix
| Data Characteristic | Best Codec | Second Choice | Avoid |
|---|---|---|---|
| Financial prices | ALP | - | RLE |
| Status flags (sorted) | RLE | Dictionary | Delta |
| Status flags (unsorted) | Dictionary | - | RLE |
| Email addresses | FSST | Dictionary | RLE |
| Sequential IDs | Delta | RLE (if sorted) | Dictionary |
| Timestamps (ordered) | Delta | - | RLE |
| Country codes | Dictionary | FSST | RLE |
| UUIDs | - (don’t compress) | FSST | RLE, Dictionary |
| Sensor readings | ALP | Delta (if integer) | - |
Performance Comparison
Compression Speed (GB/sec)
| Codec | Encode Speed | Decode Speed | Notes |
|---|---|---|---|
| RLE | 5-10 GB/s | 5-10 GB/s | Fastest, CPU-bound |
| Dictionary | 2-4 GB/s | 4-8 GB/s | Fast hash lookup |
| Delta | 3-6 GB/s | 4-8 GB/s | Simple arithmetic |
| ALP | 1-2 GB/s | 3-5 GB/s | SIMD accelerated |
| FSST | 1-3 GB/s | 1-3 GB/s | Symbol table lookup |
Memory Overhead
| Codec | Per-Column Overhead | Per-Value Overhead |
|---|---|---|
| RLE | 12 bytes (header) | 8 bytes/run |
| Dictionary | Dictionary size + 16 bytes | 1-4 bytes/value |
| Delta | 20 bytes (header) | 1-10 bytes/delta |
| ALP | ~100 bytes (metadata) | Variable |
| FSST | 2-3 KB (symbol table) | Variable |
SQL Configuration
CREATE TABLE WITH Clause
-- Single codec for entire tableCREATE TABLE measurements ( id INT PRIMARY KEY, value FLOAT8, label TEXT) WITH (compression = 'auto');
-- Per-column codec specificationCREATE TABLE events ( id INT, status TEXT, event_time BIGINT, temperature FLOAT8) WITH ( compression = 'auto', compression_level = 6, compression_columns = 'status:dictionary,event_time:delta,temperature:alp');ALTER TABLE Configuration
-- Enable/disable compressionALTER TABLE events SET COMPRESSION = 'auto';ALTER TABLE events SET COMPRESSION = 'none';
-- Set compression level (1-9)ALTER TABLE events SET COMPRESSION_LEVEL = 9;
-- Configure per-columnALTER TABLE events SET COMPRESSION_COLUMN status = 'dictionary';ALTER TABLE events SET COMPRESSION_COLUMN event_time = 'delta';Monitor Compression Statistics
-- Overall compression statsSELECT * FROM heliosdb_compression_stats;
-- Pattern analysisSELECT * FROM heliosdb_pattern_stats;
-- Recent compression eventsSELECT * FROM heliosdb_compression_events;
-- Current configurationSELECT * FROM heliosdb_config WHERE setting LIKE 'compression%';Summary: 10GB Compression Estimates
| Codec | Best Case | Typical Case | Worst Case |
|---|---|---|---|
| ALP | 4x (2.5 GB) | 3x (3.3 GB) | 1.2x (8.3 GB) |
| FSST | 3x (3.3 GB) | 2.5x (4 GB) | 1.1x (9.1 GB) |
| Dictionary | 50x (200 MB) | 8x (1.25 GB) | 1x (10 GB) |
| RLE | 10000x+ (1 MB) | 100x (100 MB) | 0.5x (20 GB)* |
| Delta | 8x (1.25 GB) | 5x (2 GB) | 1x (10 GB) |
*RLE can make data larger if used incorrectly!
Last Updated: 2026-01-16 Version: HeliosDB Nano v3.0.1