Compression Codecs User Guide

HeliosDB-Lite v3.0.1 - Complete Compression Reference

This guide provides detailed information about each compression codec in HeliosDB-Lite, including optimal use cases, data characteristics, and estimated compression ratios for 10GB data scenarios.

Overview
ALP - Adaptive Lossless floating-Point
FSST - Fast Static Symbol Table
Dictionary Encoding
RLE - Run-Length Encoding
Delta Encoding
Codec Selection Guide
Performance Comparison
SQL Configuration

Overview

HeliosDB-Lite includes five specialized compression codecs, each optimized for different data patterns:

Codec	Target Data	Typical Ratio	Speed
ALP	Floating-point numbers	2-4x	Very Fast
FSST	Strings with patterns	2-3x	Fast
Dictionary	Low-cardinality columns	5-20x	Very Fast
RLE	Repetitive/sorted data	10-100x	Fastest
Delta	Sequential numbers	2-10x	Very Fast

ALP - Adaptive Lossless floating-Point

Description

ALP (Adaptive Lossless floating-Point) is a state-of-the-art compression algorithm for IEEE 754 floating-point data. Based on ACM SIGMOD 2024 research, it automatically adapts between two strategies:

ALP Classic: For decimal-origin data (financial, percentages, measurements)
ALP-RD: For high-precision floats (scientific, ML weights)

Technical Characteristics

Encoding Speed: ~0.5 doubles per CPU cycle
Decoding Speed: ~2.6 doubles per CPU cycle
Compression: 100% lossless (zero precision loss)
Block Size: 1024 values (optimized for CPU cache)

Good Use Cases

Use Case	Description	Expected Ratio	10GB Scenario
Financial Data	Prices: $10.12, $99.95, $1234.56	3-4x	10GB → 2.5-3.3GB
Sensor Readings	Temperature: 23.5°C, 24.1°C, 23.8°C	3-4x	10GB → 2.5-3.3GB
Percentages	Values: 0.25, 0.50, 0.75, 0.33	4x	10GB → 2.5GB
Coordinates	GPS: -122.4194, 37.7749	2.5-3x	10GB → 3.3-4GB
Measurements	Scientific: 9.81, 3.14159, 2.718	2-3x	10GB → 3.3-5GB

Example - Price Data (10GB):

CREATE TABLE orders (
    id INT PRIMARY KEY,
    price FLOAT8,        -- ALP: 10GB → ~2.5GB
    quantity INT
) WITH (compression = 'alp');

Bad Use Cases

Use Case	Why It’s Bad	Expected Ratio	Recommendation
Random doubles	No patterns to exploit	1.0-1.2x	Don’t compress
ML weights	Full precision, random distribution	1.2-1.5x	Consider storing as binary
Encrypted data	Appears random	~1.0x	Don’t compress
Already compressed	No further reduction	~1.0x	Store raw

Example - Poor Compression:

Input: Random f64 values from rand::random()
Original: 10GB
Compressed: ~9GB (only 10% savings)
Overhead may not be worth it

10GB Data Estimates

Data Pattern	Compression Ratio	Compressed Size	Space Saved
Financial prices (2 decimals)	4.0x	2.5 GB	7.5 GB (75%)
Scientific measurements	3.0x	3.3 GB	6.7 GB (67%)
GPS coordinates	2.5x	4.0 GB	6.0 GB (60%)
Time-series sensor data	3.5x	2.9 GB	7.1 GB (71%)
Random doubles	1.2x	8.3 GB	1.7 GB (17%)

FSST - Fast Static Symbol Table

Description

FSST (Fast Static Symbol Table) is a lightweight string compression algorithm that encodes common substrings (1-8 bytes) using a symbol table trained on sample data. It provides random access to individual strings.

Technical Characteristics

Compression Speed: 1-3 GB/sec
Decompression Speed: 1-3 GB/sec
Symbol Table Size: ~2-3 KB per column
Random Access: Yes (individual strings decompressible)

Good Use Cases

Use Case	Description	Expected Ratio	10GB Scenario
Email Addresses	user@example.com patterns	2.5-3x	10GB → 3.3-4GB
URLs	https://example.com/path patterns	2-3x	10GB → 3.3-5GB
Log Messages	Repetitive log formats	2.5-3x	10GB → 3.3-4GB
JSON Records	Structured text patterns	2-2.5x	10GB → 4-5GB
File Paths	/home/user/docs patterns	2.5-3x	10GB → 3.3-4GB
Request Logs	GET /api/v1/users patterns	2.5-3.5x	10GB → 2.9-4GB

Example - Email Data (10GB):

CREATE TABLE users (
    id INT PRIMARY KEY,
    email TEXT,          -- FSST: 10GB → ~3.5GB
    name TEXT
) WITH (compression = 'fsst');

Bad Use Cases

Use Case	Why It’s Bad	Expected Ratio	Recommendation
UUIDs	No substring patterns	1.1-1.3x	Use binary storage
Base64 data	Uniform character distribution	1.0-1.2x	Store raw
Hashes (SHA/MD5)	Random character patterns	~1.0x	Store raw
Encrypted text	No compressible patterns	~1.0x	Don’t compress
Random strings	No common substrings	1.0-1.2x	Don’t compress

Example - Poor Compression:

Input: 10GB of UUIDs (550a8400-e29b-41d4-a716-446655440000)
Compressed: ~8.5GB (only 15% savings)
Symbol table overhead may exceed savings

10GB Data Estimates

Data Pattern	Compression Ratio	Compressed Size	Space Saved
Email addresses (common domains)	3.0x	3.3 GB	6.7 GB (67%)
URLs (same site)	2.5x	4.0 GB	6.0 GB (60%)
Server logs (structured)	3.0x	3.3 GB	6.7 GB (67%)
JSON records	2.0x	5.0 GB	5.0 GB (50%)
UUIDs	1.2x	8.3 GB	1.7 GB (17%)
Random text	1.1x	9.1 GB	0.9 GB (9%)

Dictionary Encoding

Description

Dictionary encoding replaces repeated values with compact integer indices into a dictionary of unique values. Ideal for columns with few unique values (low cardinality).

Technical Characteristics

Max Dictionary Size: 65,536 unique values
Index Width: 1, 2, or 4 bytes (auto-selected)
Encoding Speed: Very fast (hash lookup)
Decoding Speed: Very fast (array index)

Good Use Cases

Use Case	Description	Expected Ratio	10GB Scenario
State Fields	active/inactive/pending	10-50x	10GB → 200MB-1GB
Country Codes	US, UK, DE, FR (~200 values)	5-10x	10GB → 1-2GB
Category Tags	electronics/clothing/food	8-20x	10GB → 500MB-1.25GB
Boolean-like	yes/no, true/false	50-100x	10GB → 100-200MB
Day of Week	Mon, Tue, Wed… (7 values)	15-30x	10GB → 333-666MB
Enum Fields	Predefined value sets	10-50x	10GB → 200MB-1GB

Example - State Field (10GB):

CREATE TABLE orders (
    id INT PRIMARY KEY,
    status TEXT,         -- Dictionary: 10GB → ~200MB (3 unique values)
    product_id INT
) WITH (compression_columns = 'status:dictionary');

Bad Use Cases

Use Case	Why It’s Bad	Expected Ratio	Recommendation
High cardinality	>50% unique values	0.8-1.5x	Use FSST
User IDs	Mostly unique values	~1.0x	Don’t compress
Timestamps	All different	~1.0x	Use Delta
Free-text fields	High uniqueness	~1.0x	Use FSST
>65,536 unique	Exceeds dictionary limit	Fails	Use FSST

Example - Poor Compression:

Input: 10GB of unique user IDs (user_12345, user_12346, ...)
Dictionary Size: Would exceed 65,536 limit or be nearly 1:1
Recommendation: Use FSST or store uncompressed

10GB Data Estimates

Data Pattern	Unique Values	Compression Ratio	Compressed Size	Space Saved
State (3 values)	3	50x	200 MB	9.8 GB (98%)
Country codes	200	8x	1.25 GB	8.75 GB (87.5%)
Product categories	500	6x	1.67 GB	8.33 GB (83.3%)
User types	10	20x	500 MB	9.5 GB (95%)
City names	10,000	4x	2.5 GB	7.5 GB (75%)
Unique emails	1,000,000+	~1.0x	~10 GB	~0 GB (0%)

RLE - Run-Length Encoding

Description

Run-Length Encoding compresses sequences of repeated values into (value, count) pairs. Extremely effective for sorted columns or data with long runs of identical values.

Technical Characteristics

Minimum Run Length: 3 (shorter runs stored verbatim)
Maximum Run Length: 4.2 billion per entry
Encoding Speed: Fastest (simple counting)
Decoding Speed: Fastest (simple expansion)

Good Use Cases

Use Case	Description	Expected Ratio	10GB Scenario
Sorted partition keys	Same value for millions of rows	100-10,000x	10GB → 1-100MB
Time-bucketed data	Same hour/day for many rows	50-500x	10GB → 20-200MB
Flag columns (sorted)	0,0,0,…,1,1,1	100-1000x	10GB → 10-100MB
Sparse data	Mostly NULLs or zeros	50-200x	10GB → 50-200MB
Clustered keys	Same foreign key in batches	20-100x	10GB → 100-500MB

Example - Sorted Partition (10GB):

-- Data sorted by region (4 regions, 2.5GB each)
CREATE TABLE events (
    region TEXT,         -- RLE: 10GB → ~1MB (only 4 runs!)
    event_time TIMESTAMP,
    data TEXT
) WITH (compression_columns = 'region:rle');

Extreme Example:

Input: 10GB of "active" status (all same value)
Runs: 1
Compressed: ~20 bytes (value + count)
Ratio: ~500,000,000x

Bad Use Cases

Use Case	Why It’s Bad	Expected Ratio	Recommendation
Random/unsorted	No consecutive duplicates	0.5-1.0x	Use Dictionary
High cardinality unsorted	Every value different	~0.5x (worse!)	Don’t use RLE
Alternating values	A,B,A,B,A,B…	~0.3x (worse!)	Use Dictionary
UUIDs	All unique	~0.5x (worse!)	Don’t compress

Example - Poor Compression:

Input: 10GB of alternating true/false values
Runs: 5 billion (one per value)
Compressed: ~40GB (4x LARGER due to overhead!)
CRITICAL: RLE makes this data WORSE

10GB Data Estimates

Data Pattern	Run Count	Compression Ratio	Compressed Size	Space Saved
Sorted partition (4 values)	4	10,000x+	~1 MB	9.999 GB (99.99%)
Hourly buckets (8760/year)	8,760	1,000x	10 MB	9.99 GB (99.9%)
Daily flags (sorted)	365	5,000x	2 MB	9.998 GB (99.98%)
Clustered FK (1000 groups)	1,000	500x	20 MB	9.98 GB (99.8%)
Unsorted random	5 billion	0.5x	20 GB	-10 GB (WORSE)

Delta Encoding

Description

Delta encoding stores differences between consecutive values instead of absolute values. Uses zigzag + variable-length encoding for compact storage of small deltas.

Technical Characteristics

Supported Types: INT4, INT8 (32/64-bit integers)
Encoding: Zigzag encoding for signed deltas
Storage: Variable-length integers (1-10 bytes per delta)
Decoding: Sequential (requires reading from start)

Good Use Cases

Use Case	Description	Expected Ratio	10GB Scenario
Auto-increment IDs	1, 2, 3, 4, 5… (delta=1)	6-8x	10GB → 1.25-1.67GB
Timestamps (ordered)	Regular intervals (delta~1000ms)	4-8x	10GB → 1.25-2.5GB
Counters	Monotonically increasing	5-10x	10GB → 1-2GB
Sequence numbers	100, 101, 102…	6-8x	10GB → 1.25-1.67GB
Version numbers	1, 2, 3… with gaps	3-6x	10GB → 1.67-3.3GB

Example - Timestamps (10GB):

CREATE TABLE events (
    id INT,
    event_time BIGINT,   -- Delta: 10GB → ~1.5GB (uniform intervals)
    data TEXT
) WITH (compression_columns = 'event_time:delta');

Optimal Case - Sequential IDs:

Input: 10GB of sequential integers (1, 2, 3, 4, ...)
Base: 1, Deltas: [1, 1, 1, 1, ...]
Each delta = 1 byte (varint encoding)
Compressed: ~1.25GB (8x compression)

Bad Use Cases

Use Case	Why It’s Bad	Expected Ratio	Recommendation
Random integers	Large deltas need more bytes	0.8-1.2x	Don’t compress
Unsorted data	Deltas vary wildly	~1.0x	Sort first or skip
Floating-point	Not supported	N/A	Use ALP
Sparse sequences	Large gaps = large deltas	1.0-1.5x	Use Dictionary
Non-sequential	[1000, 5, 999999, 100]	~1.0x	Don’t use Delta

Example - Poor Compression:

Input: 10GB of random integers [1000000, 5, 999999, 100, 888888]
Deltas: [-999995, 999994, -999899, 888788]
Each delta = 4-5 bytes (large varints)
Compressed: ~10GB (no savings)

10GB Data Estimates

Data Pattern	Average Delta	Compression Ratio	Compressed Size	Space Saved
Sequential IDs (delta=1)	1	8x	1.25 GB	8.75 GB (87.5%)
Timestamps (1s intervals)	1,000	5x	2 GB	8 GB (80%)
Timestamps (1ms intervals)	1	8x	1.25 GB	8.75 GB (87.5%)
Version numbers (gaps)	~100	4x	2.5 GB	7.5 GB (75%)
Random integers	varies	1.0x	10 GB	0 GB (0%)

Codec Selection Guide

Decision Tree

Is your data...

FLOATING-POINT (FLOAT4/FLOAT8)?
├─ Yes → Use ALP
│   └─ Expected: 2-4x compression
└─ No ↓

TEXT/VARCHAR?
├─ Yes → Check cardinality
│   ├─ < 65,536 unique AND > 50% repetition → Use Dictionary (5-20x)
│   ├─ Has substring patterns → Use FSST (2-3x)
│   └─ Random/encrypted → Don't compress
└─ No ↓

INTEGER (INT4/INT8)?
├─ Yes → Check pattern
│   ├─ Sequential/sorted → Use Delta (2-10x)
│   ├─ Sorted with long runs → Use RLE (10-10000x)
│   ├─ Low cardinality → Use Dictionary (5-20x)
│   └─ Random → Don't compress
└─ No ↓

BINARY/BLOB?
└─ Don't compress (already efficient or encrypted)

Quick Reference Matrix

Data Characteristic	Best Codec	Second Choice	Avoid
Financial prices	ALP	-	RLE
State flags (sorted)	RLE	Dictionary	Delta
State flags (unsorted)	Dictionary	-	RLE
Email addresses	FSST	Dictionary	RLE
Sequential IDs	Delta	RLE (if sorted)	Dictionary
Timestamps (ordered)	Delta	-	RLE
Country codes	Dictionary	FSST	RLE
UUIDs	- (don’t compress)	FSST	RLE, Dictionary
Sensor readings	ALP	Delta (if integer)	-

Performance Comparison

Compression Speed (GB/sec)

Codec	Encode Speed	Decode Speed	Notes
RLE	5-10 GB/s	5-10 GB/s	Fastest, CPU-bound
Dictionary	2-4 GB/s	4-8 GB/s	Fast hash lookup
Delta	3-6 GB/s	4-8 GB/s	Simple arithmetic
ALP	1-2 GB/s	3-5 GB/s	SIMD accelerated
FSST	1-3 GB/s	1-3 GB/s	Symbol table lookup

Memory Overhead

Codec	Per-Column Overhead	Per-Value Overhead
RLE	12 bytes (header)	8 bytes/run
Dictionary	Dictionary size + 16 bytes	1-4 bytes/value
Delta	20 bytes (header)	1-10 bytes/delta
ALP	~100 bytes (metadata)	Variable
FSST	2-3 KB (symbol table)	Variable

SQL Configuration

CREATE TABLE WITH Clause

-- Single codec for entire table
CREATE TABLE measurements (
    id INT PRIMARY KEY,
    value FLOAT8,
    label TEXT
) WITH (compression = 'auto');

-- Per-column codec specification
CREATE TABLE events (
    id INT,
    status TEXT,
    event_time BIGINT,
    temperature FLOAT8
) WITH (
    compression = 'auto',
    compression_level = 6,
    compression_columns = 'status:dictionary,event_time:delta,temperature:alp'
);

ALTER TABLE Configuration

-- Enable/disable compression
ALTER TABLE events SET COMPRESSION = 'auto';
ALTER TABLE events SET COMPRESSION = 'none';

-- Set compression level (1-9)
ALTER TABLE events SET COMPRESSION_LEVEL = 9;

-- Configure per-column
ALTER TABLE events SET COMPRESSION_COLUMN status = 'dictionary';
ALTER TABLE events SET COMPRESSION_COLUMN event_time = 'delta';

Monitor Compression Statistics

-- Overall compression stats
SELECT * FROM heliosdb_compression_stats;

-- Pattern analysis
SELECT * FROM heliosdb_pattern_stats;

-- Recent compression events
SELECT * FROM heliosdb_compression_events;

-- Current configuration
SELECT * FROM heliosdb_config WHERE setting LIKE 'compression%';

Summary: 10GB Compression Estimates

Codec	Best Case	Typical Case	Worst Case
ALP	4x (2.5 GB)	3x (3.3 GB)	1.2x (8.3 GB)
FSST	3x (3.3 GB)	2.5x (4 GB)	1.1x (9.1 GB)
Dictionary	50x (200 MB)	8x (1.25 GB)	1x (10 GB)
RLE	10000x+ (1 MB)	100x (100 MB)	0.5x (20 GB)*
Delta	8x (1.25 GB)	5x (2 GB)	1x (10 GB)

*RLE can make data larger if used incorrectly!

Compression Codecs User Guide

Compression Codecs User Guide

Table of Contents

Overview

ALP - Adaptive Lossless floating-Point

Description

Technical Characteristics

Good Use Cases

Bad Use Cases

10GB Data Estimates

FSST - Fast Static Symbol Table

Description

Technical Characteristics

Good Use Cases

Bad Use Cases

10GB Data Estimates

Dictionary Encoding

Description

Technical Characteristics

Good Use Cases

Bad Use Cases

10GB Data Estimates

RLE - Run-Length Encoding

Description

Technical Characteristics

Good Use Cases

Bad Use Cases

10GB Data Estimates

Delta Encoding

Description

Technical Characteristics

Good Use Cases

Bad Use Cases

10GB Data Estimates

Codec Selection Guide

Decision Tree

Quick Reference Matrix

Performance Comparison

Compression Speed (GB/sec)

Memory Overhead

SQL Configuration

CREATE TABLE WITH Clause

ALTER TABLE Configuration

Monitor Compression Statistics

Summary: 10GB Compression Estimates