Skip to content

Compression Codecs User Guide

Compression Codecs User Guide

HeliosDB Nano v3.0.1 - Complete Compression Reference

This guide provides detailed information about each compression codec in HeliosDB Nano, including optimal use cases, data characteristics, and estimated compression ratios for 10GB data scenarios.


Table of Contents

  1. Overview
  2. ALP - Adaptive Lossless floating-Point
  3. FSST - Fast Static Symbol Table
  4. Dictionary Encoding
  5. RLE - Run-Length Encoding
  6. Delta Encoding
  7. Codec Selection Guide
  8. Performance Comparison
  9. SQL Configuration

Overview

HeliosDB Nano includes five specialized compression codecs, each optimized for different data patterns:

CodecTarget DataTypical RatioSpeed
ALPFloating-point numbers2-4xVery Fast
FSSTStrings with patterns2-3xFast
DictionaryLow-cardinality columns5-20xVery Fast
RLERepetitive/sorted data10-100xFastest
DeltaSequential numbers2-10xVery Fast

ALP - Adaptive Lossless floating-Point

Description

ALP (Adaptive Lossless floating-Point) is a state-of-the-art compression algorithm for IEEE 754 floating-point data. Based on ACM SIGMOD 2024 research, it automatically adapts between two strategies:

  • ALP Classic: For decimal-origin data (financial, percentages, measurements)
  • ALP-RD: For high-precision floats (scientific, ML weights)

Technical Characteristics

  • Encoding Speed: ~0.5 doubles per CPU cycle
  • Decoding Speed: ~2.6 doubles per CPU cycle
  • Compression: 100% lossless (zero precision loss)
  • Block Size: 1024 values (optimized for CPU cache)

Good Use Cases

Use CaseDescriptionExpected Ratio10GB Scenario
Financial DataPrices: $10.12, $99.95, $1234.563-4x10GB → 2.5-3.3GB
Sensor ReadingsTemperature: 23.5°C, 24.1°C, 23.8°C3-4x10GB → 2.5-3.3GB
PercentagesValues: 0.25, 0.50, 0.75, 0.334x10GB → 2.5GB
CoordinatesGPS: -122.4194, 37.77492.5-3x10GB → 3.3-4GB
MeasurementsScientific: 9.81, 3.14159, 2.7182-3x10GB → 3.3-5GB

Example - Price Data (10GB):

CREATE TABLE orders (
id INT PRIMARY KEY,
price FLOAT8, -- ALP: 10GB → ~2.5GB
quantity INT
) WITH (compression = 'alp');

Bad Use Cases

Use CaseWhy It’s BadExpected RatioRecommendation
Random doublesNo patterns to exploit1.0-1.2xDon’t compress
ML weightsFull precision, random distribution1.2-1.5xConsider storing as binary
Encrypted dataAppears random~1.0xDon’t compress
Already compressedNo further reduction~1.0xStore raw

Example - Poor Compression:

Input: Random f64 values from rand::random()
Original: 10GB
Compressed: ~9GB (only 10% savings)
Overhead may not be worth it

10GB Data Estimates

Data PatternCompression RatioCompressed SizeSpace Saved
Financial prices (2 decimals)4.0x2.5 GB7.5 GB (75%)
Scientific measurements3.0x3.3 GB6.7 GB (67%)
GPS coordinates2.5x4.0 GB6.0 GB (60%)
Time-series sensor data3.5x2.9 GB7.1 GB (71%)
Random doubles1.2x8.3 GB1.7 GB (17%)

FSST - Fast Static Symbol Table

Description

FSST (Fast Static Symbol Table) is a lightweight string compression algorithm that encodes common substrings (1-8 bytes) using a symbol table trained on sample data. It provides random access to individual strings.

Technical Characteristics

  • Compression Speed: 1-3 GB/sec
  • Decompression Speed: 1-3 GB/sec
  • Symbol Table Size: ~2-3 KB per column
  • Random Access: Yes (individual strings decompressible)

Good Use Cases

Use CaseDescriptionExpected Ratio10GB Scenario
Email Addressesuser@example.com patterns2.5-3x10GB → 3.3-4GB
URLshttps://example.com/path patterns2-3x10GB → 3.3-5GB
Log MessagesRepetitive log formats2.5-3x10GB → 3.3-4GB
JSON RecordsStructured text patterns2-2.5x10GB → 4-5GB
File Paths/home/user/docs patterns2.5-3x10GB → 3.3-4GB
Request LogsGET /api/v1/users patterns2.5-3.5x10GB → 2.9-4GB

Example - Email Data (10GB):

CREATE TABLE users (
id INT PRIMARY KEY,
email TEXT, -- FSST: 10GB → ~3.5GB
name TEXT
) WITH (compression = 'fsst');

Bad Use Cases

Use CaseWhy It’s BadExpected RatioRecommendation
UUIDsNo substring patterns1.1-1.3xUse binary storage
Base64 dataUniform character distribution1.0-1.2xStore raw
Hashes (SHA/MD5)Random character patterns~1.0xStore raw
Encrypted textNo compressible patterns~1.0xDon’t compress
Random stringsNo common substrings1.0-1.2xDon’t compress

Example - Poor Compression:

Input: 10GB of UUIDs (550a8400-e29b-41d4-a716-446655440000)
Compressed: ~8.5GB (only 15% savings)
Symbol table overhead may exceed savings

10GB Data Estimates

Data PatternCompression RatioCompressed SizeSpace Saved
Email addresses (common domains)3.0x3.3 GB6.7 GB (67%)
URLs (same site)2.5x4.0 GB6.0 GB (60%)
Server logs (structured)3.0x3.3 GB6.7 GB (67%)
JSON records2.0x5.0 GB5.0 GB (50%)
UUIDs1.2x8.3 GB1.7 GB (17%)
Random text1.1x9.1 GB0.9 GB (9%)

Dictionary Encoding

Description

Dictionary encoding replaces repeated values with compact integer indices into a dictionary of unique values. Ideal for columns with few unique values (low cardinality).

Technical Characteristics

  • Max Dictionary Size: 65,536 unique values
  • Index Width: 1, 2, or 4 bytes (auto-selected)
  • Encoding Speed: Very fast (hash lookup)
  • Decoding Speed: Very fast (array index)

Good Use Cases

Use CaseDescriptionExpected Ratio10GB Scenario
Status Fieldsactive/inactive/pending10-50x10GB → 200MB-1GB
Country CodesUS, UK, DE, FR (~200 values)5-10x10GB → 1-2GB
Category Tagselectronics/clothing/food8-20x10GB → 500MB-1.25GB
Boolean-likeyes/no, true/false50-100x10GB → 100-200MB
Day of WeekMon, Tue, Wed… (7 values)15-30x10GB → 333-666MB
Enum FieldsPredefined value sets10-50x10GB → 200MB-1GB

Example - Status Field (10GB):

CREATE TABLE orders (
id INT PRIMARY KEY,
status TEXT, -- Dictionary: 10GB → ~200MB (3 unique values)
product_id INT
) WITH (compression_columns = 'status:dictionary');

Bad Use Cases

Use CaseWhy It’s BadExpected RatioRecommendation
High cardinality>50% unique values0.8-1.5xUse FSST
User IDsMostly unique values~1.0xDon’t compress
TimestampsAll different~1.0xUse Delta
Free-text fieldsHigh uniqueness~1.0xUse FSST
>65,536 uniqueExceeds dictionary limitFailsUse FSST

Example - Poor Compression:

Input: 10GB of unique user IDs (user_12345, user_12346, ...)
Dictionary Size: Would exceed 65,536 limit or be nearly 1:1
Recommendation: Use FSST or store uncompressed

10GB Data Estimates

Data PatternUnique ValuesCompression RatioCompressed SizeSpace Saved
Status (3 values)350x200 MB9.8 GB (98%)
Country codes2008x1.25 GB8.75 GB (87.5%)
Product categories5006x1.67 GB8.33 GB (83.3%)
User types1020x500 MB9.5 GB (95%)
City names10,0004x2.5 GB7.5 GB (75%)
Unique emails1,000,000+~1.0x~10 GB~0 GB (0%)

RLE - Run-Length Encoding

Description

Run-Length Encoding compresses sequences of repeated values into (value, count) pairs. Extremely effective for sorted columns or data with long runs of identical values.

Technical Characteristics

  • Minimum Run Length: 3 (shorter runs stored verbatim)
  • Maximum Run Length: 4.2 billion per entry
  • Encoding Speed: Fastest (simple counting)
  • Decoding Speed: Fastest (simple expansion)

Good Use Cases

Use CaseDescriptionExpected Ratio10GB Scenario
Sorted partition keysSame value for millions of rows100-10,000x10GB → 1-100MB
Time-bucketed dataSame hour/day for many rows50-500x10GB → 20-200MB
Flag columns (sorted)0,0,0,…,1,1,1100-1000x10GB → 10-100MB
Sparse dataMostly NULLs or zeros50-200x10GB → 50-200MB
Clustered keysSame foreign key in batches20-100x10GB → 100-500MB

Example - Sorted Partition (10GB):

-- Data sorted by region (4 regions, 2.5GB each)
CREATE TABLE events (
region TEXT, -- RLE: 10GB → ~1MB (only 4 runs!)
event_time TIMESTAMP,
data TEXT
) WITH (compression_columns = 'region:rle');

Extreme Example:

Input: 10GB of "active" status (all same value)
Runs: 1
Compressed: ~20 bytes (value + count)
Ratio: ~500,000,000x

Bad Use Cases

Use CaseWhy It’s BadExpected RatioRecommendation
Random/unsortedNo consecutive duplicates0.5-1.0xUse Dictionary
High cardinality unsortedEvery value different~0.5x (worse!)Don’t use RLE
Alternating valuesA,B,A,B,A,B…~0.3x (worse!)Use Dictionary
UUIDsAll unique~0.5x (worse!)Don’t compress

Example - Poor Compression:

Input: 10GB of alternating true/false values
Runs: 5 billion (one per value)
Compressed: ~40GB (4x LARGER due to overhead!)
CRITICAL: RLE makes this data WORSE

10GB Data Estimates

Data PatternRun CountCompression RatioCompressed SizeSpace Saved
Sorted partition (4 values)410,000x+~1 MB9.999 GB (99.99%)
Hourly buckets (8760/year)8,7601,000x10 MB9.99 GB (99.9%)
Daily flags (sorted)3655,000x2 MB9.998 GB (99.98%)
Clustered FK (1000 groups)1,000500x20 MB9.98 GB (99.8%)
Unsorted random5 billion0.5x20 GB-10 GB (WORSE)

Delta Encoding

Description

Delta encoding stores differences between consecutive values instead of absolute values. Uses zigzag + variable-length encoding for compact storage of small deltas.

Technical Characteristics

  • Supported Types: INT4, INT8 (32/64-bit integers)
  • Encoding: Zigzag encoding for signed deltas
  • Storage: Variable-length integers (1-10 bytes per delta)
  • Decoding: Sequential (requires reading from start)

Good Use Cases

Use CaseDescriptionExpected Ratio10GB Scenario
Auto-increment IDs1, 2, 3, 4, 5… (delta=1)6-8x10GB → 1.25-1.67GB
Timestamps (ordered)Regular intervals (delta~1000ms)4-8x10GB → 1.25-2.5GB
CountersMonotonically increasing5-10x10GB → 1-2GB
Sequence numbers100, 101, 102…6-8x10GB → 1.25-1.67GB
Version numbers1, 2, 3… with gaps3-6x10GB → 1.67-3.3GB

Example - Timestamps (10GB):

CREATE TABLE events (
id INT,
event_time BIGINT, -- Delta: 10GB → ~1.5GB (uniform intervals)
data TEXT
) WITH (compression_columns = 'event_time:delta');

Optimal Case - Sequential IDs:

Input: 10GB of sequential integers (1, 2, 3, 4, ...)
Base: 1, Deltas: [1, 1, 1, 1, ...]
Each delta = 1 byte (varint encoding)
Compressed: ~1.25GB (8x compression)

Bad Use Cases

Use CaseWhy It’s BadExpected RatioRecommendation
Random integersLarge deltas need more bytes0.8-1.2xDon’t compress
Unsorted dataDeltas vary wildly~1.0xSort first or skip
Floating-pointNot supportedN/AUse ALP
Sparse sequencesLarge gaps = large deltas1.0-1.5xUse Dictionary
Non-sequential[1000, 5, 999999, 100]~1.0xDon’t use Delta

Example - Poor Compression:

Input: 10GB of random integers [1000000, 5, 999999, 100, 888888]
Deltas: [-999995, 999994, -999899, 888788]
Each delta = 4-5 bytes (large varints)
Compressed: ~10GB (no savings)

10GB Data Estimates

Data PatternAverage DeltaCompression RatioCompressed SizeSpace Saved
Sequential IDs (delta=1)18x1.25 GB8.75 GB (87.5%)
Timestamps (1s intervals)1,0005x2 GB8 GB (80%)
Timestamps (1ms intervals)18x1.25 GB8.75 GB (87.5%)
Version numbers (gaps)~1004x2.5 GB7.5 GB (75%)
Random integersvaries1.0x10 GB0 GB (0%)

Codec Selection Guide

Decision Tree

Is your data...
FLOATING-POINT (FLOAT4/FLOAT8)?
├─ Yes → Use ALP
│ └─ Expected: 2-4x compression
└─ No ↓
TEXT/VARCHAR?
├─ Yes → Check cardinality
│ ├─ < 65,536 unique AND > 50% repetition → Use Dictionary (5-20x)
│ ├─ Has substring patterns → Use FSST (2-3x)
│ └─ Random/encrypted → Don't compress
└─ No ↓
INTEGER (INT4/INT8)?
├─ Yes → Check pattern
│ ├─ Sequential/sorted → Use Delta (2-10x)
│ ├─ Sorted with long runs → Use RLE (10-10000x)
│ ├─ Low cardinality → Use Dictionary (5-20x)
│ └─ Random → Don't compress
└─ No ↓
BINARY/BLOB?
└─ Don't compress (already efficient or encrypted)

Quick Reference Matrix

Data CharacteristicBest CodecSecond ChoiceAvoid
Financial pricesALP-RLE
Status flags (sorted)RLEDictionaryDelta
Status flags (unsorted)Dictionary-RLE
Email addressesFSSTDictionaryRLE
Sequential IDsDeltaRLE (if sorted)Dictionary
Timestamps (ordered)Delta-RLE
Country codesDictionaryFSSTRLE
UUIDs- (don’t compress)FSSTRLE, Dictionary
Sensor readingsALPDelta (if integer)-

Performance Comparison

Compression Speed (GB/sec)

CodecEncode SpeedDecode SpeedNotes
RLE5-10 GB/s5-10 GB/sFastest, CPU-bound
Dictionary2-4 GB/s4-8 GB/sFast hash lookup
Delta3-6 GB/s4-8 GB/sSimple arithmetic
ALP1-2 GB/s3-5 GB/sSIMD accelerated
FSST1-3 GB/s1-3 GB/sSymbol table lookup

Memory Overhead

CodecPer-Column OverheadPer-Value Overhead
RLE12 bytes (header)8 bytes/run
DictionaryDictionary size + 16 bytes1-4 bytes/value
Delta20 bytes (header)1-10 bytes/delta
ALP~100 bytes (metadata)Variable
FSST2-3 KB (symbol table)Variable

SQL Configuration

CREATE TABLE WITH Clause

-- Single codec for entire table
CREATE TABLE measurements (
id INT PRIMARY KEY,
value FLOAT8,
label TEXT
) WITH (compression = 'auto');
-- Per-column codec specification
CREATE TABLE events (
id INT,
status TEXT,
event_time BIGINT,
temperature FLOAT8
) WITH (
compression = 'auto',
compression_level = 6,
compression_columns = 'status:dictionary,event_time:delta,temperature:alp'
);

ALTER TABLE Configuration

-- Enable/disable compression
ALTER TABLE events SET COMPRESSION = 'auto';
ALTER TABLE events SET COMPRESSION = 'none';
-- Set compression level (1-9)
ALTER TABLE events SET COMPRESSION_LEVEL = 9;
-- Configure per-column
ALTER TABLE events SET COMPRESSION_COLUMN status = 'dictionary';
ALTER TABLE events SET COMPRESSION_COLUMN event_time = 'delta';

Monitor Compression Statistics

-- Overall compression stats
SELECT * FROM heliosdb_compression_stats;
-- Pattern analysis
SELECT * FROM heliosdb_pattern_stats;
-- Recent compression events
SELECT * FROM heliosdb_compression_events;
-- Current configuration
SELECT * FROM heliosdb_config WHERE setting LIKE 'compression%';

Summary: 10GB Compression Estimates

CodecBest CaseTypical CaseWorst Case
ALP4x (2.5 GB)3x (3.3 GB)1.2x (8.3 GB)
FSST3x (3.3 GB)2.5x (4 GB)1.1x (9.1 GB)
Dictionary50x (200 MB)8x (1.25 GB)1x (10 GB)
RLE10000x+ (1 MB)100x (100 MB)0.5x (20 GB)*
Delta8x (1.25 GB)5x (2 GB)1x (10 GB)

*RLE can make data larger if used incorrectly!


Last Updated: 2026-01-16 Version: HeliosDB Nano v3.0.1