Skip to content

Compression Codecs User Guide

Compression Codecs User Guide

HeliosDB-Lite v3.0.1 - Complete Compression Reference

This guide provides detailed information about each compression codec in HeliosDB-Lite, including optimal use cases, data characteristics, and estimated compression ratios for 10GB data scenarios.


Table of Contents

  1. Overview
  2. ALP - Adaptive Lossless floating-Point
  3. FSST - Fast Static Symbol Table
  4. Dictionary Encoding
  5. RLE - Run-Length Encoding
  6. Delta Encoding
  7. Codec Selection Guide
  8. Performance Comparison
  9. SQL Configuration

Overview

HeliosDB-Lite includes five specialized compression codecs, each optimized for different data patterns:

CodecTarget DataTypical RatioSpeed
ALPFloating-point numbers2-4xVery Fast
FSSTStrings with patterns2-3xFast
DictionaryLow-cardinality columns5-20xVery Fast
RLERepetitive/sorted data10-100xFastest
DeltaSequential numbers2-10xVery Fast

ALP - Adaptive Lossless floating-Point

Description

ALP (Adaptive Lossless floating-Point) is a state-of-the-art compression algorithm for IEEE 754 floating-point data. Based on ACM SIGMOD 2024 research, it automatically adapts between two strategies:

  • ALP Classic: For decimal-origin data (financial, percentages, measurements)
  • ALP-RD: For high-precision floats (scientific, ML weights)

Technical Characteristics

  • Encoding Speed: ~0.5 doubles per CPU cycle
  • Decoding Speed: ~2.6 doubles per CPU cycle
  • Compression: 100% lossless (zero precision loss)
  • Block Size: 1024 values (optimized for CPU cache)

Good Use Cases

Use CaseDescriptionExpected Ratio10GB Scenario
Financial DataPrices: $10.12, $99.95, $1234.563-4x10GB → 2.5-3.3GB
Sensor ReadingsTemperature: 23.5°C, 24.1°C, 23.8°C3-4x10GB → 2.5-3.3GB
PercentagesValues: 0.25, 0.50, 0.75, 0.334x10GB → 2.5GB
CoordinatesGPS: -122.4194, 37.77492.5-3x10GB → 3.3-4GB
MeasurementsScientific: 9.81, 3.14159, 2.7182-3x10GB → 3.3-5GB

Example - Price Data (10GB):

CREATE TABLE orders (
id INT PRIMARY KEY,
price FLOAT8, -- ALP: 10GB → ~2.5GB
quantity INT
) WITH (compression = 'alp');

Bad Use Cases

Use CaseWhy It’s BadExpected RatioRecommendation
Random doublesNo patterns to exploit1.0-1.2xDon’t compress
ML weightsFull precision, random distribution1.2-1.5xConsider storing as binary
Encrypted dataAppears random~1.0xDon’t compress
Already compressedNo further reduction~1.0xStore raw

Example - Poor Compression:

Input: Random f64 values from rand::random()
Original: 10GB
Compressed: ~9GB (only 10% savings)
Overhead may not be worth it

10GB Data Estimates

Data PatternCompression RatioCompressed SizeSpace Saved
Financial prices (2 decimals)4.0x2.5 GB7.5 GB (75%)
Scientific measurements3.0x3.3 GB6.7 GB (67%)
GPS coordinates2.5x4.0 GB6.0 GB (60%)
Time-series sensor data3.5x2.9 GB7.1 GB (71%)
Random doubles1.2x8.3 GB1.7 GB (17%)

FSST - Fast Static Symbol Table

Description

FSST (Fast Static Symbol Table) is a lightweight string compression algorithm that encodes common substrings (1-8 bytes) using a symbol table trained on sample data. It provides random access to individual strings.

Technical Characteristics

  • Compression Speed: 1-3 GB/sec
  • Decompression Speed: 1-3 GB/sec
  • Symbol Table Size: ~2-3 KB per column
  • Random Access: Yes (individual strings decompressible)

Good Use Cases

Use CaseDescriptionExpected Ratio10GB Scenario
Email Addressesuser@example.com patterns2.5-3x10GB → 3.3-4GB
URLshttps://example.com/path patterns2-3x10GB → 3.3-5GB
Log MessagesRepetitive log formats2.5-3x10GB → 3.3-4GB
JSON RecordsStructured text patterns2-2.5x10GB → 4-5GB
File Paths/home/user/docs patterns2.5-3x10GB → 3.3-4GB
Request LogsGET /api/v1/users patterns2.5-3.5x10GB → 2.9-4GB

Example - Email Data (10GB):

CREATE TABLE users (
id INT PRIMARY KEY,
email TEXT, -- FSST: 10GB → ~3.5GB
name TEXT
) WITH (compression = 'fsst');

Bad Use Cases

Use CaseWhy It’s BadExpected RatioRecommendation
UUIDsNo substring patterns1.1-1.3xUse binary storage
Base64 dataUniform character distribution1.0-1.2xStore raw
Hashes (SHA/MD5)Random character patterns~1.0xStore raw
Encrypted textNo compressible patterns~1.0xDon’t compress
Random stringsNo common substrings1.0-1.2xDon’t compress

Example - Poor Compression:

Input: 10GB of UUIDs (550a8400-e29b-41d4-a716-446655440000)
Compressed: ~8.5GB (only 15% savings)
Symbol table overhead may exceed savings

10GB Data Estimates

Data PatternCompression RatioCompressed SizeSpace Saved
Email addresses (common domains)3.0x3.3 GB6.7 GB (67%)
URLs (same site)2.5x4.0 GB6.0 GB (60%)
Server logs (structured)3.0x3.3 GB6.7 GB (67%)
JSON records2.0x5.0 GB5.0 GB (50%)
UUIDs1.2x8.3 GB1.7 GB (17%)
Random text1.1x9.1 GB0.9 GB (9%)

Dictionary Encoding

Description

Dictionary encoding replaces repeated values with compact integer indices into a dictionary of unique values. Ideal for columns with few unique values (low cardinality).

Technical Characteristics

  • Max Dictionary Size: 65,536 unique values
  • Index Width: 1, 2, or 4 bytes (auto-selected)
  • Encoding Speed: Very fast (hash lookup)
  • Decoding Speed: Very fast (array index)

Good Use Cases

Use CaseDescriptionExpected Ratio10GB Scenario
Status Fieldsactive/inactive/pending10-50x10GB → 200MB-1GB
Country CodesUS, UK, DE, FR (~200 values)5-10x10GB → 1-2GB
Category Tagselectronics/clothing/food8-20x10GB → 500MB-1.25GB
Boolean-likeyes/no, true/false50-100x10GB → 100-200MB
Day of WeekMon, Tue, Wed… (7 values)15-30x10GB → 333-666MB
Enum FieldsPredefined value sets10-50x10GB → 200MB-1GB

Example - Status Field (10GB):

CREATE TABLE orders (
id INT PRIMARY KEY,
status TEXT, -- Dictionary: 10GB → ~200MB (3 unique values)
product_id INT
) WITH (compression_columns = 'status:dictionary');

Bad Use Cases

Use CaseWhy It’s BadExpected RatioRecommendation
High cardinality>50% unique values0.8-1.5xUse FSST
User IDsMostly unique values~1.0xDon’t compress
TimestampsAll different~1.0xUse Delta
Free-text fieldsHigh uniqueness~1.0xUse FSST
>65,536 uniqueExceeds dictionary limitFailsUse FSST

Example - Poor Compression:

Input: 10GB of unique user IDs (user_12345, user_12346, ...)
Dictionary Size: Would exceed 65,536 limit or be nearly 1:1
Recommendation: Use FSST or store uncompressed

10GB Data Estimates

Data PatternUnique ValuesCompression RatioCompressed SizeSpace Saved
Status (3 values)350x200 MB9.8 GB (98%)
Country codes2008x1.25 GB8.75 GB (87.5%)
Product categories5006x1.67 GB8.33 GB (83.3%)
User types1020x500 MB9.5 GB (95%)
City names10,0004x2.5 GB7.5 GB (75%)
Unique emails1,000,000+~1.0x~10 GB~0 GB (0%)

RLE - Run-Length Encoding

Description

Run-Length Encoding compresses sequences of repeated values into (value, count) pairs. Extremely effective for sorted columns or data with long runs of identical values.

Technical Characteristics

  • Minimum Run Length: 3 (shorter runs stored verbatim)
  • Maximum Run Length: 4.2 billion per entry
  • Encoding Speed: Fastest (simple counting)
  • Decoding Speed: Fastest (simple expansion)

Good Use Cases

Use CaseDescriptionExpected Ratio10GB Scenario
Sorted partition keysSame value for millions of rows100-10,000x10GB → 1-100MB
Time-bucketed dataSame hour/day for many rows50-500x10GB → 20-200MB
Flag columns (sorted)0,0,0,…,1,1,1100-1000x10GB → 10-100MB
Sparse dataMostly NULLs or zeros50-200x10GB → 50-200MB
Clustered keysSame foreign key in batches20-100x10GB → 100-500MB

Example - Sorted Partition (10GB):

-- Data sorted by region (4 regions, 2.5GB each)
CREATE TABLE events (
region TEXT, -- RLE: 10GB → ~1MB (only 4 runs!)
event_time TIMESTAMP,
data TEXT
) WITH (compression_columns = 'region:rle');

Extreme Example:

Input: 10GB of "active" status (all same value)
Runs: 1
Compressed: ~20 bytes (value + count)
Ratio: ~500,000,000x

Bad Use Cases

Use CaseWhy It’s BadExpected RatioRecommendation
Random/unsortedNo consecutive duplicates0.5-1.0xUse Dictionary
High cardinality unsortedEvery value different~0.5x (worse!)Don’t use RLE
Alternating valuesA,B,A,B,A,B…~0.3x (worse!)Use Dictionary
UUIDsAll unique~0.5x (worse!)Don’t compress

Example - Poor Compression:

Input: 10GB of alternating true/false values
Runs: 5 billion (one per value)
Compressed: ~40GB (4x LARGER due to overhead!)
CRITICAL: RLE makes this data WORSE

10GB Data Estimates

Data PatternRun CountCompression RatioCompressed SizeSpace Saved
Sorted partition (4 values)410,000x+~1 MB9.999 GB (99.99%)
Hourly buckets (8760/year)8,7601,000x10 MB9.99 GB (99.9%)
Daily flags (sorted)3655,000x2 MB9.998 GB (99.98%)
Clustered FK (1000 groups)1,000500x20 MB9.98 GB (99.8%)
Unsorted random5 billion0.5x20 GB-10 GB (WORSE)

Delta Encoding

Description

Delta encoding stores differences between consecutive values instead of absolute values. Uses zigzag + variable-length encoding for compact storage of small deltas.

Technical Characteristics

  • Supported Types: INT4, INT8 (32/64-bit integers)
  • Encoding: Zigzag encoding for signed deltas
  • Storage: Variable-length integers (1-10 bytes per delta)
  • Decoding: Sequential (requires reading from start)

Good Use Cases

Use CaseDescriptionExpected Ratio10GB Scenario
Auto-increment IDs1, 2, 3, 4, 5… (delta=1)6-8x10GB → 1.25-1.67GB
Timestamps (ordered)Regular intervals (delta~1000ms)4-8x10GB → 1.25-2.5GB
CountersMonotonically increasing5-10x10GB → 1-2GB
Sequence numbers100, 101, 102…6-8x10GB → 1.25-1.67GB
Version numbers1, 2, 3… with gaps3-6x10GB → 1.67-3.3GB

Example - Timestamps (10GB):

CREATE TABLE events (
id INT,
event_time BIGINT, -- Delta: 10GB → ~1.5GB (uniform intervals)
data TEXT
) WITH (compression_columns = 'event_time:delta');

Optimal Case - Sequential IDs:

Input: 10GB of sequential integers (1, 2, 3, 4, ...)
Base: 1, Deltas: [1, 1, 1, 1, ...]
Each delta = 1 byte (varint encoding)
Compressed: ~1.25GB (8x compression)

Bad Use Cases

Use CaseWhy It’s BadExpected RatioRecommendation
Random integersLarge deltas need more bytes0.8-1.2xDon’t compress
Unsorted dataDeltas vary wildly~1.0xSort first or skip
Floating-pointNot supportedN/AUse ALP
Sparse sequencesLarge gaps = large deltas1.0-1.5xUse Dictionary
Non-sequential[1000, 5, 999999, 100]~1.0xDon’t use Delta

Example - Poor Compression:

Input: 10GB of random integers [1000000, 5, 999999, 100, 888888]
Deltas: [-999995, 999994, -999899, 888788]
Each delta = 4-5 bytes (large varints)
Compressed: ~10GB (no savings)

10GB Data Estimates

Data PatternAverage DeltaCompression RatioCompressed SizeSpace Saved
Sequential IDs (delta=1)18x1.25 GB8.75 GB (87.5%)
Timestamps (1s intervals)1,0005x2 GB8 GB (80%)
Timestamps (1ms intervals)18x1.25 GB8.75 GB (87.5%)
Version numbers (gaps)~1004x2.5 GB7.5 GB (75%)
Random integersvaries1.0x10 GB0 GB (0%)

Codec Selection Guide

Decision Tree

Is your data...
FLOATING-POINT (FLOAT4/FLOAT8)?
├─ Yes → Use ALP
│ └─ Expected: 2-4x compression
└─ No ↓
TEXT/VARCHAR?
├─ Yes → Check cardinality
│ ├─ < 65,536 unique AND > 50% repetition → Use Dictionary (5-20x)
│ ├─ Has substring patterns → Use FSST (2-3x)
│ └─ Random/encrypted → Don't compress
└─ No ↓
INTEGER (INT4/INT8)?
├─ Yes → Check pattern
│ ├─ Sequential/sorted → Use Delta (2-10x)
│ ├─ Sorted with long runs → Use RLE (10-10000x)
│ ├─ Low cardinality → Use Dictionary (5-20x)
│ └─ Random → Don't compress
└─ No ↓
BINARY/BLOB?
└─ Don't compress (already efficient or encrypted)

Quick Reference Matrix

Data CharacteristicBest CodecSecond ChoiceAvoid
Financial pricesALP-RLE
Status flags (sorted)RLEDictionaryDelta
Status flags (unsorted)Dictionary-RLE
Email addressesFSSTDictionaryRLE
Sequential IDsDeltaRLE (if sorted)Dictionary
Timestamps (ordered)Delta-RLE
Country codesDictionaryFSSTRLE
UUIDs- (don’t compress)FSSTRLE, Dictionary
Sensor readingsALPDelta (if integer)-

Performance Comparison

Compression Speed (GB/sec)

CodecEncode SpeedDecode SpeedNotes
RLE5-10 GB/s5-10 GB/sFastest, CPU-bound
Dictionary2-4 GB/s4-8 GB/sFast hash lookup
Delta3-6 GB/s4-8 GB/sSimple arithmetic
ALP1-2 GB/s3-5 GB/sSIMD accelerated
FSST1-3 GB/s1-3 GB/sSymbol table lookup

Memory Overhead

CodecPer-Column OverheadPer-Value Overhead
RLE12 bytes (header)8 bytes/run
DictionaryDictionary size + 16 bytes1-4 bytes/value
Delta20 bytes (header)1-10 bytes/delta
ALP~100 bytes (metadata)Variable
FSST2-3 KB (symbol table)Variable

SQL Configuration

CREATE TABLE WITH Clause

-- Single codec for entire table
CREATE TABLE measurements (
id INT PRIMARY KEY,
value FLOAT8,
label TEXT
) WITH (compression = 'auto');
-- Per-column codec specification
CREATE TABLE events (
id INT,
status TEXT,
event_time BIGINT,
temperature FLOAT8
) WITH (
compression = 'auto',
compression_level = 6,
compression_columns = 'status:dictionary,event_time:delta,temperature:alp'
);

ALTER TABLE Configuration

-- Enable/disable compression
ALTER TABLE events SET COMPRESSION = 'auto';
ALTER TABLE events SET COMPRESSION = 'none';
-- Set compression level (1-9)
ALTER TABLE events SET COMPRESSION_LEVEL = 9;
-- Configure per-column
ALTER TABLE events SET COMPRESSION_COLUMN status = 'dictionary';
ALTER TABLE events SET COMPRESSION_COLUMN event_time = 'delta';

Monitor Compression Statistics

-- Overall compression stats
SELECT * FROM heliosdb_compression_stats;
-- Pattern analysis
SELECT * FROM heliosdb_pattern_stats;
-- Recent compression events
SELECT * FROM heliosdb_compression_events;
-- Current configuration
SELECT * FROM heliosdb_config WHERE setting LIKE 'compression%';

Summary: 10GB Compression Estimates

CodecBest CaseTypical CaseWorst Case
ALP4x (2.5 GB)3x (3.3 GB)1.2x (8.3 GB)
FSST3x (3.3 GB)2.5x (4 GB)1.1x (9.1 GB)
Dictionary50x (200 MB)8x (1.25 GB)1x (10 GB)
RLE10000x+ (1 MB)100x (100 MB)0.5x (20 GB)*
Delta8x (1.25 GB)5x (2 GB)1x (10 GB)

*RLE can make data larger if used incorrectly!


Last Updated: 2026-01-16 Version: HeliosDB-Lite v3.0.1