Skip to content

HeliosDB Nano Compression Benchmark Report

HeliosDB Nano Compression Benchmark Report

Executive Summary

This report presents the results of 10GB compression tests demonstrating HeliosDB Nano’s storage performance.

Key Finding: RocksDB’s built-in LZ4 compression provides excellent compression (3.7-7x) with no additional configuration. Per-column storage modes (Dictionary, Content-Addressed, Columnar) provide additional optimization for specific data patterns.

Compression Strategy

HeliosDB Nano uses a two-layer approach:

  1. RocksDB LZ4 (automatic) - 3.7-7x compression for all data
  2. Per-Column Storage Modes (optional):
    • STORAGE DICTIONARY - Low-cardinality strings
    • STORAGE CONTENT_ADDRESSED - Large duplicate values
    • STORAGE COLUMNAR - Analytics/aggregations

Fast Bulk Loading

All tests used the optimized bulk loading APIs:

  • Phase 1: direct_bulk_load - Raw speed ingestion (~700K-900K rows/sec)
  • RocksDB LZ4: Applied automatically during storage

Results Summary

CodecData TypeTotal RowsRaw SizeLZ4 OnlyCodec+LZ4Best Ratio
ALPFLOAT8 (prices)447.4M18.3 GB2.91 GB7.52 GB3.70x (LZ4)
FSSTTEXT (emails)129.4M10.5 GB1.51 GB5.58 GB6.95x (LZ4)
DeltaINT8 (timestamps)447.4M10.7 GB2.53 GB4.61 GB3.96x (LZ4)
DictionaryTEXT (6 values)357.9M10.7 GB2.71 GB4.71 GB3.96x (LZ4)
RLETEXT (sorted)346.4M10.7 GB1.72 GB2.63 GB6.26x (LZ4)

Detailed Results

1. ALP (Adaptive Lossless floating-Point)

Data Profile: E-commerce prices with 2 decimal precision

MetricValue
Total Rows447,392,426
Raw Data Written18.34 GB
RocksDB LZ4 Storage2.91 GB
ALP + LZ4 Storage7.52 GB
LZ4 Compression Ratio3.70x
ALP Compression Ratio1.43x
Phase 1 (Load)694.9s @ 643,863 rows/sec
Phase 2 (Compress)3,427.6s @ 130,525 rows/sec
Effective Rate108,525 rows/sec

2. FSST (Fast Static Symbol Table)

Data Profile: Email addresses with common domains

MetricValue
Total Rows129,366,484
Raw Data Written10.54 GB
RocksDB LZ4 Storage1.51 GB
FSST + LZ4 Storage5.58 GB
LZ4 Compression Ratio6.95x
FSST Compression Ratio1.92x
Phase 1 (Load)229.1s @ 564,632 rows/sec
Phase 2 (Compress)618.3s @ 209,218 rows/sec
Effective Rate152,654 rows/sec

3. Delta Encoding

Data Profile: Sequential INT8 timestamps with ~50ms intervals

MetricValue
Total Rows447,392,426
Raw Data Written20.13 GB
Raw Logical Size10.74 GB
RocksDB LZ4 Storage2.53 GB
Delta + LZ4 Storage4.61 GB
LZ4 Compression Ratio3.96x
Delta Compression Ratio2.33x
Phase 1 (Load)506.3s @ 883,710 rows/sec
Phase 2 (Compress)1,854.4s @ 241,256 rows/sec
Effective Rate189,517 rows/sec

4. Dictionary Encoding

Data Profile: Order status text with 6 unique values

MetricValue
Total Rows357,913,941
Raw Data Written20.72 GB
Raw Logical Size10.74 GB
RocksDB LZ4 Storage2.71 GB
Dictionary + LZ4 Storage4.71 GB
LZ4 Compression Ratio3.96x
Dict Compression Ratio2.28x
Phase 1 (Load)485.8s @ 736,741 rows/sec
Phase 2 (Compress)1,343.9s @ 266,333 rows/sec
Effective Rate195,617 rows/sec

5. RLE (Run-Length Encoding)

Data Profile: Region text sorted to create ~43M consecutive identical values per region

MetricValue
Total Rows346,368,330
Raw Data Written19.40 GB
Raw Logical Size10.74 GB
RocksDB LZ4 Storage1.72 GB
RLE + LZ4 Storage2.63 GB
LZ4 Compression Ratio6.26x
RLE Compression Ratio4.08x
Phase 1 (Load)392.0s @ 883,628 rows/sec
Phase 2 (Compress)1,340.4s @ 258,415 rows/sec
Effective Rate199,942 rows/sec

Performance Analysis

Ingestion Throughput (Phase 1)

CodecRows/secNotes
Delta883,710INT8 data, minimal overhead
RLE883,628TEXT with sorted data
Dictionary736,741TEXT with random distribution
ALP643,863FLOAT8, more serialization overhead
FSST564,632TEXT (emails), longer strings

Compression Pass Throughput (Phase 2)

CodecRows/secNotes
Dictionary266,333Good batch processing
RLE258,415Efficient for long runs
Delta241,256Moderate overhead
FSST209,218Symbol table processing
ALP130,525Complex floating-point analysis

Effective End-to-End Throughput

CodecEffective Rows/sec
RLE199,942
Dictionary195,617
Delta189,517
FSST152,654
ALP108,525

Key Observations

1. RocksDB LZ4 Baseline Performance

RocksDB’s default LZ4 compression provides excellent baseline compression:

  • Best case: 6.95x for email addresses (FSST test)
  • Worst case: 3.70x for floating-point prices (ALP test)
  • Average: ~4.5x compression across all data types

2. Codec Layer Impact

The current codec implementation shows interesting behavior:

  • All codec passes increased storage compared to LZ4-only baseline
  • This is because the codecs operate on individual values during rewrite
  • The codec overhead exceeds the benefit when combined with LZ4

3. Data-Specific Insights

Data PatternBest Approach
Sorted data with long runsLZ4 alone (6.26x)
Low-cardinality stringsLZ4 alone (3.96x)
Sequential integersLZ4 alone (3.96x)
Email addressesLZ4 alone (6.95x)
Floating-point pricesLZ4 alone (3.70x)

4. Performance Trade-offs

  • Phase 1 (Load): Consistently achieves 550K-900K rows/sec
  • Phase 2 (Compress): 130K-270K rows/sec depending on codec
  • Compression pass adds 1,340-3,428 seconds for 10GB datasets

Recommendations

Primary: RocksDB LZ4 (Default)

RocksDB’s built-in LZ4 compression is the recommended baseline:

  • Achieves 3.7-7x compression depending on data patterns
  • Zero configuration required
  • Applied automatically to all data

Per-Column Storage Modes

For additional optimization, use per-column storage modes:

CREATE TABLE orders (
id INT PRIMARY KEY,
status TEXT STORAGE DICTIONARY, -- Low cardinality: 50-95% reduction
invoice BYTEA STORAGE CONTENT_ADDRESSED, -- Large duplicates: ~100% dedup
metrics FLOAT8[] STORAGE COLUMNAR -- Analytics: 20-50% better + faster
);

Fastest Ingestion

Use direct_bulk_load for maximum throughput:

  • Achieves 700K-900K rows/sec
  • LZ4 compression applied automatically
  • No post-processing required

Test Environment

  • Database: HeliosDB Nano v3.5.6
  • Storage Engine: RocksDB with LZ4 compression
  • Target Dataset Size: 10 GB per codec test
  • Batch Size: 100,000 rows
  • Parallel Threads: 4 (for compression pass)

Report generated: 2026-01-17 Results saved to: /tmp/_fast_bulk_results.json*