Skip to content

Hybrid Columnar Compression (HCC) for HeliosDB

Hybrid Columnar Compression (HCC) for HeliosDB

Overview

HeliosDB’s Hybrid Columnar Compression (HCC) is inspired by Oracle Exadata’s columnar compression technology. It provides industry-leading compression ratios (6-10x average, up to 15x for specific workloads) while maintaining excellent query performance through intelligent encoding techniques and predicate pushdown support.

Architecture

Compression Modes

HCC offers three compression modes optimized for different workloads:

1. Online Mode (HTAP Workloads)

  • Compression Ratio: 3-5x
  • Compression Unit Size: 8,192 rows
  • Use Case: Mixed OLTP/OLAP workloads with frequent updates
  • Performance: Fastest compression and decompression
  • Best For: Real-time analytics, operational reporting

2. Warehouse Mode (OLAP Workloads)

  • Compression Ratio: 6-10x
  • Compression Unit Size: 32,768 rows
  • Use Case: Analytical queries with infrequent updates
  • Performance: Balanced compression ratio and speed
  • Best For: Data warehouses, business intelligence

3. Archive Mode (Cold Storage)

  • Compression Ratio: 10-15x
  • Compression Unit Size: 65,536 rows
  • Use Case: Long-term storage with rare access
  • Performance: Maximum compression, slower decompression
  • Best For: Historical data, compliance archives

Compression Techniques

HCC automatically selects the optimal encoding technique for each column based on its data characteristics:

Dictionary Encoding

  • Best For: Low-cardinality string columns
  • Compression Ratio: 8-15x
  • How It Works:
    • Builds a dictionary of unique values
    • Replaces values with compact integer codes
    • Bit-packs codes to minimum required width
  • Threshold: Applied when unique values < 70% of total rows

Run-Length Encoding (RLE)

  • Best For: Columns with consecutive repeated values
  • Compression Ratio: 5-20x (highly variable)
  • How It Works: Stores (value, count) pairs instead of repeated values
  • Use Cases: Status flags, sorted columns, sparse data

Delta + Bit-Packing

  • Best For: Sequential numeric data (timestamps, IDs)
  • Compression Ratio: 10-20x
  • How It Works:
    • Stores base value
    • Calculates deltas between consecutive values
    • Bit-packs deltas to minimum required width
  • Use Cases: Time-series data, auto-increment IDs

Bit-Packing

  • Best For: Integer columns, boolean columns
  • Compression Ratio: 4-8x
  • How It Works: Packs values using only the required number of bits
  • Example: Values 0-255 use 8 bits instead of 64 bits

Generic Compression

  • Best For: Binary data, high-cardinality strings
  • Compression Ratio: 2-4x
  • How It Works: Uses LZ4 (Online/Warehouse) or Zstd (Archive)
  • Fallback: Applied when other techniques aren’t effective

Columnar Storage

HCC stores data in columnar format within compression units:

Traditional Row Format:
[key1, value1, ts1] [key2, value2, ts2] [key3, value3, ts3] ...
HCC Columnar Format:
Column 0: [key1, key2, key3, ...]
Column 1: [value1, value2, value3, ...]
Column 2: [ts1, ts2, ts3, ...]

Benefits:

  • Better compression (similar values grouped together)
  • Predicate pushdown (read only needed columns)
  • Cache-friendly access patterns
  • Vectorization opportunities

Performance Characteristics

Compression Performance

Based on comprehensive benchmarks:

Workload TypeModeCompression RatioThroughput
Low-cardinality stringsWarehouse8-15x150-250 MB/s
Sequential timestampsWarehouse10-20x200-400 MB/s
Mixed integersWarehouse4-8x100-200 MB/s
High-cardinality dataWarehouse2-4x80-150 MB/s
Mixed workloadOnline3-5x200-350 MB/s
Mixed workloadWarehouse6-10x120-200 MB/s
Mixed workloadArchive10-15x50-100 MB/s

Query Performance

Point Lookups:

  • Online mode: 10-50 µs
  • Warehouse mode: 20-100 µs
  • Archive mode: 50-200 µs

Range Scans (1000 rows):

  • Without projection: 5-15 ms
  • With projection: 2-5 ms
  • Speedup: 2-5x with predicate pushdown

Storage Savings

Example: 100GB uncompressed data

ModeCompressed SizeStorage SavingsAnnual Cost Savings*
Online~25 GB75%$900
Warehouse~12 GB88%$1,056
Archive~8 GB92%$1,104

*Assuming $0.10/GB/month cloud storage cost

API Usage

Basic Compression

use heliosdb_storage::hcc::{HccCompressor, CompressionMode, ColumnData, ColumnType};
use bytes::Bytes;
// Create compressor
let compressor = HccCompressor::new(CompressionMode::Warehouse);
// Prepare column data
let mut column = ColumnData::new(0, ColumnType::Integer);
for i in 0..10000 {
let value = (1000 + i * 10).to_le_bytes();
column.add_value(Some(Bytes::copy_from_slice(&value)));
}
// Compress
let unit = compressor.compress(vec![column])?;
println!("Compression ratio: {:.2}x", unit.metadata.compression_ratio());

Decompression

use heliosdb_storage::hcc::HccDecompressor;
// Create decompressor
let decompressor = HccDecompressor::new();
// Decompress
let columns = decompressor.decompress(&unit)?;
// Access decompressed data
for value in &columns[0].values {
if let Some(v) = value {
// Process value
}
}

HCC SSTable Integration

use heliosdb_storage::{HccSSTable, CompressionMode};
use heliosdb_storage::sstable::SSTableEntry;
use bytes::Bytes;
// Create entries
let mut entries = Vec::new();
for i in 0..100000 {
entries.push(SSTableEntry {
key: Bytes::from(format!("key{:08}", i)),
value: Some(Bytes::from(format!("value{:08}", i))),
timestamp: i as u64,
});
}
// Create HCC-compressed SSTable
let sstable = HccSSTable::create(
"data.hcc.sst",
entries,
CompressionMode::Warehouse
)?;
// Point lookup
let result = sstable.get(&Bytes::from("key00050000"))?;
// Range scan
let results = sstable.scan(
&Bytes::from("key00010000"),
&Bytes::from("key00020000")
)?;
// Scan with projection (predicate pushdown)
let results = sstable.scan_with_projection(
&Bytes::from("key00010000"),
&Bytes::from("key00020000"),
&[0, 2] // Only decompress key and timestamp columns
)?;

Predicate Pushdown

// Decompress only specific columns that match a predicate
let columns = decompressor.decompress_with_predicate(
&unit,
&[0, 2], // Column IDs to decompress
|stats| {
// Custom predicate logic
stats.null_count < stats.row_count / 2
}
)?;

Choosing the Right Compression Mode

Use Online Mode When:

  • Data is frequently updated or inserted
  • Query latency is critical (<100ms requirements)
  • Running mixed OLTP/OLAP workloads
  • Need fast compression for streaming ingestion
  • Working with hot data (frequently accessed)

Use Warehouse Mode When:

  • Data is mostly read-only or append-only
  • Running primarily analytical queries
  • Storage cost is a concern but performance matters
  • Data is warm (regularly accessed but not hot)
  • Need balanced compression/performance trade-off

Use Archive Mode When:

  • Data is rarely accessed (cold storage)
  • Storage cost is the primary concern
  • Can tolerate higher query latency
  • Compliance requires long-term retention
  • Data is historical and immutable

Column Type Selection

Proper column type selection is crucial for optimal compression:

// Integer columns - use Delta encoding
ColumnType::Integer // For IDs, counters, numeric values
// String columns - use Dictionary encoding
ColumnType::String // For categories, tags, repeated strings
// Timestamp columns - use Delta encoding
ColumnType::Timestamp // For event times, created_at fields
// Boolean columns - use Bit-packing
ColumnType::Boolean // For flags, binary indicators
// Binary columns - use Generic compression
ColumnType::Binary // For BLOBs, serialized data

Compression Statistics

HCC provides detailed statistics for monitoring and optimization:

let metadata = unit.metadata;
println!("Row count: {}", metadata.row_count);
println!("Compression ratio: {:.2}x", metadata.compression_ratio());
println!("Uncompressed size: {} bytes", metadata.uncompressed_size);
println!("Compressed size: {} bytes", metadata.compressed_size);
for stats in &metadata.column_stats {
println!("Column {}: {} rows, {} unique values, {} nulls",
stats.column_id,
stats.row_count,
stats.unique_count,
stats.null_count);
}

Best Practices

1. Batch Writes

Compress data in large batches (compression unit size) for optimal ratios:

// Good: Batch 32K rows
let entries: Vec<_> = (0..32768).map(|i| create_entry(i)).collect();
let sstable = HccSSTable::create(path, entries, CompressionMode::Warehouse)?;
// Avoid: Many small batches
for entry in entries {
// Don't create individual SSTables
}

2. Sort Before Compression

Sorted data compresses better with RLE and delta encoding:

entries.sort_by(|a, b| a.key.cmp(&b.key));
let sstable = HccSSTable::create(path, entries, mode)?;

3. Use Predicate Pushdown

Only decompress columns you need:

// Good: Only decompress needed columns
let results = sstable.scan_with_projection(start, end, &[0, 2])?;
// Avoid: Decompress everything
let results = sstable.scan(start, end)?;

4. Monitor Compression Ratios

Track compression effectiveness:

if metadata.compression_ratio() < 2.0 {
println!("Warning: Low compression ratio for this data");
// Consider different compression mode or data preprocessing
}

5. Match Mode to Access Pattern

// Hot data - frequent access
HccSSTable::create(path, entries, CompressionMode::Online)?
// Warm data - regular analytics
HccSSTable::create(path, entries, CompressionMode::Warehouse)?
// Cold data - archives
HccSSTable::create(path, entries, CompressionMode::Archive)?

Integration with HeliosDB Features

LSM Tree Integration

HCC SSTables work seamlessly with HeliosDB’s LSM storage engine:

  • L0: Online mode (recent data, frequent updates)
  • L1-L4: Warehouse mode (compacted data)
  • L5+: Archive mode (cold data)

Compaction Strategy

// Different compression modes per level
match level {
0 => CompressionMode::Online,
1..=4 => CompressionMode::Warehouse,
_ => CompressionMode::Archive,
}

Backup Integration

HCC SSTables are already compressed, avoiding double compression in backups.

TOAST Integration

Large values are stored externally via TOAST before HCC compression.

Performance Tuning

CPU vs Storage Trade-off

  • More compression = Less storage + More CPU
  • Less compression = More storage + Less CPU

Memory Usage

Compression unit size affects memory usage:

  • Online (8K rows): ~1-4 MB per unit
  • Warehouse (32K rows): ~4-16 MB per unit
  • Archive (64K rows): ~8-32 MB per unit

Parallelization

Compression units are independent and can be processed in parallel:

use rayon::prelude::*;
let compressed_units: Vec<_> = chunks
.par_iter()
.map(|chunk| compressor.compress(chunk.clone()))
.collect();

Troubleshooting

Low Compression Ratios

Problem: Achieving <2x compression Solutions:

  1. Check data characteristics (high cardinality?)
  2. Try different compression mode
  3. Ensure proper column types
  4. Sort data before compression

Slow Compression

Problem: Compression throughput <50 MB/s Solutions:

  1. Use Online mode instead of Archive
  2. Reduce compression unit size
  3. Parallelize compression
  4. Check CPU/memory resources

Slow Decompression

Problem: Query latency >500ms Solutions:

  1. Use predicate pushdown (column projection)
  2. Switch to Online or Warehouse mode
  3. Add indexes for point lookups
  4. Cache decompressed data

Benchmarking

Run comprehensive benchmarks:

Terminal window
cargo bench --bench hcc_bench

This generates detailed performance reports including:

  • Compression ratios by workload type
  • Compression/decompression throughput
  • Memory usage statistics
  • Query performance metrics

Conclusion

HeliosDB’s HCC implementation successfully achieves:

  • Target compression ratio: 6-10x for mixed workloads
  • High performance: 100-400 MB/s compression throughput
  • Query optimization: 2-5x speedup with predicate pushdown
  • Flexible modes: Online, Warehouse, Archive for different use cases
  • Advanced techniques: Dictionary, RLE, Delta, Bit-packing, Generic

HCC provides Oracle Exadata-class compression with modern, efficient implementation suitable for cloud-native HTAP workloads.