Skip to content

HeliosDB Design Philosophy

HeliosDB Design Philosophy

Executive Summary

HeliosDB is a write-optimized HTAP database built on a distributed-parallel by design paradigm. Rather than retrofitting parallelism onto a traditional architecture, HeliosDB avoids single points of coordination from the ground up.

Core Mission: Provide immediate data access at maximum ingestion speed while maintaining ACID guarantees through intelligent buffering and asynchronous processing.


Core Principles

1. Zero Coordination Bottlenecks

Principle: No global thread or single point of coordination. The database scales linearly with available resources.

1.1 Lock-Free Row ID Generation

Row IDs use a hierarchical format that requires no global coordination:

| Partition (16-bit) | Timestamp (32-bit) | Sequence (16-bit) |
  • Each thread owns a partition ID (allocated once at startup)
  • Thread-local timestamp and sequence counters
  • 65,536 unique IDs per second per thread without contention
  • Implementation: src/storage/lockfree/row_id.rs

1.2 Partitioned WAL Architecture

Write-Ahead Log is partitioned for linear I/O scaling:

  • One WAL partition per CPU core (configurable)
  • Single-partition transactions use fast path (no coordination)
  • Cross-partition operations use Two-Phase Commit only when necessary
  • Implementation: src/storage/lockfree/wal_manager.rs

1.3 Work-Stealing Thread Pool

Parallel execution uses work-stealing for automatic load balancing:

  • Per-worker local queues (LIFO for cache locality)
  • Global injector queue for new tasks
  • Idle workers steal from busy workers’ queues
  • No central scheduler bottleneck
  • Implementation: src/storage/parallel/worker_pool.rs

1.4 Known Coordination Points (Documented Trade-offs)

Some minimal coordination exists by design:

Coordination PointRationaleMitigation
Global timestamp counterRequired for MVCC ordering across partitionsAtomic with Acquire/Release ordering; consider HLC for >64 cores
2PC for cross-partitionEnsures atomicity across WAL partitionsFast path for single-partition ops (majority case)

2. Staging-First Architecture (HeliosCore)

Principle: Capture data as fast as possible first, transform to optimal format asynchronously. Query Engine remains functional during transformation via in-memory hints.

2.1 Four-Stage Ingestion Pipeline

Bulk Data → StagingBuffer → StagingSegments → Compaction → MainStorage
↓ ↑
[QE Hint Data] [Async Transform]
  1. StagingBuffer: Append-only in-memory buffer (64MB default, 32MB flush threshold)
  2. StagingSegments: Immutable sorted runs from flushed buffers
  3. Compaction: K-way merge with deduplication
  4. MainStorage: Optimally formatted for table’s access patterns

Implementation: src/storage/staging/

2.2 Write Amplification Minimization

  • Data copied AS IS at maximum underlying I/O speed
  • Minimal mapping: only enough structure for immediate queryability
  • BTreeMap-sorted entries enable efficient merge without random I/O
  • Tombstone markers for deletes (no immediate compaction required)

2.3 Hint Data for Query Engine

During async transformation, a subset of staged data remains in memory:

  • Enables Query Engine to access recently-ingested data immediately
  • Lightweight index over staging buffer for point lookups
  • Progressive visibility: data queryable before full transformation completes

2.4 Multi-Store Capability

Each table (and column) can use the optimal storage format:

ModeBest ForCharacteristics
RowOLTP, point queriesFast single-row access
ColumnarAnalytics, aggregationsCompression, vectorized scans
HybridMixed workloadsHot columns row-based, cold columnar
  • Per-table storage mode selection
  • Per-column override capability
  • Async background transformation between modes
  • Implementation: Previously in HeliosDB-Lite, adapting to NativeBackend

3. Write-Optimized ACID

Principle: User acknowledgment is the highest priority. Capture writes to durable buffers immediately, consolidate asynchronously while maintaining full ACID guarantees.

3.1 Safety Level Spectrum

HeliosDB offers configurable durability levels:

LevelDurabilityThroughputUse Case
FullZero data loss1x baselineFinancial transactions, critical data
BatchedUp to N ops on crash3-5xHigh throughput with small loss tolerance
AsyncRecent ops at risk5-10xAnalytics ingestion, logs
UnsafeAll unflushed lost10-50xBulk load, temporary data

Implementation: src/storage/lockfree/config.rs

3.2 Group Commit Optimization

Batches multiple commits into single fsync:

  • 10ms batch window (configurable)
  • 10-100x throughput improvement over per-commit fsync
  • MPSC queue for commit coordination
  • Client acknowledged after batch is durable
  • Implementation: src/storage/wal.rs

3.3 MVCC Without Read Locks

Multi-Version Concurrency Control enables non-blocking reads:

  • 32-byte version header per row version
  • Snapshot isolation via read timestamps
  • Time-travel queries (AS OF TIMESTAMP/TRANSACTION)
  • No read locks required; writers never block readers
  • Implementation: src/storage/versioning/mvcc.rs

3.4 Background Async Consolidation

Write buffers are consolidated asynchronously:

  • Disk buffers capture changes immediately for durability
  • Background workers merge buffers into main storage
  • Delta tracking for incremental materialized view refresh
  • WAL cleanup after successful checkpoint

4. Adaptive Parallelism

Principle: If an operation takes too long, the database should dynamically increase parallelism. Cost-based decisions determine whether to buffer remaining work or re-execute with more resources.

4.1 SIMD Vectorization

Automatic CPU feature detection and utilization:

Detection Priority: AVX-512 → AVX2 → SSE4.2 → NEON → Scalar
  • Vectorized aggregations (SUM, MIN/MAX, COUNT)
  • Bitmask operations for filtering
  • Variable vector width based on capability (8-64 bytes)
  • Implementation: src/storage/parallel/simd_engine.rs

4.2 Three-Level Parallelism

Level 1 (Intra-segment): SIMD - 4-8 values per instruction
Level 2 (Inter-segment): Workers on different segments
Level 3 (Pipeline): Overlap I/O with compute

4.3 Scatter-Gather Scanning

Parallel segment processing with coordinated result aggregation:

  • Per-segment isolation for failure containment
  • Configurable parallelism degree (default: 16 concurrent)
  • Results collected via lock-free scatter-gather
  • Implementation: src/storage/parallel/scatter_gather.rs

4.4 Dynamic Scaling (Vision)

For long-running operations:

  1. Monitor progress rate and estimated completion time
  2. Re-evaluate: can adding workers reduce total time?
  3. Decision: continue with buffer OR re-execute with more parallelism
  4. Spawn additional workers locally or on remote nodes
  5. Cost model determines cheapest path to completion

Implementation Guidelines

Thread Model

  • Worker threads: One per CPU core (configurable via get_cpu_count())
  • Background threads: Compaction, WAL sync, maintenance
  • No busy-waiting: Threads yield when idle (crossbeam parking)

Memory Management

  • Buffer pools: Pre-allocated WriteBufferPool for predictable latency
  • 4KB alignment: Required for O_DIRECT I/O
  • Memory-mapped segments: For large sequential scans

Error Handling

  • Graceful degradation: Reduce parallelism rather than fail
  • Operation retry: Exponential backoff for transient failures
  • Circuit breaker: For repeated failures to same resource

Trade-off Documentation

Why Global Timestamp Counter?

Despite “zero coordination” philosophy, a global timestamp exists for MVCC ordering. This is an acceptable trade-off because:

  1. Atomic operations with Acquire/Release are extremely fast (<10ns)
  2. MVCC correctness requires total ordering of commits
  3. The alternative (logical clocks per partition) adds complexity for cross-partition queries
  4. Mitigation: Hybrid Logical Clocks (HLC) or batched allocation for >64 core systems

Why 2PC for Cross-Partition?

Two-Phase Commit is used for transactions spanning multiple WAL partitions:

  1. Single-partition transactions (the common case) use fast path
  2. Cross-partition atomicity requires coordination by definition
  3. Pipelining optimization can reduce latency
  4. Alternative: eventual consistency was rejected for ACID compliance

Why Multiple Safety Levels?

One-size-fits-all durability doesn’t match real-world needs:

  • Financial systems need zero data loss (Full)
  • Log ingestion can tolerate small gaps (Async)
  • Bulk ETL just needs speed (Unsafe with post-load verification)

Explicit levels force users to make conscious decisions about their durability requirements.


File Reference

ComponentPrimary FilePurpose
Lock-free Row IDssrc/storage/lockfree/row_id.rsHierarchical ID generation
Partitioned WALsrc/storage/lockfree/wal_manager.rsPer-core WAL with 2PC
Work-stealing Poolsrc/storage/parallel/worker_pool.rsCrossbeam-based execution
Staging Buffersrc/storage/staging/buffer.rsAppend-only ingestion
Staging Compactorsrc/storage/staging/compactor.rsK-way merge
Bulk Loadersrc/storage/staging/loader.rsHigh-speed ingestion API
Safety Levelssrc/storage/lockfree/config.rsDurability configuration
Group Commitsrc/storage/wal.rsBatched fsync
MVCCsrc/storage/versioning/mvcc.rsMulti-version concurrency
SIMD Enginesrc/storage/parallel/simd_engine.rsVectorized operations
Scatter-Gathersrc/storage/parallel/scatter_gather.rsParallel scanning
Shard Routersrc/storage/shard/router.rsDistributed routing