HeliosDB Design Philosophy

Executive Summary

HeliosDB is a write-optimized HTAP database built on a distributed-parallel by design paradigm. Rather than retrofitting parallelism onto a traditional architecture, HeliosDB avoids single points of coordination from the ground up.

Core Mission: Provide immediate data access at maximum ingestion speed while maintaining ACID guarantees through intelligent buffering and asynchronous processing.

Core Principles

1. Zero Coordination Bottlenecks

Principle: No global thread or single point of coordination. The database scales linearly with available resources.

1.1 Lock-Free Row ID Generation

Row IDs use a hierarchical format that requires no global coordination:

| Partition (16-bit) | Timestamp (32-bit) | Sequence (16-bit) |

Each thread owns a partition ID (allocated once at startup)
Thread-local timestamp and sequence counters
65,536 unique IDs per second per thread without contention
Implementation: src/storage/lockfree/row_id.rs

1.2 Partitioned WAL Architecture

Write-Ahead Log is partitioned for linear I/O scaling:

One WAL partition per CPU core (configurable)
Single-partition transactions use fast path (no coordination)
Cross-partition operations use Two-Phase Commit only when necessary
Implementation: src/storage/lockfree/wal_manager.rs

1.3 Work-Stealing Thread Pool

Parallel execution uses work-stealing for automatic load balancing:

Per-worker local queues (LIFO for cache locality)
Global injector queue for new tasks
Idle workers steal from busy workers’ queues
No central scheduler bottleneck
Implementation: src/storage/parallel/worker_pool.rs

1.4 Known Coordination Points (Documented Trade-offs)

Some minimal coordination exists by design:

Coordination Point	Rationale	Mitigation
Global timestamp counter	Required for MVCC ordering across partitions	Atomic with Acquire/Release ordering; consider HLC for >64 cores
2PC for cross-partition	Ensures atomicity across WAL partitions	Fast path for single-partition ops (majority case)

2. Staging-First Architecture (HeliosCore)

Principle: Capture data as fast as possible first, transform to optimal format asynchronously. Query Engine remains functional during transformation via in-memory hints.

2.1 Four-Stage Ingestion Pipeline

Bulk Data → StagingBuffer → StagingSegments → Compaction → MainStorage
                 ↓                                    ↑
           [QE Hint Data]                    [Async Transform]

StagingBuffer: Append-only in-memory buffer (64MB default, 32MB flush threshold)
StagingSegments: Immutable sorted runs from flushed buffers
Compaction: K-way merge with deduplication
MainStorage: Optimally formatted for table’s access patterns

Implementation: src/storage/staging/

2.2 Write Amplification Minimization

Data copied AS IS at maximum underlying I/O speed
Minimal mapping: only enough structure for immediate queryability
BTreeMap-sorted entries enable efficient merge without random I/O
Tombstone markers for deletes (no immediate compaction required)

2.3 Hint Data for Query Engine

During async transformation, a subset of staged data remains in memory:

Enables Query Engine to access recently-ingested data immediately
Lightweight index over staging buffer for point lookups
Progressive visibility: data queryable before full transformation completes

2.4 Multi-Store Capability

Each table (and column) can use the optimal storage format:

Mode	Best For	Characteristics
Row	OLTP, point queries	Fast single-row access
Columnar	Analytics, aggregations	Compression, vectorized scans
Hybrid	Mixed workloads	Hot columns row-based, cold columnar

Per-table storage mode selection
Per-column override capability
Async background transformation between modes
Implementation: Previously in HeliosDB-Lite, adapting to NativeBackend

3. Write-Optimized ACID

Principle: User acknowledgment is the highest priority. Capture writes to durable buffers immediately, consolidate asynchronously while maintaining full ACID guarantees.

3.1 Safety Level Spectrum

HeliosDB offers configurable durability levels:

Level	Durability	Throughput	Use Case
Full	Zero data loss	1x baseline	Financial transactions, critical data
Batched	Up to N ops on crash	3-5x	High throughput with small loss tolerance
Async	Recent ops at risk	5-10x	Analytics ingestion, logs
Unsafe	All unflushed lost	10-50x	Bulk load, temporary data

Implementation: src/storage/lockfree/config.rs

3.2 Group Commit Optimization

Batches multiple commits into single fsync:

10ms batch window (configurable)
10-100x throughput improvement over per-commit fsync
MPSC queue for commit coordination
Client acknowledged after batch is durable
Implementation: src/storage/wal.rs

3.3 MVCC Without Read Locks

Multi-Version Concurrency Control enables non-blocking reads:

32-byte version header per row version
Snapshot isolation via read timestamps
Time-travel queries (AS OF TIMESTAMP/TRANSACTION)
No read locks required; writers never block readers
Implementation: src/storage/versioning/mvcc.rs

3.4 Background Async Consolidation

Write buffers are consolidated asynchronously:

Disk buffers capture changes immediately for durability
Background workers merge buffers into main storage
Delta tracking for incremental materialized view refresh
WAL cleanup after successful checkpoint

4. Adaptive Parallelism

Principle: If an operation takes too long, the database should dynamically increase parallelism. Cost-based decisions determine whether to buffer remaining work or re-execute with more resources.

4.1 SIMD Vectorization

Automatic CPU feature detection and utilization:

Detection Priority: AVX-512 → AVX2 → SSE4.2 → NEON → Scalar

Vectorized aggregations (SUM, MIN/MAX, COUNT)
Bitmask operations for filtering
Variable vector width based on capability (8-64 bytes)
Implementation: src/storage/parallel/simd_engine.rs

4.2 Three-Level Parallelism

Level 1 (Intra-segment):  SIMD - 4-8 values per instruction
Level 2 (Inter-segment):  Workers on different segments
Level 3 (Pipeline):       Overlap I/O with compute

4.3 Scatter-Gather Scanning

Parallel segment processing with coordinated result aggregation:

Per-segment isolation for failure containment
Configurable parallelism degree (default: 16 concurrent)
Results collected via lock-free scatter-gather
Implementation: src/storage/parallel/scatter_gather.rs

4.4 Dynamic Scaling (Vision)

For long-running operations:

Monitor progress rate and estimated completion time
Re-evaluate: can adding workers reduce total time?
Decision: continue with buffer OR re-execute with more parallelism
Spawn additional workers locally or on remote nodes
Cost model determines cheapest path to completion

Implementation Guidelines

Thread Model

Worker threads: One per CPU core (configurable via get_cpu_count())
Background threads: Compaction, WAL sync, maintenance
No busy-waiting: Threads yield when idle (crossbeam parking)

Memory Management

Buffer pools: Pre-allocated WriteBufferPool for predictable latency
4KB alignment: Required for O_DIRECT I/O
Memory-mapped segments: For large sequential scans

Error Handling

Graceful degradation: Reduce parallelism rather than fail
Operation retry: Exponential backoff for transient failures
Circuit breaker: For repeated failures to same resource

Trade-off Documentation

Why Global Timestamp Counter?

Despite “zero coordination” philosophy, a global timestamp exists for MVCC ordering. This is an acceptable trade-off because:

Atomic operations with Acquire/Release are extremely fast (<10ns)
MVCC correctness requires total ordering of commits
The alternative (logical clocks per partition) adds complexity for cross-partition queries
Mitigation: Hybrid Logical Clocks (HLC) or batched allocation for >64 core systems

Why 2PC for Cross-Partition?

Two-Phase Commit is used for transactions spanning multiple WAL partitions:

Single-partition transactions (the common case) use fast path
Cross-partition atomicity requires coordination by definition
Pipelining optimization can reduce latency
Alternative: eventual consistency was rejected for ACID compliance

Why Multiple Safety Levels?

One-size-fits-all durability doesn’t match real-world needs:

Financial systems need zero data loss (Full)
Log ingestion can tolerate small gaps (Async)
Bulk ETL just needs speed (Unsafe with post-load verification)

Explicit levels force users to make conscious decisions about their durability requirements.

File Reference

Component	Primary File	Purpose
Lock-free Row IDs	`src/storage/lockfree/row_id.rs`	Hierarchical ID generation
Partitioned WAL	`src/storage/lockfree/wal_manager.rs`	Per-core WAL with 2PC
Work-stealing Pool	`src/storage/parallel/worker_pool.rs`	Crossbeam-based execution
Staging Buffer	`src/storage/staging/buffer.rs`	Append-only ingestion
Staging Compactor	`src/storage/staging/compactor.rs`	K-way merge
Bulk Loader	`src/storage/staging/loader.rs`	High-speed ingestion API
Safety Levels	`src/storage/lockfree/config.rs`	Durability configuration
Group Commit	`src/storage/wal.rs`	Batched fsync
MVCC	`src/storage/versioning/mvcc.rs`	Multi-version concurrency
SIMD Engine	`src/storage/parallel/simd_engine.rs`	Vectorized operations
Scatter-Gather	`src/storage/parallel/scatter_gather.rs`	Parallel scanning
Shard Router	`src/storage/shard/router.rs`	Distributed routing

Gap Analysis - Current Implementation details vs. philosophy
NativeBackend Architecture - Storage engine details
Compression Integration - Data compression strategies

HeliosDB Design Philosophy

HeliosDB Design Philosophy

Executive Summary

Core Principles

1. Zero Coordination Bottlenecks

1.1 Lock-Free Row ID Generation

1.2 Partitioned WAL Architecture

1.3 Work-Stealing Thread Pool

1.4 Known Coordination Points (Documented Trade-offs)

2. Staging-First Architecture (HeliosCore)

2.1 Four-Stage Ingestion Pipeline

2.2 Write Amplification Minimization

2.3 Hint Data for Query Engine

2.4 Multi-Store Capability

3. Write-Optimized ACID

3.1 Safety Level Spectrum

3.2 Group Commit Optimization

3.3 MVCC Without Read Locks

3.4 Background Async Consolidation

4. Adaptive Parallelism

4.1 SIMD Vectorization

4.2 Three-Level Parallelism

4.3 Scatter-Gather Scanning

4.4 Dynamic Scaling (Vision)

Implementation Guidelines

Thread Model

Memory Management

Error Handling

Trade-off Documentation

Why Global Timestamp Counter?

Why 2PC for Cross-Partition?

Why Multiple Safety Levels?

File Reference

Related Documents