HeliosDB Design Philosophy
HeliosDB Design Philosophy
Executive Summary
HeliosDB is a write-optimized HTAP database built on a distributed-parallel by design paradigm. Rather than retrofitting parallelism onto a traditional architecture, HeliosDB avoids single points of coordination from the ground up.
Core Mission: Provide immediate data access at maximum ingestion speed while maintaining ACID guarantees through intelligent buffering and asynchronous processing.
Core Principles
1. Zero Coordination Bottlenecks
Principle: No global thread or single point of coordination. The database scales linearly with available resources.
1.1 Lock-Free Row ID Generation
Row IDs use a hierarchical format that requires no global coordination:
| Partition (16-bit) | Timestamp (32-bit) | Sequence (16-bit) |- Each thread owns a partition ID (allocated once at startup)
- Thread-local timestamp and sequence counters
- 65,536 unique IDs per second per thread without contention
- Implementation:
src/storage/lockfree/row_id.rs
1.2 Partitioned WAL Architecture
Write-Ahead Log is partitioned for linear I/O scaling:
- One WAL partition per CPU core (configurable)
- Single-partition transactions use fast path (no coordination)
- Cross-partition operations use Two-Phase Commit only when necessary
- Implementation:
src/storage/lockfree/wal_manager.rs
1.3 Work-Stealing Thread Pool
Parallel execution uses work-stealing for automatic load balancing:
- Per-worker local queues (LIFO for cache locality)
- Global injector queue for new tasks
- Idle workers steal from busy workers’ queues
- No central scheduler bottleneck
- Implementation:
src/storage/parallel/worker_pool.rs
1.4 Known Coordination Points (Documented Trade-offs)
Some minimal coordination exists by design:
| Coordination Point | Rationale | Mitigation |
|---|---|---|
| Global timestamp counter | Required for MVCC ordering across partitions | Atomic with Acquire/Release ordering; consider HLC for >64 cores |
| 2PC for cross-partition | Ensures atomicity across WAL partitions | Fast path for single-partition ops (majority case) |
2. Staging-First Architecture (HeliosCore)
Principle: Capture data as fast as possible first, transform to optimal format asynchronously. Query Engine remains functional during transformation via in-memory hints.
2.1 Four-Stage Ingestion Pipeline
Bulk Data → StagingBuffer → StagingSegments → Compaction → MainStorage ↓ ↑ [QE Hint Data] [Async Transform]- StagingBuffer: Append-only in-memory buffer (64MB default, 32MB flush threshold)
- StagingSegments: Immutable sorted runs from flushed buffers
- Compaction: K-way merge with deduplication
- MainStorage: Optimally formatted for table’s access patterns
Implementation: src/storage/staging/
2.2 Write Amplification Minimization
- Data copied AS IS at maximum underlying I/O speed
- Minimal mapping: only enough structure for immediate queryability
- BTreeMap-sorted entries enable efficient merge without random I/O
- Tombstone markers for deletes (no immediate compaction required)
2.3 Hint Data for Query Engine
During async transformation, a subset of staged data remains in memory:
- Enables Query Engine to access recently-ingested data immediately
- Lightweight index over staging buffer for point lookups
- Progressive visibility: data queryable before full transformation completes
2.4 Multi-Store Capability
Each table (and column) can use the optimal storage format:
| Mode | Best For | Characteristics |
|---|---|---|
| Row | OLTP, point queries | Fast single-row access |
| Columnar | Analytics, aggregations | Compression, vectorized scans |
| Hybrid | Mixed workloads | Hot columns row-based, cold columnar |
- Per-table storage mode selection
- Per-column override capability
- Async background transformation between modes
- Implementation: Previously in HeliosDB-Lite, adapting to NativeBackend
3. Write-Optimized ACID
Principle: User acknowledgment is the highest priority. Capture writes to durable buffers immediately, consolidate asynchronously while maintaining full ACID guarantees.
3.1 Safety Level Spectrum
HeliosDB offers configurable durability levels:
| Level | Durability | Throughput | Use Case |
|---|---|---|---|
| Full | Zero data loss | 1x baseline | Financial transactions, critical data |
| Batched | Up to N ops on crash | 3-5x | High throughput with small loss tolerance |
| Async | Recent ops at risk | 5-10x | Analytics ingestion, logs |
| Unsafe | All unflushed lost | 10-50x | Bulk load, temporary data |
Implementation: src/storage/lockfree/config.rs
3.2 Group Commit Optimization
Batches multiple commits into single fsync:
- 10ms batch window (configurable)
- 10-100x throughput improvement over per-commit fsync
- MPSC queue for commit coordination
- Client acknowledged after batch is durable
- Implementation:
src/storage/wal.rs
3.3 MVCC Without Read Locks
Multi-Version Concurrency Control enables non-blocking reads:
- 32-byte version header per row version
- Snapshot isolation via read timestamps
- Time-travel queries (AS OF TIMESTAMP/TRANSACTION)
- No read locks required; writers never block readers
- Implementation:
src/storage/versioning/mvcc.rs
3.4 Background Async Consolidation
Write buffers are consolidated asynchronously:
- Disk buffers capture changes immediately for durability
- Background workers merge buffers into main storage
- Delta tracking for incremental materialized view refresh
- WAL cleanup after successful checkpoint
4. Adaptive Parallelism
Principle: If an operation takes too long, the database should dynamically increase parallelism. Cost-based decisions determine whether to buffer remaining work or re-execute with more resources.
4.1 SIMD Vectorization
Automatic CPU feature detection and utilization:
Detection Priority: AVX-512 → AVX2 → SSE4.2 → NEON → Scalar- Vectorized aggregations (SUM, MIN/MAX, COUNT)
- Bitmask operations for filtering
- Variable vector width based on capability (8-64 bytes)
- Implementation:
src/storage/parallel/simd_engine.rs
4.2 Three-Level Parallelism
Level 1 (Intra-segment): SIMD - 4-8 values per instructionLevel 2 (Inter-segment): Workers on different segmentsLevel 3 (Pipeline): Overlap I/O with compute4.3 Scatter-Gather Scanning
Parallel segment processing with coordinated result aggregation:
- Per-segment isolation for failure containment
- Configurable parallelism degree (default: 16 concurrent)
- Results collected via lock-free scatter-gather
- Implementation:
src/storage/parallel/scatter_gather.rs
4.4 Dynamic Scaling (Vision)
For long-running operations:
- Monitor progress rate and estimated completion time
- Re-evaluate: can adding workers reduce total time?
- Decision: continue with buffer OR re-execute with more parallelism
- Spawn additional workers locally or on remote nodes
- Cost model determines cheapest path to completion
Implementation Guidelines
Thread Model
- Worker threads: One per CPU core (configurable via
get_cpu_count()) - Background threads: Compaction, WAL sync, maintenance
- No busy-waiting: Threads yield when idle (crossbeam parking)
Memory Management
- Buffer pools: Pre-allocated
WriteBufferPoolfor predictable latency - 4KB alignment: Required for O_DIRECT I/O
- Memory-mapped segments: For large sequential scans
Error Handling
- Graceful degradation: Reduce parallelism rather than fail
- Operation retry: Exponential backoff for transient failures
- Circuit breaker: For repeated failures to same resource
Trade-off Documentation
Why Global Timestamp Counter?
Despite “zero coordination” philosophy, a global timestamp exists for MVCC ordering. This is an acceptable trade-off because:
- Atomic operations with Acquire/Release are extremely fast (<10ns)
- MVCC correctness requires total ordering of commits
- The alternative (logical clocks per partition) adds complexity for cross-partition queries
- Mitigation: Hybrid Logical Clocks (HLC) or batched allocation for >64 core systems
Why 2PC for Cross-Partition?
Two-Phase Commit is used for transactions spanning multiple WAL partitions:
- Single-partition transactions (the common case) use fast path
- Cross-partition atomicity requires coordination by definition
- Pipelining optimization can reduce latency
- Alternative: eventual consistency was rejected for ACID compliance
Why Multiple Safety Levels?
One-size-fits-all durability doesn’t match real-world needs:
- Financial systems need zero data loss (Full)
- Log ingestion can tolerate small gaps (Async)
- Bulk ETL just needs speed (Unsafe with post-load verification)
Explicit levels force users to make conscious decisions about their durability requirements.
File Reference
| Component | Primary File | Purpose |
|---|---|---|
| Lock-free Row IDs | src/storage/lockfree/row_id.rs | Hierarchical ID generation |
| Partitioned WAL | src/storage/lockfree/wal_manager.rs | Per-core WAL with 2PC |
| Work-stealing Pool | src/storage/parallel/worker_pool.rs | Crossbeam-based execution |
| Staging Buffer | src/storage/staging/buffer.rs | Append-only ingestion |
| Staging Compactor | src/storage/staging/compactor.rs | K-way merge |
| Bulk Loader | src/storage/staging/loader.rs | High-speed ingestion API |
| Safety Levels | src/storage/lockfree/config.rs | Durability configuration |
| Group Commit | src/storage/wal.rs | Batched fsync |
| MVCC | src/storage/versioning/mvcc.rs | Multi-version concurrency |
| SIMD Engine | src/storage/parallel/simd_engine.rs | Vectorized operations |
| Scatter-Gather | src/storage/parallel/scatter_gather.rs | Parallel scanning |
| Shard Router | src/storage/shard/router.rs | Distributed routing |
Related Documents
- Gap Analysis - Current implementation status vs. philosophy
- NativeBackend Architecture - Storage engine details
- Compression Integration - Data compression strategies