Skip to content

HeliosCore - Native Storage Engine

HeliosCore - Native Storage Engine

Overview

HeliosCore is HeliosDB-Lite’s custom storage engine that decouples storage from compute, designed to eventually replace RocksDB. All storage primitives (versioning, branching, time-travel, MVCC) are implemented natively in Rust.

Key Principles:

  • File-based block storage (disk sections) instead of raw devices
  • Adaptive segments (8KB-16MB) instead of fixed pages
  • Native MVCC with branching and time-travel built into the storage format
  • Multi-node distributed architecture from day one
  • Elastic scaling (1-N storage nodes)
  • Workload-aware automatic optimization

Architecture

+-----------------------------------------------------------------------------+
| COMPUTE TIER (Query Processing) |
| +------------+ +----------+ +-----------+ +-----------------+ |
| | SQL Parser | | Planner | | Optimizer | | ParallelScanner | |
| +-----+------+ +----+-----+ +-----+-----+ +-------+---------+ |
| +---------------+-------------+----------------+ |
| | |
| +---------+---------+ |
| | StorageClient | <-- Shard-aware routing |
| | (gRPC client) | |
| +---------+---------+ |
+--------------------------|--------------------------------------------------+
| gRPC (predicates, projections, parallel ops)
+--------------------------|--------------------------------------------------+
| STORAGE TIER (Data Management) |
| +---------+---------+ |
| | StorageNodeServer | |
| +---------+---------+ |
| | |
| +---------------------------+-------------------------------------------+ |
| | PARALLEL EXECUTION ENGINE | |
| | +-----------+ +------------+ +-------------------+ | |
| | | WorkerPool| | SIMD Engine| | Cross-Shard Coord | | |
| | | (per-core)| | (vectorized)| | (scatter-gather) | | |
| | +-----------+ +------------+ +-------------------+ | |
| +--------------------------------------------------------------------+ |
| | |
| +-----------------+ +------+----------+ +----------------+ |
| | PredicateEngine | | SegmentManager | | WAL Manager | |
| | (Bloom/ZoneMap/ | | (8KB-16MB, | | (Durability) | |
| | SIMD filtering)| | adaptive) | | | |
| +--------+--------+ +-------+--------+ +-------+--------+ |
| | | | |
| +--------+--------------+----+--------------------+---------+ |
| | SMFI Pipeline | |
| | (Background: index maintenance, filter updates) | |
| +----------+------------------------------------------------+ |
| | |
| +----------+--------------------------------------------------------+ |
| | INTELLIGENT FILTER ADVISOR | |
| | +----------------+ +----------------+ +--------------------+ | |
| | | FilterRegistry | | FilterAdvisor | | WorkloadAnalyzer | | |
| | | (Catalog) | | (ML-based) | | (Pattern Detection)| | |
| | +----------------+ +----------------+ +--------------------+ | |
| | +----------------+ +----------------+ +--------------------+ | |
| | | StorageBudget | | BuildCoord | | Filter Impls | | |
| | | Manager | | (Background) | | Bloom/Cuckoo/XOR | | |
| | +----------------+ +----------------+ +--------------------+ | |
| +--------------------------------------------------------------------+ |
| | |
| +---------------------------+-------------------------------------------+ |
| | BlockFileManager | |
| | (Dynamic file add/merge/relocate, extent allocation) | |
| +--------------------------------------------------------------------+ |
+-----------------------------------------------------------------------------+

Key Design Decisions

AspectDecisionRationale
Segment SizeAdaptive 8KB-16MBOLTP: small (8-64KB), OLAP: large (1-16MB), auto-selected
Segment TypesRow/Column/Hybrid/CompressedDifferent formats for different workloads
AllocationExtent-based (8-262K blocks)Fast allocation, buddy coalescing
ProtocolgRPC + protobufEfficient binary, streaming, parallel scatter-gather
File ManagementDynamic add/merge/relocateOnline capacity expansion
ReplicationConfigurable quorumSync/Async/Quorum modes per workload
ShardingHash + Range, intra + cross-nodeParallelism at every level
Bulk LoadStaging area with eventual consistencyUltra-fast ingestion, query-accessible
ParallelismPer-core workers + SIMDMaximize hardware utilization
TimestampsHLC (Hybrid Logical Clock)Causal ordering across distributed nodes
Lock-freeThread-local partitions, DashMap, atomicsZero contention on hot paths
WALDistributed partitioned WALLinear I/O scaling, parallel recovery
Aux StructuresWorkload-aware intelligent creationCreate indexes only when beneficial
MaintenanceDeferred updates, daily/weekly reorgMinimize write amplification

Implementation Status

Phase 1: HeliosCore API (IMPLEMENTED)

The HTAP-aware Query Engine to Storage collaboration API is fully implemented in src/helioscore/api/.

Implemented Components:

  • access.rs - HTAP-aware data access methods
  • hints.rs - Query hints, SLA, priority, workload classification
  • stats.rs - Statistics provider for optimizer with adaptive TTL caching
  • throttle.rs - Throttling, backpressure, admission control
  • pushdown.rs - Predicate/projection/aggregation pushdown evaluation
  • suggest.rs - Storage suggestions to query engine

Key Traits:

pub trait HeliosCoreAccess: Send + Sync {
// Data access methods (workload-aware, storage-type flexible)
fn get(&self, req: PointGetRequest) -> Result<Option<Row>>;
fn multi_get(&self, req: MultiGetRequest) -> Result<Vec<Option<Row>>>;
fn scan(&self, req: ScanRequest) -> Result<ScanIterator>;
fn aggregate(&self, req: AggregateRequest) -> Result<AggregateResult>;
// Optimizer collaboration (bidirectional)
fn get_statistics(&self, req: StatisticsRequest) -> Result<TableStatistics>;
fn estimate_cost(&self, plan: &AccessPlan) -> Result<CostEstimate>;
fn suggest_access(&self, query: &QueryProfile) -> Result<AccessSuggestion>;
fn evaluate_pushdown(&self, ops: &PushdownOps) -> Result<PushdownEvaluation>;
}

Phase 2: Block File Manager (IMPLEMENTED)

Located in src/storage/block/:

  • O_DIRECT file handles
  • Pre-allocated block files
  • Extent-based allocation with buddy system

Phase 2.1: O_DIRECT WAL Manager (IMPLEMENTED)

Located in src/storage/wal_direct/:

  • aligned_buffer.rs - 4KB-aligned buffers for O_DIRECT I/O
  • direct_wal.rs - DirectWalManager with staging buffer
  • group_commit.rs - 10ms batching window for efficient fsync
  • flusher.rs - Async background writer thread
  • platform.rs - Cross-platform O_DIRECT/F_NOCACHE support

Performance Benefits:

MetricRocksDBO_DIRECT WAL
Write Latency~1ms5-7us
Throughput~1K writes/sec150-200K writes/sec
Memory Overhead100% (page cache)20% (staging only)

Phase 3: Adaptive Segment Manager (IMPLEMENTED)

Located in src/storage/segment/:

  • Variable-size segments (8KB-16MB)
  • Row/Column/Hybrid/Compressed segment types
  • 128-byte self-describing headers
  • Inline bloom filters and zone maps

Phase 4: Native Versioning (IMPLEMENTED)

Located in src/storage/versioning/:

  • Native MVCC with inline version chains
  • Branch management at storage level
  • Time-travel query resolution
  • Version garbage collection

Phase 5: Storage-Tier Predicate Engine (IMPLEMENTED)

Located in src/storage/predicate/:

  • Zone map evaluation
  • Bloom filter integration
  • SIMD vectorized filtering

Phase 11: Intelligent Auxiliary Structures (IMPLEMENTED)

Intelligent Filter Advisor System

What Makes HeliosDB-Lite Unique: A self-optimizing filter management system that differentiates from all existing databases.

Features

  1. Automatic Filter Creation - Creates optimal secondary filters based on workload patterns
  2. ML-Driven Selection - Learns which filters benefit specific tables/columns
  3. Online Maintenance - Lock-free reads during rebuild
  4. Budget-Aware - Respects user-defined storage limits
  5. Query Engine Integration - Exposes filter availability to optimizer

Architecture

+-----------------------------------------------------------------------+
| HeliosQE (Query Engine) |
| +---------------------------------------------------------------+ |
| | Query Optimizer | |
| | - Receives filter catalog from FilterRegistry | |
| | - Incorporates filter costs into execution plans | |
| | - Requests filter creation hints to FilterAdvisor | |
| +---------------------------------------------------------------+ |
+----------------------------------+------------------------------------+
|
v
+-----------------------------------------------------------------------+
| HeliosCore Storage Engine |
| |
| +------------------+ +------------------+ +------------------+ |
| | FilterRegistry |<---| FilterAdvisor |<---| WorkloadAnalyzer | |
| | | | (ML-based) | | (Pattern Detect) | |
| | - Catalog | | - Scoring | | - Query patterns | |
| | - Visibility | | - Prioritization | | - Access freq | |
| | - Statistics | | - Eviction | | - Selectivity | |
| +--------+---------+ +--------+---------+ +------------------+ |
| | | |
| v v |
| +---------------------------------------------------------------+ |
| | FilterBuildCoordinator (Background Async) | |
| | | |
| | +---------------+ +---------------+ +---------------+ | |
| | | BuildQueue | | BuildWorkers | | DeltaTracker | | |
| | | (Priority) | | (Parallel) | | (Lock-free) | | |
| | +---------------+ +---------------+ +---------------+ | |
| +---------------------------------------------------------------+ |
| |
| +---------------------------------------------------------------+ |
| | Filter Implementations | |
| | | |
| | +--------+ +--------+ +--------+ +--------+ +--------+ | |
| | | Bloom | | Cuckoo | | XOR | | Ribbon | | MPHF | | |
| | +--------+ +--------+ +--------+ +--------+ +--------+ | |
| | +--------+ +--------+ +--------+ | |
| | |Count- | |Quotient| | ART | | |
| | |Min | | | | | | |
| | +--------+ +--------+ +--------+ | |
| +---------------------------------------------------------------+ |
| |
| +---------------------------------------------------------------+ |
| | StorageBudgetManager | |
| | - Secondary filter storage limit (%) | |
| | - Online reconfiguration (no restart) | |
| | - Eviction when over budget | |
| +---------------------------------------------------------------+ |
+-----------------------------------------------------------------------+

Implemented Components

ComponentFileStatus
FilterRegistrysrc/storage/filter_registry.rsImplemented
StorageBudgetManagersrc/storage/storage_budget_manager.rsImplemented
CuckooFiltersrc/storage/cuckoo_filter.rsImplemented
XorFiltersrc/storage/xor_filter.rsImplemented
BloomFiltersrc/storage/bloom_filter.rsImplemented
ZoneMapsrc/storage/zone_map.rsImplemented
FilterBuildCoordinatorsrc/storage/filter_build_coordinator.rsImplemented
FilterAdvisor (ML)src/storage/filter_advisor.rsImplemented
WorkloadAnalyzersrc/storage/workload_analyzer.rsImplemented
RibbonFiltersrc/storage/ribbon_filter.rsImplemented
MPHFsrc/storage/mphf.rsImplemented
CountMinSketchsrc/storage/count_min_sketch.rsImplemented
FilterCostModelsrc/storage/integration/filter_cost_model.rsImplemented

Filter Types

FilterUse CaseFPRDeletionSpace
BloomGeneral equality0.1-10%No~10 bits/item
CuckooEquality with deletes0.1-3%Yes~12 bits/item
XORStatic datasets0.1-1%No~9 bits/item
RibbonVery low FPR<0.1%No~8 bits/item
MPHFDictionary encoding0%No~3 bits/item
CountMinSketchFrequency estimationN/AN/AFixed width
Zone MapRange predicatesN/AN/AMin/Max only

Eviction Rules

Eviction Score Components:

ComponentWeightDescription
Optimizer Benefit40%How often optimizer selects this filter
Query Impact30%Actual latency/IO improvement
Recency15%How recently used
Efficiency10%Benefit per byte
Rebuild Cost5%Cost to recreate

Eviction Priority (First to Last):

PriorityCategoryConditionScore Range
1 (FIRST)Never Useduse_count=0, optimizer never selected0.00-0.05
2StaleFPR >10%, status=Stale0.05-0.15
3RedundantAnother filter dominates0.10-0.25
4Low Impacthit_rate <50%, latency <5%0.15-0.35
5InfrequentNot used in 24+ hours0.25-0.50
6Inefficient>100MB with <10% improvement0.30-0.60
7Budget PressureNormal filters, over budget0.50-0.80
8 (LAST)High ValueHigh hit rate, recent, efficient0.80-1.00
NEVERProtectedPrimary, user-protected, activeN/A

SQL Interface

-- View/change budget
SHOW helios.filter_budget_pct;
SET helios.filter_budget_pct = 0.15;
-- Per-table settings
ALTER TABLE orders SET (
helios.auto_filter = true,
helios.preferred_filter_type = 'cuckoo',
helios.filter_fpr = 0.01
);
-- Monitoring views
SELECT * FROM helios_filters;
SELECT * FROM helios_filter_recommendations;
SELECT * FROM helios_filter_build_queue;
SELECT * FROM helios_filter_budget;

Phase 17: SMFI Integration (IMPLEMENTED)

Located in src/storage/integration/:

SMFI Storage Integration connects the Self-Maintaining Filter Index with the native storage engine for optimized query execution.

Implemented Components

ComponentFileDescription
SmfiStorageIntegrationsmfi_storage.rsCore integration layer
QueryOptimizerquery_optimizer.rsFilter-aware query optimization
FilterCoordinatorfilter_coordinator.rsCoordinates filter operations
FilterCostModelfilter_cost_model.rsCost estimation for optimizer

FilterCostModel

The FilterCostModel provides cost estimation for using filters in query execution plans:

pub struct FilterCostModel {
registry: Option<Arc<FilterRegistry>>,
catalog: RwLock<Option<FilterCatalog>>,
config: FilterCostModelConfig,
}
impl FilterCostModel {
// Estimate cost of using a specific filter
fn estimate_filter_cost(&self, filter: &FilterInfo, predicate: &PredicateInfo)
-> FilterCostEstimate;
// Select best filter for a predicate
fn select_filter(&self, predicate: &PredicateInfo) -> FilterSelection;
}

Cost Model Formula

Without filter (baseline):

baseline_cost = row_count × FULL_SCAN_COST_PER_ROW

With filter:

filter_cost = lookup_cost + match_scan_cost + false_positive_scan_cost
Where:
lookup_cost = row_count × filter_lookup_cost
match_scan_cost = expected_matches × FULL_SCAN_COST_PER_ROW
fp_scan_cost = expected_rejects × FPR × FULL_SCAN_COST_PER_ROW

Net benefit:

net_benefit = baseline_cost - filter_cost

Filter Lookup Costs (Relative)

Filter TypeLookup CostNotes
MPHF0.0005Perfect O(1), fastest
XOR0.00083 memory accesses, no branching
Bloom0.001k hash operations
Cuckoo0.0012 memory accesses
Ribbon0.001Similar to Bloom
Count-Min0.002Multiple hash operations

Predicate Support Matrix

FilterEqualityIN ListRangePrefix
Bloom
Cuckoo
XOR
Ribbon
MPHF
Zone Map
ART Index

HTAP (Hybrid Transactional/Analytical Processing)

Dual Storage Architecture

+----------------------------------+
| WORKLOAD ANALYZER |
| - Query pattern tracking |
| - Access frequency per column |
| - OLTP vs OLAP classification |
+------------------+---------------+
|
+------------------------+------------------------+
| | |
v v v
+---------------------+ +---------------------+ +---------------------+
| ROW STORE | | FORMAT ROUTER | | COLUMN STORE |
| (OLTP Primary) | | (Auto-selects) | | (OLAP Replica) |
+---------------------+ +---------------------+ +---------------------+
| - Point lookups | | Query Type -> Format| | - Aggregations |
| - Range scans (PK) | | SELECT * WHERE id= | | - Full scans |
| - Single-row CRUD | | -> Row Store | | - Column projections|
| - Transaction heavy | | SELECT SUM(x)... | | - Analytics heavy |
| - 8KB slotted pages | | -> Column Store | | - Compressed columns|
+---------------------+ +---------------------+ +---------------------+

Format Router Logic

The router automatically selects the optimal storage format:

Query PatternSelected Format
Point lookup by PKRow Store
Full scan with aggregationColumn Store
SELECT few columns, large scanColumn Store
SELECT * with PK rangeRow Store
Mixed HTAP workloadAuto-detect

Storage Trade-offs

+-----------------------------------------------------------+
| Table: orders (10 GB logical) |
+-----------------------------------------------------------+
| Row Store: 10 GB (primary, always present) |
| Column Store: 6 GB (compressed, ~40% savings) |
| --------------------------- |
| Total: 16 GB (1.6x for HTAP capability) |
| |
| Trade-off: 60% more disk -> 10-100x faster analytics |
+-----------------------------------------------------------+

Segment Size Guidelines

WorkloadSegment TypeSize RangeRationale
OLTP (point lookups)Row8-32 KBFast single-row access
OLTP (range scans)Row32-64 KBModerate prefetch
OLAP (aggregations)Column1-16 MBMaximize sequential I/O
OLAP (filters)Column256KB-1MBGood filter granularity
HTAP (mixed)Hybrid64-256 KBBalance both patterns
Bulk loadStaging4-16 MBMinimize allocation overhead
Archive/coldCompressed4-16 MBMaximum compression

Maintenance Schedule

+---------------------------------------------------------------------+
| MAINTENANCE SCHEDULE |
+---------------------------------------------------------------------+
| |
| CONTINUOUS (background, idle CPU): |
| - Deferred auxiliary updates |
| - Data relocation (hot/cold zones) |
| - Auto-vacuum (when dead_ratio > threshold) |
| |
| DAILY (scheduled, low impact): |
| - Update table/column statistics |
| - Flush all deferred updates |
| - Compact small extents |
| - Column store sync verification |
| |
| WEEKLY (scheduled, medium impact): |
| - Rebuild bloom filters (if stale) |
| - Defragment pages (>30% fragmented) |
| - Rebalance shards across nodes |
| - Analyze workload and adjust store types |
| |
| MONTHLY (scheduled, maintenance window): |
| - Full table reorganization (if needed) |
| - Rebuild indexes |
| - Full vacuum with space reclamation |
| - Workload analysis + auxiliary structure recommendations |
| |
+---------------------------------------------------------------------+

Multi-Node Architecture

Cluster Overview

+-----------------------------------------------------------------------------+
| CLUSTER COORDINATOR |
| - Shard map management - Node membership - Rebalance orchestration |
+----------------------------------+------------------------------------------+
|
+---------------------------+---------------------------+
| | |
v v v
+-----------------+ +-----------------+ +-----------------+
| Storage Node 1 |<--->| Storage Node 2 |<--->| Storage Node 3 |
+-----------------+ +-----------------+ +-----------------+
| Intra-node: | | Intra-node: | | Intra-node: |
| +-----+-----+ | | +-----+-----+ | | +-----+-----+ |
| |Sh 0 |Sh 1 | | | |Sh 0 |Sh 1 | | | |Sh 2 |Sh 3 | |
| |(pri)|(pri)| | | |(rep)|(rep)| | | |(pri)|(pri)| |
| +-----+-----+ | | +-----+-----+ | | +-----+-----+ |
| +-----+-----+ | | +-----+-----+ | | +-----+-----+ |
| |Sh 2 |Sh 3 | | | |Sh 2 |Sh 3 | | | |Sh 0 |Sh 1 | |
| |(rep)|(rep)| | | |(rep)|(rep)| | | |(rep)|(rep)| |
| +-----+-----+ | | +-----+-----+ | | +-----+-----+ |
+-----------------+ +-----------------+ +-----------------+

Elastic Scaling

  • Scale Up: Add nodes, auto-migrate shards
  • Scale Down: Drain nodes, redistribute shards
  • Online Migration: Zero-downtime shard movement
  • Auto-Scaling: CPU/storage threshold triggers

Implementation Phases

PhaseComponentsStatus
1HeliosCore APIImplemented
2Block File ManagerImplemented
2.1O_DIRECT WALImplemented
2.2O_DIRECT Data SegmentsImplemented (bypass page cache for segment I/O)
3Adaptive SegmentsImplemented
4Native VersioningImplemented
5Predicate EngineImplemented
6Parallel ExecutionImplemented (WorkerPool, SIMD, ScatterGather, Pipeline)
7Shard ManagerImplemented (metadata, partition, router)
8Bulk Load StagingImplemented (buffer, compactor, loader)
9HTAP Dual StorageImplemented (analyzer, classifier, resource, columnar store)
10Multi-Store TypesImplemented (Row, Column, Hybrid, Compressed segments)
11Intelligent Aux StructuresImplemented (Filter Advisor System complete)
12Maintenance SchedulerImplemented (checkpoint, compaction, vacuum, stats)
13Access TrackingImplemented (WorkloadAnalyzer pattern detection)
14Elastic ShardingImplemented (router, reshard_manager, shard migration)
15gRPC ProtocolImplemented (proto/storage.proto, grpc module, StorageClient, StorageNodeServer)
16Multi-Node ReplicationImplemented (WAL streaming, logical replication, failover)
17SMFI IntegrationImplemented (FilterCostModel, SmfiStorage, QueryOptimizer)
18Migration ToolsImplemented (exporter, importer, validator, coordinator)

Success Metrics

MetricTarget
Query latency reduction>30% on filtered queries
Filter hit rate>80% average
False positive rate<2% average
Storage overhead<15% of data
Build impact on queries<5% latency increase
ML prediction accuracy>70%
OLTP point lookup latency<1ms p99
OLAP aggregation speedup10-100x vs row scan

Dependencies

  • libc - O_DIRECT system calls
  • tonic + prost - gRPC implementation
  • bitvec - Allocation bitmaps
  • parking_lot - Fast locks
  • dashmap - Concurrent access
  • crc32c - Fast checksums
  • crossbeam - Lock-free queues
  • rayon - Parallel iterators
  • xxhash-rust - Fast hashing
  • lz4 / zstd - Compression
  • tokio - Async runtime