HeliosCore - Native Storage Engine
HeliosCore - Native Storage Engine
Overview
HeliosCore is HeliosDB-Lite’s custom storage engine that decouples storage from compute, designed to eventually replace RocksDB. All storage primitives (versioning, branching, time-travel, MVCC) are implemented natively in Rust.
Key Principles:
- File-based block storage (disk sections) instead of raw devices
- Adaptive segments (8KB-16MB) instead of fixed pages
- Native MVCC with branching and time-travel built into the storage format
- Multi-node distributed architecture from day one
- Elastic scaling (1-N storage nodes)
- Workload-aware automatic optimization
Architecture
+-----------------------------------------------------------------------------+| COMPUTE TIER (Query Processing) || +------------+ +----------+ +-----------+ +-----------------+ || | SQL Parser | | Planner | | Optimizer | | ParallelScanner | || +-----+------+ +----+-----+ +-----+-----+ +-------+---------+ || +---------------+-------------+----------------+ || | || +---------+---------+ || | StorageClient | <-- Shard-aware routing || | (gRPC client) | || +---------+---------+ |+--------------------------|--------------------------------------------------+ | gRPC (predicates, projections, parallel ops)+--------------------------|--------------------------------------------------+| STORAGE TIER (Data Management) || +---------+---------+ || | StorageNodeServer | || +---------+---------+ || | || +---------------------------+-------------------------------------------+ || | PARALLEL EXECUTION ENGINE | || | +-----------+ +------------+ +-------------------+ | || | | WorkerPool| | SIMD Engine| | Cross-Shard Coord | | || | | (per-core)| | (vectorized)| | (scatter-gather) | | || | +-----------+ +------------+ +-------------------+ | || +--------------------------------------------------------------------+ || | || +-----------------+ +------+----------+ +----------------+ || | PredicateEngine | | SegmentManager | | WAL Manager | || | (Bloom/ZoneMap/ | | (8KB-16MB, | | (Durability) | || | SIMD filtering)| | adaptive) | | | || +--------+--------+ +-------+--------+ +-------+--------+ || | | | || +--------+--------------+----+--------------------+---------+ || | SMFI Pipeline | || | (Background: index maintenance, filter updates) | || +----------+------------------------------------------------+ || | || +----------+--------------------------------------------------------+ || | INTELLIGENT FILTER ADVISOR | || | +----------------+ +----------------+ +--------------------+ | || | | FilterRegistry | | FilterAdvisor | | WorkloadAnalyzer | | || | | (Catalog) | | (ML-based) | | (Pattern Detection)| | || | +----------------+ +----------------+ +--------------------+ | || | +----------------+ +----------------+ +--------------------+ | || | | StorageBudget | | BuildCoord | | Filter Impls | | || | | Manager | | (Background) | | Bloom/Cuckoo/XOR | | || | +----------------+ +----------------+ +--------------------+ | || +--------------------------------------------------------------------+ || | || +---------------------------+-------------------------------------------+ || | BlockFileManager | || | (Dynamic file add/merge/relocate, extent allocation) | || +--------------------------------------------------------------------+ |+-----------------------------------------------------------------------------+Key Design Decisions
| Aspect | Decision | Rationale |
|---|---|---|
| Segment Size | Adaptive 8KB-16MB | OLTP: small (8-64KB), OLAP: large (1-16MB), auto-selected |
| Segment Types | Row/Column/Hybrid/Compressed | Different formats for different workloads |
| Allocation | Extent-based (8-262K blocks) | Fast allocation, buddy coalescing |
| Protocol | gRPC + protobuf | Efficient binary, streaming, parallel scatter-gather |
| File Management | Dynamic add/merge/relocate | Online capacity expansion |
| Replication | Configurable quorum | Sync/Async/Quorum modes per workload |
| Sharding | Hash + Range, intra + cross-node | Parallelism at every level |
| Bulk Load | Staging area with eventual consistency | Ultra-fast ingestion, query-accessible |
| Parallelism | Per-core workers + SIMD | Maximize hardware utilization |
| Timestamps | HLC (Hybrid Logical Clock) | Causal ordering across distributed nodes |
| Lock-free | Thread-local partitions, DashMap, atomics | Zero contention on hot paths |
| WAL | Distributed partitioned WAL | Linear I/O scaling, parallel recovery |
| Aux Structures | Workload-aware intelligent creation | Create indexes only when beneficial |
| Maintenance | Deferred updates, daily/weekly reorg | Minimize write amplification |
Implementation Status
Phase 1: HeliosCore API (IMPLEMENTED)
The HTAP-aware Query Engine to Storage collaboration API is fully implemented in src/helioscore/api/.
Implemented Components:
access.rs- HTAP-aware data access methodshints.rs- Query hints, SLA, priority, workload classificationstats.rs- Statistics provider for optimizer with adaptive TTL cachingthrottle.rs- Throttling, backpressure, admission controlpushdown.rs- Predicate/projection/aggregation pushdown evaluationsuggest.rs- Storage suggestions to query engine
Key Traits:
pub trait HeliosCoreAccess: Send + Sync { // Data access methods (workload-aware, storage-type flexible) fn get(&self, req: PointGetRequest) -> Result<Option<Row>>; fn multi_get(&self, req: MultiGetRequest) -> Result<Vec<Option<Row>>>; fn scan(&self, req: ScanRequest) -> Result<ScanIterator>; fn aggregate(&self, req: AggregateRequest) -> Result<AggregateResult>;
// Optimizer collaboration (bidirectional) fn get_statistics(&self, req: StatisticsRequest) -> Result<TableStatistics>; fn estimate_cost(&self, plan: &AccessPlan) -> Result<CostEstimate>; fn suggest_access(&self, query: &QueryProfile) -> Result<AccessSuggestion>; fn evaluate_pushdown(&self, ops: &PushdownOps) -> Result<PushdownEvaluation>;}Phase 2: Block File Manager (IMPLEMENTED)
Located in src/storage/block/:
- O_DIRECT file handles
- Pre-allocated block files
- Extent-based allocation with buddy system
Phase 2.1: O_DIRECT WAL Manager (IMPLEMENTED)
Located in src/storage/wal_direct/:
aligned_buffer.rs- 4KB-aligned buffers for O_DIRECT I/Odirect_wal.rs- DirectWalManager with staging buffergroup_commit.rs- 10ms batching window for efficient fsyncflusher.rs- Async background writer threadplatform.rs- Cross-platform O_DIRECT/F_NOCACHE support
Performance Benefits:
| Metric | RocksDB | O_DIRECT WAL |
|---|---|---|
| Write Latency | ~1ms | 5-7us |
| Throughput | ~1K writes/sec | 150-200K writes/sec |
| Memory Overhead | 100% (page cache) | 20% (staging only) |
Phase 3: Adaptive Segment Manager (IMPLEMENTED)
Located in src/storage/segment/:
- Variable-size segments (8KB-16MB)
- Row/Column/Hybrid/Compressed segment types
- 128-byte self-describing headers
- Inline bloom filters and zone maps
Phase 4: Native Versioning (IMPLEMENTED)
Located in src/storage/versioning/:
- Native MVCC with inline version chains
- Branch management at storage level
- Time-travel query resolution
- Version garbage collection
Phase 5: Storage-Tier Predicate Engine (IMPLEMENTED)
Located in src/storage/predicate/:
- Zone map evaluation
- Bloom filter integration
- SIMD vectorized filtering
Phase 11: Intelligent Auxiliary Structures (IMPLEMENTED)
Intelligent Filter Advisor System
What Makes HeliosDB-Lite Unique: A self-optimizing filter management system that differentiates from all existing databases.
Features
- Automatic Filter Creation - Creates optimal secondary filters based on workload patterns
- ML-Driven Selection - Learns which filters benefit specific tables/columns
- Online Maintenance - Lock-free reads during rebuild
- Budget-Aware - Respects user-defined storage limits
- Query Engine Integration - Exposes filter availability to optimizer
Architecture
+-----------------------------------------------------------------------+| HeliosQE (Query Engine) || +---------------------------------------------------------------+ || | Query Optimizer | || | - Receives filter catalog from FilterRegistry | || | - Incorporates filter costs into execution plans | || | - Requests filter creation hints to FilterAdvisor | || +---------------------------------------------------------------+ |+----------------------------------+------------------------------------+ | v+-----------------------------------------------------------------------+| HeliosCore Storage Engine || || +------------------+ +------------------+ +------------------+ || | FilterRegistry |<---| FilterAdvisor |<---| WorkloadAnalyzer | || | | | (ML-based) | | (Pattern Detect) | || | - Catalog | | - Scoring | | - Query patterns | || | - Visibility | | - Prioritization | | - Access freq | || | - Statistics | | - Eviction | | - Selectivity | || +--------+---------+ +--------+---------+ +------------------+ || | | || v v || +---------------------------------------------------------------+ || | FilterBuildCoordinator (Background Async) | || | | || | +---------------+ +---------------+ +---------------+ | || | | BuildQueue | | BuildWorkers | | DeltaTracker | | || | | (Priority) | | (Parallel) | | (Lock-free) | | || | +---------------+ +---------------+ +---------------+ | || +---------------------------------------------------------------+ || || +---------------------------------------------------------------+ || | Filter Implementations | || | | || | +--------+ +--------+ +--------+ +--------+ +--------+ | || | | Bloom | | Cuckoo | | XOR | | Ribbon | | MPHF | | || | +--------+ +--------+ +--------+ +--------+ +--------+ | || | +--------+ +--------+ +--------+ | || | |Count- | |Quotient| | ART | | || | |Min | | | | | | || | +--------+ +--------+ +--------+ | || +---------------------------------------------------------------+ || || +---------------------------------------------------------------+ || | StorageBudgetManager | || | - Secondary filter storage limit (%) | || | - Online reconfiguration (no restart) | || | - Eviction when over budget | || +---------------------------------------------------------------+ |+-----------------------------------------------------------------------+Implemented Components
| Component | File | Status |
|---|---|---|
| FilterRegistry | src/storage/filter_registry.rs | Implemented |
| StorageBudgetManager | src/storage/storage_budget_manager.rs | Implemented |
| CuckooFilter | src/storage/cuckoo_filter.rs | Implemented |
| XorFilter | src/storage/xor_filter.rs | Implemented |
| BloomFilter | src/storage/bloom_filter.rs | Implemented |
| ZoneMap | src/storage/zone_map.rs | Implemented |
| FilterBuildCoordinator | src/storage/filter_build_coordinator.rs | Implemented |
| FilterAdvisor (ML) | src/storage/filter_advisor.rs | Implemented |
| WorkloadAnalyzer | src/storage/workload_analyzer.rs | Implemented |
| RibbonFilter | src/storage/ribbon_filter.rs | Implemented |
| MPHF | src/storage/mphf.rs | Implemented |
| CountMinSketch | src/storage/count_min_sketch.rs | Implemented |
| FilterCostModel | src/storage/integration/filter_cost_model.rs | Implemented |
Filter Types
| Filter | Use Case | FPR | Deletion | Space |
|---|---|---|---|---|
| Bloom | General equality | 0.1-10% | No | ~10 bits/item |
| Cuckoo | Equality with deletes | 0.1-3% | Yes | ~12 bits/item |
| XOR | Static datasets | 0.1-1% | No | ~9 bits/item |
| Ribbon | Very low FPR | <0.1% | No | ~8 bits/item |
| MPHF | Dictionary encoding | 0% | No | ~3 bits/item |
| CountMinSketch | Frequency estimation | N/A | N/A | Fixed width |
| Zone Map | Range predicates | N/A | N/A | Min/Max only |
Eviction Rules
Eviction Score Components:
| Component | Weight | Description |
|---|---|---|
| Optimizer Benefit | 40% | How often optimizer selects this filter |
| Query Impact | 30% | Actual latency/IO improvement |
| Recency | 15% | How recently used |
| Efficiency | 10% | Benefit per byte |
| Rebuild Cost | 5% | Cost to recreate |
Eviction Priority (First to Last):
| Priority | Category | Condition | Score Range |
|---|---|---|---|
| 1 (FIRST) | Never Used | use_count=0, optimizer never selected | 0.00-0.05 |
| 2 | Stale | FPR >10%, status=Stale | 0.05-0.15 |
| 3 | Redundant | Another filter dominates | 0.10-0.25 |
| 4 | Low Impact | hit_rate <50%, latency <5% | 0.15-0.35 |
| 5 | Infrequent | Not used in 24+ hours | 0.25-0.50 |
| 6 | Inefficient | >100MB with <10% improvement | 0.30-0.60 |
| 7 | Budget Pressure | Normal filters, over budget | 0.50-0.80 |
| 8 (LAST) | High Value | High hit rate, recent, efficient | 0.80-1.00 |
| NEVER | Protected | Primary, user-protected, active | N/A |
SQL Interface
-- View/change budgetSHOW helios.filter_budget_pct;SET helios.filter_budget_pct = 0.15;
-- Per-table settingsALTER TABLE orders SET ( helios.auto_filter = true, helios.preferred_filter_type = 'cuckoo', helios.filter_fpr = 0.01);
-- Monitoring viewsSELECT * FROM helios_filters;SELECT * FROM helios_filter_recommendations;SELECT * FROM helios_filter_build_queue;SELECT * FROM helios_filter_budget;Phase 17: SMFI Integration (IMPLEMENTED)
Located in src/storage/integration/:
SMFI Storage Integration connects the Self-Maintaining Filter Index with the native storage engine for optimized query execution.
Implemented Components
| Component | File | Description |
|---|---|---|
| SmfiStorageIntegration | smfi_storage.rs | Core integration layer |
| QueryOptimizer | query_optimizer.rs | Filter-aware query optimization |
| FilterCoordinator | filter_coordinator.rs | Coordinates filter operations |
| FilterCostModel | filter_cost_model.rs | Cost estimation for optimizer |
FilterCostModel
The FilterCostModel provides cost estimation for using filters in query execution plans:
pub struct FilterCostModel { registry: Option<Arc<FilterRegistry>>, catalog: RwLock<Option<FilterCatalog>>, config: FilterCostModelConfig,}
impl FilterCostModel { // Estimate cost of using a specific filter fn estimate_filter_cost(&self, filter: &FilterInfo, predicate: &PredicateInfo) -> FilterCostEstimate;
// Select best filter for a predicate fn select_filter(&self, predicate: &PredicateInfo) -> FilterSelection;}Cost Model Formula
Without filter (baseline):
baseline_cost = row_count × FULL_SCAN_COST_PER_ROWWith filter:
filter_cost = lookup_cost + match_scan_cost + false_positive_scan_cost
Where: lookup_cost = row_count × filter_lookup_cost match_scan_cost = expected_matches × FULL_SCAN_COST_PER_ROW fp_scan_cost = expected_rejects × FPR × FULL_SCAN_COST_PER_ROWNet benefit:
net_benefit = baseline_cost - filter_costFilter Lookup Costs (Relative)
| Filter Type | Lookup Cost | Notes |
|---|---|---|
| MPHF | 0.0005 | Perfect O(1), fastest |
| XOR | 0.0008 | 3 memory accesses, no branching |
| Bloom | 0.001 | k hash operations |
| Cuckoo | 0.001 | 2 memory accesses |
| Ribbon | 0.001 | Similar to Bloom |
| Count-Min | 0.002 | Multiple hash operations |
Predicate Support Matrix
| Filter | Equality | IN List | Range | Prefix |
|---|---|---|---|---|
| Bloom | ✓ | ✓ | ✗ | ✗ |
| Cuckoo | ✓ | ✓ | ✗ | ✗ |
| XOR | ✓ | ✓ | ✗ | ✗ |
| Ribbon | ✓ | ✓ | ✗ | ✗ |
| MPHF | ✓ | ✗ | ✗ | ✗ |
| Zone Map | ✓ | ✗ | ✓ | ✗ |
| ART Index | ✓ | ✗ | ✓ | ✓ |
HTAP (Hybrid Transactional/Analytical Processing)
Dual Storage Architecture
+----------------------------------+ | WORKLOAD ANALYZER | | - Query pattern tracking | | - Access frequency per column | | - OLTP vs OLAP classification | +------------------+---------------+ | +------------------------+------------------------+ | | | v v v+---------------------+ +---------------------+ +---------------------+| ROW STORE | | FORMAT ROUTER | | COLUMN STORE || (OLTP Primary) | | (Auto-selects) | | (OLAP Replica) |+---------------------+ +---------------------+ +---------------------+| - Point lookups | | Query Type -> Format| | - Aggregations || - Range scans (PK) | | SELECT * WHERE id= | | - Full scans || - Single-row CRUD | | -> Row Store | | - Column projections|| - Transaction heavy | | SELECT SUM(x)... | | - Analytics heavy || - 8KB slotted pages | | -> Column Store | | - Compressed columns|+---------------------+ +---------------------+ +---------------------+Format Router Logic
The router automatically selects the optimal storage format:
| Query Pattern | Selected Format |
|---|---|
| Point lookup by PK | Row Store |
| Full scan with aggregation | Column Store |
| SELECT few columns, large scan | Column Store |
| SELECT * with PK range | Row Store |
| Mixed HTAP workload | Auto-detect |
Storage Trade-offs
+-----------------------------------------------------------+| Table: orders (10 GB logical) |+-----------------------------------------------------------+| Row Store: 10 GB (primary, always present) || Column Store: 6 GB (compressed, ~40% savings) || --------------------------- || Total: 16 GB (1.6x for HTAP capability) || || Trade-off: 60% more disk -> 10-100x faster analytics |+-----------------------------------------------------------+Segment Size Guidelines
| Workload | Segment Type | Size Range | Rationale |
|---|---|---|---|
| OLTP (point lookups) | Row | 8-32 KB | Fast single-row access |
| OLTP (range scans) | Row | 32-64 KB | Moderate prefetch |
| OLAP (aggregations) | Column | 1-16 MB | Maximize sequential I/O |
| OLAP (filters) | Column | 256KB-1MB | Good filter granularity |
| HTAP (mixed) | Hybrid | 64-256 KB | Balance both patterns |
| Bulk load | Staging | 4-16 MB | Minimize allocation overhead |
| Archive/cold | Compressed | 4-16 MB | Maximum compression |
Maintenance Schedule
+---------------------------------------------------------------------+| MAINTENANCE SCHEDULE |+---------------------------------------------------------------------+| || CONTINUOUS (background, idle CPU): || - Deferred auxiliary updates || - Data relocation (hot/cold zones) || - Auto-vacuum (when dead_ratio > threshold) || || DAILY (scheduled, low impact): || - Update table/column statistics || - Flush all deferred updates || - Compact small extents || - Column store sync verification || || WEEKLY (scheduled, medium impact): || - Rebuild bloom filters (if stale) || - Defragment pages (>30% fragmented) || - Rebalance shards across nodes || - Analyze workload and adjust store types || || MONTHLY (scheduled, maintenance window): || - Full table reorganization (if needed) || - Rebuild indexes || - Full vacuum with space reclamation || - Workload analysis + auxiliary structure recommendations || |+---------------------------------------------------------------------+Multi-Node Architecture
Cluster Overview
+-----------------------------------------------------------------------------+| CLUSTER COORDINATOR || - Shard map management - Node membership - Rebalance orchestration |+----------------------------------+------------------------------------------+ | +---------------------------+---------------------------+ | | | v v v+-----------------+ +-----------------+ +-----------------+| Storage Node 1 |<--->| Storage Node 2 |<--->| Storage Node 3 |+-----------------+ +-----------------+ +-----------------+| Intra-node: | | Intra-node: | | Intra-node: || +-----+-----+ | | +-----+-----+ | | +-----+-----+ || |Sh 0 |Sh 1 | | | |Sh 0 |Sh 1 | | | |Sh 2 |Sh 3 | || |(pri)|(pri)| | | |(rep)|(rep)| | | |(pri)|(pri)| || +-----+-----+ | | +-----+-----+ | | +-----+-----+ || +-----+-----+ | | +-----+-----+ | | +-----+-----+ || |Sh 2 |Sh 3 | | | |Sh 2 |Sh 3 | | | |Sh 0 |Sh 1 | || |(rep)|(rep)| | | |(rep)|(rep)| | | |(rep)|(rep)| || +-----+-----+ | | +-----+-----+ | | +-----+-----+ |+-----------------+ +-----------------+ +-----------------+Elastic Scaling
- Scale Up: Add nodes, auto-migrate shards
- Scale Down: Drain nodes, redistribute shards
- Online Migration: Zero-downtime shard movement
- Auto-Scaling: CPU/storage threshold triggers
Implementation Phases
| Phase | Components | Status |
|---|---|---|
| 1 | HeliosCore API | Implemented |
| 2 | Block File Manager | Implemented |
| 2.1 | O_DIRECT WAL | Implemented |
| 2.2 | O_DIRECT Data Segments | Implemented (bypass page cache for segment I/O) |
| 3 | Adaptive Segments | Implemented |
| 4 | Native Versioning | Implemented |
| 5 | Predicate Engine | Implemented |
| 6 | Parallel Execution | Implemented (WorkerPool, SIMD, ScatterGather, Pipeline) |
| 7 | Shard Manager | Implemented (metadata, partition, router) |
| 8 | Bulk Load Staging | Implemented (buffer, compactor, loader) |
| 9 | HTAP Dual Storage | Implemented (analyzer, classifier, resource, columnar store) |
| 10 | Multi-Store Types | Implemented (Row, Column, Hybrid, Compressed segments) |
| 11 | Intelligent Aux Structures | Implemented (Filter Advisor System complete) |
| 12 | Maintenance Scheduler | Implemented (checkpoint, compaction, vacuum, stats) |
| 13 | Access Tracking | Implemented (WorkloadAnalyzer pattern detection) |
| 14 | Elastic Sharding | Implemented (router, reshard_manager, shard migration) |
| 15 | gRPC Protocol | Implemented (proto/storage.proto, grpc module, StorageClient, StorageNodeServer) |
| 16 | Multi-Node Replication | Implemented (WAL streaming, logical replication, failover) |
| 17 | SMFI Integration | Implemented (FilterCostModel, SmfiStorage, QueryOptimizer) |
| 18 | Migration Tools | Implemented (exporter, importer, validator, coordinator) |
Success Metrics
| Metric | Target |
|---|---|
| Query latency reduction | >30% on filtered queries |
| Filter hit rate | >80% average |
| False positive rate | <2% average |
| Storage overhead | <15% of data |
| Build impact on queries | <5% latency increase |
| ML prediction accuracy | >70% |
| OLTP point lookup latency | <1ms p99 |
| OLAP aggregation speedup | 10-100x vs row scan |
Dependencies
libc- O_DIRECT system callstonic+prost- gRPC implementationbitvec- Allocation bitmapsparking_lot- Fast locksdashmap- Concurrent accesscrc32c- Fast checksumscrossbeam- Lock-free queuesrayon- Parallel iteratorsxxhash-rust- Fast hashinglz4/zstd- Compressiontokio- Async runtime