HeliosCore Intelligent Filter Advisor System
HeliosCore Intelligent Filter Advisor System
Overview
The HeliosCore Storage Engine implements a unique, self-optimizing filter management system that:
- Automatically creates optimal secondary filters based on workload patterns
- ML-driven selection learns which filters benefit specific tables/columns
- Online maintenance with lock-free reads during rebuild
- Budget-aware respects user-defined storage limits (%)
- Query Engine integration exposes filter availability to optimizer
This differentiates HeliosDB-Lite from all existing databases.
Architecture
+-------------------------------------------------------------------------+| HeliosQE (Query Engine) || +-------------------------------------------------------------------+ || | Query Optimizer | || | - Receives filter catalog from FilterRegistry | || | - Incorporates filter costs into execution plans | || | - Requests filter creation hints to FilterAdvisor | || +-------------------------------------------------------------------+ |+------------------------------------+------------------------------------+ | v+-------------------------------------------------------------------------+| HeliosCore Storage Engine || || +------------------+ +------------------+ +--------------------+ || | FilterRegistry |<---| FilterAdvisor |<---| WorkloadAnalyzer | || | | | (ML-based) | | (Pattern Detection)| || | - Catalog | | | | | || | - Visibility | | - Scoring | | - Query patterns | || | - Statistics | | - Prioritization | | - Access frequency | || +--------+---------+ | - Eviction | | - Selectivity | || | +--------+---------+ +--------------------+ || | | || v v || +-------------------------------------------------------------------+ || | FilterBuildCoordinator (Background Async) | || | | || | +---------------+ +---------------+ +---------------+ | || | | BuildQueue | | BuildWorkers | | DeltaTracker | | || | | (Priority) | | (Parallel) | | (Lock-free) | | || | +---------------+ +---------------+ +---------------+ | || +-------------------------------------------------------------------+ || || +-------------------------------------------------------------------+ || | Filter Implementations | || | | || | +--------+ +--------+ +--------+ +--------+ +--------+ | || | | Bloom | | Cuckoo | | XOR | | Ribbon | | MPHF | | || | +--------+ +--------+ +--------+ +--------+ +--------+ | || | +--------+ +--------+ +--------+ | || | |Count- | |Quotient| | ART | (existing) | || | |Min | | | | | | || | +--------+ +--------+ +--------+ | || +-------------------------------------------------------------------+ || || +-------------------------------------------------------------------+ || | StorageBudgetManager | || | - Secondary filter storage limit (%) | || | - Online reconfiguration (no restart) | || | - Eviction when over budget | || +-------------------------------------------------------------------+ |+-------------------------------------------------------------------------+Component Specifications
1. FilterRegistry
Purpose: Central catalog of all filters with visibility/usability status.
File: src/storage/filter_registry.rs
pub struct FilterRegistry { /// All registered filters by ID filters: DashMap<FilterId, FilterEntry>,
/// Index: (table, column) -> Vec<FilterId> column_index: DashMap<(TableId, ColumnId), Vec<FilterId>>,
/// Index: table -> Vec<FilterId> table_index: DashMap<TableId, Vec<FilterId>>,
/// Statistics for optimizer statistics: FilterStatistics,}
pub struct FilterEntry { pub id: FilterId, pub filter_type: FilterType, pub target: FilterTarget, pub status: AtomicFilterStatus, pub visibility: AtomicBool, pub created_at: Instant, pub last_used: AtomicInstant, pub use_count: AtomicU64, pub hit_rate: AtomicF64, pub false_positive_rate: AtomicF64, pub size_bytes: AtomicU64, pub build_cost_ms: u64, pub version: AtomicU64,}
pub enum FilterStatus { Pending, // Queued for build Building, // Currently being built Ready, // Available for queries Rebuilding, // Being rebuilt (old version still serving) Stale, // Needs rebuild due to data changes Evicting, // Being removed Disabled, // Manually disabled}
pub enum FilterType { Bloom { fpr: f64 }, Cuckoo { fpr: f64, supports_delete: bool }, Xor { fpr: f64 }, Ribbon { fpr: f64 }, Mphf, CountMinSketch { width: usize, depth: usize }, Quotient { fpr: f64 }, ArtIndex,}
pub enum FilterTarget { Table(TableId), Column(TableId, ColumnId), Segment(SegmentId), ColumnSegment(SegmentId, ColumnId),}Key Methods:
impl FilterRegistry { /// Get all usable filters for a table/column (for optimizer) pub fn get_usable_filters(&self, table: TableId, column: Option<ColumnId>) -> Vec<&FilterEntry>;
/// Mark filter as visible/invisible to Query Engine pub fn set_visibility(&self, id: FilterId, visible: bool);
/// Update statistics after query uses filter pub fn record_usage(&self, id: FilterId, hit: bool, latency_ns: u64);
/// Get total secondary filter storage used pub fn total_secondary_storage(&self) -> u64;
/// Export catalog for optimizer (lightweight snapshot) pub fn export_catalog(&self) -> FilterCatalog;}2. FilterAdvisor (ML-Based)
Purpose: Intelligent decision-making for filter creation, using ML to learn from workload patterns.
File: src/storage/filter_advisor.rs
pub struct FilterAdvisor { /// ML model for filter benefit prediction model: FilterPredictionModel,
/// Historical performance data history: FilterPerformanceHistory,
/// Current recommendations queue recommendations: PriorityQueue<FilterRecommendation>,
/// Configuration config: AdvisorConfig,}
pub struct FilterRecommendation { pub target: FilterTarget, pub recommended_type: FilterType, pub priority_score: f64, pub estimated_benefit: BenefitEstimate, pub estimated_cost: CostEstimate, pub confidence: f64, pub reason: RecommendationReason,}
pub struct BenefitEstimate { pub queries_per_hour_benefited: f64, pub avg_latency_reduction_ms: f64, pub io_reduction_bytes: u64, pub cpu_reduction_pct: f64,}
pub struct CostEstimate { pub storage_bytes: u64, pub build_time_ms: u64, pub maintenance_overhead_pct: f64,}
pub enum RecommendationReason { HighSelectivityColumn, FrequentJoinColumn, HotAccessPattern, DeleteHeavyTable, LargeStaticSegment, StringDictionaryCandidate, FrequencyAnalyticsColumn, ColdDataOptimization,}ML Model Features:
pub struct FilterFeatureVector { // Column characteristics pub cardinality: u64, pub null_ratio: f64, pub avg_value_size: u64, pub data_type: DataTypeCategory,
// Access patterns pub read_frequency: f64, pub write_frequency: f64, pub delete_frequency: f64, pub scan_vs_point_ratio: f64,
// Query patterns pub equality_predicate_count: u64, pub range_predicate_count: u64, pub join_count: u64, pub in_list_count: u64,
// Current filter performance (if exists) pub current_filter_type: Option<FilterType>, pub current_hit_rate: Option<f64>, pub current_fpr: Option<f64>,
// Table characteristics pub table_row_count: u64, pub table_size_bytes: u64, pub segment_count: u64, pub data_temperature: DataTemperature,}
pub enum DataTemperature { Hot, // >100 accesses/min Warm, // 1-100 accesses/min Cold, // <1 access/min}3. FilterBuildCoordinator
Purpose: Background async filter creation with lock-free access during builds.
File: src/storage/filter_build_coordinator.rs
pub struct FilterBuildCoordinator { /// Priority queue of pending builds build_queue: PriorityQueue<BuildTask>,
/// Active build workers workers: Vec<JoinHandle<()>>,
/// Delta tracker for lock-free rebuilds delta_tracker: DeltaTracker,
/// Budget manager reference budget_manager: Arc<StorageBudgetManager>,
/// Registry reference registry: Arc<FilterRegistry>,
/// Configuration config: BuildConfig,}
pub struct BuildConfig { /// Number of parallel build workers pub worker_count: usize, // Default: num_cpus / 2
/// Max concurrent builds pub max_concurrent_builds: usize, // Default: 4
/// Build batch size (rows per batch) pub batch_size: usize, // Default: 10_000
/// CPU limit for builds (0.0-1.0) pub cpu_budget: f64, // Default: 0.3 (30%)
/// Pause builds during high query load pub adaptive_throttling: bool, // Default: true}Online Rebuild Process:
impl FilterBuildCoordinator { pub async fn build_filter_online(&self, task: BuildTask) -> Result<FilterId> { // 1. Mark filter as "Rebuilding" (old version still serving) self.registry.set_status(filter_id, FilterStatus::Rebuilding);
// 2. Start delta tracking for writes during build let delta_handle = self.delta_tracker.start_tracking(task.target.clone());
// 3. Build new filter in background let new_filter = self.build_filter_async(&task).await?;
// 4. Pause briefly to capture final deltas let final_delta = delta_handle.finalize();
// 5. Apply deltas to new filter self.apply_deltas(&new_filter, final_delta)?;
// 6. Atomic swap self.registry.atomic_swap(filter_id, new_filter);
// 7. Mark as Ready self.registry.set_status(filter_id, FilterStatus::Ready); self.registry.set_visibility(filter_id, true);
Ok(filter_id) }}4. StorageBudgetManager
Purpose: Manage secondary filter storage limits with online reconfiguration.
File: src/storage/storage_budget_manager.rs
pub struct StorageBudgetManager { /// Current budget percentage (0.0-1.0) budget_pct: AtomicF64,
/// Absolute budget in bytes budget_bytes: AtomicU64,
/// Current usage current_usage: AtomicU64,
/// Total storage available total_storage: AtomicU64,
/// Filter registry reference registry: Arc<FilterRegistry>,
/// Configuration config: BudgetConfig,}
pub struct BudgetConfig { /// Default budget percentage pub default_pct: f64, // Default: 0.10 (10%)
/// Minimum budget pub min_bytes: u64, // Default: 100MB
/// Maximum budget pub max_bytes: Option<u64>, // Default: None
/// Recalculation interval pub recalc_interval: Duration, // Default: 1 minute
/// Eviction strategy pub eviction_strategy: EvictionStrategy,}
pub enum EvictionStrategy { LeastRecentlyUsed, LowestBenefit, LargestFirst, OldestFirst, Hybrid, // Default - recommended}Online Reconfiguration:
impl StorageBudgetManager { /// Change budget percentage without restart pub fn set_budget_pct(&self, new_pct: f64) -> Result<()> { if new_pct < 0.0 || new_pct > 0.5 { return Err(Error::InvalidBudget("Must be 0-50%")); }
self.budget_pct.store(new_pct, Ordering::Release); self.recalculate_budget();
if self.is_over_budget() { self.trigger_eviction(); }
Ok(()) }}Eviction Rules
Eviction Score Components
| Component | Weight | Description |
|---|---|---|
| Optimizer Benefit | 40% | How often optimizer selects this filter |
| Query Impact | 30% | Actual latency/IO improvement |
| Recency | 15% | How recently used |
| Efficiency | 10% | Benefit per byte |
| Rebuild Cost | 5% | Cost to recreate |
Eviction Priority Order (First to Last)
| Priority | Category | Condition | Score Range |
|---|---|---|---|
| 1 (FIRST) | Never Used | use_count=0, optimizer never selected | 0.00-0.05 |
| 2 | Stale | FPR >10%, status=Stale | 0.05-0.15 |
| 3 | Redundant | Another filter dominates | 0.10-0.25 |
| 4 | Low Impact | hit_rate <50%, latency <5% | 0.15-0.35 |
| 5 | Infrequent | Not used in 24+ hours | 0.25-0.50 |
| 6 | Inefficient | >100MB with <10% improvement | 0.30-0.60 |
| 7 | Budget Pressure | Normal filters, over budget | 0.50-0.80 |
| 8 (LAST) | High Value | High hit rate, recent, efficient | 0.80-1.00 |
| NEVER | Protected | Primary, user-protected, active | N/A |
O_DIRECT Storage Support
Filter persistence uses O_DIRECT for consistency with HeliosCore:
pub struct DirectIOConfig { pub alignment: usize, // Default: 4096 (4KB) pub min_io_size: usize, // Default: 4096 pub max_io_size: usize, // Default: 1MB pub use_io_uring: bool, // Linux 5.1+ async I/O pub fallback_buffered: bool, // Graceful degradation}Benefits:
- Predictable I/O latency
- Better memory utilization
- No double-buffering
SQL Interface
Configuration
-- View/change budgetSHOW helios.filter_budget_pct;SET helios.filter_budget_pct = 0.15;
-- Per-table settingsALTER TABLE orders SET ( helios.auto_filter = true, helios.preferred_filter_type = 'cuckoo', helios.filter_fpr = 0.01);Monitoring Views
-- All filtersSELECT * FROM helios_filters;
-- RecommendationsSELECT * FROM helios_filter_recommendations;
-- Build queueSELECT * FROM helios_filter_build_queue;
-- Budget statusSELECT * FROM helios_filter_budget;Implementation Phases
| Phase | Components | Priority |
|---|---|---|
| 1 | FilterRegistry, StorageBudgetManager, CuckooFilter, XorFilter | High |
| 2 | FilterBuildCoordinator, DeltaTracker | High |
| 3 | WorkloadAnalyzer, FilterAdvisor, FilterDecisionEngine | Medium |
| 4 | RibbonFilter, MPHF, CountMinSketch | Medium |
| 5 | Query Engine integration (FilterCostModel) | Medium |
| 6 | ML model training pipeline | Low |
Files to Implement
| File | Description | Lines Est. |
|---|---|---|
src/storage/filter_registry.rs | Central catalog | ~500 |
src/storage/storage_budget_manager.rs | Budget management | ~400 |
src/storage/cuckoo_filter.rs | Cuckoo filter | ~800 |
src/storage/xor_filter.rs | XOR filter | ~400 |
src/storage/filter_build_coordinator.rs | Background builds | ~600 |
src/storage/delta_tracker.rs | Lock-free tracking | ~300 |
src/storage/workload_analyzer.rs | Pattern analysis | ~400 |
src/storage/filter_advisor.rs | ML advisor | ~500 |
src/storage/ribbon_filter.rs | Ribbon filter | ~1200 |
src/storage/mphf.rs | MPHF | ~600 |
src/storage/count_min_sketch.rs | CMS | ~300 |
src/qe/optimizer/filter_cost_model.rs | Optimizer integration | ~400 |
Success Metrics
| Metric | Target |
|---|---|
| Query latency reduction | >30% on filtered queries |
| Filter hit rate | >80% average |
| False positive rate | <2% average |
| Storage overhead | <15% of data |
| Build impact on queries | <5% latency increase |
| ML prediction accuracy | >70% |