Skip to content

HeliosCore Intelligent Filter Advisor System

HeliosCore Intelligent Filter Advisor System

Overview

The HeliosCore Storage Engine implements a unique, self-optimizing filter management system that:

  1. Automatically creates optimal secondary filters based on workload patterns
  2. ML-driven selection learns which filters benefit specific tables/columns
  3. Online maintenance with lock-free reads during rebuild
  4. Budget-aware respects user-defined storage limits (%)
  5. Query Engine integration exposes filter availability to optimizer

This differentiates HeliosDB-Lite from all existing databases.


Architecture

+-------------------------------------------------------------------------+
| HeliosQE (Query Engine) |
| +-------------------------------------------------------------------+ |
| | Query Optimizer | |
| | - Receives filter catalog from FilterRegistry | |
| | - Incorporates filter costs into execution plans | |
| | - Requests filter creation hints to FilterAdvisor | |
| +-------------------------------------------------------------------+ |
+------------------------------------+------------------------------------+
|
v
+-------------------------------------------------------------------------+
| HeliosCore Storage Engine |
| |
| +------------------+ +------------------+ +--------------------+ |
| | FilterRegistry |<---| FilterAdvisor |<---| WorkloadAnalyzer | |
| | | | (ML-based) | | (Pattern Detection)| |
| | - Catalog | | | | | |
| | - Visibility | | - Scoring | | - Query patterns | |
| | - Statistics | | - Prioritization | | - Access frequency | |
| +--------+---------+ | - Eviction | | - Selectivity | |
| | +--------+---------+ +--------------------+ |
| | | |
| v v |
| +-------------------------------------------------------------------+ |
| | FilterBuildCoordinator (Background Async) | |
| | | |
| | +---------------+ +---------------+ +---------------+ | |
| | | BuildQueue | | BuildWorkers | | DeltaTracker | | |
| | | (Priority) | | (Parallel) | | (Lock-free) | | |
| | +---------------+ +---------------+ +---------------+ | |
| +-------------------------------------------------------------------+ |
| |
| +-------------------------------------------------------------------+ |
| | Filter Implementations | |
| | | |
| | +--------+ +--------+ +--------+ +--------+ +--------+ | |
| | | Bloom | | Cuckoo | | XOR | | Ribbon | | MPHF | | |
| | +--------+ +--------+ +--------+ +--------+ +--------+ | |
| | +--------+ +--------+ +--------+ | |
| | |Count- | |Quotient| | ART | (existing) | |
| | |Min | | | | | | |
| | +--------+ +--------+ +--------+ | |
| +-------------------------------------------------------------------+ |
| |
| +-------------------------------------------------------------------+ |
| | StorageBudgetManager | |
| | - Secondary filter storage limit (%) | |
| | - Online reconfiguration (no restart) | |
| | - Eviction when over budget | |
| +-------------------------------------------------------------------+ |
+-------------------------------------------------------------------------+

Component Specifications

1. FilterRegistry

Purpose: Central catalog of all filters with visibility/usability status.

File: src/storage/filter_registry.rs

pub struct FilterRegistry {
/// All registered filters by ID
filters: DashMap<FilterId, FilterEntry>,
/// Index: (table, column) -> Vec<FilterId>
column_index: DashMap<(TableId, ColumnId), Vec<FilterId>>,
/// Index: table -> Vec<FilterId>
table_index: DashMap<TableId, Vec<FilterId>>,
/// Statistics for optimizer
statistics: FilterStatistics,
}
pub struct FilterEntry {
pub id: FilterId,
pub filter_type: FilterType,
pub target: FilterTarget,
pub status: AtomicFilterStatus,
pub visibility: AtomicBool,
pub created_at: Instant,
pub last_used: AtomicInstant,
pub use_count: AtomicU64,
pub hit_rate: AtomicF64,
pub false_positive_rate: AtomicF64,
pub size_bytes: AtomicU64,
pub build_cost_ms: u64,
pub version: AtomicU64,
}
pub enum FilterStatus {
Pending, // Queued for build
Building, // Currently being built
Ready, // Available for queries
Rebuilding, // Being rebuilt (old version still serving)
Stale, // Needs rebuild due to data changes
Evicting, // Being removed
Disabled, // Manually disabled
}
pub enum FilterType {
Bloom { fpr: f64 },
Cuckoo { fpr: f64, supports_delete: bool },
Xor { fpr: f64 },
Ribbon { fpr: f64 },
Mphf,
CountMinSketch { width: usize, depth: usize },
Quotient { fpr: f64 },
ArtIndex,
}
pub enum FilterTarget {
Table(TableId),
Column(TableId, ColumnId),
Segment(SegmentId),
ColumnSegment(SegmentId, ColumnId),
}

Key Methods:

impl FilterRegistry {
/// Get all usable filters for a table/column (for optimizer)
pub fn get_usable_filters(&self, table: TableId, column: Option<ColumnId>)
-> Vec<&FilterEntry>;
/// Mark filter as visible/invisible to Query Engine
pub fn set_visibility(&self, id: FilterId, visible: bool);
/// Update statistics after query uses filter
pub fn record_usage(&self, id: FilterId, hit: bool, latency_ns: u64);
/// Get total secondary filter storage used
pub fn total_secondary_storage(&self) -> u64;
/// Export catalog for optimizer (lightweight snapshot)
pub fn export_catalog(&self) -> FilterCatalog;
}

2. FilterAdvisor (ML-Based)

Purpose: Intelligent decision-making for filter creation, using ML to learn from workload patterns.

File: src/storage/filter_advisor.rs

pub struct FilterAdvisor {
/// ML model for filter benefit prediction
model: FilterPredictionModel,
/// Historical performance data
history: FilterPerformanceHistory,
/// Current recommendations queue
recommendations: PriorityQueue<FilterRecommendation>,
/// Configuration
config: AdvisorConfig,
}
pub struct FilterRecommendation {
pub target: FilterTarget,
pub recommended_type: FilterType,
pub priority_score: f64,
pub estimated_benefit: BenefitEstimate,
pub estimated_cost: CostEstimate,
pub confidence: f64,
pub reason: RecommendationReason,
}
pub struct BenefitEstimate {
pub queries_per_hour_benefited: f64,
pub avg_latency_reduction_ms: f64,
pub io_reduction_bytes: u64,
pub cpu_reduction_pct: f64,
}
pub struct CostEstimate {
pub storage_bytes: u64,
pub build_time_ms: u64,
pub maintenance_overhead_pct: f64,
}
pub enum RecommendationReason {
HighSelectivityColumn,
FrequentJoinColumn,
HotAccessPattern,
DeleteHeavyTable,
LargeStaticSegment,
StringDictionaryCandidate,
FrequencyAnalyticsColumn,
ColdDataOptimization,
}

ML Model Features:

pub struct FilterFeatureVector {
// Column characteristics
pub cardinality: u64,
pub null_ratio: f64,
pub avg_value_size: u64,
pub data_type: DataTypeCategory,
// Access patterns
pub read_frequency: f64,
pub write_frequency: f64,
pub delete_frequency: f64,
pub scan_vs_point_ratio: f64,
// Query patterns
pub equality_predicate_count: u64,
pub range_predicate_count: u64,
pub join_count: u64,
pub in_list_count: u64,
// Current filter performance (if exists)
pub current_filter_type: Option<FilterType>,
pub current_hit_rate: Option<f64>,
pub current_fpr: Option<f64>,
// Table characteristics
pub table_row_count: u64,
pub table_size_bytes: u64,
pub segment_count: u64,
pub data_temperature: DataTemperature,
}
pub enum DataTemperature {
Hot, // >100 accesses/min
Warm, // 1-100 accesses/min
Cold, // <1 access/min
}

3. FilterBuildCoordinator

Purpose: Background async filter creation with lock-free access during builds.

File: src/storage/filter_build_coordinator.rs

pub struct FilterBuildCoordinator {
/// Priority queue of pending builds
build_queue: PriorityQueue<BuildTask>,
/// Active build workers
workers: Vec<JoinHandle<()>>,
/// Delta tracker for lock-free rebuilds
delta_tracker: DeltaTracker,
/// Budget manager reference
budget_manager: Arc<StorageBudgetManager>,
/// Registry reference
registry: Arc<FilterRegistry>,
/// Configuration
config: BuildConfig,
}
pub struct BuildConfig {
/// Number of parallel build workers
pub worker_count: usize, // Default: num_cpus / 2
/// Max concurrent builds
pub max_concurrent_builds: usize, // Default: 4
/// Build batch size (rows per batch)
pub batch_size: usize, // Default: 10_000
/// CPU limit for builds (0.0-1.0)
pub cpu_budget: f64, // Default: 0.3 (30%)
/// Pause builds during high query load
pub adaptive_throttling: bool, // Default: true
}

Online Rebuild Process:

impl FilterBuildCoordinator {
pub async fn build_filter_online(&self, task: BuildTask) -> Result<FilterId> {
// 1. Mark filter as "Rebuilding" (old version still serving)
self.registry.set_status(filter_id, FilterStatus::Rebuilding);
// 2. Start delta tracking for writes during build
let delta_handle = self.delta_tracker.start_tracking(task.target.clone());
// 3. Build new filter in background
let new_filter = self.build_filter_async(&task).await?;
// 4. Pause briefly to capture final deltas
let final_delta = delta_handle.finalize();
// 5. Apply deltas to new filter
self.apply_deltas(&new_filter, final_delta)?;
// 6. Atomic swap
self.registry.atomic_swap(filter_id, new_filter);
// 7. Mark as Ready
self.registry.set_status(filter_id, FilterStatus::Ready);
self.registry.set_visibility(filter_id, true);
Ok(filter_id)
}
}

4. StorageBudgetManager

Purpose: Manage secondary filter storage limits with online reconfiguration.

File: src/storage/storage_budget_manager.rs

pub struct StorageBudgetManager {
/// Current budget percentage (0.0-1.0)
budget_pct: AtomicF64,
/// Absolute budget in bytes
budget_bytes: AtomicU64,
/// Current usage
current_usage: AtomicU64,
/// Total storage available
total_storage: AtomicU64,
/// Filter registry reference
registry: Arc<FilterRegistry>,
/// Configuration
config: BudgetConfig,
}
pub struct BudgetConfig {
/// Default budget percentage
pub default_pct: f64, // Default: 0.10 (10%)
/// Minimum budget
pub min_bytes: u64, // Default: 100MB
/// Maximum budget
pub max_bytes: Option<u64>, // Default: None
/// Recalculation interval
pub recalc_interval: Duration, // Default: 1 minute
/// Eviction strategy
pub eviction_strategy: EvictionStrategy,
}
pub enum EvictionStrategy {
LeastRecentlyUsed,
LowestBenefit,
LargestFirst,
OldestFirst,
Hybrid, // Default - recommended
}

Online Reconfiguration:

impl StorageBudgetManager {
/// Change budget percentage without restart
pub fn set_budget_pct(&self, new_pct: f64) -> Result<()> {
if new_pct < 0.0 || new_pct > 0.5 {
return Err(Error::InvalidBudget("Must be 0-50%"));
}
self.budget_pct.store(new_pct, Ordering::Release);
self.recalculate_budget();
if self.is_over_budget() {
self.trigger_eviction();
}
Ok(())
}
}

Eviction Rules

Eviction Score Components

ComponentWeightDescription
Optimizer Benefit40%How often optimizer selects this filter
Query Impact30%Actual latency/IO improvement
Recency15%How recently used
Efficiency10%Benefit per byte
Rebuild Cost5%Cost to recreate

Eviction Priority Order (First to Last)

PriorityCategoryConditionScore Range
1 (FIRST)Never Useduse_count=0, optimizer never selected0.00-0.05
2StaleFPR >10%, status=Stale0.05-0.15
3RedundantAnother filter dominates0.10-0.25
4Low Impacthit_rate <50%, latency <5%0.15-0.35
5InfrequentNot used in 24+ hours0.25-0.50
6Inefficient>100MB with <10% improvement0.30-0.60
7Budget PressureNormal filters, over budget0.50-0.80
8 (LAST)High ValueHigh hit rate, recent, efficient0.80-1.00
NEVERProtectedPrimary, user-protected, activeN/A

O_DIRECT Storage Support

Filter persistence uses O_DIRECT for consistency with HeliosCore:

pub struct DirectIOConfig {
pub alignment: usize, // Default: 4096 (4KB)
pub min_io_size: usize, // Default: 4096
pub max_io_size: usize, // Default: 1MB
pub use_io_uring: bool, // Linux 5.1+ async I/O
pub fallback_buffered: bool, // Graceful degradation
}

Benefits:

  • Predictable I/O latency
  • Better memory utilization
  • No double-buffering

SQL Interface

Configuration

-- View/change budget
SHOW helios.filter_budget_pct;
SET helios.filter_budget_pct = 0.15;
-- Per-table settings
ALTER TABLE orders SET (
helios.auto_filter = true,
helios.preferred_filter_type = 'cuckoo',
helios.filter_fpr = 0.01
);

Monitoring Views

-- All filters
SELECT * FROM helios_filters;
-- Recommendations
SELECT * FROM helios_filter_recommendations;
-- Build queue
SELECT * FROM helios_filter_build_queue;
-- Budget status
SELECT * FROM helios_filter_budget;

Implementation Phases

PhaseComponentsPriority
1FilterRegistry, StorageBudgetManager, CuckooFilter, XorFilterHigh
2FilterBuildCoordinator, DeltaTrackerHigh
3WorkloadAnalyzer, FilterAdvisor, FilterDecisionEngineMedium
4RibbonFilter, MPHF, CountMinSketchMedium
5Query Engine integration (FilterCostModel)Medium
6ML model training pipelineLow

Files to Implement

FileDescriptionLines Est.
src/storage/filter_registry.rsCentral catalog~500
src/storage/storage_budget_manager.rsBudget management~400
src/storage/cuckoo_filter.rsCuckoo filter~800
src/storage/xor_filter.rsXOR filter~400
src/storage/filter_build_coordinator.rsBackground builds~600
src/storage/delta_tracker.rsLock-free tracking~300
src/storage/workload_analyzer.rsPattern analysis~400
src/storage/filter_advisor.rsML advisor~500
src/storage/ribbon_filter.rsRibbon filter~1200
src/storage/mphf.rsMPHF~600
src/storage/count_min_sketch.rsCMS~300
src/qe/optimizer/filter_cost_model.rsOptimizer integration~400

Success Metrics

MetricTarget
Query latency reduction>30% on filtered queries
Filter hit rate>80% average
False positive rate<2% average
Storage overhead<15% of data
Build impact on queries<5% latency increase
ML prediction accuracy>70%