Storage-Level Self-Maintaining Filter Architecture
Storage-Level Self-Maintaining Filter Architecture
Executive Summary
This document analyzes how to enhance HeliosDB Nano’s storage-level filtering to achieve:
- Zero memory overhead for concurrent connections
- Self-maintaining filter structures updated during DML operations
- Parallel-ready architecture for optimizer-driven parallelism
- CPU-aware background maintenance inspired by MV auto-refresh
Current State Analysis
Existing Components
| Component | Location | Status | Self-Maintaining? |
|---|---|---|---|
| Bloom Filters | bloom_filter.rs | ✓ Implemented | ✗ Manual rebuild |
| Zone Maps | zone_map.rs | ✓ Implemented | ✗ Manual rebuild |
| SIMD Filtering | simd_filter.rs | ✓ Implemented | N/A (stateless) |
| Delta Tracker | mv_delta.rs | ✓ Implemented | ✓ Auto on DML |
| CPU Monitor | mv_scheduler.rs | ✓ Implemented | N/A |
Gap: Filter structures require explicit rebuilding
Current flow:
INSERT → Delta recorded (for MV) → Bloom/Zone NOT updated ↓ Requires manual: engine.build_bloom_filters_for_table("t") engine.build_zone_maps_for_table("t", 1000)Comparative Analysis: Database Filtering Architectures
Oracle Exadata Storage Indexes
Architecture:
┌─────────────────────────────────────────────────────────────┐│ Exadata Storage Cell │├─────────────────────────────────────────────────────────────┤│ Storage Index (per 1MB region, auto-maintained) ││ ┌─────────┬─────────┬─────────┬─────────┐ ││ │ Col1 │ Col2 │ Col3 │ ... │ ││ │ min/max │ min/max │ min/max │ │ ││ └─────────┴─────────┴─────────┴─────────┘ ││ ││ Maintenance: Updated on EACH write (synchronous) ││ Storage: ~1% of data size ││ Memory: None (disk-resident, cached in storage cell RAM) │└─────────────────────────────────────────────────────────────┘Key Innovation:
- Automatic maintenance with NO DBA intervention
- Per-region granularity (1MB default)
- Synchronous update on write (no background job needed)
PostgreSQL BRIN Indexes
Architecture:
┌─────────────────────────────────────────────────────────────┐│ BRIN Index Structure │├─────────────────────────────────────────────────────────────┤│ Block Range Index (per N pages, default 128) ││ ┌─────────────────────────────────────────────┐ ││ │ Range ID │ Min Value │ Max Value │ Has Null │ ││ └─────────────────────────────────────────────┘ ││ ││ Maintenance: Lazy - requires VACUUM or explicit refresh ││ Storage: Very small (~0.003% of table size) ││ Memory: Index pages loaded on demand │└─────────────────────────────────────────────────────────────┘Limitation:
- NOT self-maintaining (needs VACUUM)
- Best for naturally-ordered data (timestamps, sequences)
- No bloom filter capability
ClickHouse Data Skipping Indexes
Architecture:
┌─────────────────────────────────────────────────────────────┐│ ClickHouse Granule System │├─────────────────────────────────────────────────────────────┤│ Per-granule (8192 rows default): ││ ┌─────────────────────────────────────────────┐ ││ │ minmax │ set(N) │ bloom_filter │ ngrambf │ ││ │ indexes │ indexes │ indexes │ indexes │ ││ └─────────────────────────────────────────────┘ ││ ││ Maintenance: Built during MERGE (background) ││ Storage: Embedded in data parts ││ Memory: Loaded lazily during query │└─────────────────────────────────────────────────────────────┘Key Innovation:
- Multiple index types per column
- Background maintenance during compaction
- No memory overhead for idle tables
ScyllaDB/Cassandra Bloom Filters
Architecture:
┌─────────────────────────────────────────────────────────────┐│ Per-SSTable Bloom Filter │├─────────────────────────────────────────────────────────────┤│ Built during compaction/flush, stored with SSTable ││ ┌─────────────────────────────────────────────┐ ││ │ Partition Key Bloom Filter (per SSTable) │ ││ │ Updated: On SSTable creation (immutable) │ ││ │ Size: ~10 bits per key │ ││ └─────────────────────────────────────────────┘ ││ ││ Memory: Optionally memory-mapped ││ Maintenance: Zero (immutable once created) │└─────────────────────────────────────────────────────────────┘Proposed Architecture: Self-Maintaining Filter Index (SMFI)
Design Principles
- Storage-First: All filter structures persisted to RocksDB, not memory
- Delta-Driven: Leverage existing
mv_delta_trackerfor incremental updates - CPU-Aware: Background consolidation with CPU throttling (like MV refresh)
- Parallel-Safe: Lock-free reads, append-only writes
Architecture Overview
┌─────────────────────────────────────────────────────────────────────────┐│ STORAGE-LEVEL FILTERING ARCHITECTURE │├─────────────────────────────────────────────────────────────────────────┤│ ││ ┌────────────────────────────────────────────────────────────────────┐ ││ │ DML OPERATIONS │ ││ │ INSERT ─────┬──────────────────────────────────────────────────── │ ││ │ UPDATE ─────┼─► mv_delta_tracker.record_*() │ ││ │ DELETE ─────┘ │ │ ││ └──────────────────────────┼─────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌────────────────────────────────────────────────────────────────────┐ ││ │ FILTER INDEX DELTA TRACKER (NEW) │ ││ │ ┌─────────────────────────────────────────────────────────────┐ │ ││ │ │ Synchronous (Lightweight): │ │ ││ │ │ • Bloom filter bit-set (OR operation, ~100ns) │ │ ││ │ │ • Zone map min/max update (comparison, ~50ns) │ │ ││ │ │ • Delta log append (for background consolidation) │ │ ││ │ └─────────────────────────────────────────────────────────────┘ │ ││ └────────────────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌────────────────────────────────────────────────────────────────────┐ ││ │ STORAGE-PERSISTED FILTER STRUCTURES │ ││ │ │ ││ │ RocksDB Keys: │ ││ │ ┌─────────────────────────────────────────────────────────────┐ │ ││ │ │ fidx:bloom:{table}:{column} → BloomFilter bytes │ │ ││ │ │ fidx:zone:{table}:{block_id} → BlockZoneMap bytes │ │ ││ │ │ fidx:delta:{table}:{seq} → FilterDelta bytes │ │ ││ │ │ fidx:meta:{table} → FilterMetadata bytes │ │ ││ │ └─────────────────────────────────────────────────────────────┘ │ ││ │ │ ││ │ Memory Usage: ZERO (all disk-resident, loaded on-demand) │ ││ └────────────────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌────────────────────────────────────────────────────────────────────┐ ││ │ BACKGROUND CONSOLIDATION WORKER │ ││ │ (Inspired by AutoRefreshWorker) │ ││ │ │ ││ │ Triggers: │ ││ │ • Delta count threshold (e.g., 1000 deltas) │ ││ │ • Time-based (e.g., every 5 minutes) │ ││ │ • CPU availability (< 50% usage) │ ││ │ │ ││ │ Operations: │ ││ │ • Merge delta bloom filters into base filter │ ││ │ • Recalculate zone maps from delta log │ ││ │ • Compact delta log │ ││ │ • Update filter metadata │ ││ └────────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘Detailed Component Design
1. Filter Index Delta Tracker
/// Lightweight synchronous filter index updates during DMLpub struct FilterIndexDeltaTracker { /// Storage engine reference for persistence db: Arc<DB>, /// Per-table bloom filter deltas (in-memory buffer, flushed periodically) bloom_deltas: DashMap<String, BloomFilterDelta>, /// Per-table zone map deltas zone_deltas: DashMap<String, ZoneMapDelta>, /// Delta sequence counter delta_seq: AtomicU64, /// Configuration config: FilterIndexConfig,}
/// Minimal bloom filter delta (append-only)pub struct BloomFilterDelta { /// Bits to OR into base filter pub bits_to_set: Vec<(usize, u64)>, // (word_index, bits_mask) /// Row count added pub rows_added: u64,}
/// Zone map delta for incremental updatespub struct ZoneMapDelta { /// Column updates: (column_name, new_min, new_max) pub column_updates: Vec<(String, Option<Value>, Option<Value>)>, /// Block ID this delta applies to pub block_id: u64,}
impl FilterIndexDeltaTracker { /// Called synchronously during INSERT (cost: ~150ns per row) pub fn on_insert(&self, table: &str, row_id: u64, tuple: &Tuple, schema: &Schema) { // 1. Update bloom filter delta (fast bit operations) self.update_bloom_delta(table, tuple, schema);
// 2. Update zone map delta (fast min/max comparison) self.update_zone_delta(table, row_id, tuple, schema);
// 3. Append to delta log (for crash recovery) self.append_delta_log(table, DeltaType::Insert, row_id, tuple); }
/// Bloom update is O(1) - just compute hash and set bits fn update_bloom_delta(&self, table: &str, tuple: &Tuple, schema: &Schema) { let mut delta = self.bloom_deltas.entry(table.to_string()).or_default(); for (i, value) in tuple.values.iter().enumerate() { if !matches!(value, Value::Null) { let (h1, h2) = self.hash_value(value); // Compute bit positions using Kirsch-Mitzenmacher for k in 0..self.config.num_hashes { let bit_pos = (h1.wrapping_add(k as u64 * h2)) % self.config.bloom_bits; let word_idx = bit_pos / 64; let bit_mask = 1u64 << (bit_pos % 64); delta.bits_to_set.push((word_idx as usize, bit_mask)); } } } delta.rows_added += 1; }}2. Storage-Persisted Filter Structures
/// Filter metadata stored in RocksDB#[derive(Serialize, Deserialize)]pub struct FilterIndexMetadata { /// Table name pub table_name: String, /// Number of rows indexed pub indexed_rows: u64, /// Number of pending deltas pub pending_deltas: u64, /// Last consolidation timestamp pub last_consolidation: DateTime<Utc>, /// Bloom filter config pub bloom_config: BloomFilterConfig, /// Zone map block size pub zone_block_size: u64, /// Filter index version (for upgrades) pub version: u32,}
/// RocksDB key prefixes for filter index storagepub mod filter_index_keys { pub const BLOOM_PREFIX: &str = "fidx:bloom:"; pub const ZONE_PREFIX: &str = "fidx:zone:"; pub const DELTA_PREFIX: &str = "fidx:delta:"; pub const META_PREFIX: &str = "fidx:meta:";}3. CPU-Aware Background Consolidation Worker
/// Background worker for filter index maintenance/// Inspired by AutoRefreshWorker patternpub struct FilterIndexMaintenanceWorker { /// Storage engine storage: Arc<StorageEngine>, /// Configuration config: MaintenanceConfig, /// CPU monitor (reuse from mv_scheduler) cpu_monitor: Arc<CpuMonitor>, /// Shutdown signal shutdown: Arc<AtomicBool>,}
#[derive(Clone)]pub struct MaintenanceConfig { /// Maximum CPU usage before throttling (default: 15%) pub max_cpu_percent: f32, /// Delta threshold before consolidation (default: 1000) pub delta_threshold: u64, /// Time threshold before consolidation (default: 300 seconds) pub time_threshold_seconds: u64, /// Check interval (default: 10 seconds) pub check_interval_seconds: u64, /// Enable parallel consolidation pub parallel_consolidation: bool,}
impl FilterIndexMaintenanceWorker { pub async fn run(&self) { loop { if self.shutdown.load(Ordering::Relaxed) { break; }
// Check CPU usage before proceeding let cpu = self.cpu_monitor.current_usage(); if cpu > self.config.max_cpu_percent { debug!("CPU too high ({:.1}%), skipping filter maintenance", cpu); tokio::time::sleep(Duration::from_secs(self.config.check_interval_seconds)).await; continue; }
// Find tables needing consolidation let tables = self.find_tables_needing_consolidation().await;
for table in tables { // Consolidate in parallel if enabled if self.config.parallel_consolidation { self.consolidate_parallel(&table).await; } else { self.consolidate_sequential(&table).await; } }
tokio::time::sleep(Duration::from_secs(self.config.check_interval_seconds)).await; } }
/// Parallel consolidation using rayon async fn consolidate_parallel(&self, table: &str) { use rayon::prelude::*;
let deltas = self.load_pending_deltas(table);
// Parallel bloom filter merge let bloom_bits: Vec<_> = deltas.par_iter() .flat_map(|d| d.bloom_delta.bits_to_set.par_iter()) .collect();
// Apply to base filter atomically self.apply_bloom_updates(table, &bloom_bits).await;
// Parallel zone map updates per block let block_groups = self.group_zone_deltas_by_block(&deltas); block_groups.par_iter().for_each(|(block_id, updates)| { self.merge_zone_updates(table, *block_id, updates); }); }}4. Parallel-Ready Query Execution
/// Parallel filter evaluation for FilteredScanimpl PredicatePushdownManager { /// Parallel scan with storage-level filtering pub fn scan_with_pushdown_parallel( &self, table_name: &str, tuples: Vec<Tuple>, predicates: &[AnalyzedPredicate], schema: &Schema, parallelism: usize, ) -> Vec<Tuple> { use rayon::prelude::*;
// Phase 1: Parallel bloom filter check let bloom_filtered: Vec<_> = tuples.par_iter() .filter(|tuple| self.check_bloom_filters_parallel(table_name, tuple, predicates)) .cloned() .collect();
// Phase 2: Parallel zone map block pruning (already done at block level)
// Phase 3: Parallel SIMD predicate evaluation let chunk_size = (bloom_filtered.len() / parallelism).max(64); let results: Vec<_> = bloom_filtered .par_chunks(chunk_size) .flat_map(|chunk| { self.simd_engine.filter_batch_parallel(chunk, predicates, schema) }) .collect();
results }}Innovation: Speculative Filter Indexes (SFI)
Concept
Unlike traditional indexes that require explicit creation, SFI automatically creates lightweight filter structures based on query patterns:
┌─────────────────────────────────────────────────────────────────────────┐│ SPECULATIVE FILTER INDEX SYSTEM │├─────────────────────────────────────────────────────────────────────────┤│ ││ Query Pattern Analyzer: ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ SELECT * FROM orders WHERE customer_id = ? │ ││ │ ↓ │ ││ │ Pattern: equality predicate on orders.customer_id │ ││ │ Frequency: 847 times in last hour │ ││ │ Selectivity: ~0.01% │ ││ │ ↓ │ ││ │ Decision: Create bloom filter for orders.customer_id │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ Auto-Created Filters (no DBA intervention): ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ Column │ Filter Type │ Trigger │ ││ │ ───────────────┼─────────────┼─────────────────────────────── │ ││ │ customer_id │ Bloom │ High equality query frequency │ ││ │ order_date │ Zone Map │ Range queries detected │ ││ │ status │ Enum Set │ Low cardinality column │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ Lifecycle: ││ • Created speculatively after pattern detected ││ • Maintained incrementally during DML ││ • Dropped automatically if unused for configurable period ││ • Zero memory footprint (all storage-persisted) ││ │└─────────────────────────────────────────────────────────────────────────┘Implementation
/// Speculative Filter Index Managerpub struct SpeculativeFilterManager { /// Query pattern tracker pattern_tracker: QueryPatternTracker, /// Filter index manager filter_manager: FilterIndexManager, /// Configuration config: SpeculativeConfig,}
#[derive(Clone)]pub struct SpeculativeConfig { /// Minimum query frequency before considering filter creation pub min_query_frequency: u64, /// Minimum selectivity improvement required (default: 10x) pub min_selectivity_improvement: f64, /// Days of inactivity before dropping speculative filter pub drop_after_days: u32, /// Maximum speculative filters per table pub max_filters_per_table: usize,}
impl SpeculativeFilterManager { /// Called by optimizer after query execution pub fn record_query_pattern(&self, table: &str, predicate: &LogicalExpr, selectivity: f64, execution_time_ms: u64) { let pattern = self.extract_pattern(predicate); self.pattern_tracker.record(table, pattern, selectivity, execution_time_ms);
// Check if we should create a speculative filter if self.should_create_filter(table, &pattern) { self.create_speculative_filter(table, &pattern); } }
fn should_create_filter(&self, table: &str, pattern: &QueryPattern) -> bool { let stats = self.pattern_tracker.get_stats(table, pattern);
// Criteria for filter creation: // 1. High query frequency // 2. Low selectivity (few rows returned) // 3. No existing filter for this column // 4. Below max filters per table
stats.frequency >= self.config.min_query_frequency && stats.avg_selectivity < 0.1 // Less than 10% of rows && !self.filter_exists(table, &pattern.column) && self.filter_count(table) < self.config.max_filters_per_table }}Innovation: Columnar Zone Summaries (CZS)
Concept
Inspired by Exadata Storage Indexes but optimized for row-store with column awareness:
┌─────────────────────────────────────────────────────────────────────────┐│ COLUMNAR ZONE SUMMARY ARCHITECTURE │├─────────────────────────────────────────────────────────────────────────┤│ ││ Traditional Zone Map: ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ Block 0: min=1, max=100 (all columns combined) │ ││ │ Block 1: min=50, max=200 │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ Columnar Zone Summary (CZS): ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ Block 0: │ ││ │ ├── id: min=1, max=100, distinct≈100, nulls=0 │ ││ │ ├── customer: min=1001, max=9999, distinct≈45, nulls=0 │ ││ │ ├── total: min=10.50, max=9999.99, avg=156.23, nulls=2 │ ││ │ └── status: values={pending,shipped,delivered}, nulls=0 │ ││ │ │ ││ │ Additional Summaries: │ ││ │ ├── Histogram buckets (for range selectivity estimation) │ ││ │ ├── Most common values (for equality optimization) │ ││ │ └── Correlation hints (for multi-column predicates) │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ Storage: ~2KB per block (persisted in RocksDB) ││ Memory: Zero (loaded on-demand during query planning) ││ Maintenance: Incremental during DML ││ │└─────────────────────────────────────────────────────────────────────────┘Implementation
/// Enhanced per-block column summary#[derive(Serialize, Deserialize)]pub struct ColumnZoneSummary { /// Column name pub column_name: String, /// Basic statistics pub min: Option<Value>, pub max: Option<Value>, pub null_count: u64, pub row_count: u64, /// Approximate distinct count (HyperLogLog) pub approx_distinct: u64, /// Average (for numeric columns) pub avg: Option<f64>, /// Sum (for numeric columns, useful for aggregation pushdown) pub sum: Option<f64>, /// Most common values (top 5) with frequencies pub mcv: Vec<(Value, u64)>, /// Histogram buckets (for selectivity estimation) pub histogram: Option<HistogramBuckets>,}
/// Block-level columnar zone summary#[derive(Serialize, Deserialize)]pub struct BlockZoneSummary { pub block_id: u64, pub first_row_id: u64, pub last_row_id: u64, pub row_count: u64, pub columns: HashMap<String, ColumnZoneSummary>, /// Cross-column correlation hints pub correlations: Vec<ColumnCorrelation>,}
impl BlockZoneSummary { /// Incremental update on INSERT (O(1) per column) pub fn update_on_insert(&mut self, tuple: &Tuple, schema: &Schema) { self.row_count += 1;
for (i, col) in schema.columns.iter().enumerate() { let summary = self.columns.entry(col.name.clone()) .or_insert_with(|| ColumnZoneSummary::new(&col.name));
let value = tuple.get(i).unwrap_or(&Value::Null); summary.update_incremental(value); } }
/// Can this block satisfy the predicate? (Fast check) pub fn can_satisfy(&self, predicates: &[AnalyzedPredicate]) -> BlockDecision { for pred in predicates { if let Some(summary) = self.columns.get(&pred.column_name) { match pred.evaluate_against_summary(summary) { SummaryMatch::Impossible => return BlockDecision::Skip, SummaryMatch::Partial => continue, SummaryMatch::Full => continue, } } } BlockDecision::Scan }}Memory vs Storage Comparison
Per-Connection Memory Usage
| System | Memory per Connection | Filtering Overhead |
|---|---|---|
| Oracle (In-Memory) | ~10MB+ | SIMD in memory |
| PostgreSQL | ~5-10MB | Filter in executor |
| HeliosDB Nano (Current) | ~2-5MB | Filter after load |
| HeliosDB Nano (SMFI) | ~1-2MB | Filter at storage |
Storage Overhead for Filter Indexes
| Table Size | Bloom Filters | Zone Maps | Total Overhead |
|---|---|---|---|
| 1GB | ~10MB (1%) | ~5MB (0.5%) | ~15MB (1.5%) |
| 10GB | ~100MB | ~50MB | ~150MB |
| 100GB | ~1GB | ~500MB | ~1.5GB |
Query Latency Impact
Scenario: SELECT * FROM orders WHERE customer_id = 12345 AND total > 100Table size: 10 million rows, 10% match customer_id, 50% match total > 100
Without SMFI: Load all 10M rows → Filter in memory → Return 500K rows Memory: ~2GB peak Time: ~5 seconds
With SMFI: Bloom check customer_id → Zone map prune blocks → SIMD filter remainder Memory: ~50MB peak (only matching blocks loaded) Time: ~200msImplementation Roadmap
Phase 1: Self-Maintaining Filter Updates (Core)
- Integrate filter delta tracking into
insert_tuple(),update_tuple(),delete_tuple() - Persist filter deltas to RocksDB
- Background consolidation worker with CPU awareness
- Crash recovery from delta log
Phase 2: Columnar Zone Summaries (Enhanced Statistics)
- Implement
BlockZoneSummarywith incremental updates - HyperLogLog for approximate distinct counts
- Histogram buckets for selectivity estimation
- Integration with optimizer cost model
Phase 3: Speculative Filter Indexes (Intelligence)
- Query pattern tracker
- Automatic filter creation based on patterns
- Filter lifecycle management (creation, maintenance, dropping)
- Optimizer hints for speculative filter usage
Phase 4: Parallel Query Support
- Parallel bloom filter evaluation with rayon
- Parallel zone map block pruning
- Parallel SIMD filtering
- Optimizer-driven parallelism decisions
Conclusion
The proposed Self-Maintaining Filter Index (SMFI) architecture provides:
- Zero additional memory for concurrent connections
- Automatic maintenance integrated with existing DML paths
- CPU-aware background consolidation preventing system overload
- Parallel-ready design for optimizer-driven parallelism
- Innovative speculative indexing for workload adaptation
This aligns with HeliosDB Nano’s philosophy of memory optimization while maximizing storage-level filtering efficiency.