Skip to content

Storage-Level Self-Maintaining Filter Architecture

Storage-Level Self-Maintaining Filter Architecture

Executive Summary

This document analyzes how to enhance HeliosDB Nano’s storage-level filtering to achieve:

  • Zero memory overhead for concurrent connections
  • Self-maintaining filter structures updated during DML operations
  • Parallel-ready architecture for optimizer-driven parallelism
  • CPU-aware background maintenance inspired by MV auto-refresh

Current State Analysis

Existing Components

ComponentLocationStatusSelf-Maintaining?
Bloom Filtersbloom_filter.rs✓ Implemented✗ Manual rebuild
Zone Mapszone_map.rs✓ Implemented✗ Manual rebuild
SIMD Filteringsimd_filter.rs✓ ImplementedN/A (stateless)
Delta Trackermv_delta.rs✓ Implemented✓ Auto on DML
CPU Monitormv_scheduler.rs✓ ImplementedN/A

Gap: Filter structures require explicit rebuilding

Current flow:

INSERT → Delta recorded (for MV) → Bloom/Zone NOT updated
Requires manual:
engine.build_bloom_filters_for_table("t")
engine.build_zone_maps_for_table("t", 1000)

Comparative Analysis: Database Filtering Architectures

Oracle Exadata Storage Indexes

Architecture:

┌─────────────────────────────────────────────────────────────┐
│ Exadata Storage Cell │
├─────────────────────────────────────────────────────────────┤
│ Storage Index (per 1MB region, auto-maintained) │
│ ┌─────────┬─────────┬─────────┬─────────┐ │
│ │ Col1 │ Col2 │ Col3 │ ... │ │
│ │ min/max │ min/max │ min/max │ │ │
│ └─────────┴─────────┴─────────┴─────────┘ │
│ │
│ Maintenance: Updated on EACH write (synchronous) │
│ Storage: ~1% of data size │
│ Memory: None (disk-resident, cached in storage cell RAM) │
└─────────────────────────────────────────────────────────────┘

Key Innovation:

  • Automatic maintenance with NO DBA intervention
  • Per-region granularity (1MB default)
  • Synchronous update on write (no background job needed)

PostgreSQL BRIN Indexes

Architecture:

┌─────────────────────────────────────────────────────────────┐
│ BRIN Index Structure │
├─────────────────────────────────────────────────────────────┤
│ Block Range Index (per N pages, default 128) │
│ ┌─────────────────────────────────────────────┐ │
│ │ Range ID │ Min Value │ Max Value │ Has Null │ │
│ └─────────────────────────────────────────────┘ │
│ │
│ Maintenance: Lazy - requires VACUUM or explicit refresh │
│ Storage: Very small (~0.003% of table size) │
│ Memory: Index pages loaded on demand │
└─────────────────────────────────────────────────────────────┘

Limitation:

  • NOT self-maintaining (needs VACUUM)
  • Best for naturally-ordered data (timestamps, sequences)
  • No bloom filter capability

ClickHouse Data Skipping Indexes

Architecture:

┌─────────────────────────────────────────────────────────────┐
│ ClickHouse Granule System │
├─────────────────────────────────────────────────────────────┤
│ Per-granule (8192 rows default): │
│ ┌─────────────────────────────────────────────┐ │
│ │ minmax │ set(N) │ bloom_filter │ ngrambf │ │
│ │ indexes │ indexes │ indexes │ indexes │ │
│ └─────────────────────────────────────────────┘ │
│ │
│ Maintenance: Built during MERGE (background) │
│ Storage: Embedded in data parts │
│ Memory: Loaded lazily during query │
└─────────────────────────────────────────────────────────────┘

Key Innovation:

  • Multiple index types per column
  • Background maintenance during compaction
  • No memory overhead for idle tables

ScyllaDB/Cassandra Bloom Filters

Architecture:

┌─────────────────────────────────────────────────────────────┐
│ Per-SSTable Bloom Filter │
├─────────────────────────────────────────────────────────────┤
│ Built during compaction/flush, stored with SSTable │
│ ┌─────────────────────────────────────────────┐ │
│ │ Partition Key Bloom Filter (per SSTable) │ │
│ │ Updated: On SSTable creation (immutable) │ │
│ │ Size: ~10 bits per key │ │
│ └─────────────────────────────────────────────┘ │
│ │
│ Memory: Optionally memory-mapped │
│ Maintenance: Zero (immutable once created) │
└─────────────────────────────────────────────────────────────┘

Proposed Architecture: Self-Maintaining Filter Index (SMFI)

Design Principles

  1. Storage-First: All filter structures persisted to RocksDB, not memory
  2. Delta-Driven: Leverage existing mv_delta_tracker for incremental updates
  3. CPU-Aware: Background consolidation with CPU throttling (like MV refresh)
  4. Parallel-Safe: Lock-free reads, append-only writes

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────┐
│ STORAGE-LEVEL FILTERING ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ DML OPERATIONS │ │
│ │ INSERT ─────┬──────────────────────────────────────────────────── │ │
│ │ UPDATE ─────┼─► mv_delta_tracker.record_*() │ │
│ │ DELETE ─────┘ │ │ │
│ └──────────────────────────┼─────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ FILTER INDEX DELTA TRACKER (NEW) │ │
│ │ ┌─────────────────────────────────────────────────────────────┐ │ │
│ │ │ Synchronous (Lightweight): │ │ │
│ │ │ • Bloom filter bit-set (OR operation, ~100ns) │ │ │
│ │ │ • Zone map min/max update (comparison, ~50ns) │ │ │
│ │ │ • Delta log append (for background consolidation) │ │ │
│ │ └─────────────────────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ STORAGE-PERSISTED FILTER STRUCTURES │ │
│ │ │ │
│ │ RocksDB Keys: │ │
│ │ ┌─────────────────────────────────────────────────────────────┐ │ │
│ │ │ fidx:bloom:{table}:{column} → BloomFilter bytes │ │ │
│ │ │ fidx:zone:{table}:{block_id} → BlockZoneMap bytes │ │ │
│ │ │ fidx:delta:{table}:{seq} → FilterDelta bytes │ │ │
│ │ │ fidx:meta:{table} → FilterMetadata bytes │ │ │
│ │ └─────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ Memory Usage: ZERO (all disk-resident, loaded on-demand) │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ BACKGROUND CONSOLIDATION WORKER │ │
│ │ (Inspired by AutoRefreshWorker) │ │
│ │ │ │
│ │ Triggers: │ │
│ │ • Delta count threshold (e.g., 1000 deltas) │ │
│ │ • Time-based (e.g., every 5 minutes) │ │
│ │ • CPU availability (< 50% usage) │ │
│ │ │ │
│ │ Operations: │ │
│ │ • Merge delta bloom filters into base filter │ │
│ │ • Recalculate zone maps from delta log │ │
│ │ • Compact delta log │ │
│ │ • Update filter metadata │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘

Detailed Component Design

1. Filter Index Delta Tracker

/// Lightweight synchronous filter index updates during DML
pub struct FilterIndexDeltaTracker {
/// Storage engine reference for persistence
db: Arc<DB>,
/// Per-table bloom filter deltas (in-memory buffer, flushed periodically)
bloom_deltas: DashMap<String, BloomFilterDelta>,
/// Per-table zone map deltas
zone_deltas: DashMap<String, ZoneMapDelta>,
/// Delta sequence counter
delta_seq: AtomicU64,
/// Configuration
config: FilterIndexConfig,
}
/// Minimal bloom filter delta (append-only)
pub struct BloomFilterDelta {
/// Bits to OR into base filter
pub bits_to_set: Vec<(usize, u64)>, // (word_index, bits_mask)
/// Row count added
pub rows_added: u64,
}
/// Zone map delta for incremental updates
pub struct ZoneMapDelta {
/// Column updates: (column_name, new_min, new_max)
pub column_updates: Vec<(String, Option<Value>, Option<Value>)>,
/// Block ID this delta applies to
pub block_id: u64,
}
impl FilterIndexDeltaTracker {
/// Called synchronously during INSERT (cost: ~150ns per row)
pub fn on_insert(&self, table: &str, row_id: u64, tuple: &Tuple, schema: &Schema) {
// 1. Update bloom filter delta (fast bit operations)
self.update_bloom_delta(table, tuple, schema);
// 2. Update zone map delta (fast min/max comparison)
self.update_zone_delta(table, row_id, tuple, schema);
// 3. Append to delta log (for crash recovery)
self.append_delta_log(table, DeltaType::Insert, row_id, tuple);
}
/// Bloom update is O(1) - just compute hash and set bits
fn update_bloom_delta(&self, table: &str, tuple: &Tuple, schema: &Schema) {
let mut delta = self.bloom_deltas.entry(table.to_string()).or_default();
for (i, value) in tuple.values.iter().enumerate() {
if !matches!(value, Value::Null) {
let (h1, h2) = self.hash_value(value);
// Compute bit positions using Kirsch-Mitzenmacher
for k in 0..self.config.num_hashes {
let bit_pos = (h1.wrapping_add(k as u64 * h2)) % self.config.bloom_bits;
let word_idx = bit_pos / 64;
let bit_mask = 1u64 << (bit_pos % 64);
delta.bits_to_set.push((word_idx as usize, bit_mask));
}
}
}
delta.rows_added += 1;
}
}

2. Storage-Persisted Filter Structures

/// Filter metadata stored in RocksDB
#[derive(Serialize, Deserialize)]
pub struct FilterIndexMetadata {
/// Table name
pub table_name: String,
/// Number of rows indexed
pub indexed_rows: u64,
/// Number of pending deltas
pub pending_deltas: u64,
/// Last consolidation timestamp
pub last_consolidation: DateTime<Utc>,
/// Bloom filter config
pub bloom_config: BloomFilterConfig,
/// Zone map block size
pub zone_block_size: u64,
/// Filter index version (for upgrades)
pub version: u32,
}
/// RocksDB key prefixes for filter index storage
pub mod filter_index_keys {
pub const BLOOM_PREFIX: &str = "fidx:bloom:";
pub const ZONE_PREFIX: &str = "fidx:zone:";
pub const DELTA_PREFIX: &str = "fidx:delta:";
pub const META_PREFIX: &str = "fidx:meta:";
}

3. CPU-Aware Background Consolidation Worker

/// Background worker for filter index maintenance
/// Inspired by AutoRefreshWorker pattern
pub struct FilterIndexMaintenanceWorker {
/// Storage engine
storage: Arc<StorageEngine>,
/// Configuration
config: MaintenanceConfig,
/// CPU monitor (reuse from mv_scheduler)
cpu_monitor: Arc<CpuMonitor>,
/// Shutdown signal
shutdown: Arc<AtomicBool>,
}
#[derive(Clone)]
pub struct MaintenanceConfig {
/// Maximum CPU usage before throttling (default: 15%)
pub max_cpu_percent: f32,
/// Delta threshold before consolidation (default: 1000)
pub delta_threshold: u64,
/// Time threshold before consolidation (default: 300 seconds)
pub time_threshold_seconds: u64,
/// Check interval (default: 10 seconds)
pub check_interval_seconds: u64,
/// Enable parallel consolidation
pub parallel_consolidation: bool,
}
impl FilterIndexMaintenanceWorker {
pub async fn run(&self) {
loop {
if self.shutdown.load(Ordering::Relaxed) {
break;
}
// Check CPU usage before proceeding
let cpu = self.cpu_monitor.current_usage();
if cpu > self.config.max_cpu_percent {
debug!("CPU too high ({:.1}%), skipping filter maintenance", cpu);
tokio::time::sleep(Duration::from_secs(self.config.check_interval_seconds)).await;
continue;
}
// Find tables needing consolidation
let tables = self.find_tables_needing_consolidation().await;
for table in tables {
// Consolidate in parallel if enabled
if self.config.parallel_consolidation {
self.consolidate_parallel(&table).await;
} else {
self.consolidate_sequential(&table).await;
}
}
tokio::time::sleep(Duration::from_secs(self.config.check_interval_seconds)).await;
}
}
/// Parallel consolidation using rayon
async fn consolidate_parallel(&self, table: &str) {
use rayon::prelude::*;
let deltas = self.load_pending_deltas(table);
// Parallel bloom filter merge
let bloom_bits: Vec<_> = deltas.par_iter()
.flat_map(|d| d.bloom_delta.bits_to_set.par_iter())
.collect();
// Apply to base filter atomically
self.apply_bloom_updates(table, &bloom_bits).await;
// Parallel zone map updates per block
let block_groups = self.group_zone_deltas_by_block(&deltas);
block_groups.par_iter().for_each(|(block_id, updates)| {
self.merge_zone_updates(table, *block_id, updates);
});
}
}

4. Parallel-Ready Query Execution

/// Parallel filter evaluation for FilteredScan
impl PredicatePushdownManager {
/// Parallel scan with storage-level filtering
pub fn scan_with_pushdown_parallel(
&self,
table_name: &str,
tuples: Vec<Tuple>,
predicates: &[AnalyzedPredicate],
schema: &Schema,
parallelism: usize,
) -> Vec<Tuple> {
use rayon::prelude::*;
// Phase 1: Parallel bloom filter check
let bloom_filtered: Vec<_> = tuples.par_iter()
.filter(|tuple| self.check_bloom_filters_parallel(table_name, tuple, predicates))
.cloned()
.collect();
// Phase 2: Parallel zone map block pruning (already done at block level)
// Phase 3: Parallel SIMD predicate evaluation
let chunk_size = (bloom_filtered.len() / parallelism).max(64);
let results: Vec<_> = bloom_filtered
.par_chunks(chunk_size)
.flat_map(|chunk| {
self.simd_engine.filter_batch_parallel(chunk, predicates, schema)
})
.collect();
results
}
}

Innovation: Speculative Filter Indexes (SFI)

Concept

Unlike traditional indexes that require explicit creation, SFI automatically creates lightweight filter structures based on query patterns:

┌─────────────────────────────────────────────────────────────────────────┐
│ SPECULATIVE FILTER INDEX SYSTEM │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Query Pattern Analyzer: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ SELECT * FROM orders WHERE customer_id = ? │ │
│ │ ↓ │ │
│ │ Pattern: equality predicate on orders.customer_id │ │
│ │ Frequency: 847 times in last hour │ │
│ │ Selectivity: ~0.01% │ │
│ │ ↓ │ │
│ │ Decision: Create bloom filter for orders.customer_id │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Auto-Created Filters (no DBA intervention): │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Column │ Filter Type │ Trigger │ │
│ │ ───────────────┼─────────────┼─────────────────────────────── │ │
│ │ customer_id │ Bloom │ High equality query frequency │ │
│ │ order_date │ Zone Map │ Range queries detected │ │
│ │ status │ Enum Set │ Low cardinality column │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Lifecycle: │
│ • Created speculatively after pattern detected │
│ • Maintained incrementally during DML │
│ • Dropped automatically if unused for configurable period │
│ • Zero memory footprint (all storage-persisted) │
│ │
└─────────────────────────────────────────────────────────────────────────┘

Implementation

/// Speculative Filter Index Manager
pub struct SpeculativeFilterManager {
/// Query pattern tracker
pattern_tracker: QueryPatternTracker,
/// Filter index manager
filter_manager: FilterIndexManager,
/// Configuration
config: SpeculativeConfig,
}
#[derive(Clone)]
pub struct SpeculativeConfig {
/// Minimum query frequency before considering filter creation
pub min_query_frequency: u64,
/// Minimum selectivity improvement required (default: 10x)
pub min_selectivity_improvement: f64,
/// Days of inactivity before dropping speculative filter
pub drop_after_days: u32,
/// Maximum speculative filters per table
pub max_filters_per_table: usize,
}
impl SpeculativeFilterManager {
/// Called by optimizer after query execution
pub fn record_query_pattern(&self, table: &str, predicate: &LogicalExpr,
selectivity: f64, execution_time_ms: u64) {
let pattern = self.extract_pattern(predicate);
self.pattern_tracker.record(table, pattern, selectivity, execution_time_ms);
// Check if we should create a speculative filter
if self.should_create_filter(table, &pattern) {
self.create_speculative_filter(table, &pattern);
}
}
fn should_create_filter(&self, table: &str, pattern: &QueryPattern) -> bool {
let stats = self.pattern_tracker.get_stats(table, pattern);
// Criteria for filter creation:
// 1. High query frequency
// 2. Low selectivity (few rows returned)
// 3. No existing filter for this column
// 4. Below max filters per table
stats.frequency >= self.config.min_query_frequency
&& stats.avg_selectivity < 0.1 // Less than 10% of rows
&& !self.filter_exists(table, &pattern.column)
&& self.filter_count(table) < self.config.max_filters_per_table
}
}

Innovation: Columnar Zone Summaries (CZS)

Concept

Inspired by Exadata Storage Indexes but optimized for row-store with column awareness:

┌─────────────────────────────────────────────────────────────────────────┐
│ COLUMNAR ZONE SUMMARY ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Traditional Zone Map: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Block 0: min=1, max=100 (all columns combined) │ │
│ │ Block 1: min=50, max=200 │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Columnar Zone Summary (CZS): │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Block 0: │ │
│ │ ├── id: min=1, max=100, distinct≈100, nulls=0 │ │
│ │ ├── customer: min=1001, max=9999, distinct≈45, nulls=0 │ │
│ │ ├── total: min=10.50, max=9999.99, avg=156.23, nulls=2 │ │
│ │ └── status: values={pending,shipped,delivered}, nulls=0 │ │
│ │ │ │
│ │ Additional Summaries: │ │
│ │ ├── Histogram buckets (for range selectivity estimation) │ │
│ │ ├── Most common values (for equality optimization) │ │
│ │ └── Correlation hints (for multi-column predicates) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Storage: ~2KB per block (persisted in RocksDB) │
│ Memory: Zero (loaded on-demand during query planning) │
│ Maintenance: Incremental during DML │
│ │
└─────────────────────────────────────────────────────────────────────────┘

Implementation

/// Enhanced per-block column summary
#[derive(Serialize, Deserialize)]
pub struct ColumnZoneSummary {
/// Column name
pub column_name: String,
/// Basic statistics
pub min: Option<Value>,
pub max: Option<Value>,
pub null_count: u64,
pub row_count: u64,
/// Approximate distinct count (HyperLogLog)
pub approx_distinct: u64,
/// Average (for numeric columns)
pub avg: Option<f64>,
/// Sum (for numeric columns, useful for aggregation pushdown)
pub sum: Option<f64>,
/// Most common values (top 5) with frequencies
pub mcv: Vec<(Value, u64)>,
/// Histogram buckets (for selectivity estimation)
pub histogram: Option<HistogramBuckets>,
}
/// Block-level columnar zone summary
#[derive(Serialize, Deserialize)]
pub struct BlockZoneSummary {
pub block_id: u64,
pub first_row_id: u64,
pub last_row_id: u64,
pub row_count: u64,
pub columns: HashMap<String, ColumnZoneSummary>,
/// Cross-column correlation hints
pub correlations: Vec<ColumnCorrelation>,
}
impl BlockZoneSummary {
/// Incremental update on INSERT (O(1) per column)
pub fn update_on_insert(&mut self, tuple: &Tuple, schema: &Schema) {
self.row_count += 1;
for (i, col) in schema.columns.iter().enumerate() {
let summary = self.columns.entry(col.name.clone())
.or_insert_with(|| ColumnZoneSummary::new(&col.name));
let value = tuple.get(i).unwrap_or(&Value::Null);
summary.update_incremental(value);
}
}
/// Can this block satisfy the predicate? (Fast check)
pub fn can_satisfy(&self, predicates: &[AnalyzedPredicate]) -> BlockDecision {
for pred in predicates {
if let Some(summary) = self.columns.get(&pred.column_name) {
match pred.evaluate_against_summary(summary) {
SummaryMatch::Impossible => return BlockDecision::Skip,
SummaryMatch::Partial => continue,
SummaryMatch::Full => continue,
}
}
}
BlockDecision::Scan
}
}

Memory vs Storage Comparison

Per-Connection Memory Usage

SystemMemory per ConnectionFiltering Overhead
Oracle (In-Memory)~10MB+SIMD in memory
PostgreSQL~5-10MBFilter in executor
HeliosDB Nano (Current)~2-5MBFilter after load
HeliosDB Nano (SMFI)~1-2MBFilter at storage

Storage Overhead for Filter Indexes

Table SizeBloom FiltersZone MapsTotal Overhead
1GB~10MB (1%)~5MB (0.5%)~15MB (1.5%)
10GB~100MB~50MB~150MB
100GB~1GB~500MB~1.5GB

Query Latency Impact

Scenario: SELECT * FROM orders WHERE customer_id = 12345 AND total > 100
Table size: 10 million rows, 10% match customer_id, 50% match total > 100
Without SMFI:
Load all 10M rows → Filter in memory → Return 500K rows
Memory: ~2GB peak
Time: ~5 seconds
With SMFI:
Bloom check customer_id → Zone map prune blocks → SIMD filter remainder
Memory: ~50MB peak (only matching blocks loaded)
Time: ~200ms

Implementation Roadmap

Phase 1: Self-Maintaining Filter Updates (Core)

  • Integrate filter delta tracking into insert_tuple(), update_tuple(), delete_tuple()
  • Persist filter deltas to RocksDB
  • Background consolidation worker with CPU awareness
  • Crash recovery from delta log

Phase 2: Columnar Zone Summaries (Enhanced Statistics)

  • Implement BlockZoneSummary with incremental updates
  • HyperLogLog for approximate distinct counts
  • Histogram buckets for selectivity estimation
  • Integration with optimizer cost model

Phase 3: Speculative Filter Indexes (Intelligence)

  • Query pattern tracker
  • Automatic filter creation based on patterns
  • Filter lifecycle management (creation, maintenance, dropping)
  • Optimizer hints for speculative filter usage

Phase 4: Parallel Query Support

  • Parallel bloom filter evaluation with rayon
  • Parallel zone map block pruning
  • Parallel SIMD filtering
  • Optimizer-driven parallelism decisions

Conclusion

The proposed Self-Maintaining Filter Index (SMFI) architecture provides:

  1. Zero additional memory for concurrent connections
  2. Automatic maintenance integrated with existing DML paths
  3. CPU-aware background consolidation preventing system overload
  4. Parallel-ready design for optimizer-driven parallelism
  5. Innovative speculative indexing for workload adaptation

This aligns with HeliosDB Nano’s philosophy of memory optimization while maximizing storage-level filtering efficiency.