Skip to content

Self-Maintaining Filter Index (SMFI)

Self-Maintaining Filter Index (SMFI)

Overview

HeliosDB Nano implements a Self-Maintaining Filter Index (SMFI) architecture that provides automatic storage-level filtering with zero memory overhead. Unlike traditional database indexes that require manual creation and maintenance, SMFI structures are automatically updated during DML operations and intelligently created based on query patterns.

Key Benefits

BenefitDescription
Zero Memory OverheadAll filter structures stored on disk, loaded on-demand
Self-MaintainingAutomatic updates during INSERT/UPDATE/DELETE
CPU-AwareBackground consolidation respects system load
IntelligentAuto-creates filters based on query patterns
Parallel-ReadyFull parallel query support with rayon

Architecture

DML OPERATIONS
INSERT ───────────┼──────────── UPDATE ──────────── DELETE
┌───────────────────────┐
│ Filter Delta Tracker │ (~150ns overhead)
│ - Bloom filter bits │
│ - Zone map updates │
│ - Delta log append │
└───────────────────────┘
┌─────────────┴─────────────┐
▼ ▼
┌─────────────────┐ ┌─────────────────────────┐
│ Speculative │ │ Background Consolidation│
│ Filter Manager │ │ Worker (CPU-aware) │
│ - Pattern track │ │ - Merge deltas │
│ - Auto-create │ │ - Rebuild when idle │
└─────────────────┘ └─────────────────────────┘
┌───────────────────────┐
│ QUERY EXECUTION │
│ Phase 1: Bloom check │
│ Phase 2: Zone prune │
│ Phase 3: SIMD filter │
└───────────────────────┘

Components

1. Bloom Filters

Probabilistic data structures for fast membership testing. Ideal for equality predicates.

How It Works:

  • Each value is hashed to multiple bit positions
  • Query: Check all bit positions (if any 0, definitely not present)
  • False positive rate configurable (default: 1%)

When Used:

-- Bloom filter effective for equality checks
SELECT * FROM orders WHERE customer_id = 12345;
SELECT * FROM users WHERE email = 'user@example.com';

Performance:

  • Check time: ~10ns per value
  • Memory: ~10 bits per distinct value
  • False positive rate: ~1% (configurable)

2. Zone Maps

Block-level min/max statistics for range predicates. Enables block pruning.

How It Works:

  • Each block tracks min/max values per column
  • Query: Skip blocks where predicate can’t match
  • Zero false positives (exact pruning)

When Used:

-- Zone map effective for range queries
SELECT * FROM events WHERE timestamp > '2025-01-01';
SELECT * FROM products WHERE price BETWEEN 100 AND 500;

Performance:

  • Check time: ~1ns per block
  • Storage: ~100 bytes per block per column
  • Pruning: 50-99% of blocks skipped (data-dependent)

3. Columnar Zone Summaries (CZS)

Enhanced per-block column statistics with cardinality estimation.

Statistics Tracked:

  • Min/Max values
  • Null count
  • Approximate distinct count (HyperLogLog)
  • Most Common Values (MCV)
  • Histogram buckets

When Used:

-- Optimizer uses CZS for cost estimation
EXPLAIN SELECT * FROM orders WHERE status = 'pending';
-- Shows selectivity estimate from MCV
-- HyperLogLog for distinct estimation
SELECT COUNT(DISTINCT customer_id) FROM orders;
-- Uses pre-computed approximate distinct count

4. Speculative Filter Indexes

Automatically created filters based on query patterns.

How It Works:

  1. Query pattern tracker records predicate usage
  2. High-frequency low-selectivity patterns detected
  3. Bloom filter automatically created for column
  4. Unused filters dropped after configurable period

Example Pattern Detection:

Query: SELECT * FROM orders WHERE customer_id = ?
Executions: 847 in last hour
Selectivity: 0.01% (very selective)
→ Auto-create bloom filter for orders.customer_id

5. Parallel Filter Evaluation

Three-phase parallel filtering using rayon.

Execution Phases:

Phase 1: Parallel Bloom Filter Checks
└─ Eliminate non-matching rows early
Phase 2: Parallel Zone Map Block Pruning
└─ Skip entire blocks that can't match
Phase 3: Parallel SIMD Predicate Evaluation
└─ High-throughput vectorized filtering

Parallelism:

  • Uses all available CPU cores
  • Adaptive chunk sizing based on CPU load
  • Work-stealing for load balancing

Performance Impact

Query Performance

ScenarioWithout SMFIWith SMFIImprovement
Point lookup (1M rows)100ms5ms20x
Range scan (10M rows, 1% match)500ms50ms10x
Multi-predicate (100M rows)5s200ms25x
Parallel 8-core (100M rows)5s30ms166x

Write Overhead

OperationSMFI OverheadTotal Impact
INSERT~150ns< 0.02%
UPDATE~200ns< 0.02%
DELETE~100ns< 0.02%

Storage Overhead

Table SizeBloom FiltersZone MapsTotal
1 GB10 MB (1%)5 MB (0.5%)~1.5%
10 GB100 MB50 MB~1.5%
100 GB1 GB500 MB~1.5%

Implicit Benefits

SMFI provides several implicit benefits that require no configuration:

1. Automatic Query Acceleration

All SELECT queries with WHERE clauses automatically benefit:

-- These queries automatically use SMFI filtering
SELECT * FROM orders WHERE customer_id = 123;
SELECT * FROM events WHERE timestamp > '2025-01-01';
SELECT * FROM products WHERE category = 'electronics' AND price < 1000;

2. Reduced I/O

Block pruning via zone maps reduces disk reads:

-- Query on ordered timestamp column
SELECT * FROM logs WHERE timestamp > NOW() - INTERVAL '1 hour';
-- With zone maps: Only reads recent blocks
-- Without zone maps: Full table scan

3. Memory Efficiency

All filter structures are disk-resident:

  • 100 concurrent connections use same filter structures
  • No per-connection memory allocation for filters
  • Filters loaded on-demand, evicted after use

4. Automatic Index Suggestions

Speculative Filter Manager tracks queries and suggests optimizations:

-- View automatically detected patterns
SELECT * FROM pg_speculative_filters();
-- Shows: table, column, pattern_type, frequency, selectivity

5. Parallel Query Execution

Multi-core utilization is automatic:

-- Large scan automatically parallelized
SELECT COUNT(*) FROM orders WHERE amount > 1000;
-- Uses all available CPU cores for filtering

Configuration

Filter Consolidation Settings

-- Maximum CPU for background consolidation (default: 15%)
SET smfi_max_cpu_percent = 15;
-- Delta threshold before consolidation (default: 1000)
SET smfi_delta_threshold = 1000;
-- Time threshold in seconds (default: 300)
SET smfi_time_threshold = 300;

Speculative Filter Settings

-- Minimum query frequency for auto-filter (default: 10)
SET smfi_min_query_frequency = 10;
-- Days before dropping unused filter (default: 7)
SET smfi_drop_after_days = 7;
-- Maximum speculative filters per table (default: 10)
SET smfi_max_filters_per_table = 10;

Parallel Filtering Settings

-- Enable/disable parallel filtering (default: on)
SET smfi_parallel_enabled = on;
-- Minimum rows for parallel execution (default: 10000)
SET smfi_parallel_threshold = 10000;
-- Maximum parallel workers (default: CPU cores)
SET smfi_max_workers = 8;

Bloom Filter Settings

-- False positive rate (default: 0.01 = 1%)
SET smfi_bloom_fpr = 0.01;
-- Number of hash functions (default: 7)
SET smfi_bloom_hashes = 7;

System Views

pg_smfi_status()

Overview of SMFI system status:

SELECT * FROM pg_smfi_status();
ColumnDescription
tables_trackedNumber of tables with SMFI
bloom_filtersTotal bloom filters
zone_mapsTotal zone maps
pending_deltasDeltas awaiting consolidation
last_consolidationLast consolidation timestamp

pg_smfi_table_stats()

Per-table SMFI statistics:

SELECT * FROM pg_smfi_table_stats();
ColumnDescription
table_nameTable name
bloom_filter_countNumber of bloom filters
zone_map_blocksNumber of zone map blocks
speculative_filtersAuto-created filters
bloom_checksTotal bloom filter checks
bloom_hitsBloom filter positive matches
blocks_prunedBlocks skipped via zone maps

pg_speculative_filters()

Currently active speculative filters:

SELECT * FROM pg_speculative_filters();
ColumnDescription
table_nameTable name
column_nameColumn name
filter_typeFilter type (bloom, zone)
pattern_typeQuery pattern (equality, range)
query_frequencyTimes pattern executed
selectivityEstimated selectivity
created_atFilter creation time
last_usedLast usage time

Best Practices

1. Let SMFI Work Automatically

SMFI is designed to be zero-configuration. Avoid over-tuning:

-- Don't do this unless you have specific needs
SET smfi_bloom_fpr = 0.001; -- Very low FPR wastes space
-- Let defaults work for most cases
RESET smfi_bloom_fpr;

2. Monitor Speculative Filters

Check which filters are being auto-created:

-- See what SMFI has learned about your workload
SELECT table_name, column_name, query_frequency, selectivity
FROM pg_speculative_filters()
ORDER BY query_frequency DESC;

3. Use EXPLAIN to Verify

Check if SMFI is being used:

EXPLAIN SELECT * FROM orders WHERE customer_id = 123;
-- Look for: "Bloom Filter Check" or "Zone Map Prune"

4. Bulk Load Optimization (Automatic)

SMFI automatically detects and optimizes bulk load operations:

-- Bulk INSERT (100+ rows) - SMFI auto-suspends
INSERT INTO orders VALUES
(1, 'A', 100),
(2, 'B', 200),
-- ... 100+ rows
(1000, 'Z', 999);
-- SMFI auto-resumes and schedules rebuild
-- INSERT ... SELECT - auto-suspends (when implemented)
INSERT INTO archive SELECT * FROM orders WHERE date < '2024-01-01';
-- COPY FROM - auto-suspends (when implemented)
COPY orders FROM '/data/orders.csv';

How It Works:

  1. Detects operations with ≥100 rows
  2. Suspends per-row tracking for affected table
  3. Counts rows modified during suspension
  4. Resumes tracking when operation completes
  5. Schedules filter rebuild automatically

Manual Override (Optional):

For very large imports, you can still manually control:

-- Temporarily disable for bulk insert
SET smfi_tracking_enabled = off;
-- Bulk load data
COPY orders FROM '/data/orders.csv';
-- Re-enable and rebuild
SET smfi_tracking_enabled = on;
CALL smfi_rebuild_table('orders');

Comparison with Traditional Indexes

FeatureB-Tree IndexSMFI Bloom FilterSMFI Zone Map
CreationManual CREATE INDEXAutomaticAutomatic
MaintenanceManual REINDEXSelf-maintainingSelf-maintaining
MemoryIn-memory pagesDisk-onlyDisk-only
False PositivesNone~1%None
Best ForRange + equalityEqualityRange
Write Overhead10-30%< 0.02%< 0.02%

Troubleshooting

SMFI Not Being Used

-- Check if SMFI is enabled
SHOW smfi_enabled;
-- Verify table has been tracked
SELECT * FROM pg_smfi_table_stats() WHERE table_name = 'orders';
-- Force SMFI rebuild
CALL smfi_rebuild_table('orders');

High Consolidation CPU Usage

-- Lower max CPU for consolidation
SET smfi_max_cpu_percent = 10;
-- Increase delta threshold (less frequent consolidation)
SET smfi_delta_threshold = 5000;

Speculative Filters Not Created

-- Check query frequency threshold
SHOW smfi_min_query_frequency;
-- Lower threshold if needed
SET smfi_min_query_frequency = 5;

See Also