Self-Maintaining Filter Index (SMFI)
Self-Maintaining Filter Index (SMFI)
Overview
HeliosDB Nano implements a Self-Maintaining Filter Index (SMFI) architecture that provides automatic storage-level filtering with zero memory overhead. Unlike traditional database indexes that require manual creation and maintenance, SMFI structures are automatically updated during DML operations and intelligently created based on query patterns.
Key Benefits
| Benefit | Description |
|---|---|
| Zero Memory Overhead | All filter structures stored on disk, loaded on-demand |
| Self-Maintaining | Automatic updates during INSERT/UPDATE/DELETE |
| CPU-Aware | Background consolidation respects system load |
| Intelligent | Auto-creates filters based on query patterns |
| Parallel-Ready | Full parallel query support with rayon |
Architecture
DML OPERATIONS │ INSERT ───────────┼──────────── UPDATE ──────────── DELETE │ ▼ ┌───────────────────────┐ │ Filter Delta Tracker │ (~150ns overhead) │ - Bloom filter bits │ │ - Zone map updates │ │ - Delta log append │ └───────────────────────┘ │ ┌─────────────┴─────────────┐ ▼ ▼ ┌─────────────────┐ ┌─────────────────────────┐ │ Speculative │ │ Background Consolidation│ │ Filter Manager │ │ Worker (CPU-aware) │ │ - Pattern track │ │ - Merge deltas │ │ - Auto-create │ │ - Rebuild when idle │ └─────────────────┘ └─────────────────────────┘ │ ▼ ┌───────────────────────┐ │ QUERY EXECUTION │ │ Phase 1: Bloom check │ │ Phase 2: Zone prune │ │ Phase 3: SIMD filter │ └───────────────────────┘Components
1. Bloom Filters
Probabilistic data structures for fast membership testing. Ideal for equality predicates.
How It Works:
- Each value is hashed to multiple bit positions
- Query: Check all bit positions (if any 0, definitely not present)
- False positive rate configurable (default: 1%)
When Used:
-- Bloom filter effective for equality checksSELECT * FROM orders WHERE customer_id = 12345;SELECT * FROM users WHERE email = 'user@example.com';Performance:
- Check time: ~10ns per value
- Memory: ~10 bits per distinct value
- False positive rate: ~1% (configurable)
2. Zone Maps
Block-level min/max statistics for range predicates. Enables block pruning.
How It Works:
- Each block tracks min/max values per column
- Query: Skip blocks where predicate can’t match
- Zero false positives (exact pruning)
When Used:
-- Zone map effective for range queriesSELECT * FROM events WHERE timestamp > '2025-01-01';SELECT * FROM products WHERE price BETWEEN 100 AND 500;Performance:
- Check time: ~1ns per block
- Storage: ~100 bytes per block per column
- Pruning: 50-99% of blocks skipped (data-dependent)
3. Columnar Zone Summaries (CZS)
Enhanced per-block column statistics with cardinality estimation.
Statistics Tracked:
- Min/Max values
- Null count
- Approximate distinct count (HyperLogLog)
- Most Common Values (MCV)
- Histogram buckets
When Used:
-- Optimizer uses CZS for cost estimationEXPLAIN SELECT * FROM orders WHERE status = 'pending';-- Shows selectivity estimate from MCV
-- HyperLogLog for distinct estimationSELECT COUNT(DISTINCT customer_id) FROM orders;-- Uses pre-computed approximate distinct count4. Speculative Filter Indexes
Automatically created filters based on query patterns.
How It Works:
- Query pattern tracker records predicate usage
- High-frequency low-selectivity patterns detected
- Bloom filter automatically created for column
- Unused filters dropped after configurable period
Example Pattern Detection:
Query: SELECT * FROM orders WHERE customer_id = ?Executions: 847 in last hourSelectivity: 0.01% (very selective)→ Auto-create bloom filter for orders.customer_id5. Parallel Filter Evaluation
Three-phase parallel filtering using rayon.
Execution Phases:
Phase 1: Parallel Bloom Filter Checks └─ Eliminate non-matching rows early
Phase 2: Parallel Zone Map Block Pruning └─ Skip entire blocks that can't match
Phase 3: Parallel SIMD Predicate Evaluation └─ High-throughput vectorized filteringParallelism:
- Uses all available CPU cores
- Adaptive chunk sizing based on CPU load
- Work-stealing for load balancing
Performance Impact
Query Performance
| Scenario | Without SMFI | With SMFI | Improvement |
|---|---|---|---|
| Point lookup (1M rows) | 100ms | 5ms | 20x |
| Range scan (10M rows, 1% match) | 500ms | 50ms | 10x |
| Multi-predicate (100M rows) | 5s | 200ms | 25x |
| Parallel 8-core (100M rows) | 5s | 30ms | 166x |
Write Overhead
| Operation | SMFI Overhead | Total Impact |
|---|---|---|
| INSERT | ~150ns | < 0.02% |
| UPDATE | ~200ns | < 0.02% |
| DELETE | ~100ns | < 0.02% |
Storage Overhead
| Table Size | Bloom Filters | Zone Maps | Total |
|---|---|---|---|
| 1 GB | 10 MB (1%) | 5 MB (0.5%) | ~1.5% |
| 10 GB | 100 MB | 50 MB | ~1.5% |
| 100 GB | 1 GB | 500 MB | ~1.5% |
Implicit Benefits
SMFI provides several implicit benefits that require no configuration:
1. Automatic Query Acceleration
All SELECT queries with WHERE clauses automatically benefit:
-- These queries automatically use SMFI filteringSELECT * FROM orders WHERE customer_id = 123;SELECT * FROM events WHERE timestamp > '2025-01-01';SELECT * FROM products WHERE category = 'electronics' AND price < 1000;2. Reduced I/O
Block pruning via zone maps reduces disk reads:
-- Query on ordered timestamp columnSELECT * FROM logs WHERE timestamp > NOW() - INTERVAL '1 hour';-- With zone maps: Only reads recent blocks-- Without zone maps: Full table scan3. Memory Efficiency
All filter structures are disk-resident:
- 100 concurrent connections use same filter structures
- No per-connection memory allocation for filters
- Filters loaded on-demand, evicted after use
4. Automatic Index Suggestions
Speculative Filter Manager tracks queries and suggests optimizations:
-- View automatically detected patternsSELECT * FROM pg_speculative_filters();
-- Shows: table, column, pattern_type, frequency, selectivity5. Parallel Query Execution
Multi-core utilization is automatic:
-- Large scan automatically parallelizedSELECT COUNT(*) FROM orders WHERE amount > 1000;-- Uses all available CPU cores for filteringConfiguration
Filter Consolidation Settings
-- Maximum CPU for background consolidation (default: 15%)SET smfi_max_cpu_percent = 15;
-- Delta threshold before consolidation (default: 1000)SET smfi_delta_threshold = 1000;
-- Time threshold in seconds (default: 300)SET smfi_time_threshold = 300;Speculative Filter Settings
-- Minimum query frequency for auto-filter (default: 10)SET smfi_min_query_frequency = 10;
-- Days before dropping unused filter (default: 7)SET smfi_drop_after_days = 7;
-- Maximum speculative filters per table (default: 10)SET smfi_max_filters_per_table = 10;Parallel Filtering Settings
-- Enable/disable parallel filtering (default: on)SET smfi_parallel_enabled = on;
-- Minimum rows for parallel execution (default: 10000)SET smfi_parallel_threshold = 10000;
-- Maximum parallel workers (default: CPU cores)SET smfi_max_workers = 8;Bloom Filter Settings
-- False positive rate (default: 0.01 = 1%)SET smfi_bloom_fpr = 0.01;
-- Number of hash functions (default: 7)SET smfi_bloom_hashes = 7;System Views
pg_smfi_status()
Overview of SMFI system status:
SELECT * FROM pg_smfi_status();| Column | Description |
|---|---|
| tables_tracked | Number of tables with SMFI |
| bloom_filters | Total bloom filters |
| zone_maps | Total zone maps |
| pending_deltas | Deltas awaiting consolidation |
| last_consolidation | Last consolidation timestamp |
pg_smfi_table_stats()
Per-table SMFI statistics:
SELECT * FROM pg_smfi_table_stats();| Column | Description |
|---|---|
| table_name | Table name |
| bloom_filter_count | Number of bloom filters |
| zone_map_blocks | Number of zone map blocks |
| speculative_filters | Auto-created filters |
| bloom_checks | Total bloom filter checks |
| bloom_hits | Bloom filter positive matches |
| blocks_pruned | Blocks skipped via zone maps |
pg_speculative_filters()
Currently active speculative filters:
SELECT * FROM pg_speculative_filters();| Column | Description |
|---|---|
| table_name | Table name |
| column_name | Column name |
| filter_type | Filter type (bloom, zone) |
| pattern_type | Query pattern (equality, range) |
| query_frequency | Times pattern executed |
| selectivity | Estimated selectivity |
| created_at | Filter creation time |
| last_used | Last usage time |
Best Practices
1. Let SMFI Work Automatically
SMFI is designed to be zero-configuration. Avoid over-tuning:
-- Don't do this unless you have specific needsSET smfi_bloom_fpr = 0.001; -- Very low FPR wastes space
-- Let defaults work for most casesRESET smfi_bloom_fpr;2. Monitor Speculative Filters
Check which filters are being auto-created:
-- See what SMFI has learned about your workloadSELECT table_name, column_name, query_frequency, selectivityFROM pg_speculative_filters()ORDER BY query_frequency DESC;3. Use EXPLAIN to Verify
Check if SMFI is being used:
EXPLAIN SELECT * FROM orders WHERE customer_id = 123;-- Look for: "Bloom Filter Check" or "Zone Map Prune"4. Bulk Load Optimization (Automatic)
SMFI automatically detects and optimizes bulk load operations:
-- Bulk INSERT (100+ rows) - SMFI auto-suspendsINSERT INTO orders VALUES (1, 'A', 100), (2, 'B', 200), -- ... 100+ rows (1000, 'Z', 999);-- SMFI auto-resumes and schedules rebuild
-- INSERT ... SELECT - auto-suspends (when implemented)INSERT INTO archive SELECT * FROM orders WHERE date < '2024-01-01';
-- COPY FROM - auto-suspends (when implemented)COPY orders FROM '/data/orders.csv';How It Works:
- Detects operations with ≥100 rows
- Suspends per-row tracking for affected table
- Counts rows modified during suspension
- Resumes tracking when operation completes
- Schedules filter rebuild automatically
Manual Override (Optional):
For very large imports, you can still manually control:
-- Temporarily disable for bulk insertSET smfi_tracking_enabled = off;
-- Bulk load dataCOPY orders FROM '/data/orders.csv';
-- Re-enable and rebuildSET smfi_tracking_enabled = on;CALL smfi_rebuild_table('orders');Comparison with Traditional Indexes
| Feature | B-Tree Index | SMFI Bloom Filter | SMFI Zone Map |
|---|---|---|---|
| Creation | Manual CREATE INDEX | Automatic | Automatic |
| Maintenance | Manual REINDEX | Self-maintaining | Self-maintaining |
| Memory | In-memory pages | Disk-only | Disk-only |
| False Positives | None | ~1% | None |
| Best For | Range + equality | Equality | Range |
| Write Overhead | 10-30% | < 0.02% | < 0.02% |
Troubleshooting
SMFI Not Being Used
-- Check if SMFI is enabledSHOW smfi_enabled;
-- Verify table has been trackedSELECT * FROM pg_smfi_table_stats() WHERE table_name = 'orders';
-- Force SMFI rebuildCALL smfi_rebuild_table('orders');High Consolidation CPU Usage
-- Lower max CPU for consolidationSET smfi_max_cpu_percent = 10;
-- Increase delta threshold (less frequent consolidation)SET smfi_delta_threshold = 5000;Speculative Filters Not Created
-- Check query frequency thresholdSHOW smfi_min_query_frequency;
-- Lower threshold if neededSET smfi_min_query_frequency = 5;See Also
- SQL Settings Reference - All SMFI settings
- Query Optimization - General optimization techniques
- EXPLAIN Guide - Query plan analysis
- Scalability Report - Detailed benchmarks