HeliosDB Adaptive Indexing - Architecture
HeliosDB Adaptive Indexing - Architecture
This document describes the internal architecture and design decisions of the adaptive indexing system.
System Overview
┌─────────────────────────────────────────────────────────────┐│ AdaptiveIndexer ││ (Main Coordinator) │└─────────────────────────────────────────────────────────────┘ │ ┌─────────────────────┼─────────────────────┐ │ │ │ ▼ ▼ ▼┌──────────────┐ ┌──────────────┐ ┌──────────────┐│ Pattern │ │ ML │ │ Cost-Benefit ││ Analyzer │─────▶│ Recommender │────▶│ Analyzer │└──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ └─────────┬───────────┘ │ │ ▼ ▼┌──────────────┐ ┌──────────────────┐│ Query │ │ IndexAdvisor ││ History │ │ (Recommendation │└──────────────┘ │ Engine) │ └──────────────────┘ │ ┌──────────────────┼──────────────────┐ │ │ │ ▼ ▼ ▼ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Auto │ │ Statistics │ │ Index │ │ Manager │ │ Tracker │ │ Executor │ └──────────────┘ └──────────────┘ └──────────────┘Component Details
1. Pattern Analyzer
Purpose: Extract and analyze query patterns from SQL queries
Key Data Structures:
struct QueryPattern { table: String, columns: Vec<String>, operations: Vec<Operation>, frequency: usize, avg_latency_ms: f64, last_seen: SystemTime,}
enum Operation { Filter(String), // WHERE clause Sort(String), // ORDER BY Join(String), // JOIN ON Aggregate(String), // COUNT, SUM, etc. GroupBy(String), // GROUP BY}Algorithm:
- Parse SQL query (simple regex-based parser)
- Extract table name from FROM clause
- Extract columns from SELECT clause
- Parse WHERE clause for filter operations
- Parse ORDER BY for sort operations
- Parse JOIN ON for join columns
- Parse GROUP BY for grouping columns
- Create pattern signature for deduplication
- Update or create pattern record
Performance: O(n) where n = query length
Concurrency: Uses DashMap for lock-free concurrent access
2. ML Recommender
Purpose: Use machine learning to cluster similar queries and suggest indexes
Algorithm: K-means Clustering
Feature Extraction:
- Column presence (binary features)
- Operation presence (binary features)
- Query frequency (log scale)
- Query latency (log scale)
Feature Vector Example:
[col1, col2, ..., colN, filter1, ..., filterM, log(freq), log(latency)]Clustering Process:
- Extract features from all query patterns
- Normalize features to [0, 1] range
- Run K-means with k=5 (configurable)
- Analyze each cluster for common patterns
- Generate index suggestions from clusters
Why K-means?
- Fast convergence (O(n * k * i) where i = iterations)
- Works well for query pattern similarity
- Easy to interpret clusters
- Low memory footprint
Alternatives Considered:
- DBSCAN: Better for arbitrary shapes, but slower
- Hierarchical: Better for visualization, but O(n²)
- Decision Trees: Better for classification, not clustering
3. Cost-Benefit Analyzer
Purpose: Evaluate the economic trade-off of creating indexes
Cost Model:
Storage Cost = index_size_mb × disk_cost_per_mb_per_dayWrite Cost = (insert + update + delete overhead) × writes_per_day × cpu_cost_per_msTotal Cost = Storage Cost + Write CostBenefit Model:
Query Speedup = scan_cost / index_cost = row_count / (log₂(row_count) + result_size)
Time Saved = query_latency × (1 - 1/speedup)Benefit = time_saved × queries_per_day × value_per_ms_savedScore Calculation:
Score = Benefit / CostIndex Size Estimation:
Row Size = (avg_column_size × num_columns) + pointer_sizeTotal Size = row_size × row_count × (1 + overhead_factor)Selectivity Impact:
- High cardinality columns → better indexes
- Low cardinality columns → less effective
- Multi-column indexes → multiplicative selectivity
4. Index Advisor
Purpose: Coordinate all components and generate recommendations
Recommendation Process:
-
Single-Column Indexes:
- Group patterns by (table, column)
- Calculate total query count
- Run cost-benefit analysis
- Filter by score threshold
-
Multi-Column Indexes:
- Find columns that appear together
- Evaluate composite index cost/benefit
- Compare to multiple single-column indexes
- Recommend if composite is better
-
Partial Indexes:
- Identify common filter predicates
- Estimate filter selectivity
- Calculate reduced maintenance cost
- Recommend if score > threshold × 1.5
-
ML-Based Recommendations:
- Cluster query patterns (if enough data)
- Extract common columns per cluster
- Run cost-benefit analysis
- Merge with other recommendations
-
Deduplication:
- Remove duplicate recommendations
- Keep highest-score version
-
Ranking:
- Sort by priority (Critical > High > Medium > Low)
- Within priority, sort by score
- Limit to top N recommendations
5. Auto Index Manager
Purpose: Manage automatic index creation and deletion
Approval Workflow:
Recommendation ↓Score >= Threshold? ──No──→ Add to Pending Actions ↓ YesAuto-Create Enabled? ──No──→ Add to Pending Actions ↓ YesCreate Index ↓Register in StatisticsDeletion Workflow:
Periodic Check ↓Find Unused Indexes (last_used > threshold_days) ↓Auto-Delete Enabled? ──No──→ Add to Pending Actions ↓ YesDelete Index ↓Unregister from StatisticsSafety Features:
- Duplicate index detection
- Manual approval for low-score recommendations
- Gradual rollout (max concurrent creates)
- Rollback support via pending actions
6. Statistics Tracker
Purpose: Monitor index usage and effectiveness
Metrics Tracked:
- Usage count (queries served)
- Last used timestamp
- Average selectivity (result_size / table_size)
- Average speedup (with_index_time / without_index_time)
- Index size (MB)
Staleness Detection:
if last_used.is_none() { unused_days = days_since(created_at)} else { unused_days = days_since(last_used)}
is_unused = unused_days > thresholdPerformance Metrics:
avg_speedup = sum(speedup_per_query) / usage_counttotal_time_saved = sum(time_saved_per_query)roi = total_time_saved × value_per_ms / index_costData Flow
Query Recording Flow
User Query ↓Record Query (async) ↓Parse SQL ↓Extract Pattern ↓Update Pattern Map (DashMap) ↓Add to History (DashMap) ↓Return (non-blocking)Recommendation Flow
Request Recommendations ↓Get Significant Patterns ↓┌───────────────┬────────────────┬──────────────────┐│ Single-Column │ Multi-Column │ ML Clustering ││ Analysis │ Analysis │ Analysis │└───────┬───────┴────────┬───────┴────────┬─────────┘ │ │ │ └────────────────┼────────────────┘ ↓ Merge Results ↓ Deduplicate ↓ Cost-Benefit Analysis ↓ Filter by Threshold ↓ Rank by Priority/Score ↓ Return Top NIndex Creation Flow
Recommendation ↓Check if Exists ↓Calculate Score ↓Auto-Approve? ──No──→ Add to Pending ↓ YesCall Executor.create_index() ↓Register in Statistics ↓Return SuccessConcurrency Model
Thread Safety
All shared state uses concurrent data structures:
DashMap: Lock-free hash map for patterns and statisticsArc<RwLock>: For pending actions (rare writes)Arc: For immutable shared state
Async Design
All public APIs are async:
- Non-blocking query recording
- Async index creation/deletion
- Concurrent recommendation generation
Scalability
- Pattern Storage: O(P) where P = unique patterns (~1000s)
- Query History: O(H) where H = history size (configurable)
- Recommendation: O(P × log P) for sorting
- ML Clustering: O(P × k × i) where k=5, i=10
Expected memory: ~10 MB for 10K patterns
Design Decisions
Why Lock-Free Data Structures?
- High concurrency for query recording
- No lock contention on hot paths
- Better performance under load
Why Simple SQL Parser?
- Fast (regex-based)
- Good enough for pattern extraction
- Doesn’t need 100% accuracy
- Easy to extend
Trade-off: May miss complex queries, but catches 90% of patterns
Why K-means for Clustering?
- Fast and scalable
- Interpretable results
- Low memory usage
- Good for our use case
Trade-off: Requires predefined k, but we use heuristics
Why Approval Workflow?
- Safety first approach
- Prevents index explosion
- Allows human review
- Easy to disable for automation
Why Score-Based Ranking?
- Economic justification
- Easy to explain
- Allows threshold tuning
- Prevents low-value indexes
Extension Points
Custom Index Types
pub enum IndexType { BTree, Hash, Partial(String), MultiColumn(Vec<String>), // Add custom types: FullText(String), Geospatial(String), Bitmap(String),}Custom Cost Models
impl CostBenefitAnalyzer { pub fn with_custom_costs( disk_cost: f64, cpu_cost: f64, value_per_ms: f64, ) -> Self { // Custom pricing model }}Custom ML Algorithms
trait PatternClusterer { fn cluster(&self, patterns: &[QueryPattern]) -> Vec<Cluster>;}
// Implement for different algorithmsimpl PatternClusterer for DBSCANClusterer { ... }impl PatternClusterer for HierarchicalClusterer { ... }Custom Executors
#[async_trait]impl IndexExecutor for PostgresExecutor { async fn create_index(...) -> Result<(), String> { // PostgreSQL-specific implementation }}
impl IndexExecutor for MySQLExecutor { ... }impl IndexExecutor for MongoExecutor { ... }Performance Characteristics
Time Complexity
- Query Recording: O(n) where n = query length
- Pattern Lookup: O(1) amortized (DashMap)
- Recommendation Generation: O(P log P) where P = patterns
- ML Clustering: O(P × k × i) typical case
- Cost-Benefit Analysis: O(1) per recommendation
Space Complexity
- Patterns: O(P) where P = unique patterns
- History: O(H) where H = max_history_size
- ML Features: O(P × F) where F = feature count
- Statistics: O(I) where I = index count
Throughput
- Query Recording: 100K+ ops/sec
- Recommendations: 1K+ ops/sec
- Index Creation: 10+ ops/sec (limited by database)
Testing Strategy
Unit Tests (70+ tests)
- Pattern extraction accuracy
- Cost calculation correctness
- ML clustering behavior
- Statistics tracking
- Approval workflow
Integration Tests (10+ tests)
- End-to-end workflows
- Component interaction
- Concurrent access
- Error handling
- Edge cases
Property-Based Tests
- Pattern signature uniqueness
- Score monotonicity
- Cost always positive
- Recommendations always sorted
Monitoring and Observability
Metrics (Prometheus)
// Pattern analysisadaptive_patterns_totaladaptive_queries_recorded_totaladaptive_query_latency_histogram
// Recommendationsadaptive_recommendations_generated_totaladaptive_recommendations_score_histogramadaptive_recommendations_by_priority
// Index managementadaptive_indexes_created_totaladaptive_indexes_deleted_totaladaptive_index_creation_errors_total
// Statisticsadaptive_index_usage_countadaptive_index_speedup_avgadaptive_index_size_mbLogging
- Query recording (DEBUG)
- Pattern analysis (INFO)
- Recommendations (INFO)
- Index creation/deletion (WARN)
- Errors (ERROR)
Tracing
- Distributed tracing support
- Span per recommendation
- Attribute propagation
Future Enhancements
Phase 2
- Online index creation (non-blocking)
- What-if analysis API
- Index compression strategies
- Workload replay for testing
Phase 3
- Deep learning for pattern prediction
- Time-series analysis for seasonal patterns
- Automatic index consolidation
- Multi-tenant support
Phase 4
- Distributed index recommendations
- Cross-shard index analysis
- Federated learning for patterns
- Automatic index tuning