Skip to content

HeliosDB Adaptive Indexing - Architecture

HeliosDB Adaptive Indexing - Architecture

This document describes the internal architecture and design decisions of the adaptive indexing system.

System Overview

┌─────────────────────────────────────────────────────────────┐
│ AdaptiveIndexer │
│ (Main Coordinator) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Pattern │ │ ML │ │ Cost-Benefit │
│ Analyzer │─────▶│ Recommender │────▶│ Analyzer │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
│ └─────────┬───────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────────┐
│ Query │ │ IndexAdvisor │
│ History │ │ (Recommendation │
└──────────────┘ │ Engine) │
└──────────────────┘
┌──────────────────┼──────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Auto │ │ Statistics │ │ Index │
│ Manager │ │ Tracker │ │ Executor │
└──────────────┘ └──────────────┘ └──────────────┘

Component Details

1. Pattern Analyzer

Purpose: Extract and analyze query patterns from SQL queries

Key Data Structures:

struct QueryPattern {
table: String,
columns: Vec<String>,
operations: Vec<Operation>,
frequency: usize,
avg_latency_ms: f64,
last_seen: SystemTime,
}
enum Operation {
Filter(String), // WHERE clause
Sort(String), // ORDER BY
Join(String), // JOIN ON
Aggregate(String), // COUNT, SUM, etc.
GroupBy(String), // GROUP BY
}

Algorithm:

  1. Parse SQL query (simple regex-based parser)
  2. Extract table name from FROM clause
  3. Extract columns from SELECT clause
  4. Parse WHERE clause for filter operations
  5. Parse ORDER BY for sort operations
  6. Parse JOIN ON for join columns
  7. Parse GROUP BY for grouping columns
  8. Create pattern signature for deduplication
  9. Update or create pattern record

Performance: O(n) where n = query length

Concurrency: Uses DashMap for lock-free concurrent access

2. ML Recommender

Purpose: Use machine learning to cluster similar queries and suggest indexes

Algorithm: K-means Clustering

Feature Extraction:

  • Column presence (binary features)
  • Operation presence (binary features)
  • Query frequency (log scale)
  • Query latency (log scale)

Feature Vector Example:

[col1, col2, ..., colN, filter1, ..., filterM, log(freq), log(latency)]

Clustering Process:

  1. Extract features from all query patterns
  2. Normalize features to [0, 1] range
  3. Run K-means with k=5 (configurable)
  4. Analyze each cluster for common patterns
  5. Generate index suggestions from clusters

Why K-means?

  • Fast convergence (O(n * k * i) where i = iterations)
  • Works well for query pattern similarity
  • Easy to interpret clusters
  • Low memory footprint

Alternatives Considered:

  • DBSCAN: Better for arbitrary shapes, but slower
  • Hierarchical: Better for visualization, but O(n²)
  • Decision Trees: Better for classification, not clustering

3. Cost-Benefit Analyzer

Purpose: Evaluate the economic trade-off of creating indexes

Cost Model:

Storage Cost = index_size_mb × disk_cost_per_mb_per_day
Write Cost = (insert + update + delete overhead) × writes_per_day × cpu_cost_per_ms
Total Cost = Storage Cost + Write Cost

Benefit Model:

Query Speedup = scan_cost / index_cost
= row_count / (log₂(row_count) + result_size)
Time Saved = query_latency × (1 - 1/speedup)
Benefit = time_saved × queries_per_day × value_per_ms_saved

Score Calculation:

Score = Benefit / Cost

Index Size Estimation:

Row Size = (avg_column_size × num_columns) + pointer_size
Total Size = row_size × row_count × (1 + overhead_factor)

Selectivity Impact:

  • High cardinality columns → better indexes
  • Low cardinality columns → less effective
  • Multi-column indexes → multiplicative selectivity

4. Index Advisor

Purpose: Coordinate all components and generate recommendations

Recommendation Process:

  1. Single-Column Indexes:

    • Group patterns by (table, column)
    • Calculate total query count
    • Run cost-benefit analysis
    • Filter by score threshold
  2. Multi-Column Indexes:

    • Find columns that appear together
    • Evaluate composite index cost/benefit
    • Compare to multiple single-column indexes
    • Recommend if composite is better
  3. Partial Indexes:

    • Identify common filter predicates
    • Estimate filter selectivity
    • Calculate reduced maintenance cost
    • Recommend if score > threshold × 1.5
  4. ML-Based Recommendations:

    • Cluster query patterns (if enough data)
    • Extract common columns per cluster
    • Run cost-benefit analysis
    • Merge with other recommendations
  5. Deduplication:

    • Remove duplicate recommendations
    • Keep highest-score version
  6. Ranking:

    • Sort by priority (Critical > High > Medium > Low)
    • Within priority, sort by score
    • Limit to top N recommendations

5. Auto Index Manager

Purpose: Manage automatic index creation and deletion

Approval Workflow:

Recommendation
Score >= Threshold? ──No──→ Add to Pending Actions
↓ Yes
Auto-Create Enabled? ──No──→ Add to Pending Actions
↓ Yes
Create Index
Register in Statistics

Deletion Workflow:

Periodic Check
Find Unused Indexes (last_used > threshold_days)
Auto-Delete Enabled? ──No──→ Add to Pending Actions
↓ Yes
Delete Index
Unregister from Statistics

Safety Features:

  • Duplicate index detection
  • Manual approval for low-score recommendations
  • Gradual rollout (max concurrent creates)
  • Rollback support via pending actions

6. Statistics Tracker

Purpose: Monitor index usage and effectiveness

Metrics Tracked:

  • Usage count (queries served)
  • Last used timestamp
  • Average selectivity (result_size / table_size)
  • Average speedup (with_index_time / without_index_time)
  • Index size (MB)

Staleness Detection:

if last_used.is_none() {
unused_days = days_since(created_at)
} else {
unused_days = days_since(last_used)
}
is_unused = unused_days > threshold

Performance Metrics:

avg_speedup = sum(speedup_per_query) / usage_count
total_time_saved = sum(time_saved_per_query)
roi = total_time_saved × value_per_ms / index_cost

Data Flow

Query Recording Flow

User Query
Record Query (async)
Parse SQL
Extract Pattern
Update Pattern Map (DashMap)
Add to History (DashMap)
Return (non-blocking)

Recommendation Flow

Request Recommendations
Get Significant Patterns
┌───────────────┬────────────────┬──────────────────┐
│ Single-Column │ Multi-Column │ ML Clustering │
│ Analysis │ Analysis │ Analysis │
└───────┬───────┴────────┬───────┴────────┬─────────┘
│ │ │
└────────────────┼────────────────┘
Merge Results
Deduplicate
Cost-Benefit Analysis
Filter by Threshold
Rank by Priority/Score
Return Top N

Index Creation Flow

Recommendation
Check if Exists
Calculate Score
Auto-Approve? ──No──→ Add to Pending
↓ Yes
Call Executor.create_index()
Register in Statistics
Return Success

Concurrency Model

Thread Safety

All shared state uses concurrent data structures:

  • DashMap: Lock-free hash map for patterns and statistics
  • Arc<RwLock>: For pending actions (rare writes)
  • Arc: For immutable shared state

Async Design

All public APIs are async:

  • Non-blocking query recording
  • Async index creation/deletion
  • Concurrent recommendation generation

Scalability

  • Pattern Storage: O(P) where P = unique patterns (~1000s)
  • Query History: O(H) where H = history size (configurable)
  • Recommendation: O(P × log P) for sorting
  • ML Clustering: O(P × k × i) where k=5, i=10

Expected memory: ~10 MB for 10K patterns

Design Decisions

Why Lock-Free Data Structures?

  • High concurrency for query recording
  • No lock contention on hot paths
  • Better performance under load

Why Simple SQL Parser?

  • Fast (regex-based)
  • Good enough for pattern extraction
  • Doesn’t need 100% accuracy
  • Easy to extend

Trade-off: May miss complex queries, but catches 90% of patterns

Why K-means for Clustering?

  • Fast and scalable
  • Interpretable results
  • Low memory usage
  • Good for our use case

Trade-off: Requires predefined k, but we use heuristics

Why Approval Workflow?

  • Safety first approach
  • Prevents index explosion
  • Allows human review
  • Easy to disable for automation

Why Score-Based Ranking?

  • Economic justification
  • Easy to explain
  • Allows threshold tuning
  • Prevents low-value indexes

Extension Points

Custom Index Types

pub enum IndexType {
BTree,
Hash,
Partial(String),
MultiColumn(Vec<String>),
// Add custom types:
FullText(String),
Geospatial(String),
Bitmap(String),
}

Custom Cost Models

impl CostBenefitAnalyzer {
pub fn with_custom_costs(
disk_cost: f64,
cpu_cost: f64,
value_per_ms: f64,
) -> Self {
// Custom pricing model
}
}

Custom ML Algorithms

trait PatternClusterer {
fn cluster(&self, patterns: &[QueryPattern]) -> Vec<Cluster>;
}
// Implement for different algorithms
impl PatternClusterer for DBSCANClusterer { ... }
impl PatternClusterer for HierarchicalClusterer { ... }

Custom Executors

#[async_trait]
impl IndexExecutor for PostgresExecutor {
async fn create_index(...) -> Result<(), String> {
// PostgreSQL-specific implementation
}
}
impl IndexExecutor for MySQLExecutor { ... }
impl IndexExecutor for MongoExecutor { ... }

Performance Characteristics

Time Complexity

  • Query Recording: O(n) where n = query length
  • Pattern Lookup: O(1) amortized (DashMap)
  • Recommendation Generation: O(P log P) where P = patterns
  • ML Clustering: O(P × k × i) typical case
  • Cost-Benefit Analysis: O(1) per recommendation

Space Complexity

  • Patterns: O(P) where P = unique patterns
  • History: O(H) where H = max_history_size
  • ML Features: O(P × F) where F = feature count
  • Statistics: O(I) where I = index count

Throughput

  • Query Recording: 100K+ ops/sec
  • Recommendations: 1K+ ops/sec
  • Index Creation: 10+ ops/sec (limited by database)

Testing Strategy

Unit Tests (70+ tests)

  • Pattern extraction accuracy
  • Cost calculation correctness
  • ML clustering behavior
  • Statistics tracking
  • Approval workflow

Integration Tests (10+ tests)

  • End-to-end workflows
  • Component interaction
  • Concurrent access
  • Error handling
  • Edge cases

Property-Based Tests

  • Pattern signature uniqueness
  • Score monotonicity
  • Cost always positive
  • Recommendations always sorted

Monitoring and Observability

Metrics (Prometheus)

// Pattern analysis
adaptive_patterns_total
adaptive_queries_recorded_total
adaptive_query_latency_histogram
// Recommendations
adaptive_recommendations_generated_total
adaptive_recommendations_score_histogram
adaptive_recommendations_by_priority
// Index management
adaptive_indexes_created_total
adaptive_indexes_deleted_total
adaptive_index_creation_errors_total
// Statistics
adaptive_index_usage_count
adaptive_index_speedup_avg
adaptive_index_size_mb

Logging

  • Query recording (DEBUG)
  • Pattern analysis (INFO)
  • Recommendations (INFO)
  • Index creation/deletion (WARN)
  • Errors (ERROR)

Tracing

  • Distributed tracing support
  • Span per recommendation
  • Attribute propagation

Future Enhancements

Phase 2

  • Online index creation (non-blocking)
  • What-if analysis API
  • Index compression strategies
  • Workload replay for testing

Phase 3

  • Deep learning for pattern prediction
  • Time-series analysis for seasonal patterns
  • Automatic index consolidation
  • Multi-tenant support

Phase 4

  • Distributed index recommendations
  • Cross-shard index analysis
  • Federated learning for patterns
  • Automatic index tuning

References