HeliosDB Adaptive Indexing - Architecture

This document describes the internal architecture and design decisions of the adaptive indexing system.

System Overview

┌─────────────────────────────────────────────────────────────┐
│                     AdaptiveIndexer                         │
│                    (Main Coordinator)                        │
└─────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        │                     │                     │
        ▼                     ▼                     ▼
┌──────────────┐      ┌──────────────┐     ┌──────────────┐
│   Pattern    │      │     ML       │     │ Cost-Benefit │
│   Analyzer   │─────▶│ Recommender  │────▶│   Analyzer   │
└──────────────┘      └──────────────┘     └──────────────┘
        │                     │                     │
        │                     └─────────┬───────────┘
        │                               │
        ▼                               ▼
┌──────────────┐              ┌──────────────────┐
│    Query     │              │  IndexAdvisor    │
│   History    │              │ (Recommendation  │
└──────────────┘              │    Engine)       │
                              └──────────────────┘
                                       │
                    ┌──────────────────┼──────────────────┐
                    │                  │                  │
                    ▼                  ▼                  ▼
            ┌──────────────┐   ┌──────────────┐  ┌──────────────┐
            │    Auto      │   │  Statistics  │  │   Index      │
            │   Manager    │   │   Tracker    │  │  Executor    │
            └──────────────┘   └──────────────┘  └──────────────┘

Component Details

1. Pattern Analyzer

Purpose: Extract and analyze query patterns from SQL queries

Key Data Structures:

struct QueryPattern {
    table: String,
    columns: Vec<String>,
    operations: Vec<Operation>,
    frequency: usize,
    avg_latency_ms: f64,
    last_seen: SystemTime,
}

enum Operation {
    Filter(String),    // WHERE clause
    Sort(String),      // ORDER BY
    Join(String),      // JOIN ON
    Aggregate(String), // COUNT, SUM, etc.
    GroupBy(String),   // GROUP BY
}

Algorithm:

Parse SQL query (simple regex-based parser)
Extract table name from FROM clause
Extract columns from SELECT clause
Parse WHERE clause for filter operations
Parse ORDER BY for sort operations
Parse JOIN ON for join columns
Parse GROUP BY for grouping columns
Create pattern signature for deduplication
Update or create pattern record

Performance: O(n) where n = query length

Concurrency: Uses DashMap for lock-free concurrent access

2. ML Recommender

Purpose: Use machine learning to cluster similar queries and suggest indexes

Algorithm: K-means Clustering

Feature Extraction:

Column presence (binary features)
Operation presence (binary features)
Query frequency (log scale)
Query latency (log scale)

Feature Vector Example:

[col1, col2, ..., colN, filter1, ..., filterM, log(freq), log(latency)]

Clustering Process:

Extract features from all query patterns
Normalize features to [0, 1] range
Run K-means with k=5 (configurable)
Analyze each cluster for common patterns
Generate index suggestions from clusters

Why K-means?

Fast convergence (O(n * k * i) where i = iterations)
Works well for query pattern similarity
Easy to interpret clusters
Low memory footprint

Alternatives Considered:

DBSCAN: Better for arbitrary shapes, but slower
Hierarchical: Better for visualization, but O(n²)
Decision Trees: Better for classification, not clustering

3. Cost-Benefit Analyzer

Purpose: Evaluate the economic trade-off of creating indexes

Cost Model:

Storage Cost = index_size_mb × disk_cost_per_mb_per_day
Write Cost = (insert + update + delete overhead) × writes_per_day × cpu_cost_per_ms
Total Cost = Storage Cost + Write Cost

Benefit Model:

Query Speedup = scan_cost / index_cost
              = row_count / (log₂(row_count) + result_size)

Time Saved = query_latency × (1 - 1/speedup)
Benefit = time_saved × queries_per_day × value_per_ms_saved

Score Calculation:

Score = Benefit / Cost

Index Size Estimation:

Row Size = (avg_column_size × num_columns) + pointer_size
Total Size = row_size × row_count × (1 + overhead_factor)

Selectivity Impact:

High cardinality columns → better indexes
Low cardinality columns → less effective
Multi-column indexes → multiplicative selectivity

4. Index Advisor

Purpose: Coordinate all components and generate recommendations

Recommendation Process:

Single-Column Indexes:
- Group patterns by (table, column)
- Calculate total query count
- Run cost-benefit analysis
- Filter by score threshold
Multi-Column Indexes:
- Find columns that appear together
- Evaluate composite index cost/benefit
- Compare to multiple single-column indexes
- Recommend if composite is better
Partial Indexes:
- Identify common filter predicates
- Estimate filter selectivity
- Calculate reduced maintenance cost
- Recommend if score > threshold × 1.5
ML-Based Recommendations:
- Cluster query patterns (if enough data)
- Extract common columns per cluster
- Run cost-benefit analysis
- Merge with other recommendations
Deduplication:
- Remove duplicate recommendations
- Keep highest-score version
Ranking:
- Sort by priority (Critical > High > Medium > Low)
- Within priority, sort by score
- Limit to top N recommendations

5. Auto Index Manager

Purpose: Manage automatic index creation and deletion

Approval Workflow:

Recommendation
      ↓
Score >= Threshold? ──No──→ Add to Pending Actions
      ↓ Yes
Auto-Create Enabled? ──No──→ Add to Pending Actions
      ↓ Yes
Create Index
      ↓
Register in Statistics

Deletion Workflow:

Periodic Check
      ↓
Find Unused Indexes (last_used > threshold_days)
      ↓
Auto-Delete Enabled? ──No──→ Add to Pending Actions
      ↓ Yes
Delete Index
      ↓
Unregister from Statistics

Safety Features:

Duplicate index detection
Manual approval for low-score recommendations
Gradual rollout (max concurrent creates)
Rollback support via pending actions

6. Statistics Tracker

Purpose: Monitor index usage and effectiveness

Metrics Tracked:

Usage count (queries served)
Last used timestamp
Average selectivity (result_size / table_size)
Average speedup (with_index_time / without_index_time)
Index size (MB)

Staleness Detection:

if last_used.is_none() {
    unused_days = days_since(created_at)
} else {
    unused_days = days_since(last_used)
}

is_unused = unused_days > threshold

Performance Metrics:

avg_speedup = sum(speedup_per_query) / usage_count
total_time_saved = sum(time_saved_per_query)
roi = total_time_saved × value_per_ms / index_cost

Data Flow

Query Recording Flow

User Query
    ↓
Record Query (async)
    ↓
Parse SQL
    ↓
Extract Pattern
    ↓
Update Pattern Map (DashMap)
    ↓
Add to History (DashMap)
    ↓
Return (non-blocking)

Recommendation Flow

Request Recommendations
    ↓
Get Significant Patterns
    ↓
┌───────────────┬────────────────┬──────────────────┐
│ Single-Column │ Multi-Column   │ ML Clustering    │
│ Analysis      │ Analysis       │ Analysis         │
└───────┬───────┴────────┬───────┴────────┬─────────┘
        │                │                │
        └────────────────┼────────────────┘
                         ↓
                  Merge Results
                         ↓
                  Deduplicate
                         ↓
                  Cost-Benefit Analysis
                         ↓
                  Filter by Threshold
                         ↓
                  Rank by Priority/Score
                         ↓
                  Return Top N

Index Creation Flow

Recommendation
    ↓
Check if Exists
    ↓
Calculate Score
    ↓
Auto-Approve? ──No──→ Add to Pending
    ↓ Yes
Call Executor.create_index()
    ↓
Register in Statistics
    ↓
Return Success

Concurrency Model

Thread Safety

All shared state uses concurrent data structures:

DashMap: Lock-free hash map for patterns and statistics
Arc<RwLock>: For pending actions (rare writes)
Arc: For immutable shared state

Async Design

All public APIs are async:

Non-blocking query recording
Async index creation/deletion
Concurrent recommendation generation

Scalability

Pattern Storage: O(P) where P = unique patterns (~1000s)
Query History: O(H) where H = history size (configurable)
Recommendation: O(P × log P) for sorting
ML Clustering: O(P × k × i) where k=5, i=10

Expected memory: ~10 MB for 10K patterns

Design Decisions

Why Lock-Free Data Structures?

High concurrency for query recording
No lock contention on hot paths
Better performance under load

Why Simple SQL Parser?

Fast (regex-based)
Good enough for pattern extraction
Doesn’t need 100% accuracy
Easy to extend

Trade-off: May miss complex queries, but catches 90% of patterns

Why K-means for Clustering?

Fast and scalable
Interpretable results
Low memory usage
Good for our use case

Trade-off: Requires predefined k, but we use heuristics

Why Approval Workflow?

Safety first approach
Prevents index explosion
Allows human review
Easy to disable for automation

Why Score-Based Ranking?

Economic justification
Easy to explain
Allows threshold tuning
Prevents low-value indexes

Extension Points

Custom Index Types

pub enum IndexType {
    BTree,
    Hash,
    Partial(String),
    MultiColumn(Vec<String>),
    // Add custom types:
    FullText(String),
    Geospatial(String),
    Bitmap(String),
}

Custom Cost Models

impl CostBenefitAnalyzer {
    pub fn with_custom_costs(
        disk_cost: f64,
        cpu_cost: f64,
        value_per_ms: f64,
    ) -> Self {
        // Custom pricing model
    }
}

Custom ML Algorithms

trait PatternClusterer {
    fn cluster(&self, patterns: &[QueryPattern]) -> Vec<Cluster>;
}

// Implement for different algorithms
impl PatternClusterer for DBSCANClusterer { ... }
impl PatternClusterer for HierarchicalClusterer { ... }

Custom Executors

#[async_trait]
impl IndexExecutor for PostgresExecutor {
    async fn create_index(...) -> Result<(), String> {
        // PostgreSQL-specific implementation
    }
}

impl IndexExecutor for MySQLExecutor { ... }
impl IndexExecutor for MongoExecutor { ... }

Performance Characteristics

Time Complexity

Query Recording: O(n) where n = query length
Pattern Lookup: O(1) amortized (DashMap)
Recommendation Generation: O(P log P) where P = patterns
ML Clustering: O(P × k × i) typical case
Cost-Benefit Analysis: O(1) per recommendation

Space Complexity

Patterns: O(P) where P = unique patterns
History: O(H) where H = max_history_size
ML Features: O(P × F) where F = feature count
Statistics: O(I) where I = index count

Throughput

Query Recording: 100K+ ops/sec
Recommendations: 1K+ ops/sec
Index Creation: 10+ ops/sec (limited by database)

Testing Strategy

Unit Tests (70+ tests)

Pattern extraction accuracy
Cost calculation correctness
ML clustering behavior
Statistics tracking
Approval workflow

Integration Tests (10+ tests)

End-to-end workflows
Component interaction
Concurrent access
Error handling
Edge cases

Property-Based Tests

Pattern signature uniqueness
Score monotonicity
Cost always positive
Recommendations always sorted

Monitoring and Observability

Metrics (Prometheus)

// Pattern analysis
adaptive_patterns_total
adaptive_queries_recorded_total
adaptive_query_latency_histogram

// Recommendations
adaptive_recommendations_generated_total
adaptive_recommendations_score_histogram
adaptive_recommendations_by_priority

// Index management
adaptive_indexes_created_total
adaptive_indexes_deleted_total
adaptive_index_creation_errors_total

// Statistics
adaptive_index_usage_count
adaptive_index_speedup_avg
adaptive_index_size_mb

Logging

Query recording (DEBUG)
Pattern analysis (INFO)
Recommendations (INFO)
Index creation/deletion (WARN)
Errors (ERROR)

HeliosDB Adaptive Indexing - Architecture

HeliosDB Adaptive Indexing - Architecture

System Overview

Component Details

1. Pattern Analyzer

2. ML Recommender

3. Cost-Benefit Analyzer

4. Index Advisor

5. Auto Index Manager

6. Statistics Tracker

Data Flow

Query Recording Flow

Recommendation Flow

Index Creation Flow

Concurrency Model

Thread Safety

Async Design

Scalability

Design Decisions

Why Lock-Free Data Structures?

Why Simple SQL Parser?

Why K-means for Clustering?

Why Approval Workflow?

Why Score-Based Ranking?

Extension Points

Custom Index Types

Custom Cost Models

Custom ML Algorithms

Custom Executors

Performance Characteristics

Time Complexity

Space Complexity

Throughput

Testing Strategy

Unit Tests (70+ tests)

Integration Tests (10+ tests)

Property-Based Tests

Monitoring and Observability

Metrics (Prometheus)

Logging

Tracing

Future Enhancements

Phase 2

Phase 3

Phase 4

References