Workload Optimizer - Pattern Analyzer Baseline Architecture
Workload Optimizer - Pattern Analyzer Baseline Architecture
Executive Summary
This document describes the baseline architecture for the workload-aware query optimization system with pattern recognition capabilities, implemented in /home/claude/HeliosDB/heliosdb-workload/src/pattern_analyzer.rs (1,028 LOC).
Overview
The Pattern Analyzer is a core component of HeliosDB’s intelligent workload optimization layer. It provides pattern recognition, similarity matching, and historical learning capabilities to make better query optimization decisions.
Core Components
1. PatternAnalyzer
The main entry point for pattern analysis and management.
Key Responsibilities:
- Query fingerprint generation
- Pattern storage and retrieval
- Similarity matching
- Cost estimation based on historical data
- Pattern lifecycle management
Configuration:
pub struct PatternAnalyzerConfig { pub max_patterns: usize, // 10,000 patterns max pub common_pattern_threshold: usize, // Min 10 occurrences pub similarity_threshold: f64, // 0.8 similarity score pub pattern_max_age_secs: u64, // 24 hours}2. Pattern Fingerprinting
PatternFingerprint Structure:
pub struct PatternFingerprint { pub pattern_type: PatternType, // Query classification pub tables: Vec<String>, // Normalized table names (sorted) pub join_count: u32, // Number of joins pub aggregate_count: u32, // Number of aggregations pub has_group_by: bool, // Structure flags pub has_order_by: bool, pub has_limit: bool, pub where_condition_count: u32, // Predicate complexity pub join_types: Vec<String>, // JOIN types used pub structural_hash: u64, // Quick comparison hash}Pattern Types:
Select- Simple SELECT with optional WHEREJoin- SELECT with JOIN operationsAggregate- SELECT with GROUP BY or aggregationsMutation- INSERT, UPDATE, DELETE operationsComplex- Complex queries (subqueries, CTEs, UNION)Unknown- Unclassified patterns
3. Similarity Matching Algorithm
The similarity calculation uses a weighted scoring system:
| Feature | Weight | Description |
|---|---|---|
| Pattern Type Match | 0.25 | Exact type match bonus |
| Table Overlap (Jaccard) | 0.20 | Table name intersection/union |
| Join Count Similarity | 0.15 | Ratio of min/max join counts |
| Aggregate Count Similarity | 0.15 | Ratio of min/max aggregate counts |
| Boolean Features | 0.15 | GROUP BY, ORDER BY, LIMIT flags |
| WHERE Condition Similarity | 0.10 | Predicate count ratio |
Total Similarity Score: 0.0 (completely different) to 1.0 (identical)
Example:
-- Query ASELECT * FROM users WHERE age > 25
-- Query BSELECT * FROM users WHERE age > 30
-- Similarity Score: ~0.95 (very similar structure)4. Execution Statistics
ExecutionStats tracks historical performance metrics:
pub struct ExecutionStats { pub execution_count: u64, pub avg_execution_time_us: f64, // Running average pub min_execution_time_us: u64, pub max_execution_time_us: u64, pub avg_rows_scanned: f64, pub avg_rows_returned: f64, pub avg_memory_bytes: f64, pub avg_io_operations: f64, pub avg_cpu_time_us: f64, pub cache_hit_rate: f64, // 0.0 - 1.0 pub last_execution_timestamp: i64,}Running Averages: Uses incremental formula to avoid storing all historical data:
new_avg = (old_avg * n + new_value) / (n + 1)5. Cost Estimation Model
The cost estimator combines pattern type, table count, and historical statistics:
fn estimate_cost(&self, pattern: &QueryPattern) -> f64 { let mut cost = pattern.stats.estimated_cost();
// Pattern type multiplier cost *= match pattern.pattern_type { Select => 1.0, Join => 2.0 + join_count * 0.5, Aggregate => 1.5, Mutation => 1.2, Complex => 3.0, Unknown => 1.0, };
// Table count adjustment cost *= 1.0 + (table_count - 1) * 0.2;
// Cardinality factor cost += avg_rows_scanned * seq_scan_cost / 1000.0;
return cost;}Key Methods
Pattern Analysis
pub fn analyze_query(&mut self, sql: &str) -> Result<QueryPattern>Analyzes a SQL query and generates its pattern fingerprint:
- Normalizes query (uppercase, structure extraction)
- Classifies pattern type (SELECT, JOIN, AGGREGATE, etc.)
- Extracts structural features (tables, joins, predicates)
- Computes structural hash for quick lookup
- Returns QueryPattern with fingerprint
Example:
let mut analyzer = PatternAnalyzer::new(config);let pattern = analyzer.analyze_query( "SELECT u.name, o.total FROM users u JOIN orders o ON u.id = o.user_id")?;
assert_eq!(pattern.pattern_type, PatternType::Join);assert_eq!(pattern.fingerprint.join_count, 1);assert_eq!(pattern.fingerprint.tables, vec!["ORDERS", "USERS"]);Similar Pattern Matching
pub fn find_similar_patterns(&self, pattern: &QueryPattern) -> Vec<SimilarPattern>Finds historically executed patterns similar to the input:
- Iterates through stored patterns
- Calculates similarity score for each
- Filters by similarity_threshold (default 0.8)
- Returns sorted list (most similar first)
Use Case: Query plan reuse and optimization hints
Execution Recording
pub fn record_execution(&mut self, pattern: QueryPattern, stats: &QueryStats)Records execution statistics for a pattern:
- Updates pattern’s execution statistics
- Stores or updates pattern in hash map
- Handles storage limits via LRU eviction
- Maintains running averages
Common Pattern Identification
pub fn get_common_patterns(&self, limit: usize) -> Vec<QueryPattern>Returns the most frequently executed patterns:
- Filters by common_pattern_threshold
- Sorts by occurrence count (descending)
- Returns top N patterns
Use Case: Workload characterization and optimization prioritization
Pattern Storage
Data Structure: HashMap with structural hash as key
patterns: HashMap<String, QueryPattern> // key: hex(structural_hash)Eviction Policy: When max_patterns reached, remove 10% least common patterns
Memory Footprint Estimation:
- PatternFingerprint: ~200 bytes
- ExecutionStats: ~120 bytes
- Per pattern: ~320 bytes + sample query length
- 10,000 patterns: ~3.2 MB + queries
Integration with Existing Components
1. CostModel Integration
The PatternAnalyzer can provide historical cost data to the ML-based CostModel:
// Get historical cost for similar patternslet similar = analyzer.find_similar_patterns(&pattern);let historical_costs: Vec<f64> = similar.iter() .map(|s| s.pattern.stats.estimated_cost()) .collect();
// Use as training data for CostModelfor (pattern, cost) in similar { let features = extract_features(&pattern); cost_model.add_training_sample(features, cost);}2. QueryOptimizer Integration
// Check for known patterns before optimizationlet pattern = analyzer.analyze_query(sql)?;let similar = analyzer.find_similar_patterns(&pattern);
if let Some(best_match) = similar.first() { if best_match.similarity_score > 0.95 { // Reuse previous optimization decisions return cached_plan_for_pattern(&best_match.pattern); }}
// Otherwise, proceed with full optimization3. WorkloadClassifier Integration
// Aggregate pattern types to determine workload mixlet patterns = analyzer.get_common_patterns(100);
let oltp_count = patterns.iter() .filter(|p| matches!(p.pattern_type, PatternType::Select)) .count();
let olap_count = patterns.iter() .filter(|p| matches!(p.pattern_type, PatternType::Join | PatternType::Aggregate)) .count();
// Feed to classifier for workload profilingTest Coverage
The implementation includes 10 comprehensive unit tests:
1. Pattern Creation
test_pattern_analyzer_creation- Initialization
2. Query Analysis
test_select_query_analysis- Simple SELECT patternstest_join_query_analysis- JOIN detectiontest_aggregate_query_analysis- GROUP BY and aggregations
3. Similarity Matching
test_pattern_fingerprint_similarity- Similar query matching (>0.8)test_pattern_fingerprint_different- Different query detection (<0.6)
4. Execution Tracking
test_record_execution- Pattern storagetest_find_similar_patterns- Similarity searchtest_execution_stats_update- Running average calculations
5. Common Patterns
test_get_common_patterns- Frequency-based retrieval
6. Cost Estimation
test_cost_estimation- Historical cost computation
Test Results: All tests pass ✓
Performance Characteristics
Time Complexity
| Operation | Complexity | Notes |
|---|---|---|
| analyze_query | O(n) | n = query length (parsing) |
| find_similar_patterns | O(m) | m = stored patterns |
| record_execution | O(1) | HashMap insert |
| get_common_patterns | O(m log m) | Sorting by count |
Space Complexity
- Per Pattern: O(1) - Fixed size structures
- Total Storage: O(n) where n = number of unique patterns
- With Eviction: Bounded at max_patterns (10,000)
Optimization Opportunities
- Similarity Search: Could use LSH (Locality-Sensitive Hashing) for faster approximate matching
- Storage: Could persist to disk for long-term learning
- Fingerprinting: Could use actual SQL parser instead of regex for better accuracy
- Statistics: Could add percentile tracking (P50, P95, P99)
Usage Example
use heliosdb_workload::{PatternAnalyzer, PatternAnalyzerConfig, QueryStats};
// Initialize analyzerlet config = PatternAnalyzerConfig { max_patterns: 10_000, common_pattern_threshold: 10, similarity_threshold: 0.8, pattern_max_age_secs: 86_400,};
let mut analyzer = PatternAnalyzer::new(config);
// Analyze querylet sql = "SELECT u.name, COUNT(o.id) FROM users u JOIN orders o ON u.id = o.user_id GROUP BY u.name HAVING COUNT(o.id) > 5";
let pattern = analyzer.analyze_query(sql)?;println!("Pattern type: {}", pattern.pattern_type);println!("Tables: {:?}", pattern.fingerprint.tables);
// Find similar historical querieslet similar = analyzer.find_similar_patterns(&pattern);for sim in similar.iter().take(5) { println!("Similar pattern (score: {:.2}): {}", sim.similarity_score, sim.pattern.sample_query);}
// Record execution after running querylet stats = QueryStats { query_id: uuid::Uuid::new_v4().to_string(), query_template: sql.to_string(), execution_time_us: 250_000, rows_scanned: 10_000, rows_returned: 42, // ... other stats};
analyzer.record_execution(pattern.clone(), &stats);
// Get most common patternslet common = analyzer.get_common_patterns(10);println!("Top 10 common patterns:");for (i, p) in common.iter().enumerate() { println!("{}. {} - {} executions", i + 1, p.pattern_type, p.occurrence_count());}
// Estimate cost for new querylet estimated_cost = analyzer.estimate_cost(&pattern);println!("Estimated cost: {:.2}", estimated_cost);Success Criteria Achievement
Can identify 5 distinct common query patterns
- PatternType enum supports: Select, Join, Aggregate, Mutation, Complex, Unknown
- All types properly classified in tests
Pattern fingerprinting correctly groups similar queries
- Similarity algorithm with 6 weighted features
- Similarity threshold configurable (default 0.8)
- Tests demonstrate >0.8 similarity for similar queries
Cost estimation provides reasonable estimates
- Multi-factor cost model (pattern type, tables, cardinality)
- Based on historical execution statistics
- Test validates positive, bounded costs
All 10 tests passing
- Comprehensive test coverage across all features
- Tests validate creation, analysis, similarity, recording, and estimation
- Running:
cargo test -p heliosdb-workload pattern_analyzer
Future Enhancements
Phase 1: Advanced Pattern Matching
- Implement actual SQL parser (sqlparser-rs) for better accuracy
- Add predicate structure analysis (equality vs range vs complex)
- Support parameterized query templates
Phase 2: Machine Learning Integration
- Train embedding model for semantic similarity
- Use LSTM/Transformer for query performance prediction
- Implement reinforcement learning for plan selection
Phase 3: Distributed Pattern Sharing
- Synchronize patterns across cluster nodes
- Implement pattern gossip protocol
- Support multi-tenant pattern isolation
Phase 4: Adaptive Optimization
- Automatically adjust cost model parameters based on feedback
- Detect workload shifts and trigger re-optimization
- A/B test alternative query plans
Conclusion
The Pattern Analyzer provides a solid baseline architecture for workload-aware query optimization. With 1,028 lines of production-quality code, comprehensive test coverage, and clean integration points, it forms the foundation for intelligent, learning-based query optimization in HeliosDB.
The system successfully meets all specified requirements:
- Pattern recognition with 6 pattern types
- Similarity matching with configurable threshold
- Historical statistics tracking
- Cost estimation based on execution history
- 10 passing unit tests
- Ready for integration with existing optimizer components
File Locations
- Implementation:
/home/claude/HeliosDB/heliosdb-workload/src/pattern_analyzer.rs - Module Export:
/home/claude/HeliosDB/heliosdb-workload/src/lib.rs - Tests: Inline in pattern_analyzer.rs (10 tests)
- Documentation: This file
Document Version: 1.0 Date: 2025-10-31 Author: HeliosDB Development Team Status: Complete - Ready for Review