HeliosDB v3.0 Query Optimization Guide
HeliosDB v3.0 Query Optimization Guide
Advanced Performance Tuning and Query Optimization
Table of Contents
- Query Performance Fundamentals
- Index Optimization
- Query Rewriting Techniques
- Caching Strategies
- Distributed Query Optimization
- Specialized Optimizations
Query Performance Fundamentals
Understanding Query Execution
-- View query execution planEXPLAIN SELECT * FROM users WHERE email = 'alice@example.com';
-- Detailed execution plan with costsEXPLAIN ANALYZE SELECT * FROM users WHERE email = 'alice@example.com';Output Analysis:
Seq Scan on users (cost=0.00..10.00 rows=1 width=100) (actual time=0.050..0.100 rows=1 loops=1) Filter: (email = 'alice@example.com'::text) Rows Removed by Filter: 999Planning Time: 0.123 msExecution Time: 0.456 msKey Metrics
- cost=start..end: Estimated cost (lower is better)
- rows: Estimated number of rows
- width: Average row size in bytes
- actual time: Real execution time
- loops: Number of times the node was executed
Index Optimization
Index Selection
When to Create Indexes
Create indexes for:
- Columns in WHERE clauses
- JOIN columns
- ORDER BY columns
- GROUP BY columns
- Foreign key columns
❌ Avoid indexes on:
- Low-cardinality columns (few distinct values)
- Frequently updated columns (high write cost)
- Very small tables (<1000 rows)
Index Types and Use Cases
1. B-Tree Index (Default)
-- Best for: Equality and range queriesCREATE INDEX idx_users_age ON users(age);
-- Optimizes:SELECT * FROM users WHERE age = 25;SELECT * FROM users WHERE age BETWEEN 20 AND 30;SELECT * FROM users WHERE age > 25;2. Hash Index
-- Best for: Equality queries onlyCREATE INDEX idx_users_email ON users USING HASH(email);
-- Optimizes:SELECT * FROM users WHERE email = 'alice@example.com';
-- Does NOT optimize:SELECT * FROM users WHERE email LIKE 'alice%'; -- ❌3. Full-Text Index
-- Best for: Text searchCREATE FULLTEXT INDEX fts_products ON products(name, description)WITH LANGUAGE 'English';
-- Optimizes:SELECT * FROM products WHERE MATCH(name, description) AGAINST('laptop');4. Spatial Index (R-tree)
-- Best for: Geospatial queriesCREATE SPATIAL INDEX idx_locations ON locations USING RTREE(coordinates);
-- Optimizes:SELECT * FROM locations WHERE ST_Distance(coordinates, ST_Point(0,0)) < 1000;5. Global Secondary Index (Cross-Shard)
-- Best for: Queries on non-sharding-key columnsCREATE GLOBAL INDEX idx_orders_amount ON orders(amount);
-- Optimizes queries across shards:SELECT * FROM orders WHERE amount > 1000;Composite Indexes
-- Multi-column indexCREATE INDEX idx_users_age_city ON users(age, city);
-- Optimizes (left-to-right prefix matching):SELECT * FROM users WHERE age = 25; -- Uses indexSELECT * FROM users WHERE age = 25 AND city = 'NYC'; -- Uses indexSELECT * FROM users WHERE city = 'NYC'; -- ❌ Does NOT use indexIndex Column Order Rule: Put high-selectivity columns first.
-- Good: email is unique (high selectivity)CREATE INDEX idx_users_email_age ON users(email, age);
-- Bad: age has few distinct values (low selectivity)CREATE INDEX idx_users_age_email ON users(age, email);Covering Indexes
-- Include additional columns to avoid table lookupsCREATE INDEX idx_users_email_name ON users(email) INCLUDE (name, created_at);
-- Query satisfied entirely from index (no table lookup):SELECT name, created_at FROM users WHERE email = 'alice@example.com';Partial Indexes
-- Index only active usersCREATE INDEX idx_active_users ON users(email) WHERE active = true;
-- Smaller index, faster queries for active users:SELECT * FROM users WHERE email = 'alice@example.com' AND active = true;Index Maintenance
-- View index usage statisticsSELECT schemaname, tablename, indexname, idx_scan as scans, idx_tup_read as tuples_read, idx_tup_fetch as tuples_fetchedFROM heliosdb_index_usageORDER BY idx_scan DESC;
-- Drop unused indexesSELECT indexnameFROM heliosdb_index_usageWHERE idx_scan = 0 AND idx_tup_read = 0;
-- Rebuild index (defragment)REINDEX INDEX idx_users_email;Query Rewriting Techniques
1. Use Materialized Views
Before (expensive aggregation):
SELECT region, DATE(order_date), SUM(amount)FROM salesGROUP BY region, DATE(order_date);-- Execution: 5 secondsAfter (materialized view):
CREATE MATERIALIZED VIEW sales_summary ASSELECT region, DATE(order_date) as date, SUM(amount) as totalFROM salesGROUP BY region, DATE(order_date)WITH REFRESH ON COMMIT;
SELECT * FROM sales_summary WHERE region = 'US';-- Execution: 10 ms2. Avoid SELECT *
Before:
SELECT * FROM users WHERE id = 123;-- Fetches all columns, wastes bandwidthAfter:
SELECT id, name, email FROM users WHERE id = 123;-- Fetches only needed columns, 3x faster3. Use EXISTS Instead of IN
Before (slow with large subquery):
SELECT * FROM users WHERE id IN ( SELECT user_id FROM orders WHERE amount > 1000);-- Execution: 2 secondsAfter (faster with EXISTS):
SELECT * FROM users u WHERE EXISTS ( SELECT 1 FROM orders o WHERE o.user_id = u.id AND o.amount > 1000);-- Execution: 200 ms4. Push Down Filters
Before (filter after join):
SELECT u.name, o.amountFROM users uJOIN orders o ON u.id = o.user_idWHERE o.amount > 1000;After (filter before join):
SELECT u.name, o.amountFROM users uJOIN (SELECT * FROM orders WHERE amount > 1000) oON u.id = o.user_id;-- Reduces join size, 5x faster5. Use UNION ALL Instead of UNION
Before (UNION deduplicates, expensive):
SELECT name FROM users_2024UNIONSELECT name FROM users_2025;-- Execution: 1 secondAfter (UNION ALL, no deduplication):
SELECT name FROM users_2024UNION ALLSELECT name FROM users_2025;-- Execution: 100 ms6. Optimize Subqueries with CTEs
Before (correlated subquery, slow):
SELECT name, (SELECT COUNT(*) FROM orders WHERE user_id = users.id) as order_countFROM users;-- Execution: 10 secondsAfter (CTE with join, fast):
WITH order_counts AS ( SELECT user_id, COUNT(*) as count FROM orders GROUP BY user_id)SELECT u.name, COALESCE(oc.count, 0) as order_countFROM users uLEFT JOIN order_counts oc ON u.id = oc.user_id;-- Execution: 500 msCaching Strategies
1. Query Result Caching
-- Enable query caching (auto-enabled in v3.0)ALTER DATABASE SET query_cache_size = '4GB';ALTER DATABASE SET query_cache_ttl = 300; -- 5 minutes
-- View cache statisticsSELECT cache_hits, cache_misses, cache_hits::float / (cache_hits + cache_misses) as hit_rateFROM heliosdb_cache_stats;Cached Query Example:
-- First execution: 100ms (cache miss)SELECT * FROM products WHERE category = 'Electronics';
-- Subsequent executions: <1ms (cache hit)SELECT * FROM products WHERE category = 'Electronics';2. Materialized View Caching
-- Create materialized view with refresh scheduleCREATE MATERIALIZED VIEW daily_sales ASSELECT DATE(order_date), SUM(amount)FROM salesGROUP BY DATE(order_date)WITH REFRESH SCHEDULE '0 1 * * *'; -- Refresh at 1 AM daily3. Connection Pooling
// Automatic connection pooling in v3.0let pool = heliosdb::ConnectionPool::new(100).await?;let conn = pool.acquire().await?; // Reuses connectionsDistributed Query Optimization
1. Shard Key Selection
Good Shard Key (even distribution):
CREATE TABLE orders ( id INTEGER PRIMARY KEY, user_id INTEGER, amount DECIMAL(10,2)) SHARD BY HASH(user_id) SHARDS 8;
-- Query hits single shard:SELECT * FROM orders WHERE user_id = 123;Bad Shard Key (skewed distribution):
-- ❌ Don't shard by timestamp (hot shard on recent data)CREATE TABLE orders (...) SHARD BY HASH(order_date) SHARDS 8;2. Co-located Joins
-- Both tables sharded by user_id (co-located)CREATE TABLE users (...) SHARD BY HASH(user_id) SHARDS 8;CREATE TABLE orders (...) SHARD BY HASH(user_id) SHARDS 8;
-- Join executes locally on each shard (fast):SELECT u.name, o.amountFROM users uJOIN orders o ON u.user_id = o.user_id;3. Global Secondary Indexes
-- Query on non-sharding-key requires cross-shard scan (slow):SELECT * FROM orders WHERE amount > 1000;
-- Create global secondary index (fast):CREATE GLOBAL INDEX idx_orders_amount ON orders(amount);SELECT * FROM orders WHERE amount > 1000; -- Uses GSI4. Read Replicas for Read Scaling
-- Create read replicaCREATE READ REPLICA replica_1 WITH (region => 'us_east');
-- Route read queries to replicaSET session_replication_role = 'replica';SELECT * FROM users WHERE age > 25; -- Reads from replicaSpecialized Optimizations
1. Time-Series Queries
Optimize with downsampling:
-- Create downsampled view (5-minute aggregates)ALTER TABLE metrics SET timeseries_downsample = ( interval => '5 minutes', aggregation => 'avg');
-- Query downsampled data (100x faster):SELECT TIME_BUCKET('5 minutes', timestamp) as bucket, AVG(value)FROM metricsWHERE timestamp >= NOW() - INTERVAL '24 hours'GROUP BY bucket;2. Approximate Queries for Large Datasets
Before (full scan, slow):
SELECT region, COUNT(*), AVG(amount)FROM sales -- 10 billion rowsGROUP BY region;-- Execution: 30 minutesAfter (approximate, fast):
-- Create 1% sampleCREATE SAMPLE sales_sample ON salesSTRATIFIED BY regionWITH SIZE 1% ERROR 5%;
-- Approximate query (100x faster):SELECT region, APPROX_COUNT(*) as count, APPROX_AVG(amount) as avg_amountFROM sales_sampleGROUP BY regionWITH CONFIDENCE 95%;-- Execution: 20 seconds-- Result: count = 1M ± 50K (95% confidence)3. Full-Text Search Optimization
-- Create full-text index with optimized settingsCREATE FULLTEXT INDEX fts_docs ON documents(title, content)WITH ( language => 'English', min_word_length => 3, max_word_length => 40, stopwords => true);
-- Use phrase search for better relevanceSELECT * FROM documentsWHERE MATCH(title, content) AGAINST('"machine learning"' IN BOOLEAN MODE);4. Geospatial Query Optimization
-- Create spatial indexCREATE SPATIAL INDEX idx_locations ON locations USING RTREE(coordinates);
-- Use bounding box filter (faster):SELECT * FROM locationsWHERE ST_Within( coordinates, ST_MakeEnvelope(-74.0, 40.7, -73.9, 40.8, 4326));
-- Avoid expensive operations in WHERE clause:-- ❌ SlowSELECT * FROM locations WHERE ST_Distance(coordinates, ST_Point(0,0)) < 1000;
-- Fast (using KNN)SELECT * FROM locationsORDER BY coordinates <-> ST_Point(0,0)::geometryLIMIT 10;5. Graph Query Optimization
-- Optimize recursive CTEs with depth limitWITH RECURSIVE descendants AS ( SELECT id, name, manager_id, 0 AS depth FROM employees WHERE id = 1
UNION ALL
SELECT e.id, e.name, e.manager_id, d.depth + 1 FROM employees e JOIN descendants d ON e.manager_id = d.id WHERE d.depth < 10 -- Depth limit prevents runaway recursion)SELECT * FROM descendants;Adaptive Indexing (v3.0)
Automatic Index Recommendations
-- Enable adaptive indexingALTER DATABASE SET adaptive_indexing = true;
-- View recommendationsSELECT table_name, column_names, estimated_speedup, query_count, recommendation_scoreFROM heliosdb_index_recommendationsORDER BY recommendation_score DESC;
-- Apply recommendationSELECT apply_index_recommendation(recommendation_id);Automated Query Tuning
-- Enable query tuningALTER DATABASE SET automated_query_tuning = true;
-- View tuning suggestionsSELECT query_hash, original_query, optimized_query, estimated_speedupFROM heliosdb_query_tuning_suggestionsORDER BY estimated_speedup DESC;Monitoring and Debugging
1. Slow Query Log
-- Enable slow query loggingSET log_min_duration_statement = 1000; -- Log queries > 1 second
-- View slow queriesSELECT query_text, execution_time_ms, rows_returned, execution_countFROM heliosdb_slow_queriesORDER BY execution_time_ms DESCLIMIT 10;2. Query Profiling
-- Profile query executionEXPLAIN (ANALYZE, BUFFERS, VERBOSE)SELECT * FROM users WHERE email = 'alice@example.com';3. Index Usage Statistics
-- Identify unused indexesSELECT schemaname, tablename, indexname, idx_scanFROM heliosdb_index_usageWHERE idx_scan = 0ORDER BY pg_relation_size(indexrelid) DESC;Performance Checklist
Query Design
- Use specific columns instead of SELECT *
- Add WHERE clause to limit result set
- Use appropriate JOIN types (INNER vs LEFT)
- Avoid correlated subqueries
- Use CTEs for complex queries
Indexing
- Index columns used in WHERE, JOIN, ORDER BY
- Use composite indexes for multi-column queries
- Create covering indexes to avoid table lookups
- Use partial indexes for filtered queries
- Monitor and drop unused indexes
Caching
- Enable query result caching
- Use materialized views for expensive aggregations
- Configure appropriate TTL values
- Monitor cache hit rate
Distributed Queries
- Choose appropriate shard key
- Co-locate related tables
- Use global secondary indexes
- Route reads to replicas
Specialized Features
- Use downsampling for time-series data
- Use approximate queries for large datasets
- Enable adaptive indexing
- Use automated query tuning
Benchmarking
Run Performance Tests
# Run benchmark suiteheliosdb-bench --suite all --output results.json
# Compare with baselineheliosdb-bench compare baseline.json results.jsonExample Benchmark Output
Query Performance Comparison:┌─────────────────────────┬─────────┬─────────┬────────────┐│ Query Type │ v2.0 │ v3.0 │ Improvement│├─────────────────────────┼─────────┼─────────┼────────────┤│ Simple SELECT │ 2ms │ 1ms │ 2x ││ Complex JOIN │ 500ms │ 100ms │ 5x ││ Full-text search │ 50ms │ 10ms │ 5x ││ Approximate aggregation │ N/A │ 20s │ 100x ││ Geospatial KNN │ 100ms │ 20ms │ 5x │└─────────────────────────┴─────────┴─────────┴────────────┘