Skip to content

HeliosDB v3.0 Query Optimization Guide

HeliosDB v3.0 Query Optimization Guide

Advanced Performance Tuning and Query Optimization


Table of Contents

  1. Query Performance Fundamentals
  2. Index Optimization
  3. Query Rewriting Techniques
  4. Caching Strategies
  5. Distributed Query Optimization
  6. Specialized Optimizations

Query Performance Fundamentals

Understanding Query Execution

-- View query execution plan
EXPLAIN SELECT * FROM users WHERE email = 'alice@example.com';
-- Detailed execution plan with costs
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'alice@example.com';

Output Analysis:

Seq Scan on users (cost=0.00..10.00 rows=1 width=100) (actual time=0.050..0.100 rows=1 loops=1)
Filter: (email = 'alice@example.com'::text)
Rows Removed by Filter: 999
Planning Time: 0.123 ms
Execution Time: 0.456 ms

Key Metrics

  • cost=start..end: Estimated cost (lower is better)
  • rows: Estimated number of rows
  • width: Average row size in bytes
  • actual time: Real execution time
  • loops: Number of times the node was executed

Index Optimization

Index Selection

When to Create Indexes

Create indexes for:

  • Columns in WHERE clauses
  • JOIN columns
  • ORDER BY columns
  • GROUP BY columns
  • Foreign key columns

Avoid indexes on:

  • Low-cardinality columns (few distinct values)
  • Frequently updated columns (high write cost)
  • Very small tables (<1000 rows)

Index Types and Use Cases

1. B-Tree Index (Default)

-- Best for: Equality and range queries
CREATE INDEX idx_users_age ON users(age);
-- Optimizes:
SELECT * FROM users WHERE age = 25;
SELECT * FROM users WHERE age BETWEEN 20 AND 30;
SELECT * FROM users WHERE age > 25;

2. Hash Index

-- Best for: Equality queries only
CREATE INDEX idx_users_email ON users USING HASH(email);
-- Optimizes:
SELECT * FROM users WHERE email = 'alice@example.com';
-- Does NOT optimize:
SELECT * FROM users WHERE email LIKE 'alice%'; -- ❌

3. Full-Text Index

-- Best for: Text search
CREATE FULLTEXT INDEX fts_products ON products(name, description)
WITH LANGUAGE 'English';
-- Optimizes:
SELECT * FROM products WHERE MATCH(name, description) AGAINST('laptop');

4. Spatial Index (R-tree)

-- Best for: Geospatial queries
CREATE SPATIAL INDEX idx_locations ON locations USING RTREE(coordinates);
-- Optimizes:
SELECT * FROM locations WHERE ST_Distance(coordinates, ST_Point(0,0)) < 1000;

5. Global Secondary Index (Cross-Shard)

-- Best for: Queries on non-sharding-key columns
CREATE GLOBAL INDEX idx_orders_amount ON orders(amount);
-- Optimizes queries across shards:
SELECT * FROM orders WHERE amount > 1000;

Composite Indexes

-- Multi-column index
CREATE INDEX idx_users_age_city ON users(age, city);
-- Optimizes (left-to-right prefix matching):
SELECT * FROM users WHERE age = 25; -- Uses index
SELECT * FROM users WHERE age = 25 AND city = 'NYC'; -- Uses index
SELECT * FROM users WHERE city = 'NYC'; -- ❌ Does NOT use index

Index Column Order Rule: Put high-selectivity columns first.

-- Good: email is unique (high selectivity)
CREATE INDEX idx_users_email_age ON users(email, age);
-- Bad: age has few distinct values (low selectivity)
CREATE INDEX idx_users_age_email ON users(age, email);

Covering Indexes

-- Include additional columns to avoid table lookups
CREATE INDEX idx_users_email_name ON users(email) INCLUDE (name, created_at);
-- Query satisfied entirely from index (no table lookup):
SELECT name, created_at FROM users WHERE email = 'alice@example.com';

Partial Indexes

-- Index only active users
CREATE INDEX idx_active_users ON users(email) WHERE active = true;
-- Smaller index, faster queries for active users:
SELECT * FROM users WHERE email = 'alice@example.com' AND active = true;

Index Maintenance

-- View index usage statistics
SELECT
schemaname,
tablename,
indexname,
idx_scan as scans,
idx_tup_read as tuples_read,
idx_tup_fetch as tuples_fetched
FROM heliosdb_index_usage
ORDER BY idx_scan DESC;
-- Drop unused indexes
SELECT indexname
FROM heliosdb_index_usage
WHERE idx_scan = 0 AND idx_tup_read = 0;
-- Rebuild index (defragment)
REINDEX INDEX idx_users_email;

Query Rewriting Techniques

1. Use Materialized Views

Before (expensive aggregation):

SELECT region, DATE(order_date), SUM(amount)
FROM sales
GROUP BY region, DATE(order_date);
-- Execution: 5 seconds

After (materialized view):

CREATE MATERIALIZED VIEW sales_summary AS
SELECT region, DATE(order_date) as date, SUM(amount) as total
FROM sales
GROUP BY region, DATE(order_date)
WITH REFRESH ON COMMIT;
SELECT * FROM sales_summary WHERE region = 'US';
-- Execution: 10 ms

2. Avoid SELECT *

Before:

SELECT * FROM users WHERE id = 123;
-- Fetches all columns, wastes bandwidth

After:

SELECT id, name, email FROM users WHERE id = 123;
-- Fetches only needed columns, 3x faster

3. Use EXISTS Instead of IN

Before (slow with large subquery):

SELECT * FROM users WHERE id IN (
SELECT user_id FROM orders WHERE amount > 1000
);
-- Execution: 2 seconds

After (faster with EXISTS):

SELECT * FROM users u WHERE EXISTS (
SELECT 1 FROM orders o WHERE o.user_id = u.id AND o.amount > 1000
);
-- Execution: 200 ms

4. Push Down Filters

Before (filter after join):

SELECT u.name, o.amount
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.amount > 1000;

After (filter before join):

SELECT u.name, o.amount
FROM users u
JOIN (SELECT * FROM orders WHERE amount > 1000) o
ON u.id = o.user_id;
-- Reduces join size, 5x faster

5. Use UNION ALL Instead of UNION

Before (UNION deduplicates, expensive):

SELECT name FROM users_2024
UNION
SELECT name FROM users_2025;
-- Execution: 1 second

After (UNION ALL, no deduplication):

SELECT name FROM users_2024
UNION ALL
SELECT name FROM users_2025;
-- Execution: 100 ms

6. Optimize Subqueries with CTEs

Before (correlated subquery, slow):

SELECT
name,
(SELECT COUNT(*) FROM orders WHERE user_id = users.id) as order_count
FROM users;
-- Execution: 10 seconds

After (CTE with join, fast):

WITH order_counts AS (
SELECT user_id, COUNT(*) as count
FROM orders
GROUP BY user_id
)
SELECT u.name, COALESCE(oc.count, 0) as order_count
FROM users u
LEFT JOIN order_counts oc ON u.id = oc.user_id;
-- Execution: 500 ms

Caching Strategies

1. Query Result Caching

-- Enable query caching (auto-enabled in v3.0)
ALTER DATABASE SET query_cache_size = '4GB';
ALTER DATABASE SET query_cache_ttl = 300; -- 5 minutes
-- View cache statistics
SELECT
cache_hits,
cache_misses,
cache_hits::float / (cache_hits + cache_misses) as hit_rate
FROM heliosdb_cache_stats;

Cached Query Example:

-- First execution: 100ms (cache miss)
SELECT * FROM products WHERE category = 'Electronics';
-- Subsequent executions: <1ms (cache hit)
SELECT * FROM products WHERE category = 'Electronics';

2. Materialized View Caching

-- Create materialized view with refresh schedule
CREATE MATERIALIZED VIEW daily_sales AS
SELECT DATE(order_date), SUM(amount)
FROM sales
GROUP BY DATE(order_date)
WITH REFRESH SCHEDULE '0 1 * * *'; -- Refresh at 1 AM daily

3. Connection Pooling

// Automatic connection pooling in v3.0
let pool = heliosdb::ConnectionPool::new(100).await?;
let conn = pool.acquire().await?; // Reuses connections

Distributed Query Optimization

1. Shard Key Selection

Good Shard Key (even distribution):

CREATE TABLE orders (
id INTEGER PRIMARY KEY,
user_id INTEGER,
amount DECIMAL(10,2)
) SHARD BY HASH(user_id) SHARDS 8;
-- Query hits single shard:
SELECT * FROM orders WHERE user_id = 123;

Bad Shard Key (skewed distribution):

-- ❌ Don't shard by timestamp (hot shard on recent data)
CREATE TABLE orders (...) SHARD BY HASH(order_date) SHARDS 8;

2. Co-located Joins

-- Both tables sharded by user_id (co-located)
CREATE TABLE users (...) SHARD BY HASH(user_id) SHARDS 8;
CREATE TABLE orders (...) SHARD BY HASH(user_id) SHARDS 8;
-- Join executes locally on each shard (fast):
SELECT u.name, o.amount
FROM users u
JOIN orders o ON u.user_id = o.user_id;

3. Global Secondary Indexes

-- Query on non-sharding-key requires cross-shard scan (slow):
SELECT * FROM orders WHERE amount > 1000;
-- Create global secondary index (fast):
CREATE GLOBAL INDEX idx_orders_amount ON orders(amount);
SELECT * FROM orders WHERE amount > 1000; -- Uses GSI

4. Read Replicas for Read Scaling

-- Create read replica
CREATE READ REPLICA replica_1 WITH (region => 'us_east');
-- Route read queries to replica
SET session_replication_role = 'replica';
SELECT * FROM users WHERE age > 25; -- Reads from replica

Specialized Optimizations

1. Time-Series Queries

Optimize with downsampling:

-- Create downsampled view (5-minute aggregates)
ALTER TABLE metrics SET timeseries_downsample = (
interval => '5 minutes',
aggregation => 'avg'
);
-- Query downsampled data (100x faster):
SELECT
TIME_BUCKET('5 minutes', timestamp) as bucket,
AVG(value)
FROM metrics
WHERE timestamp >= NOW() - INTERVAL '24 hours'
GROUP BY bucket;

2. Approximate Queries for Large Datasets

Before (full scan, slow):

SELECT region, COUNT(*), AVG(amount)
FROM sales -- 10 billion rows
GROUP BY region;
-- Execution: 30 minutes

After (approximate, fast):

-- Create 1% sample
CREATE SAMPLE sales_sample ON sales
STRATIFIED BY region
WITH SIZE 1% ERROR 5%;
-- Approximate query (100x faster):
SELECT
region,
APPROX_COUNT(*) as count,
APPROX_AVG(amount) as avg_amount
FROM sales_sample
GROUP BY region
WITH CONFIDENCE 95%;
-- Execution: 20 seconds
-- Result: count = 1M ± 50K (95% confidence)

3. Full-Text Search Optimization

-- Create full-text index with optimized settings
CREATE FULLTEXT INDEX fts_docs ON documents(title, content)
WITH (
language => 'English',
min_word_length => 3,
max_word_length => 40,
stopwords => true
);
-- Use phrase search for better relevance
SELECT * FROM documents
WHERE MATCH(title, content) AGAINST('"machine learning"' IN BOOLEAN MODE);

4. Geospatial Query Optimization

-- Create spatial index
CREATE SPATIAL INDEX idx_locations ON locations USING RTREE(coordinates);
-- Use bounding box filter (faster):
SELECT * FROM locations
WHERE ST_Within(
coordinates,
ST_MakeEnvelope(-74.0, 40.7, -73.9, 40.8, 4326)
);
-- Avoid expensive operations in WHERE clause:
-- ❌ Slow
SELECT * FROM locations WHERE ST_Distance(coordinates, ST_Point(0,0)) < 1000;
-- Fast (using KNN)
SELECT * FROM locations
ORDER BY coordinates <-> ST_Point(0,0)::geometry
LIMIT 10;

5. Graph Query Optimization

-- Optimize recursive CTEs with depth limit
WITH RECURSIVE descendants AS (
SELECT id, name, manager_id, 0 AS depth
FROM employees
WHERE id = 1
UNION ALL
SELECT e.id, e.name, e.manager_id, d.depth + 1
FROM employees e
JOIN descendants d ON e.manager_id = d.id
WHERE d.depth < 10 -- Depth limit prevents runaway recursion
)
SELECT * FROM descendants;

Adaptive Indexing (v3.0)

Automatic Index Recommendations

-- Enable adaptive indexing
ALTER DATABASE SET adaptive_indexing = true;
-- View recommendations
SELECT
table_name,
column_names,
estimated_speedup,
query_count,
recommendation_score
FROM heliosdb_index_recommendations
ORDER BY recommendation_score DESC;
-- Apply recommendation
SELECT apply_index_recommendation(recommendation_id);

Automated Query Tuning

-- Enable query tuning
ALTER DATABASE SET automated_query_tuning = true;
-- View tuning suggestions
SELECT
query_hash,
original_query,
optimized_query,
estimated_speedup
FROM heliosdb_query_tuning_suggestions
ORDER BY estimated_speedup DESC;

Monitoring and Debugging

1. Slow Query Log

-- Enable slow query logging
SET log_min_duration_statement = 1000; -- Log queries > 1 second
-- View slow queries
SELECT
query_text,
execution_time_ms,
rows_returned,
execution_count
FROM heliosdb_slow_queries
ORDER BY execution_time_ms DESC
LIMIT 10;

2. Query Profiling

-- Profile query execution
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT * FROM users WHERE email = 'alice@example.com';

3. Index Usage Statistics

-- Identify unused indexes
SELECT
schemaname,
tablename,
indexname,
idx_scan
FROM heliosdb_index_usage
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;

Performance Checklist

Query Design

  • Use specific columns instead of SELECT *
  • Add WHERE clause to limit result set
  • Use appropriate JOIN types (INNER vs LEFT)
  • Avoid correlated subqueries
  • Use CTEs for complex queries

Indexing

  • Index columns used in WHERE, JOIN, ORDER BY
  • Use composite indexes for multi-column queries
  • Create covering indexes to avoid table lookups
  • Use partial indexes for filtered queries
  • Monitor and drop unused indexes

Caching

  • Enable query result caching
  • Use materialized views for expensive aggregations
  • Configure appropriate TTL values
  • Monitor cache hit rate

Distributed Queries

  • Choose appropriate shard key
  • Co-locate related tables
  • Use global secondary indexes
  • Route reads to replicas

Specialized Features

  • Use downsampling for time-series data
  • Use approximate queries for large datasets
  • Enable adaptive indexing
  • Use automated query tuning

Benchmarking

Run Performance Tests

Terminal window
# Run benchmark suite
heliosdb-bench --suite all --output results.json
# Compare with baseline
heliosdb-bench compare baseline.json results.json

Example Benchmark Output

Query Performance Comparison:
┌─────────────────────────┬─────────┬─────────┬────────────┐
│ Query Type │ v2.0 │ v3.0 │ Improvement│
├─────────────────────────┼─────────┼─────────┼────────────┤
│ Simple SELECT │ 2ms │ 1ms │ 2x │
│ Complex JOIN │ 500ms │ 100ms │ 5x │
│ Full-text search │ 50ms │ 10ms │ 5x │
│ Approximate aggregation │ N/A │ 20s │ 100x │
│ Geospatial KNN │ 100ms │ 20ms │ 5x │
└─────────────────────────┴─────────┴─────────┴────────────┘

See Also