Conversational BI Performance Optimization Guide
Conversational BI Performance Optimization Guide
Version: 7.0 Last Updated: 2025-11-14 Status: Production Ready
Table of Contents
- Overview
- Performance Targets
- Quick Start - Top 10 Optimizations
- Latency Optimization
- Memory Optimization
- Throughput Optimization
- Cache Optimization
- Database Optimization
- LLM Integration Optimization
- Monitoring & Profiling
- Production Tuning
- Troubleshooting
Overview
This guide provides comprehensive performance optimization strategies for HeliosDB Conversational BI in production environments.
Performance Philosophy
Goal: Sub-200ms P99 latency for 95%+ accurate NL2SQL translation
Approach:
- Optimize the critical path (query → SQL generation)
- Minimize LLM calls through aggressive caching
- Reduce memory allocation overhead
- Maximize throughput through concurrency
Baseline Performance (Week 7-8 Validation)
| Metric | Current | Target | Status |
|---|---|---|---|
| Session Creation | ~5ms | <10ms | EXCEEDED |
| Query (no LLM) | ~30ms | <50ms | EXCEEDED |
| Query (cached) | ~80ms | <100ms | ACHIEVED |
| Query (uncached) | ~250ms | <300ms | ⏳ TARGET |
| P99 Latency | Pending | <200ms | ⏳ TARGET |
| Memory/Session | ~2.5MB | <2MB | ⏳ CLOSE |
| QPS | 100+ | 150+ | ⏳ TARGET |
Performance Targets
Latency Targets
# Target latencies (P99)session_creation: <10mscache_lookup: <5msschema_loading: <20msllm_call: <200ms # External dependencysql_generation_total: <200mscontext_tracking: <10msresponse_formatting: <5msMemory Targets
# Memory usage per componentsession_state: <500KBconversation_context: <1MBschema_cache: <5MB per databasesemantic_cache: <100MB totaltotal_per_session: <2MBThroughput Targets
# Concurrent requestssingle_instance: 150+ QPSwith_caching: 500+ QPScluster: 1000+ QPSQuick Start - Top 10 Optimizations
1. Enable Semantic Caching
Impact: 70-80% latency reduction for similar queries
[cache]enabled = truemax_entries = 10000ttl_seconds = 3600similarity_threshold = 0.85Why: Avoids expensive LLM calls for semantically similar queries.
2. Optimize Connection Pooling
Impact: 40-50% latency reduction for database operations
[database]pool_size = 20 # 2x CPU coresmax_connections = 100connection_timeout_ms = 5000idle_timeout_seconds = 300Why: Eliminates connection establishment overhead.
3. Configure Rate Limiting
Impact: Prevents overload, ensures consistent performance
[rate_limiting]enabled = truequeries_per_minute = 60 # Per tenantburst_allowance = 10Why: Protects against traffic spikes and ensures fair resource allocation.
4. Enable Circuit Breakers
Impact: Graceful degradation, prevents cascade failures
[circuit_breaker]enabled = truefailure_threshold = 5reset_timeout_seconds = 60half_open_requests = 3Why: Isolates failures and prevents system-wide outages.
5. Optimize Context Window
Impact: 20-30% memory reduction
[context]max_turns = 10 # Down from 20max_context_length = 2000 # Down from 4000context_compression = trueWhy: Reduces memory footprint and LLM token usage.
6. Configure Aggressive Timeouts
Impact: Prevents slow queries from blocking resources
[timeouts]llm_timeout_ms = 5000 # 5 secondssql_execution_timeout_ms = 10000 # 10 secondstotal_request_timeout_ms = 15000 # 15 secondsWhy: Ensures predictable latency and resource release.
7. Enable Request Batching
Impact: 2-3x throughput improvement
[batching]enabled = truemax_batch_size = 10batch_timeout_ms = 100Why: Amortizes LLM call overhead across multiple requests.
8. Optimize Schema Caching
Impact: 50-60% reduction in schema loading time
[schema_cache]enabled = truemax_schemas = 100preload_popular = trueeviction_policy = "LRU"Why: Eliminates repeated schema extraction overhead.
9. Configure LLM Model Routing
Impact: 30-40% cost and latency reduction
[llm]fast_model = "gpt-3.5-turbo" # Simple queriesaccurate_model = "gpt-4" # Complex queriescomplexity_threshold = 0.7Why: Uses cheaper/faster models when appropriate.
10. Enable Parallel Processing
Impact: 2x throughput improvement
[concurrency]max_concurrent_requests = 50worker_threads = 8 # CPU coresasync_processing = trueWhy: Maximizes resource utilization and throughput.
Latency Optimization
Critical Path Analysis
Total Latency = Cache + Schema + LLM + Tracking + Formatting ~200ms = 5ms + 20ms + 150ms + 10ms + 5msOptimization Priority:
- LLM call (150ms) - Cache aggressively
- Schema loading (20ms) - Preload and cache
- Context tracking (10ms) - Optimize data structures
- Cache lookup (5ms) - Use in-memory hash maps
- Response formatting (5ms) - Minimize serialization
LLM Call Optimization
Strategy 1: Semantic Caching
// High-impact optimizationlet cache_key = generate_semantic_key(&query, &schema_hash);if let Some(cached_sql) = semantic_cache.get(&cache_key) { return Ok(cached_sql); // Save 150ms+}Strategy 2: Model Selection
// Use faster model for simple querieslet model = if query_complexity < 0.7 { "gpt-3.5-turbo" // 50-100ms} else { "gpt-4" // 150-200ms};Strategy 3: Timeout Configuration
// Aggressive timeout to prevent slow querieslet timeout = Duration::from_millis(5000);let result = timeout(timeout, llm_call(prompt)).await?;Schema Loading Optimization
Strategy 1: Preload Common Schemas
// Load popular schemas at startupasync fn preload_schemas(config: &Config) { for db in &config.popular_databases { schema_cache.load(db).await; }}Strategy 2: Lazy Schema Loading
// Only load schema when neededif !schema_cache.contains(&db_id) { let schema = extract_schema(&db_id).await?; schema_cache.insert(db_id, schema);}Strategy 3: Schema Compression
// Compress schema representationlet compressed = compress_schema(&full_schema);// Reduces memory and serialization overheadContext Tracking Optimization
Strategy 1: Circular Buffer
// Use circular buffer for conversation historystruct ConversationContext { turns: VecDeque<Turn>, // O(1) push/pop max_turns: usize,}Strategy 2: Lazy Materialization
// Only materialize context when neededfn get_context(&self) -> String { self.turns.iter() .rev() .take(5) // Only recent turns .map(|t| t.summary()) .collect()}Memory Optimization
Memory Allocation Strategy
Goal: <2MB per session
Breakdown:
Session State: 500KB (25%)Context History: 800KB (40%)Schema Cache: 400KB (20%)Temp Buffers: 300KB (15%)-----------------------------------Total: 2000KB (100%)Reduce Session State
Strategy 1: Store References
struct Session { session_id: Uuid, user_id: Arc<str>, // Shared string database_id: Arc<str>, // Shared string context: ConversationContext, // Avoid: Full copies of strings}Strategy 2: Compact Representation
// Use compact data structuresstruct Turn { turn_id: u16, // 2 bytes vs 8 timestamp: u32, // 4 bytes (unix seconds) query_hash: u64, // 8 bytes vs full string entities: SmallVec<[String; 4]>, // Stack allocation}Context Compression
Strategy 1: Summarization
// Summarize old turnsimpl ConversationContext { fn compress(&mut self) { if self.turns.len() > 10 { let old_turns = self.turns.drain(..5); self.summary = summarize(old_turns); } }}Strategy 2: Entity Extraction
// Store entities instead of full textstruct Turn { entities: Vec<String>, // Just "users", "orders" intent: Intent, // Enum (1 byte) // Avoid: Full query text}Cache Size Management
Strategy 1: LRU Eviction
// Automatic cache size managementlet cache = LruCache::new(10000);cache.insert(key, value); // Auto-evicts oldestStrategy 2: Memory Budget
// Set hard memory limitsconst MAX_CACHE_SIZE: usize = 100 * 1024 * 1024; // 100MBif cache.memory_usage() > MAX_CACHE_SIZE { cache.evict_until(MAX_CACHE_SIZE);}Throughput Optimization
Concurrency Configuration
Optimal Settings (8-core system):
[concurrency]worker_threads = 8 # Match CPU coresmax_concurrent_requests = 50 # 5-10x workerstask_queue_size = 200 # 4x concurrent requestsAsync I/O Optimization
Strategy 1: Parallel LLM Calls
// Process multiple queries concurrentlylet futures: Vec<_> = queries.iter() .map(|q| generate_sql(q)) .collect();let results = join_all(futures).await;Strategy 2: Pipeline Processing
// Pipeline stagesasync fn process_query(query: Query) -> Result<SQL> { let (schema, context) = join!( load_schema(query.db), // Parallel get_context(query.session) // Parallel ); generate_sql(query, schema, context).await}Load Balancing
Strategy 1: Session Affinity
// Route sessions to same worker for cache hitslet worker_id = hash(session_id) % num_workers;workers[worker_id].process(query).awaitStrategy 2: Least Loaded
// Route to least loaded workerlet worker = workers.iter() .min_by_key(|w| w.queue_size()) .unwrap();worker.process(query).awaitCache Optimization
Semantic Cache Strategy
Configuration:
[semantic_cache]enabled = truemax_entries = 10000ttl_seconds = 3600similarity_threshold = 0.85embedding_model = "text-embedding-ada-002"Hit Rate Target: >90% for production workloads
Cache Key Generation
Strategy 1: Schema-Aware Keys
fn cache_key(query: &str, schema_hash: &str) -> String { format!("{}:{}", normalize(query), schema_hash)}Strategy 2: Canonical Form
fn normalize(query: &str) -> String { query.to_lowercase() .trim() .split_whitespace() .collect::<Vec<_>>() .join(" ")}Cache Warming
Strategy 1: Preload Common Queries
// At startupfor query in common_queries { let sql = generate_sql(&query).await?; cache.insert(query, sql);}Strategy 2: Background Refresh
// Refresh cache before expirytokio::spawn(async { loop { sleep(Duration::from_secs(1800)).await; // 30 min cache.refresh_expiring_entries().await; }});Database Optimization
Connection Pool Tuning
Optimal Configuration:
[database.pool]min_connections = 5max_connections = 20 # 2x CPU coresconnection_timeout = 5000idle_timeout = 300max_lifetime = 1800Query Optimization
Strategy 1: Prepared Statements
// Use prepared statements for repeated querieslet stmt = conn.prepare("SELECT * FROM users WHERE id = ?")?;let user = stmt.query_row([user_id], |row| ...)?;Strategy 2: Batch Operations
// Batch schema extractionslet tables = conn.query("SELECT * FROM information_schema.tables")?;// Process all at once instead of one-by-oneSchema Extraction Optimization
Strategy 1: Incremental Updates
// Only update changed schemasif schema_version_changed(db_id) { update_schema(db_id).await?;}Strategy 2: Parallel Extraction
// Extract multiple schemas in parallellet futures: Vec<_> = databases.iter() .map(|db| extract_schema(db)) .collect();let schemas = join_all(futures).await;LLM Integration Optimization
Model Selection
Decision Tree:
Query Complexity < 0.7 → gpt-3.5-turbo (fast, cheap)Query Complexity ≥ 0.7 → gpt-4 (accurate, slower)Has cache hit → Skip LLM entirelyPrompt Optimization
Strategy 1: Minimal Prompts
// Provide only necessary contextlet prompt = format!( "Schema: {}\nQuery: {}\nSQL:", essential_schema, // Not full schema query);Strategy 2: Few-Shot Examples
// Include 2-3 examples, not 10+let examples = get_relevant_examples(&query, limit: 3);let prompt = format_prompt(&schema, &query, &examples);Retry Strategy
Exponential Backoff:
let mut delay = Duration::from_millis(100);for attempt in 0..3 { match llm_call(&prompt).await { Ok(result) => return Ok(result), Err(e) if e.is_retryable() => { sleep(delay).await; delay *= 2; // Exponential backoff } Err(e) => return Err(e), }}Monitoring & Profiling
Key Metrics to Track
Latency Metrics:
# P50, P90, P95, P99histogram_quantile(0.99, rate(query_duration_seconds_bucket[5m]))
# By componenthistogram_quantile(0.99, rate(llm_call_duration_seconds_bucket[5m]))histogram_quantile(0.99, rate(cache_lookup_duration_seconds_bucket[5m]))Throughput Metrics:
# Queries per secondrate(queries_total[1m])
# Cache hit raterate(cache_hits_total[5m]) / rate(cache_requests_total[5m])Resource Metrics:
# Memory usageprocess_resident_memory_bytes
# CPU usagerate(process_cpu_seconds_total[1m])Profiling Tools
CPU Profiling:
# Generate flamegraphcd heliosdb-conversational-bi./scripts/profile_performance.sh flamegraphMemory Profiling:
# Generate memory profile./scripts/profile_performance.sh memoryBenchmarking:
# Run criterion benchmarkscargo bench --package heliosdb-conversational-biProduction Tuning
Recommended Production Configuration
# Production-optimized configuration
[general]environment = "production"log_level = "info"
[cache]enabled = truemax_entries = 50000 # Increased for productionttl_seconds = 7200 # 2 hourssimilarity_threshold = 0.85
[database]pool_size = 20max_connections = 100connection_timeout_ms = 5000
[rate_limiting]enabled = truequeries_per_minute = 100 # Adjusted per SLAburst_allowance = 20
[circuit_breaker]enabled = truefailure_threshold = 5reset_timeout_seconds = 60
[concurrency]worker_threads = 16 # Production servermax_concurrent_requests = 100
[timeouts]llm_timeout_ms = 5000total_request_timeout_ms = 15000
[monitoring]metrics_enabled = truemetrics_port = 9090tracing_enabled = trueScaling Strategies
Vertical Scaling (Single Instance):
- 16+ CPU cores
- 32+ GB RAM
- SSD storage
- Expected: 150-200 QPS
Horizontal Scaling (Cluster):
- Load balancer (session affinity)
- 3+ instances
- Shared cache (Redis)
- Expected: 500+ QPS
Auto-Scaling Rules:
scale_up: - cpu_usage > 70% for 5 minutes - p99_latency > 300ms for 5 minutes - queue_depth > 100
scale_down: - cpu_usage < 30% for 15 minutes - queue_depth < 20Troubleshooting
High Latency
Symptom: P99 > 500ms
Diagnosis:
# Check cache hit ratecurl http://localhost:9090/metrics | grep cache_hit_rate
# Check LLM latencycurl http://localhost:9090/metrics | grep llm_durationSolutions:
- Increase cache size → Improve hit rate
- Enable semantic caching → Reduce LLM calls
- Use faster LLM model → Reduce LLM latency
- Add more workers → Increase concurrency
High Memory Usage
Symptom: Memory > 4GB per instance
Diagnosis:
# Memory profile./scripts/profile_performance.sh memory
# Check cache sizecurl http://localhost:9090/metrics | grep cache_size_bytesSolutions:
- Reduce cache size → Lower memory footprint
- Enable context compression → Smaller sessions
- Reduce max_turns → Less history
- Implement aggressive eviction → Free memory
Low Throughput
Symptom: QPS < 50
Diagnosis:
# Check queue depthcurl http://localhost:9090/metrics | grep queue_depth
# Check worker utilizationcurl http://localhost:9090/metrics | grep worker_busy_ratioSolutions:
- Increase worker_threads → More parallelism
- Increase max_concurrent_requests → Higher queue
- Optimize database pool → Reduce contention
- Enable request batching → Better efficiency
Cache Misses
Symptom: Hit rate < 50%
Diagnosis:
# Analyze cache patternscurl http://localhost:9090/metrics | grep cache_
# Check similarity thresholdgrep similarity_threshold config.tomlSolutions:
- Lower similarity_threshold → More liberal matching
- Improve query normalization → Better key generation
- Preload common queries → Warm cache
- Increase cache TTL → Longer retention
Performance Checklist
Pre-Production
- Run load tests (100+ concurrent users)
- Profile CPU and memory usage
- Execute BIRD benchmark subset
- Validate cache hit rate (>70%)
- Test circuit breaker activation
- Verify rate limiting enforcement
- Check monitoring and alerting
- Review timeout configuration
Post-Deployment
- Monitor P99 latency (<200ms target)
- Track cache hit rate (>90% target)
- Watch memory usage (<2MB per session)
- Measure throughput (>150 QPS)
- Collect user feedback
- Analyze query patterns
- Optimize based on production data
- Plan capacity scaling
Conclusion
Performance optimization is an ongoing process. Use this guide as a reference for:
- Initial Configuration - Start with recommended settings
- Production Monitoring - Track key metrics continuously
- Iterative Tuning - Adjust based on real workload
- Problem Solving - Use troubleshooting section for issues
Key Takeaways
Cache Aggressively - 70-80% latency reduction Optimize Critical Path - Focus on LLM and schema loading Monitor Continuously - Metrics drive optimization Scale Appropriately - Vertical first, horizontal for growth Test Thoroughly - Load test before production
Target Summary
| Metric | Target | Optimization |
|---|---|---|
| P99 Latency | <200ms | Cache + Fast Models |
| Memory | <2MB | Compression + Eviction |
| Throughput | 150+ QPS | Concurrency + Batching |
| Cache Hit | >90% | Semantic + Preload |
With these optimizations, HeliosDB Conversational BI delivers world-class NL2SQL performance at scale.
Version: 7.0 Last Updated: 2025-11-14 Maintained By: HeliosDB Team