Skip to content

Conversational BI Performance Optimization Guide

Conversational BI Performance Optimization Guide

Version: 7.0 Last Updated: 2025-11-14 Status: Production Ready

Table of Contents

  1. Overview
  2. Performance Targets
  3. Quick Start - Top 10 Optimizations
  4. Latency Optimization
  5. Memory Optimization
  6. Throughput Optimization
  7. Cache Optimization
  8. Database Optimization
  9. LLM Integration Optimization
  10. Monitoring & Profiling
  11. Production Tuning
  12. Troubleshooting

Overview

This guide provides comprehensive performance optimization strategies for HeliosDB Conversational BI in production environments.

Performance Philosophy

Goal: Sub-200ms P99 latency for 95%+ accurate NL2SQL translation

Approach:

  • Optimize the critical path (query → SQL generation)
  • Minimize LLM calls through aggressive caching
  • Reduce memory allocation overhead
  • Maximize throughput through concurrency

Baseline Performance (Week 7-8 Validation)

MetricCurrentTargetStatus
Session Creation~5ms<10msEXCEEDED
Query (no LLM)~30ms<50msEXCEEDED
Query (cached)~80ms<100msACHIEVED
Query (uncached)~250ms<300ms⏳ TARGET
P99 LatencyPending<200ms⏳ TARGET
Memory/Session~2.5MB<2MB⏳ CLOSE
QPS100+150+⏳ TARGET

Performance Targets

Latency Targets

# Target latencies (P99)
session_creation: <10ms
cache_lookup: <5ms
schema_loading: <20ms
llm_call: <200ms # External dependency
sql_generation_total: <200ms
context_tracking: <10ms
response_formatting: <5ms

Memory Targets

# Memory usage per component
session_state: <500KB
conversation_context: <1MB
schema_cache: <5MB per database
semantic_cache: <100MB total
total_per_session: <2MB

Throughput Targets

# Concurrent requests
single_instance: 150+ QPS
with_caching: 500+ QPS
cluster: 1000+ QPS

Quick Start - Top 10 Optimizations

1. Enable Semantic Caching

Impact: 70-80% latency reduction for similar queries

[cache]
enabled = true
max_entries = 10000
ttl_seconds = 3600
similarity_threshold = 0.85

Why: Avoids expensive LLM calls for semantically similar queries.

2. Optimize Connection Pooling

Impact: 40-50% latency reduction for database operations

[database]
pool_size = 20 # 2x CPU cores
max_connections = 100
connection_timeout_ms = 5000
idle_timeout_seconds = 300

Why: Eliminates connection establishment overhead.

3. Configure Rate Limiting

Impact: Prevents overload, ensures consistent performance

[rate_limiting]
enabled = true
queries_per_minute = 60 # Per tenant
burst_allowance = 10

Why: Protects against traffic spikes and ensures fair resource allocation.

4. Enable Circuit Breakers

Impact: Graceful degradation, prevents cascade failures

[circuit_breaker]
enabled = true
failure_threshold = 5
reset_timeout_seconds = 60
half_open_requests = 3

Why: Isolates failures and prevents system-wide outages.

5. Optimize Context Window

Impact: 20-30% memory reduction

[context]
max_turns = 10 # Down from 20
max_context_length = 2000 # Down from 4000
context_compression = true

Why: Reduces memory footprint and LLM token usage.

6. Configure Aggressive Timeouts

Impact: Prevents slow queries from blocking resources

[timeouts]
llm_timeout_ms = 5000 # 5 seconds
sql_execution_timeout_ms = 10000 # 10 seconds
total_request_timeout_ms = 15000 # 15 seconds

Why: Ensures predictable latency and resource release.

7. Enable Request Batching

Impact: 2-3x throughput improvement

[batching]
enabled = true
max_batch_size = 10
batch_timeout_ms = 100

Why: Amortizes LLM call overhead across multiple requests.

8. Optimize Schema Caching

Impact: 50-60% reduction in schema loading time

[schema_cache]
enabled = true
max_schemas = 100
preload_popular = true
eviction_policy = "LRU"

Why: Eliminates repeated schema extraction overhead.

9. Configure LLM Model Routing

Impact: 30-40% cost and latency reduction

[llm]
fast_model = "gpt-3.5-turbo" # Simple queries
accurate_model = "gpt-4" # Complex queries
complexity_threshold = 0.7

Why: Uses cheaper/faster models when appropriate.

10. Enable Parallel Processing

Impact: 2x throughput improvement

[concurrency]
max_concurrent_requests = 50
worker_threads = 8 # CPU cores
async_processing = true

Why: Maximizes resource utilization and throughput.

Latency Optimization

Critical Path Analysis

Total Latency = Cache + Schema + LLM + Tracking + Formatting
~200ms = 5ms + 20ms + 150ms + 10ms + 5ms

Optimization Priority:

  1. LLM call (150ms) - Cache aggressively
  2. Schema loading (20ms) - Preload and cache
  3. Context tracking (10ms) - Optimize data structures
  4. Cache lookup (5ms) - Use in-memory hash maps
  5. Response formatting (5ms) - Minimize serialization

LLM Call Optimization

Strategy 1: Semantic Caching

// High-impact optimization
let cache_key = generate_semantic_key(&query, &schema_hash);
if let Some(cached_sql) = semantic_cache.get(&cache_key) {
return Ok(cached_sql); // Save 150ms+
}

Strategy 2: Model Selection

// Use faster model for simple queries
let model = if query_complexity < 0.7 {
"gpt-3.5-turbo" // 50-100ms
} else {
"gpt-4" // 150-200ms
};

Strategy 3: Timeout Configuration

// Aggressive timeout to prevent slow queries
let timeout = Duration::from_millis(5000);
let result = timeout(timeout, llm_call(prompt)).await?;

Schema Loading Optimization

Strategy 1: Preload Common Schemas

// Load popular schemas at startup
async fn preload_schemas(config: &Config) {
for db in &config.popular_databases {
schema_cache.load(db).await;
}
}

Strategy 2: Lazy Schema Loading

// Only load schema when needed
if !schema_cache.contains(&db_id) {
let schema = extract_schema(&db_id).await?;
schema_cache.insert(db_id, schema);
}

Strategy 3: Schema Compression

// Compress schema representation
let compressed = compress_schema(&full_schema);
// Reduces memory and serialization overhead

Context Tracking Optimization

Strategy 1: Circular Buffer

// Use circular buffer for conversation history
struct ConversationContext {
turns: VecDeque<Turn>, // O(1) push/pop
max_turns: usize,
}

Strategy 2: Lazy Materialization

// Only materialize context when needed
fn get_context(&self) -> String {
self.turns.iter()
.rev()
.take(5) // Only recent turns
.map(|t| t.summary())
.collect()
}

Memory Optimization

Memory Allocation Strategy

Goal: <2MB per session

Breakdown:

Session State: 500KB (25%)
Context History: 800KB (40%)
Schema Cache: 400KB (20%)
Temp Buffers: 300KB (15%)
-----------------------------------
Total: 2000KB (100%)

Reduce Session State

Strategy 1: Store References

struct Session {
session_id: Uuid,
user_id: Arc<str>, // Shared string
database_id: Arc<str>, // Shared string
context: ConversationContext,
// Avoid: Full copies of strings
}

Strategy 2: Compact Representation

// Use compact data structures
struct Turn {
turn_id: u16, // 2 bytes vs 8
timestamp: u32, // 4 bytes (unix seconds)
query_hash: u64, // 8 bytes vs full string
entities: SmallVec<[String; 4]>, // Stack allocation
}

Context Compression

Strategy 1: Summarization

// Summarize old turns
impl ConversationContext {
fn compress(&mut self) {
if self.turns.len() > 10 {
let old_turns = self.turns.drain(..5);
self.summary = summarize(old_turns);
}
}
}

Strategy 2: Entity Extraction

// Store entities instead of full text
struct Turn {
entities: Vec<String>, // Just "users", "orders"
intent: Intent, // Enum (1 byte)
// Avoid: Full query text
}

Cache Size Management

Strategy 1: LRU Eviction

// Automatic cache size management
let cache = LruCache::new(10000);
cache.insert(key, value); // Auto-evicts oldest

Strategy 2: Memory Budget

// Set hard memory limits
const MAX_CACHE_SIZE: usize = 100 * 1024 * 1024; // 100MB
if cache.memory_usage() > MAX_CACHE_SIZE {
cache.evict_until(MAX_CACHE_SIZE);
}

Throughput Optimization

Concurrency Configuration

Optimal Settings (8-core system):

[concurrency]
worker_threads = 8 # Match CPU cores
max_concurrent_requests = 50 # 5-10x workers
task_queue_size = 200 # 4x concurrent requests

Async I/O Optimization

Strategy 1: Parallel LLM Calls

// Process multiple queries concurrently
let futures: Vec<_> = queries.iter()
.map(|q| generate_sql(q))
.collect();
let results = join_all(futures).await;

Strategy 2: Pipeline Processing

// Pipeline stages
async fn process_query(query: Query) -> Result<SQL> {
let (schema, context) = join!(
load_schema(query.db), // Parallel
get_context(query.session) // Parallel
);
generate_sql(query, schema, context).await
}

Load Balancing

Strategy 1: Session Affinity

// Route sessions to same worker for cache hits
let worker_id = hash(session_id) % num_workers;
workers[worker_id].process(query).await

Strategy 2: Least Loaded

// Route to least loaded worker
let worker = workers.iter()
.min_by_key(|w| w.queue_size())
.unwrap();
worker.process(query).await

Cache Optimization

Semantic Cache Strategy

Configuration:

[semantic_cache]
enabled = true
max_entries = 10000
ttl_seconds = 3600
similarity_threshold = 0.85
embedding_model = "text-embedding-ada-002"

Hit Rate Target: >90% for production workloads

Cache Key Generation

Strategy 1: Schema-Aware Keys

fn cache_key(query: &str, schema_hash: &str) -> String {
format!("{}:{}", normalize(query), schema_hash)
}

Strategy 2: Canonical Form

fn normalize(query: &str) -> String {
query.to_lowercase()
.trim()
.split_whitespace()
.collect::<Vec<_>>()
.join(" ")
}

Cache Warming

Strategy 1: Preload Common Queries

// At startup
for query in common_queries {
let sql = generate_sql(&query).await?;
cache.insert(query, sql);
}

Strategy 2: Background Refresh

// Refresh cache before expiry
tokio::spawn(async {
loop {
sleep(Duration::from_secs(1800)).await; // 30 min
cache.refresh_expiring_entries().await;
}
});

Database Optimization

Connection Pool Tuning

Optimal Configuration:

[database.pool]
min_connections = 5
max_connections = 20 # 2x CPU cores
connection_timeout = 5000
idle_timeout = 300
max_lifetime = 1800

Query Optimization

Strategy 1: Prepared Statements

// Use prepared statements for repeated queries
let stmt = conn.prepare("SELECT * FROM users WHERE id = ?")?;
let user = stmt.query_row([user_id], |row| ...)?;

Strategy 2: Batch Operations

// Batch schema extractions
let tables = conn.query("SELECT * FROM information_schema.tables")?;
// Process all at once instead of one-by-one

Schema Extraction Optimization

Strategy 1: Incremental Updates

// Only update changed schemas
if schema_version_changed(db_id) {
update_schema(db_id).await?;
}

Strategy 2: Parallel Extraction

// Extract multiple schemas in parallel
let futures: Vec<_> = databases.iter()
.map(|db| extract_schema(db))
.collect();
let schemas = join_all(futures).await;

LLM Integration Optimization

Model Selection

Decision Tree:

Query Complexity < 0.7 → gpt-3.5-turbo (fast, cheap)
Query Complexity ≥ 0.7 → gpt-4 (accurate, slower)
Has cache hit → Skip LLM entirely

Prompt Optimization

Strategy 1: Minimal Prompts

// Provide only necessary context
let prompt = format!(
"Schema: {}\nQuery: {}\nSQL:",
essential_schema, // Not full schema
query
);

Strategy 2: Few-Shot Examples

// Include 2-3 examples, not 10+
let examples = get_relevant_examples(&query, limit: 3);
let prompt = format_prompt(&schema, &query, &examples);

Retry Strategy

Exponential Backoff:

let mut delay = Duration::from_millis(100);
for attempt in 0..3 {
match llm_call(&prompt).await {
Ok(result) => return Ok(result),
Err(e) if e.is_retryable() => {
sleep(delay).await;
delay *= 2; // Exponential backoff
}
Err(e) => return Err(e),
}
}

Monitoring & Profiling

Key Metrics to Track

Latency Metrics:

# P50, P90, P95, P99
histogram_quantile(0.99, rate(query_duration_seconds_bucket[5m]))
# By component
histogram_quantile(0.99, rate(llm_call_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(cache_lookup_duration_seconds_bucket[5m]))

Throughput Metrics:

# Queries per second
rate(queries_total[1m])
# Cache hit rate
rate(cache_hits_total[5m]) / rate(cache_requests_total[5m])

Resource Metrics:

# Memory usage
process_resident_memory_bytes
# CPU usage
rate(process_cpu_seconds_total[1m])

Profiling Tools

CPU Profiling:

Terminal window
# Generate flamegraph
cd heliosdb-conversational-bi
./scripts/profile_performance.sh flamegraph

Memory Profiling:

Terminal window
# Generate memory profile
./scripts/profile_performance.sh memory

Benchmarking:

Terminal window
# Run criterion benchmarks
cargo bench --package heliosdb-conversational-bi

Production Tuning

# Production-optimized configuration
[general]
environment = "production"
log_level = "info"
[cache]
enabled = true
max_entries = 50000 # Increased for production
ttl_seconds = 7200 # 2 hours
similarity_threshold = 0.85
[database]
pool_size = 20
max_connections = 100
connection_timeout_ms = 5000
[rate_limiting]
enabled = true
queries_per_minute = 100 # Adjusted per SLA
burst_allowance = 20
[circuit_breaker]
enabled = true
failure_threshold = 5
reset_timeout_seconds = 60
[concurrency]
worker_threads = 16 # Production server
max_concurrent_requests = 100
[timeouts]
llm_timeout_ms = 5000
total_request_timeout_ms = 15000
[monitoring]
metrics_enabled = true
metrics_port = 9090
tracing_enabled = true

Scaling Strategies

Vertical Scaling (Single Instance):

  • 16+ CPU cores
  • 32+ GB RAM
  • SSD storage
  • Expected: 150-200 QPS

Horizontal Scaling (Cluster):

  • Load balancer (session affinity)
  • 3+ instances
  • Shared cache (Redis)
  • Expected: 500+ QPS

Auto-Scaling Rules:

scale_up:
- cpu_usage > 70% for 5 minutes
- p99_latency > 300ms for 5 minutes
- queue_depth > 100
scale_down:
- cpu_usage < 30% for 15 minutes
- queue_depth < 20

Troubleshooting

High Latency

Symptom: P99 > 500ms

Diagnosis:

Terminal window
# Check cache hit rate
curl http://localhost:9090/metrics | grep cache_hit_rate
# Check LLM latency
curl http://localhost:9090/metrics | grep llm_duration

Solutions:

  1. Increase cache size → Improve hit rate
  2. Enable semantic caching → Reduce LLM calls
  3. Use faster LLM model → Reduce LLM latency
  4. Add more workers → Increase concurrency

High Memory Usage

Symptom: Memory > 4GB per instance

Diagnosis:

Terminal window
# Memory profile
./scripts/profile_performance.sh memory
# Check cache size
curl http://localhost:9090/metrics | grep cache_size_bytes

Solutions:

  1. Reduce cache size → Lower memory footprint
  2. Enable context compression → Smaller sessions
  3. Reduce max_turns → Less history
  4. Implement aggressive eviction → Free memory

Low Throughput

Symptom: QPS < 50

Diagnosis:

Terminal window
# Check queue depth
curl http://localhost:9090/metrics | grep queue_depth
# Check worker utilization
curl http://localhost:9090/metrics | grep worker_busy_ratio

Solutions:

  1. Increase worker_threads → More parallelism
  2. Increase max_concurrent_requests → Higher queue
  3. Optimize database pool → Reduce contention
  4. Enable request batching → Better efficiency

Cache Misses

Symptom: Hit rate < 50%

Diagnosis:

Terminal window
# Analyze cache patterns
curl http://localhost:9090/metrics | grep cache_
# Check similarity threshold
grep similarity_threshold config.toml

Solutions:

  1. Lower similarity_threshold → More liberal matching
  2. Improve query normalization → Better key generation
  3. Preload common queries → Warm cache
  4. Increase cache TTL → Longer retention

Performance Checklist

Pre-Production

  • Run load tests (100+ concurrent users)
  • Profile CPU and memory usage
  • Execute BIRD benchmark subset
  • Validate cache hit rate (>70%)
  • Test circuit breaker activation
  • Verify rate limiting enforcement
  • Check monitoring and alerting
  • Review timeout configuration

Post-Deployment

  • Monitor P99 latency (<200ms target)
  • Track cache hit rate (>90% target)
  • Watch memory usage (<2MB per session)
  • Measure throughput (>150 QPS)
  • Collect user feedback
  • Analyze query patterns
  • Optimize based on production data
  • Plan capacity scaling

Conclusion

Performance optimization is an ongoing process. Use this guide as a reference for:

  1. Initial Configuration - Start with recommended settings
  2. Production Monitoring - Track key metrics continuously
  3. Iterative Tuning - Adjust based on real workload
  4. Problem Solving - Use troubleshooting section for issues

Key Takeaways

Cache Aggressively - 70-80% latency reduction Optimize Critical Path - Focus on LLM and schema loading Monitor Continuously - Metrics drive optimization Scale Appropriately - Vertical first, horizontal for growth Test Thoroughly - Load test before production

Target Summary

MetricTargetOptimization
P99 Latency<200msCache + Fast Models
Memory<2MBCompression + Eviction
Throughput150+ QPSConcurrency + Batching
Cache Hit>90%Semantic + Preload

With these optimizations, HeliosDB Conversational BI delivers world-class NL2SQL performance at scale.


Version: 7.0 Last Updated: 2025-11-14 Maintained By: HeliosDB Team