HeliosDB NL2SQL Production Deployment Guide
HeliosDB NL2SQL Production Deployment Guide
Version: 1.0 Last Updated: October 28, 2025 Status: Production Ready
Table of Contents
- Deployment Overview
- Configuration
- Installation & Setup
- Monitoring & Observability
- Operational Runbooks
- Production Checklist
- Scaling Guidelines
- Troubleshooting Guide
1. Deployment Overview
1.1 Architecture Diagram
┌─────────────────────────────────────────────────────────────────┐│ CLIENT LAYER ││ (Applications, Web UI, APIs) │└────────────────────────────┬────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────┐│ NL2SQL ENGINE (Core) ││ ┌───────────────┐ ┌──────────────┐ ┌────────────────────┐ ││ │ Query │ │ Agentic │ │ Code Advisor │ ││ │ Complexity │─▶│ Decomposer │◀─│ (Performance │ ││ │ Assessment │ │ │ │ Analysis) │ ││ └───────────────┘ └──────┬───────┘ └──────────┬─────────┘ ││ │ │ │ ││ ▼ ▼ │ ││ ┌──────────────────────────────────────────┐ │ ││ │ Schema Retriever & Cache │ │ ││ │ (Skeleton Cache + Metadata Integration) │ │ ││ └────────────┬─────────────────────────────┘ │ ││ │ │ ││ ▼ ▼ ││ ┌──────────────────────────────────────────────────────────┐ ││ │ LLM Translation Layer │ ││ │ (OpenAI, Anthropic, DeepSeek, Grok - Multi-provider) │ ││ └────────────┬─────────────────────────────────────────────┘ │└───────────────┼────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────┐│ STATISTICS & METADATA LAYER ││ ┌──────────────────┐ ┌─────────────────┐ ┌────────────────┐ ││ │ Statistics │ │ Metadata │ │ Refresh │ ││ │ Service │ │ Adapter │ │ Manager │ ││ │ (Cardinality, │ │ (Schema │ │ (Priority │ ││ │ Selectivity) │ │ Integration) │ │ Queue) │ ││ └──────────────────┘ └─────────────────┘ └────────────────┘ │└─────────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────┐│ DATABASE METADATA LAYER ││ (heliosdb-metadata: Tables, Columns, ││ Indexes, Constraints, Statistics) │└─────────────────────────────────────────────────────────────────┘1.2 Component Dependencies
Critical Dependencies:
heliosdb-metadata: Schema service, constraint metadataheliosdb-nl2sql: Core NL2SQL engine, agentic decomposer, code advisor- LLM Provider APIs: OpenAI, Anthropic, DeepSeek, Grok
- Statistics cache: In-memory (parking_lot RwLock) or Redis for distributed deployments
Data Flow:
- Natural language query → Complexity assessment
- Simple queries → Direct translation
- Complex queries → Agentic decomposition → Task execution
- Generated SQL → Code Advisor analysis → Performance recommendations
- Statistics service → Real-time cardinality/selectivity estimates
- Metadata adapter → Schema integration → Index recommendations
1.3 System Requirements
Minimum Production Requirements:
- CPU: 4 cores (8 cores recommended for high throughput)
- Memory: 8GB RAM (16GB+ for large schemas)
- Storage: 50GB SSD (for caching, logs, statistics)
- Network: 1Gbps (low latency to LLM APIs critical)
- OS: Linux (Ubuntu 22.04 LTS, RHEL 8+, or equivalent)
Recommended Production Configuration:
- CPU: 16+ cores for parallel task execution
- Memory: 32GB+ RAM for large-scale statistics caching
- Storage: 500GB+ NVMe SSD for high IOPS
- Network: 10Gbps with dedicated LLM API egress
- Database: PostgreSQL 14+, MySQL 8.0+, or HeliosDB native
1.4 Prerequisites
Software:
- Rust 1.70+ (for compilation)
- PostgreSQL 14+ or MySQL 8.0+ (for metadata storage)
- Redis 7+ (optional, for distributed caching)
- Prometheus + Grafana (for monitoring)
- Docker + Docker Compose (optional, for containerized deployment)
API Keys:
- OpenAI API key (GPT-4, GPT-3.5-turbo)
- Anthropic API key (Claude 3 Opus/Sonnet)
- DeepSeek API key (optional)
- Grok API key (optional)
Network Access:
- Outbound HTTPS to LLM provider endpoints
- Inbound on application ports (default: 8080 for HTTP API)
- Database connectivity (PostgreSQL: 5432, MySQL: 3306)
2. Configuration
2.1 Complete TOML Configuration Reference
/etc/heliosdb/nl2sql.toml:
# ===== NL2SQL Core Configuration =====[nl2sql]# Enable/disable NL2SQL engineenabled = true
# Query timeout (seconds)query_timeout = 30
# Maximum concurrent queriesmax_concurrent_queries = 100
# Cache size (MB) for skeleton cachecache_size_mb = 500
# Cache TTL (seconds)cache_ttl = 3600
# Default LLM provider (openai, anthropic, deepseek, grok)default_provider = "openai"
# Fallback providers (used if primary fails)fallback_providers = ["anthropic", "deepseek"]
# ===== Agentic Decomposition =====[nl2sql.agentic]# Enable agentic task decomposition for complex queriesenabled = true
# Complexity threshold (1-10) - queries above this trigger decompositioncomplexity_threshold = 5
# Maximum tasks per decompositionmax_tasks = 10
# Maximum execution stagesmax_stages = 5
# Agentic execution timeout (seconds)timeout_seconds = 30
# Enable parallel task executionparallel_execution = true
# Maximum parallel tasks per stagemax_parallel_tasks = 4
# ===== Code Advisor =====[nl2sql.advisor]# Enable Code Advisor for query optimizationenabled = true
# Automatically recommend indexesauto_recommend_indexes = true
# Minimum quality score threshold (0-100)min_quality_score = 50
# Enable statistics integrationuse_statistics = true
# Enable metadata integrationuse_metadata = true
# Bottleneck detection thresholdbottleneck_severity_threshold = 6
# ===== Statistics Service =====[nl2sql.statistics]# Enable statistics serviceenabled = true
# Statistics refresh interval (seconds)refresh_interval_seconds = 3600
# Cache size for table statistics (MB)cache_size_mb = 100
# Cache size for column statistics (MB)column_cache_size_mb = 200
# Use histogram-based estimatesenable_histograms = true
# Histogram bucket counthistogram_buckets = 100
# Minimum table size (rows) for statistics collectionmin_table_size = 1000
# ===== Metadata Adapter =====[nl2sql.metadata]# Enable metadata integrationenabled = true
# Metadata refresh interval (seconds)refresh_interval = 3600
# Collect constraint metadatacollect_constraints = true
# Collect index metadatacollect_indexes = true
# Collect foreign key metadatacollect_foreign_keys = true
# Max concurrent metadata collection tasksmax_concurrent_collections = 10
# ===== Statistics Refresh Manager =====[nl2sql.refresh_manager]# Enable background statistics refreshenabled = true
# Refresh queue capacityqueue_capacity = 1000
# Worker threads for refresh tasksworker_threads = 4
# Priority levels (higher = more urgent)high_priority_threshold = 8medium_priority_threshold = 5
# Auto-refresh on query (adaptive)auto_refresh_on_query = true
# Staleness threshold (seconds) - trigger refresh if stats older than thisstaleness_threshold = 7200
# ===== LLM Provider Configuration =====[llm.openai]# API key (or use OPENAI_API_KEY environment variable)api_key = "${OPENAI_API_KEY}"
# Model to usemodel = "gpt-4-turbo-preview"
# API endpointendpoint = "https://api.openai.com/v1"
# Request timeout (seconds)timeout = 30
# Max retriesmax_retries = 3
# Rate limit (requests per minute)rate_limit = 500
# Temperature (0.0-1.0)temperature = 0.0
# Max tokensmax_tokens = 2000
[llm.anthropic]api_key = "${ANTHROPIC_API_KEY}"model = "claude-3-5-sonnet-20241022"endpoint = "https://api.anthropic.com/v1"timeout = 30max_retries = 3rate_limit = 100temperature = 0.0max_tokens = 4000
[llm.deepseek]api_key = "${DEEPSEEK_API_KEY}"model = "deepseek-chat"endpoint = "https://api.deepseek.com/v1"timeout = 30max_retries = 3rate_limit = 200temperature = 0.0max_tokens = 2000
[llm.grok]api_key = "${GROK_API_KEY}"model = "grok-1"endpoint = "https://api.x.ai/v1"timeout = 30max_retries = 3rate_limit = 100temperature = 0.0max_tokens = 2000
# ===== Database Connection =====[database]# Database type (postgres, mysql, heliosdb)type = "postgres"
# Connection stringconnection_string = "postgresql://heliosdb:password@localhost:5432/heliosdb_metadata"
# Connection pool sizepool_size = 20
# Connection timeout (seconds)connection_timeout = 10
# Query timeout (seconds)query_timeout = 30
# ===== Distributed Caching (Redis) =====[cache.redis]# Enable Redis for distributed cachingenabled = false
# Redis connection stringurl = "redis://localhost:6379"
# Key prefix for NL2SQL cache entrieskey_prefix = "nl2sql:"
# Connection pool sizepool_size = 10
# ===== Observability =====[observability]# Enable metrics collectionmetrics_enabled = true
# Metrics port (Prometheus scrape endpoint)metrics_port = 9090
# Metrics pathmetrics_path = "/metrics"
# Enable distributed tracingtracing_enabled = true
# Tracing exporter (jaeger, zipkin, otlp)tracing_exporter = "jaeger"
# Jaeger agent endpointjaeger_agent_endpoint = "localhost:6831"
# Log level (trace, debug, info, warn, error)log_level = "info"
# Structured logging format (json, text)log_format = "json"
# Log output (stdout, file)log_output = "stdout"
# Log file path (if log_output = "file")log_file = "/var/log/heliosdb/nl2sql.log"
# ===== Security =====[security]# Enable API authenticationrequire_auth = true
# JWT secret (or use JWT_SECRET environment variable)jwt_secret = "${JWT_SECRET}"
# JWT expiration (seconds)jwt_expiration = 3600
# Enable TLSenable_tls = true
# TLS certificate pathtls_cert = "/etc/heliosdb/certs/server.crt"
# TLS key pathtls_key = "/etc/heliosdb/certs/server.key"
# Rate limiting (requests per minute per client)rate_limit_per_client = 100
# ===== Feature Flags =====[features]# Enable experimental featuresexperimental = false
# Enable A/B testingab_testing = false
# A/B test split (0.0-1.0)ab_test_split = 0.5
# Enable verbose error messages (disable in production)verbose_errors = false2.2 Environment Variables
Required:
# LLM API Keysexport OPENAI_API_KEY="sk-proj-..."export ANTHROPIC_API_KEY="sk-ant-..."export DEEPSEEK_API_KEY="..." # Optionalexport GROK_API_KEY="..." # Optional
# Databaseexport DATABASE_URL="postgresql://user:pass@localhost:5432/heliosdb"
# Securityexport JWT_SECRET="your-256-bit-secret-key-here"
# Deploymentexport RUST_LOG="info,heliosdb_nl2sql=debug"export RUST_BACKTRACE="1"Optional:
# Redis (for distributed caching)export REDIS_URL="redis://localhost:6379"
# Observabilityexport JAEGER_AGENT_HOST="localhost"export JAEGER_AGENT_PORT="6831"
# Performance tuningexport NL2SQL_WORKER_THREADS="16"export NL2SQL_MAX_CONCURRENT_QUERIES="200"2.3 Performance Tuning Parameters
CPU-Bound Workloads (complex decomposition):
[nl2sql.agentic]max_parallel_tasks = 8 # Match CPU core countparallel_execution = truetimeout_seconds = 60
[nl2sql.refresh_manager]worker_threads = 8 # Match CPU core countMemory-Bound Workloads (large schemas):
[nl2sql]cache_size_mb = 2000 # Increase for large skeleton cache
[nl2sql.statistics]cache_size_mb = 500column_cache_size_mb = 1000enable_histograms = truehistogram_buckets = 200High-Throughput Workloads (many concurrent queries):
[nl2sql]max_concurrent_queries = 500
[database]pool_size = 50
[cache.redis]enabled = truepool_size = 20Low-Latency Workloads (fast response times):
[nl2sql]query_timeout = 10cache_ttl = 7200 # Longer cache retention
[llm.openai]model = "gpt-3.5-turbo" # Faster, lower latencytimeout = 10
[nl2sql.agentic]complexity_threshold = 7 # Avoid decomposition for most queries2.4 Security Settings
Production Security Hardening:
[security]require_auth = trueenable_tls = truerate_limit_per_client = 50 # Stricter rate limit
# TLS 1.3 onlytls_min_version = "1.3"
# Disable verbose errors (prevent information disclosure)[features]verbose_errors = false
# Enable audit logging[observability]audit_log_enabled = trueaudit_log_path = "/var/log/heliosdb/audit.log"API Key Rotation:
# Store API keys in secret manager (AWS Secrets Manager, HashiCorp Vault)export OPENAI_API_KEY=$(aws secretsmanager get-secret-value \ --secret-id heliosdb/nl2sql/openai --query SecretString --output text)3. Installation & Setup
3.1 Step-by-Step Deployment Procedure
Option A: Docker Deployment (Recommended for Quick Start)
Step 1: Build Docker image
cd /path/to/heliosdbdocker build -t heliosdb-nl2sql:latest -f docker/Dockerfile.nl2sql .Step 2: Create configuration
mkdir -p /etc/heliosdbcp config/nl2sql.toml.example /etc/heliosdb/nl2sql.toml# Edit /etc/heliosdb/nl2sql.toml with your settingsStep 3: Run with Docker Compose
version: '3.8'
services: nl2sql: image: heliosdb-nl2sql:latest ports: - "8080:8080" - "9090:9090" environment: - OPENAI_API_KEY=${OPENAI_API_KEY} - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} - DATABASE_URL=${DATABASE_URL} - JWT_SECRET=${JWT_SECRET} - RUST_LOG=info volumes: - /etc/heliosdb/nl2sql.toml:/app/config.toml:ro - nl2sql-cache:/var/cache/nl2sql restart: unless-stopped healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] interval: 30s timeout: 10s retries: 3
postgres: image: postgres:14 environment: POSTGRES_DB: heliosdb_metadata POSTGRES_USER: heliosdb POSTGRES_PASSWORD: secure_password volumes: - postgres-data:/var/lib/postgresql/data ports: - "5432:5432"
redis: image: redis:7-alpine ports: - "6379:6379" volumes: - redis-data:/data
volumes: nl2sql-cache: postgres-data: redis-data:docker-compose up -dOption B: Native Deployment (Production)
Step 1: Compile HeliosDB with NL2SQL feature
cd /path/to/heliosdbcargo build --release --features nl2sql,statistics,advisor
sudo cp target/release/heliosdb /usr/local/bin/Step 2: Create system user
sudo useradd -r -s /bin/false -d /var/lib/heliosdb heliosdbsudo mkdir -p /var/lib/heliosdb /var/log/heliosdb /etc/heliosdbsudo chown -R heliosdb:heliosdb /var/lib/heliosdb /var/log/heliosdbStep 3: Install configuration
sudo cp config/nl2sql.toml /etc/heliosdb/sudo chmod 640 /etc/heliosdb/nl2sql.tomlsudo chown root:heliosdb /etc/heliosdb/nl2sql.tomlStep 4: Create systemd service
sudo tee /etc/systemd/system/heliosdb-nl2sql.service > /dev/null <<EOF[Unit]Description=HeliosDB NL2SQL EngineAfter=network.target postgresql.service
[Service]Type=simpleUser=heliosdbGroup=heliosdbWorkingDirectory=/var/lib/heliosdbExecStart=/usr/local/bin/heliosdb nl2sql --config /etc/heliosdb/nl2sql.tomlRestart=on-failureRestartSec=10LimitNOFILE=65536
# Security hardeningNoNewPrivileges=truePrivateTmp=trueProtectSystem=strictProtectHome=trueReadWritePaths=/var/lib/heliosdb /var/log/heliosdb
# EnvironmentEnvironment="RUST_LOG=info,heliosdb_nl2sql=debug"EnvironmentFile=/etc/heliosdb/nl2sql.env
[Install]WantedBy=multi-user.targetEOFStep 5: Create environment file
sudo tee /etc/heliosdb/nl2sql.env > /dev/null <<EOFOPENAI_API_KEY=sk-proj-...ANTHROPIC_API_KEY=sk-ant-...DATABASE_URL=postgresql://heliosdb:password@localhost:5432/heliosdb_metadataJWT_SECRET=your-256-bit-secret-hereEOF
sudo chmod 600 /etc/heliosdb/nl2sql.envsudo chown root:heliosdb /etc/heliosdb/nl2sql.envStep 6: Enable and start service
sudo systemctl daemon-reloadsudo systemctl enable heliosdb-nl2sqlsudo systemctl start heliosdb-nl2sqlsudo systemctl status heliosdb-nl2sql3.2 Database Initialization
Step 1: Create metadata database
-- PostgreSQLCREATE DATABASE heliosdb_metadata;CREATE USER heliosdb WITH PASSWORD 'secure_password';GRANT ALL PRIVILEGES ON DATABASE heliosdb_metadata TO heliosdb;
-- MySQLCREATE DATABASE heliosdb_metadata CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;CREATE USER 'heliosdb'@'localhost' IDENTIFIED BY 'secure_password';GRANT ALL PRIVILEGES ON heliosdb_metadata.* TO 'heliosdb'@'localhost';FLUSH PRIVILEGES;Step 2: Run schema migrations
heliosdb migrate --database $DATABASE_URL --migrations ./migrations/nl2sqlStep 3: Verify schema
-- PostgreSQL\dt\di
-- Expected tables:-- - nl2sql_cache-- - nl2sql_statistics_table-- - nl2sql_statistics_column-- - nl2sql_refresh_queue-- - nl2sql_audit_log3.3 Schema Setup
Step 1: Initialize schema metadata (for target database)
# Connect NL2SQL to your application databaseheliosdb nl2sql schema sync \ --source-db "postgresql://app_user:pass@localhost:5432/app_db" \ --target-metadata "postgresql://heliosdb:pass@localhost:5432/heliosdb_metadata"Step 2: Verify schema sync
SELECT table_name, column_count, last_updatedFROM metadata_tablesORDER BY table_name;Step 3: Configure schema refresh
[nl2sql.metadata]enabled = truerefresh_interval = 3600 # Re-sync every hour
# Auto-detect schema changesauto_sync_on_ddl = true3.4 Initial Statistics Collection
Step 1: Collect baseline statistics
heliosdb nl2sql statistics collect --all-tablesStep 2: Verify statistics
SELECT ts.table_name, ts.row_count, ts.size_bytes, COUNT(cs.column_name) AS columns_with_statsFROM nl2sql_statistics_table tsLEFT JOIN nl2sql_statistics_column cs ON ts.table_name = cs.table_nameGROUP BY ts.table_name, ts.row_count, ts.size_bytesORDER BY ts.row_count DESC;Step 3: Schedule periodic statistics refresh
# Cron job for daily statistics refresh0 2 * * * /usr/local/bin/heliosdb nl2sql statistics refresh --priority autoStep 4: Monitor statistics staleness
SELECT table_name, last_updated, NOW() - last_updated AS stalenessFROM nl2sql_statistics_tableWHERE NOW() - last_updated > INTERVAL '24 hours'ORDER BY staleness DESC;4. Monitoring & Observability
4.1 Key Metrics to Monitor
Query Performance Metrics:
# Query latency (histogram)nl2sql_query_duration_seconds{quantile="0.5"}nl2sql_query_duration_seconds{quantile="0.95"}nl2sql_query_duration_seconds{quantile="0.99"}
# Query throughputrate(nl2sql_queries_total[5m])
# Query success raterate(nl2sql_queries_total{status="success"}[5m]) /rate(nl2sql_queries_total[5m])
# LLM latencynl2sql_llm_request_duration_seconds{provider="openai"}nl2sql_llm_request_duration_seconds{provider="anthropic"}
# Cache hit raterate(nl2sql_cache_hits[5m]) /(rate(nl2sql_cache_hits[5m]) + rate(nl2sql_cache_misses[5m]))Agentic Decomposition Metrics:
# Decomposition frequencyrate(nl2sql_agentic_decompositions_total[5m])
# Average task count per decompositionnl2sql_agentic_tasks_per_query{quantile="0.95"}
# Stage execution timenl2sql_agentic_stage_duration_seconds{stage="1"}nl2sql_agentic_stage_duration_seconds{stage="2"}
# Parallel efficiencynl2sql_agentic_parallel_speedup_ratioCode Advisor Metrics:
# Advisor analysis frequencyrate(nl2sql_advisor_analyses_total[5m])
# Quality scoresnl2sql_advisor_quality_score{quantile="0.5"}
# Bottleneck detection raterate(nl2sql_advisor_bottlenecks_detected[5m])
# Index recommendation raterate(nl2sql_advisor_index_recommendations[5m])Statistics Service Metrics:
# Statistics cache sizenl2sql_statistics_cache_entries{type="table"}nl2sql_statistics_cache_entries{type="column"}
# Statistics refresh raterate(nl2sql_statistics_refreshes_total[5m])
# Statistics stalenessnl2sql_statistics_age_seconds{quantile="0.95"}
# Cardinality estimation accuracy (if ground truth available)nl2sql_statistics_estimation_error_ratioSystem Resource Metrics:
# Memory usageprocess_resident_memory_bytesnl2sql_cache_memory_bytes
# CPU utilizationrate(process_cpu_seconds_total[5m])
# Connection pool usagenl2sql_db_connections_activenl2sql_db_connections_idle4.2 Grafana Dashboard Configuration
Dashboard JSON (/etc/grafana/dashboards/nl2sql.json):
{ "dashboard": { "title": "HeliosDB NL2SQL Production Dashboard", "panels": [ { "title": "Query Throughput", "type": "graph", "targets": [ { "expr": "rate(nl2sql_queries_total[5m])", "legendFormat": "Total Queries/sec" }, { "expr": "rate(nl2sql_queries_total{status=\"success\"}[5m])", "legendFormat": "Success" }, { "expr": "rate(nl2sql_queries_total{status=\"error\"}[5m])", "legendFormat": "Errors" } ] }, { "title": "Query Latency (p50, p95, p99)", "type": "graph", "targets": [ { "expr": "nl2sql_query_duration_seconds{quantile=\"0.5\"}", "legendFormat": "p50" }, { "expr": "nl2sql_query_duration_seconds{quantile=\"0.95\"}", "legendFormat": "p95" }, { "expr": "nl2sql_query_duration_seconds{quantile=\"0.99\"}", "legendFormat": "p99" } ] }, { "title": "Cache Hit Rate", "type": "gauge", "targets": [ { "expr": "rate(nl2sql_cache_hits[5m]) / (rate(nl2sql_cache_hits[5m]) + rate(nl2sql_cache_misses[5m]))", "legendFormat": "Hit Rate" } ], "thresholds": [ { "value": 0.7, "color": "green" }, { "value": 0.5, "color": "yellow" }, { "value": 0, "color": "red" } ] }, { "title": "LLM Provider Latency", "type": "graph", "targets": [ { "expr": "nl2sql_llm_request_duration_seconds{provider=\"openai\", quantile=\"0.95\"}", "legendFormat": "OpenAI p95" }, { "expr": "nl2sql_llm_request_duration_seconds{provider=\"anthropic\", quantile=\"0.95\"}", "legendFormat": "Anthropic p95" } ] }, { "title": "Agentic Decomposition Rate", "type": "graph", "targets": [ { "expr": "rate(nl2sql_agentic_decompositions_total[5m])", "legendFormat": "Decompositions/sec" }, { "expr": "nl2sql_agentic_tasks_per_query{quantile=\"0.95\"}", "legendFormat": "Tasks per Query (p95)" } ] }, { "title": "Statistics Cache Health", "type": "stat", "targets": [ { "expr": "nl2sql_statistics_cache_entries{type=\"table\"}", "legendFormat": "Table Stats" }, { "expr": "nl2sql_statistics_cache_entries{type=\"column\"}", "legendFormat": "Column Stats" } ] } ] }}Quick Setup:
# Import dashboardcurl -X POST http://admin:admin@localhost:3000/api/dashboards/db \ -H "Content-Type: application/json" \ -d @/etc/grafana/dashboards/nl2sql.json4.3 Alert Thresholds
Prometheus Alerting Rules (/etc/prometheus/rules/nl2sql.yml):
groups: - name: nl2sql_alerts interval: 30s rules: # High error rate - alert: NL2SQLHighErrorRate expr: | rate(nl2sql_queries_total{status="error"}[5m]) / rate(nl2sql_queries_total[5m]) > 0.05 for: 5m labels: severity: warning annotations: summary: "NL2SQL error rate above 5%" description: "Error rate is {{ $value | humanizePercentage }}"
# Query latency too high - alert: NL2SQLHighLatency expr: nl2sql_query_duration_seconds{quantile="0.95"} > 10 for: 5m labels: severity: warning annotations: summary: "NL2SQL p95 latency above 10s" description: "p95 latency is {{ $value }}s"
# LLM provider failure - alert: NL2SQLLLMProviderDown expr: | rate(nl2sql_llm_requests_total{status="error"}[5m]) / rate(nl2sql_llm_requests_total[5m]) > 0.5 for: 2m labels: severity: critical annotations: summary: "LLM provider {{ $labels.provider }} has high failure rate" description: "Failure rate is {{ $value | humanizePercentage }}"
# Cache hit rate too low - alert: NL2SQLLowCacheHitRate expr: | rate(nl2sql_cache_hits[5m]) / (rate(nl2sql_cache_hits[5m]) + rate(nl2sql_cache_misses[5m])) < 0.5 for: 10m labels: severity: info annotations: summary: "NL2SQL cache hit rate below 50%" description: "Hit rate is {{ $value | humanizePercentage }}"
# Statistics stale - alert: NL2SQLStaleStatistics expr: nl2sql_statistics_age_seconds{quantile="0.95"} > 86400 for: 1h labels: severity: warning annotations: summary: "NL2SQL statistics are stale (>24h)" description: "p95 age is {{ $value | humanizeDuration }}"
# High memory usage - alert: NL2SQLHighMemoryUsage expr: nl2sql_cache_memory_bytes > 10737418240 # 10GB for: 10m labels: severity: warning annotations: summary: "NL2SQL cache memory usage above 10GB" description: "Current usage: {{ $value | humanize1024 }}B"
# Connection pool exhaustion - alert: NL2SQLConnectionPoolExhausted expr: | nl2sql_db_connections_active / (nl2sql_db_connections_active + nl2sql_db_connections_idle) > 0.9 for: 5m labels: severity: critical annotations: summary: "NL2SQL connection pool >90% utilized" description: "Utilization: {{ $value | humanizePercentage }}"4.4 Log Aggregation Setup
Structured JSON Logging (configured in nl2sql.toml):
[observability]log_format = "json"log_output = "stdout"Example JSON Log:
{ "timestamp": "2025-10-28T10:23:45.123Z", "level": "INFO", "target": "heliosdb_nl2sql::engine", "message": "Query translation completed", "fields": { "query_id": "q-123456", "nl_query": "Show top 10 customers by revenue", "complexity": 7, "agentic_decomposition": true, "tasks": 4, "latency_ms": 1523, "llm_provider": "openai", "cache_hit": false, "confidence": 0.92 }, "span": { "name": "translate_query", "trace_id": "abc123def456" }}Elasticsearch Integration (via Filebeat):
filebeat.inputs: - type: container paths: - '/var/lib/docker/containers/*/*.log' json.keys_under_root: true json.add_error_key: true processors: - add_fields: target: '' fields: service: heliosdb-nl2sql
output.elasticsearch: hosts: ["elasticsearch:9200"] index: "heliosdb-nl2sql-%{+yyyy.MM.dd}"Log Queries (Elasticsearch/Kibana):
# Find slow queriesGET heliosdb-nl2sql-*/_search{ "query": { "bool": { "must": [ { "match": { "target": "heliosdb_nl2sql::engine" }}, { "range": { "fields.latency_ms": { "gte": 5000 }}} ] } }, "sort": [ { "fields.latency_ms": "desc" }]}
# Analyze error patternsGET heliosdb-nl2sql-*/_search{ "query": { "match": { "level": "ERROR" }}, "aggs": { "error_types": { "terms": { "field": "fields.error_type.keyword" } } }}4.5 Health Check Endpoints
Liveness Probe (checks if service is running):
curl -f http://localhost:8080/health/liveness# Response: { "status": "alive", "timestamp": "2025-10-28T10:23:45Z" }Readiness Probe (checks if service can accept traffic):
curl -f http://localhost:8080/health/readiness# Response:# {# "status": "ready",# "checks": {# "database": "ok",# "llm_providers": "ok",# "statistics_cache": "ok"# },# "timestamp": "2025-10-28T10:23:45Z"# }Startup Probe (checks if service has started):
curl -f http://localhost:8080/health/startup# Response: { "status": "started", "uptime_seconds": 3600 }5. Operational Runbooks
5.1 Common Operational Tasks
Task 1: Refresh Statistics Manually
When to Use: After bulk data load, schema changes, or when statistics are stale.
Procedure:
# Refresh all tablesheliosdb nl2sql statistics refresh --all
# Refresh specific tableheliosdb nl2sql statistics refresh --table customers --priority high
# Refresh based on stalenessheliosdb nl2sql statistics refresh --stale-threshold 24h --priority autoVerification:
SELECT table_name, last_updatedFROM nl2sql_statistics_tableWHERE table_name = 'customers';Task 2: Adjust Refresh Priorities
When to Use: To optimize statistics freshness for critical tables.
Procedure:
# Set high priority for critical tablesheliosdb nl2sql refresh-manager set-priority \ --table orders --priority 9
heliosdb nl2sql refresh-manager set-priority \ --table customers --priority 8
# Lower priority for archival tablesheliosdb nl2sql refresh-manager set-priority \ --table historical_data --priority 2Configuration:
[nl2sql.refresh_manager]high_priority_threshold = 8medium_priority_threshold = 5Task 3: Clear Cache (Emergency)
When to Use: After schema changes, corrupt cache, or to force re-translation.
Procedure:
# Clear entire cacheheliosdb nl2sql cache clear --all
# Clear specific table entriesheliosdb nl2sql cache clear --table customers
# Clear stale entries onlyheliosdb nl2sql cache clear --older-than 7dRedis Cache Clear:
redis-cli --scan --pattern "nl2sql:*" | xargs redis-cli DELTask 4: Update LLM Provider Priorities
When to Use: When primary provider is down or has degraded performance.
Procedure:
# Update config[nl2sql]default_provider = "anthropic" # Switch from OpenAIfallback_providers = ["openai", "deepseek"]
# Or via environment variableexport NL2SQL_DEFAULT_PROVIDER="anthropic"Reload Configuration (without restart):
sudo systemctl reload heliosdb-nl2sql# Or send SIGHUPsudo kill -HUP $(pgrep heliosdb)Task 5: Enable/Disable Agentic Decomposition
When to Use: To reduce latency or troubleshoot complex query issues.
Procedure:
# Disable agentic decomposition[nl2sql.agentic]enabled = false
# Or increase threshold to reduce usagecomplexity_threshold = 8 # Only for very complex queriesRuntime Toggle (via API):
curl -X POST http://localhost:8080/admin/features \ -H "Authorization: Bearer $JWT_TOKEN" \ -d '{"agentic_enabled": false}'5.2 Troubleshooting Procedures
Issue: High Query Latency
Symptoms:
- p95 latency >5s
- Timeout errors
- User complaints
Decision Tree:
High Latency?├─ Check LLM Provider Latency│ ├─ >3s? → Switch provider or increase timeout│ └─ <3s → Continue├─ Check Agentic Decomposition│ ├─ Many tasks (>6)? → Increase complexity_threshold│ └─ Few tasks → Continue├─ Check Statistics Staleness│ ├─ >24h? → Refresh statistics│ └─ Fresh → Continue└─ Check Database Connection ├─ Pool exhausted? → Increase pool_size └─ OK → Check cache hit rateActions:
# 1. Check current latency breakdowncurl http://localhost:9090/metrics | grep nl2sql_query_duration
# 2. Identify bottleneck# - If LLM latency high: Switch provider# - If many tasks: Increase complexity threshold# - If cache miss: Warm up cache
# 3. Temporarily disable agentic for latency-sensitive queriescurl -X POST http://localhost:8080/admin/features \ -d '{"agentic_complexity_threshold": 10}'Issue: LLM Provider Failures
Symptoms:
llm_request_errorin logs- Increased error rate
- Fallback provider usage spiking
Diagnosis:
# Check LLM provider statuscurl http://localhost:9090/metrics | grep nl2sql_llm_requests_total
# Check error logsjournalctl -u heliosdb-nl2sql -n 100 | grep "llm_request_error"Actions:
# 1. Verify API key validityecho $OPENAI_API_KEY | wc -c # Should be ~50+ chars
# 2. Test API connectivitycurl https://api.openai.com/v1/models \ -H "Authorization: Bearer $OPENAI_API_KEY"
# 3. Switch to fallback providerexport NL2SQL_DEFAULT_PROVIDER="anthropic"sudo systemctl reload heliosdb-nl2sql
# 4. Monitor fallback performancewatch -n 5 'curl -s http://localhost:9090/metrics | grep nl2sql_llm_requests_total'Issue: Stale Statistics
Symptoms:
- Poor query plans
- Unexpected full table scans
- Inaccurate cardinality estimates
Diagnosis:
SELECT table_name, last_updated, NOW() - last_updated AS stalenessFROM nl2sql_statistics_tableWHERE NOW() - last_updated > INTERVAL '24 hours'ORDER BY staleness DESC;Actions:
# 1. Identify stale tablesheliosdb nl2sql statistics list --stale-threshold 24h
# 2. Prioritize refreshfor table in orders customers products; do heliosdb nl2sql statistics refresh --table $table --priority highdone
# 3. Verify refreshwatch -n 10 'psql -c "SELECT table_name, last_updated FROM nl2sql_statistics_table WHERE table_name IN ('"'"'orders'"'"', '"'"'customers'"'"');"'
# 4. Update refresh schedule# Edit cron job to run more frequentlycrontab -e# Change: 0 2 * * * → */6 * * * (every 6 hours)Issue: Memory Exhaustion
Symptoms:
- OOM kills
- High swap usage
- Cache eviction rate high
Diagnosis:
# Check memory usagefree -hps aux | grep heliosdb | awk '{print $2, $4, $6}' | sort -k3 -rn | head -1
# Check cache sizecurl http://localhost:9090/metrics | grep nl2sql_cache_memory_bytesActions:
# 1. Reduce cache sizes[nl2sql]cache_size_mb = 250 # Reduce from 500
[nl2sql.statistics]cache_size_mb = 50 # Reduce from 100column_cache_size_mb = 100 # Reduce from 200
# 2. Enable Redis for distributed caching (offload memory)[cache.redis]enabled = true# 3. Restart servicesudo systemctl restart heliosdb-nl2sql
# 4. Monitor memorywatch -n 5 'free -h && curl -s http://localhost:9090/metrics | grep cache_memory'5.3 Performance Optimization
Optimization 1: Increase Cache Hit Rate
Target: >70% cache hit rate
Actions:
# 1. Increase cache TTL[nl2sql]cache_ttl = 7200 # 2 hours
# 2. Pre-warm cache with common queries# Create warmup script#!/bin/bashQUERIES=( "Show all active customers" "List top 10 products by sales" "Count orders by status")
for query in "${QUERIES[@]}"; do curl -X POST http://localhost:8080/api/nl2sql/translate \ -H "Content-Type: application/json" \ -d "{\"query\": \"$query\"}"doneOptimization 2: Reduce Agentic Decomposition Overhead
Target: <20% of queries use decomposition
Actions:
# Increase complexity threshold[nl2sql.agentic]complexity_threshold = 7 # Up from 5max_tasks = 6 # Cap at 6 tasksMonitor Impact:
# Before and after comparisoncurl -s http://localhost:9090/metrics | grep nl2sql_agentic_decompositions_totalOptimization 3: Optimize Statistics Collection
Target: Statistics refresh <5 minutes
Actions:
# 1. Identify slow tablesheliosdb nl2sql statistics analyze --report
# 2. Disable histogram collection for large tables[nl2sql.statistics]enable_histograms = false # Or reduce bucketshistogram_buckets = 50 # Down from 100min_table_size = 10000 # Skip small tables5.4 Capacity Planning
Metrics to Track:
# Query rate growthrate(nl2sql_queries_total[1d]) - rate(nl2sql_queries_total[1d] offset 7d)
# P95 latency trendnl2sql_query_duration_seconds{quantile="0.95"} offset 7d
# Memory growth ratedelta(nl2sql_cache_memory_bytes[1d])Capacity Projections:
# Example: Estimate capacity needsimport pandas as pdimport numpy as np
# Historical data (queries per day)data = pd.Series([1000, 1200, 1350, 1500, 1700, 2000, 2400])
# Linear regression for 30-day projectionfrom sklearn.linear_model import LinearRegressionX = np.array(range(len(data))).reshape(-1, 1)y = data.valuesmodel = LinearRegression().fit(X, y)
# Predict day 30predicted = model.predict([[30]])[0]print(f"Projected queries/day in 30 days: {predicted:.0f}")
# Calculate required resources# Assume: 100 qps = 4 CPU cores, 8GB RAMcurrent_qps = 2400 / 86400 # ~28 qpsprojected_qps = predicted / 86400cpu_cores_needed = (projected_qps / 25) * 4memory_gb_needed = (projected_qps / 25) * 8
print(f"Resources needed:")print(f" CPU cores: {cpu_cores_needed:.0f}")print(f" Memory: {memory_gb_needed:.0f}GB")5.5 Backup and Recovery
Backup Strategy:
1. Configuration Backup:
#!/bin/bashBACKUP_DIR="/var/backups/heliosdb-nl2sql"DATE=$(date +%Y%m%d_%H%M%S)
mkdir -p $BACKUP_DIRtar -czf $BACKUP_DIR/config-$DATE.tar.gz \ /etc/heliosdb/nl2sql.toml \ /etc/heliosdb/nl2sql.env
# Retain 30 daysfind $BACKUP_DIR -name "config-*.tar.gz" -mtime +30 -delete2. Statistics Backup (PostgreSQL):
#!/bin/bashBACKUP_DIR="/var/backups/heliosdb-nl2sql"DATE=$(date +%Y%m%d_%H%M%S)
pg_dump -h localhost -U heliosdb \ -t nl2sql_statistics_table \ -t nl2sql_statistics_column \ heliosdb_metadata | gzip > $BACKUP_DIR/statistics-$DATE.sql.gz
# Retain 7 daysfind $BACKUP_DIR -name "statistics-*.sql.gz" -mtime +7 -delete3. Cache Snapshot (Redis):
#!/bin/bashredis-cli --rdb /var/backups/heliosdb-nl2sql/cache-$(date +%Y%m%d).rdbRecovery Procedure:
# 1. Restore configurationtar -xzf /var/backups/heliosdb-nl2sql/config-20251028_120000.tar.gz -C /
# 2. Restore statisticsgunzip < /var/backups/heliosdb-nl2sql/statistics-20251028.sql.gz | \ psql -h localhost -U heliosdb heliosdb_metadata
# 3. Restart servicesudo systemctl restart heliosdb-nl2sql
# 4. Verify healthcurl http://localhost:8080/health/readiness6. Production Checklist
6.1 Pre-Deployment Validation
Configuration Validation:
# Validate TOML syntaxheliosdb nl2sql config validate --config /etc/heliosdb/nl2sql.toml
# Test database connectivityheliosdb nl2sql config test-db --config /etc/heliosdb/nl2sql.toml
# Test LLM provider connectivityheliosdb nl2sql config test-llm --provider all --config /etc/heliosdb/nl2sql.toml
# Verify API keys are setenv | grep -E '(OPENAI|ANTHROPIC|DEEPSEEK|GROK)_API_KEY'
# Check filesystem permissionsls -la /etc/heliosdb/nl2sql.toml # Should be 640, root:heliosdbls -ld /var/lib/heliosdb # Should be 755, heliosdb:heliosdbSecurity Validation:
# Verify TLS certificatesopenssl x509 -in /etc/heliosdb/certs/server.crt -noout -dates
# Test JWT secret strengthecho -n $JWT_SECRET | wc -c # Should be ≥32 characters
# Verify rate limitingcurl -I http://localhost:8080/health # Check for X-RateLimit headers
# Check firewall rulessudo iptables -L -n | grep 8080Performance Validation:
# Load test with sample queriesab -n 100 -c 10 -T 'application/json' \ -p test-query.json \ http://localhost:8080/api/nl2sql/translate
# Verify connection pool sizecurl http://localhost:9090/metrics | grep nl2sql_db_connections
# Check memory limitscat /proc/$(pgrep heliosdb)/limits | grep "Max resident set"6.2 Post-Deployment Verification
Functional Tests:
# Test simple query translationcurl -X POST http://localhost:8080/api/nl2sql/translate \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $JWT_TOKEN" \ -d '{"query": "Show all customers"}' | jq .
# Test complex query (agentic decomposition)curl -X POST http://localhost:8080/api/nl2sql/translate \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $JWT_TOKEN" \ -d '{"query": "Find top 10 customers by revenue in Q1 2024 with order counts"}' | jq .
# Test Code Advisorcurl -X POST http://localhost:8080/api/nl2sql/advisor/analyze \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $JWT_TOKEN" \ -d '{"sql": "SELECT * FROM orders WHERE customer_id = 123"}' | jq .
# Test statistics endpointcurl http://localhost:8080/api/nl2sql/statistics/tables | jq .Monitoring Tests:
# Verify metrics endpointcurl http://localhost:9090/metrics | grep nl2sql_queries_total
# Check health endpointscurl http://localhost:8080/health/livenesscurl http://localhost:8080/health/readinesscurl http://localhost:8080/health/startup
# Verify logs are being writtenjournalctl -u heliosdb-nl2sql -n 10 --no-pager
# Test alerting (trigger error condition)# Check that Prometheus alerts fire and PagerDuty notifications sentIntegration Tests:
# Verify schema syncheliosdb nl2sql schema list | grep -q "customers" && echo " Schema synced"
# Verify statistics collectionpsql -c "SELECT COUNT(*) FROM nl2sql_statistics_table;" | grep -q "10" && echo " Statistics collected"
# Test cache functionality# Run same query twice, verify second is fastertime1=$(curl -w "%{time_total}" -o /dev/null -s -X POST http://localhost:8080/api/nl2sql/translate -d '{"query":"test"}')time2=$(curl -w "%{time_total}" -o /dev/null -s -X POST http://localhost:8080/api/nl2sql/translate -d '{"query":"test"}')echo "First: ${time1}s, Second: ${time2}s (should be faster)"6.3 Performance Baseline
Establish Baseline Metrics:
# Run benchmark suiteheliosdb nl2sql benchmark run --output baseline-$(date +%Y%m%d).json
# Example baseline results:# {# "simple_query_p50": 0.12s,# "simple_query_p95": 0.35s,# "complex_query_p50": 1.8s,# "complex_query_p95": 3.2s,# "cache_hit_rate": 0.72,# "throughput_qps": 350,# "agentic_decomposition_rate": 0.18# }Compare Against Baseline:
# After 1 week, re-run benchmarkheliosdb nl2sql benchmark run --output week1-$(date +%Y%m%d).json
# Compareheliosdb nl2sql benchmark compare \ --baseline baseline-20251028.json \ --current week1-20251104.json6.4 Monitoring Setup Verification
Prometheus Targets:
# Verify Prometheus is scraping NL2SQL metricscurl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job=="heliosdb-nl2sql")'
# Should show:# {# "health": "up",# "lastScrape": "2025-10-28T10:23:45Z",# "scrapeUrl": "http://heliosdb-nl2sql:9090/metrics"# }Grafana Dashboards:
# Verify dashboard existscurl -H "Authorization: Bearer $GRAFANA_API_KEY" \ http://localhost:3000/api/search?query=NL2SQL | jq .
# Test dashboard datacurl -H "Authorization: Bearer $GRAFANA_API_KEY" \ 'http://localhost:3000/api/datasources/proxy/1/api/v1/query?query=nl2sql_queries_total'Alerting:
# Verify alert rules loadedcurl http://localhost:9090/api/v1/rules | jq '.data.groups[] | select(.name=="nl2sql_alerts")'
# Test alert firing (optional - trigger condition manually)# Disable LLM provider temporarily to trigger NL2SQLLLMProviderDown alert6.5 Documentation Handoff
Required Documentation:
Architecture diagram (saved to /docs/nl2sql/PRODUCTION_DEPLOYMENT_GUIDE.md) Configuration reference (TOML + env vars) Runbook procedures (common tasks, troubleshooting) Monitoring dashboard URLs Alert contact information (PagerDuty, Slack) Backup/restore procedures Escalation matrixHandoff Checklist:
- [ ] Operations team trained on deployment guide- [ ] On-call rotation established- [ ] Runbook procedures tested in staging- [ ] Grafana dashboard access granted- [ ] PagerDuty integration tested- [ ] Backup/restore procedures verified- [ ] Emergency contacts documented- [ ] Knowledge transfer session completed7. Scaling Guidelines
7.1 Horizontal Scaling Strategies
Load Balancer Configuration (NGINX):
upstream nl2sql_backend { least_conn; # Use least-connections algorithm
server nl2sql-1:8080 max_fails=3 fail_timeout=30s; server nl2sql-2:8080 max_fails=3 fail_timeout=30s; server nl2sql-3:8080 max_fails=3 fail_timeout=30s;
keepalive 32;}
server { listen 80; server_name nl2sql.heliosdb.com;
location /api/nl2sql/ { proxy_pass http://nl2sql_backend; proxy_http_version 1.1; proxy_set_header Connection ""; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
# Increase timeout for complex queries proxy_read_timeout 60s; proxy_connect_timeout 10s; }
location /health/ { proxy_pass http://nl2sql_backend; proxy_read_timeout 5s; }}Kubernetes Deployment:
apiVersion: apps/v1kind: Deploymentmetadata: name: heliosdb-nl2sql labels: app: heliosdb-nl2sqlspec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 selector: matchLabels: app: heliosdb-nl2sql template: metadata: labels: app: heliosdb-nl2sql spec: containers: - name: nl2sql image: heliosdb-nl2sql:latest ports: - containerPort: 8080 name: http - containerPort: 9090 name: metrics env: - name: OPENAI_API_KEY valueFrom: secretKeyRef: name: llm-api-keys key: openai - name: DATABASE_URL valueFrom: secretKeyRef: name: database-credentials key: url resources: requests: cpu: "4" memory: "8Gi" limits: cpu: "8" memory: "16Gi" livenessProbe: httpGet: path: /health/liveness port: 8080 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /health/readiness port: 8080 initialDelaySeconds: 10 periodSeconds: 5 volumeMounts: - name: config mountPath: /app/config.toml subPath: nl2sql.toml volumes: - name: config configMap: name: nl2sql-config---apiVersion: v1kind: Servicemetadata: name: heliosdb-nl2sqlspec: type: LoadBalancer selector: app: heliosdb-nl2sql ports: - name: http port: 80 targetPort: 8080 - name: metrics port: 9090 targetPort: 9090---apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: heliosdb-nl2sql-hpaspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: heliosdb-nl2sql minReplicas: 3 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Pods pods: metric: name: nl2sql_queries_total target: type: AverageValue averageValue: "100" # 100 qps per pod7.2 Resource Allocation
Sizing Guide (per instance):
| Workload Type | CPU Cores | Memory | Queries/sec |
|---|---|---|---|
| Light (simple queries) | 2-4 | 4-8GB | 50-100 |
| Medium (mixed) | 4-8 | 8-16GB | 100-200 |
| Heavy (complex, agentic) | 8-16 | 16-32GB | 200-400 |
| Very Heavy (high concurrency) | 16-32 | 32-64GB | 400-800 |
Formula:
CPU_cores = (target_qps * avg_query_latency_seconds * 1.5_overhead) / concurrency_factorMemory_GB = (cache_size_mb + column_cache_mb + 2048_base) / 1024Example:
# Target: 300 qps, avg latency 1.5starget_qps = 300avg_latency = 1.5overhead = 1.5concurrency = 0.8 # 80% of queries can run in parallel
cpu_cores = (target_qps * avg_latency * overhead) / concurrency# = (300 * 1.5 * 1.5) / 0.8 = 843 → ~12 cores
# Memorycache_mb = 500column_cache_mb = 200base_mb = 2048
memory_gb = (cache_mb + column_cache_mb + base_mb) / 1024# = 2748 / 1024 = ~3GB → Recommend 8GB with headroom7.3 Load Balancing
Session Affinity (for cache locality):
# NGINX - IP hash for session affinityupstream nl2sql_backend { ip_hash; # Route same IP to same backend server nl2sql-1:8080; server nl2sql-2:8080; server nl2sql-3:8080;}Kubernetes Service Mesh (Istio):
apiVersion: networking.istio.io/v1beta1kind: DestinationRulemetadata: name: nl2sql-lbspec: host: heliosdb-nl2sql trafficPolicy: loadBalancer: consistentHash: httpHeaderName: "X-User-ID" # Route based on user IDHealth Check-Based Routing:
# Remove unhealthy backends automaticallyupstream nl2sql_backend { server nl2sql-1:8080 max_fails=3 fail_timeout=30s; server nl2sql-2:8080 max_fails=3 fail_timeout=30s backup; # Backup server
check interval=5000 rise=2 fall=3 timeout=3000 type=http; check_http_send "GET /health/readiness HTTP/1.0\r\n\r\n"; check_http_expect_alive http_2xx http_3xx;}7.4 Multi-Region Deployment
Architecture:
┌─────────────────┐ │ Global DNS │ │ (Route 53) │ └────────┬────────┘ │ ┌────────────────┼────────────────┐ ▼ ▼ ▼ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ US-EAST-1 │ │ EU-WEST-1 │ │ AP-SOUTH-1 │ │ (Primary) │ │ (Secondary) │ │ (Secondary) │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ ┌──────▼───────┐ ┌──────▼───────┐ ┌──────▼───────┐ │ NL2SQL │ │ NL2SQL │ │ NL2SQL │ │ Cluster │ │ Cluster │ │ Cluster │ │ (3 nodes) │ │ (3 nodes) │ │ (3 nodes) │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ ┌──────▼───────────────────────────────────▼──────┐ │ Global Statistics Replication │ │ (PostgreSQL Multi-Master / CockroachDB) │ └──────────────────────────────────────────────────┘Configuration (per region):
[deployment]region = "us-east-1"primary_region = true
# Replicate statistics to other regions[statistics.replication]enabled = truetargets = [ "eu-west-1.nl2sql.heliosdb.com", "ap-south-1.nl2sql.heliosdb.com"]sync_interval = 300 # 5 minutesFailover Strategy:
# Route 53 health checkType: HTTPSResourcePath: /health/readinessFailureThreshold: 3MeasureLatency: trueRegions: - us-east-1 (Primary - Weight: 100) - eu-west-1 (Failover - Weight: 0) - ap-south-1 (Failover - Weight: 0)
# Automatic failover if primary unhealthy for >1 minute8. Troubleshooting Guide
8.1 Common Issues and Solutions
Issue 1: Query Translation Failures
Symptoms:
translation_errorin response- Error: “Failed to parse SQL from LLM response”
- Confidence score consistently low (<0.5)
Root Causes:
- Ambiguous natural language query
- LLM provider issue
- Schema mismatch
- Complex query beyond LLM capabilities
Solutions:
Step 1: Check query complexity
curl -X POST http://localhost:8080/api/nl2sql/analyze \ -d '{"query": "YOUR_QUERY"}' | jq '.complexity'
# If complexity >8, query may be too complexStep 2: Verify schema is synced
SELECT table_name, last_updatedFROM metadata_tablesWHERE table_name IN ('referenced_table_1', 'referenced_table_2');
-- If last_updated is old, re-syncheliosdb nl2sql schema sync --forceStep 3: Try different LLM provider
curl -X POST http://localhost:8080/api/nl2sql/translate \ -d '{"query": "YOUR_QUERY", "provider": "anthropic"}' | jq .Step 4: Simplify query
# Instead of:"Show me the top 10 customers by total revenue in Q1 2024, grouped by region, with running totals and percent of total"
# Try:"Show top 10 customers by revenue in 2024"Issue 2: Agentic Decomposition Timeouts
Symptoms:
agentic_timeouterrors- Queries taking >30s
- Many tasks (>8) in decomposition
Root Causes:
- Too many tasks generated
- Sequential execution (not parallelized)
- LLM provider slow
- Database connection slow
Solutions:
Step 1: Increase timeout temporarily
[nl2sql.agentic]timeout_seconds = 60 # Increase from 30Step 2: Enable parallel execution
[nl2sql.agentic]parallel_execution = truemax_parallel_tasks = 6 # Increase from 4Step 3: Reduce max tasks
[nl2sql.agentic]max_tasks = 6 # Reduce from 10complexity_threshold = 7 # Increase to reduce decomposition frequencyStep 4: Check LLM provider latency
curl http://localhost:9090/metrics | grep nl2sql_llm_request_duration_seconds
# If >3s, switch providerIssue 3: Statistics Collection Failures
Symptoms:
statistics_collection_errorin logsget_table_statsreturnsNone- Cardinality estimates are wrong
Root Causes:
- Database permission issues
- Table not in metadata
- Statistics refresh manager queue full
- Database connection timeout
Solutions:
Step 1: Check database permissions
-- PostgreSQLGRANT SELECT ON ALL TABLES IN SCHEMA public TO heliosdb;GRANT SELECT ON pg_statistic TO heliosdb;
-- MySQLGRANT SELECT ON *.* TO 'heliosdb'@'localhost';GRANT SELECT ON mysql.innodb_table_stats TO 'heliosdb'@'localhost';Step 2: Verify table exists in metadata
SELECT * FROM metadata_tables WHERE table_name = 'your_table';
-- If not found, re-sync schemaStep 3: Check refresh queue
heliosdb nl2sql refresh-manager status
# If queue full, increase capacity[nl2sql.refresh_manager]queue_capacity = 2000 # Increase from 1000worker_threads = 8 # Increase from 4Step 4: Manual statistics collection
heliosdb nl2sql statistics refresh --table your_table --forceIssue 4: Code Advisor Performance Issues
Symptoms:
- Advisor analysis taking >5s
- Memory spikes during analysis
- Incomplete recommendations
Root Causes:
- Large SQL queries (>10KB)
- Statistics lookup overhead
- Metadata adapter slow
- Too many index recommendations
Solutions:
Step 1: Disable statistics integration temporarily
[nl2sql.advisor]use_statistics = false # Temporarily disableStep 2: Limit index recommendations
[nl2sql.advisor]max_index_recommendations = 5 # Cap at 5Step 3: Cache advisor results
# Enable Redis caching for advisor results[cache.redis]enabled = trueadvisor_cache_ttl = 1800 # 30 minutesStep 4: Increase advisor timeout
[nl2sql.advisor]analysis_timeout_seconds = 10 # Increase from 58.2 Debug Procedures
Enable Debug Logging:
# Temporary (runtime)export RUST_LOG="debug,heliosdb_nl2sql=trace"sudo systemctl restart heliosdb-nl2sql
# Permanent (config)[observability]log_level = "debug"Trace Specific Query:
# Enable tracing for single querycurl -X POST http://localhost:8080/api/nl2sql/translate \ -H "X-Trace-Enabled: true" \ -d '{"query": "YOUR_QUERY"}' | jq .
# View trace in Jaegeropen http://localhost:16686Profile Memory Usage:
# Using heaptrackheaptrack /usr/local/bin/heliosdb nl2sql --config /etc/heliosdb/nl2sql.toml
# Analyzeheaptrack_gui heaptrack.heliosdb.*.gzSQL Query Inspection:
# Enable SQL query logging (PostgreSQL)ALTER SYSTEM SET log_statement = 'all';SELECT pg_reload_conf();
# View logstail -f /var/lib/postgresql/14/main/log/postgresql-*.log | grep heliosdb8.3 Log Analysis
Find Slow Queries:
journalctl -u heliosdb-nl2sql --since "1 hour ago" | \ jq -r 'select(.fields.latency_ms > 5000) | "\(.timestamp) \(.fields.query_id) \(.fields.latency_ms)ms"'Analyze Error Patterns:
journalctl -u heliosdb-nl2sql --since "1 day ago" | \ jq -r 'select(.level == "ERROR") | .fields.error_type' | \ sort | uniq -c | sort -rnTrack LLM Provider Usage:
journalctl -u heliosdb-nl2sql --since "1 hour ago" | \ jq -r 'select(.target == "heliosdb_nl2sql::llm") | .fields.provider' | \ sort | uniq -c8.4 Performance Tuning
Identify Bottlenecks:
# CPU profilingperf record -F 99 -p $(pgrep heliosdb) -g -- sleep 30perf report
# I/O profilingiotop -p $(pgrep heliosdb)
# Network profilingnethogs -p $(pgrep heliosdb)Optimize Configuration:
# Benchmark different configurationsfor threshold in 5 6 7 8; do echo "Testing complexity_threshold=$threshold" sed -i "s/complexity_threshold = .*/complexity_threshold = $threshold/" /etc/heliosdb/nl2sql.toml sudo systemctl reload heliosdb-nl2sql sleep 10 heliosdb nl2sql benchmark run --duration 60s --output "benchmark-$threshold.json"done
# Compare resultsheliosdb nl2sql benchmark compare benchmark-*.jsonAppendix
A. Configuration Template
See Section 2.1 for complete TOML configuration template.
B. Metrics Reference
Complete metrics list:
# Query metricsnl2sql_queries_total{status="success|error"}nl2sql_query_duration_seconds{quantile="0.5|0.95|0.99"}nl2sql_query_complexity_score{quantile="0.5|0.95|0.99"}
# Agentic metricsnl2sql_agentic_decompositions_totalnl2sql_agentic_tasks_per_query{quantile="0.5|0.95|0.99"}nl2sql_agentic_stages_per_query{quantile="0.5|0.95|0.99"}nl2sql_agentic_stage_duration_seconds{stage="1|2|3|..."}nl2sql_agentic_parallel_speedup_ratio
# Code Advisor metricsnl2sql_advisor_analyses_totalnl2sql_advisor_quality_score{quantile="0.5|0.95|0.99"}nl2sql_advisor_bottlenecks_detected{kind="FullTableScan|MissingIndex|..."}nl2sql_advisor_index_recommendations_total
# Statistics metricsnl2sql_statistics_cache_entries{type="table|column"}nl2sql_statistics_cache_hitsnl2sql_statistics_cache_missesnl2sql_statistics_refreshes_total{priority="high|medium|low"}nl2sql_statistics_refresh_duration_secondsnl2sql_statistics_age_seconds{quantile="0.5|0.95|0.99"}
# LLM metricsnl2sql_llm_requests_total{provider="openai|anthropic|...", status="success|error"}nl2sql_llm_request_duration_seconds{provider="...", quantile="0.5|0.95|0.99"}nl2sql_llm_tokens_used_total{provider="...", type="input|output"}
# Cache metricsnl2sql_cache_hitsnl2sql_cache_missesnl2sql_cache_entriesnl2sql_cache_memory_bytesnl2sql_cache_evictions_total
# Database metricsnl2sql_db_connections_activenl2sql_db_connections_idlenl2sql_db_query_duration_seconds{quantile="0.5|0.95|0.99"}
# System metricsprocess_cpu_seconds_totalprocess_resident_memory_bytesprocess_open_fdsC. API Endpoints
Translation API:
POST /api/nl2sql/translate- Translate natural language to SQLPOST /api/nl2sql/analyze- Analyze query complexity (without translation)
Code Advisor API:
POST /api/nl2sql/advisor/analyze- Analyze SQL queryPOST /api/nl2sql/advisor/recommend-indexes- Get index recommendations
Statistics API:
GET /api/nl2sql/statistics/tables- List table statisticsGET /api/nl2sql/statistics/tables/{table}- Get table statisticsPOST /api/nl2sql/statistics/refresh- Trigger statistics refresh
Admin API:
POST /admin/features- Toggle feature flagsPOST /admin/cache/clear- Clear cacheGET /admin/config- Get current configuration
Health API:
GET /health/liveness- Liveness probeGET /health/readiness- Readiness probeGET /health/startup- Startup probe
D. Emergency Contacts
On-Call Rotation:
- Primary: oncall-nl2sql@heliosdb.com
- Secondary: oncall-platform@heliosdb.com
- Escalation: engineering-leads@heliosdb.com
PagerDuty:
- Service: HeliosDB-NL2SQL-Production
- Escalation Policy: Engineering-NL2SQL
Slack Channels:
- #heliosdb-nl2sql-prod (production alerts)
- #heliosdb-nl2sql-oncall (on-call coordination)
- #heliosdb-platform (general platform issues)
End of Production Deployment Guide
Version: 1.0 Maintained by: HeliosDB Platform Team Last Review: October 28, 2025 Next Review: January 28, 2026