Skip to content

HeliosDB NL2SQL Production Deployment Guide

HeliosDB NL2SQL Production Deployment Guide

Version: 1.0 Last Updated: October 28, 2025 Status: Production Ready


Table of Contents

  1. Deployment Overview
  2. Configuration
  3. Installation & Setup
  4. Monitoring & Observability
  5. Operational Runbooks
  6. Production Checklist
  7. Scaling Guidelines
  8. Troubleshooting Guide

1. Deployment Overview

1.1 Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│ CLIENT LAYER │
│ (Applications, Web UI, APIs) │
└────────────────────────────┬────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ NL2SQL ENGINE (Core) │
│ ┌───────────────┐ ┌──────────────┐ ┌────────────────────┐ │
│ │ Query │ │ Agentic │ │ Code Advisor │ │
│ │ Complexity │─▶│ Decomposer │◀─│ (Performance │ │
│ │ Assessment │ │ │ │ Analysis) │ │
│ └───────────────┘ └──────┬───────┘ └──────────┬─────────┘ │
│ │ │ │ │
│ ▼ ▼ │ │
│ ┌──────────────────────────────────────────┐ │ │
│ │ Schema Retriever & Cache │ │ │
│ │ (Skeleton Cache + Metadata Integration) │ │ │
│ └────────────┬─────────────────────────────┘ │ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ LLM Translation Layer │ │
│ │ (OpenAI, Anthropic, DeepSeek, Grok - Multi-provider) │ │
│ └────────────┬─────────────────────────────────────────────┘ │
└───────────────┼────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ STATISTICS & METADATA LAYER │
│ ┌──────────────────┐ ┌─────────────────┐ ┌────────────────┐ │
│ │ Statistics │ │ Metadata │ │ Refresh │ │
│ │ Service │ │ Adapter │ │ Manager │ │
│ │ (Cardinality, │ │ (Schema │ │ (Priority │ │
│ │ Selectivity) │ │ Integration) │ │ Queue) │ │
│ └──────────────────┘ └─────────────────┘ └────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ DATABASE METADATA LAYER │
│ (heliosdb-metadata: Tables, Columns, │
│ Indexes, Constraints, Statistics) │
└─────────────────────────────────────────────────────────────────┘

1.2 Component Dependencies

Critical Dependencies:

  • heliosdb-metadata: Schema service, constraint metadata
  • heliosdb-nl2sql: Core NL2SQL engine, agentic decomposer, code advisor
  • LLM Provider APIs: OpenAI, Anthropic, DeepSeek, Grok
  • Statistics cache: In-memory (parking_lot RwLock) or Redis for distributed deployments

Data Flow:

  1. Natural language query → Complexity assessment
  2. Simple queries → Direct translation
  3. Complex queries → Agentic decomposition → Task execution
  4. Generated SQL → Code Advisor analysis → Performance recommendations
  5. Statistics service → Real-time cardinality/selectivity estimates
  6. Metadata adapter → Schema integration → Index recommendations

1.3 System Requirements

Minimum Production Requirements:

  • CPU: 4 cores (8 cores recommended for high throughput)
  • Memory: 8GB RAM (16GB+ for large schemas)
  • Storage: 50GB SSD (for caching, logs, statistics)
  • Network: 1Gbps (low latency to LLM APIs critical)
  • OS: Linux (Ubuntu 22.04 LTS, RHEL 8+, or equivalent)

Recommended Production Configuration:

  • CPU: 16+ cores for parallel task execution
  • Memory: 32GB+ RAM for large-scale statistics caching
  • Storage: 500GB+ NVMe SSD for high IOPS
  • Network: 10Gbps with dedicated LLM API egress
  • Database: PostgreSQL 14+, MySQL 8.0+, or HeliosDB native

1.4 Prerequisites

Software:

  • Rust 1.70+ (for compilation)
  • PostgreSQL 14+ or MySQL 8.0+ (for metadata storage)
  • Redis 7+ (optional, for distributed caching)
  • Prometheus + Grafana (for monitoring)
  • Docker + Docker Compose (optional, for containerized deployment)

API Keys:

  • OpenAI API key (GPT-4, GPT-3.5-turbo)
  • Anthropic API key (Claude 3 Opus/Sonnet)
  • DeepSeek API key (optional)
  • Grok API key (optional)

Network Access:

  • Outbound HTTPS to LLM provider endpoints
  • Inbound on application ports (default: 8080 for HTTP API)
  • Database connectivity (PostgreSQL: 5432, MySQL: 3306)

2. Configuration

2.1 Complete TOML Configuration Reference

/etc/heliosdb/nl2sql.toml:

# ===== NL2SQL Core Configuration =====
[nl2sql]
# Enable/disable NL2SQL engine
enabled = true
# Query timeout (seconds)
query_timeout = 30
# Maximum concurrent queries
max_concurrent_queries = 100
# Cache size (MB) for skeleton cache
cache_size_mb = 500
# Cache TTL (seconds)
cache_ttl = 3600
# Default LLM provider (openai, anthropic, deepseek, grok)
default_provider = "openai"
# Fallback providers (used if primary fails)
fallback_providers = ["anthropic", "deepseek"]
# ===== Agentic Decomposition =====
[nl2sql.agentic]
# Enable agentic task decomposition for complex queries
enabled = true
# Complexity threshold (1-10) - queries above this trigger decomposition
complexity_threshold = 5
# Maximum tasks per decomposition
max_tasks = 10
# Maximum execution stages
max_stages = 5
# Agentic execution timeout (seconds)
timeout_seconds = 30
# Enable parallel task execution
parallel_execution = true
# Maximum parallel tasks per stage
max_parallel_tasks = 4
# ===== Code Advisor =====
[nl2sql.advisor]
# Enable Code Advisor for query optimization
enabled = true
# Automatically recommend indexes
auto_recommend_indexes = true
# Minimum quality score threshold (0-100)
min_quality_score = 50
# Enable statistics integration
use_statistics = true
# Enable metadata integration
use_metadata = true
# Bottleneck detection threshold
bottleneck_severity_threshold = 6
# ===== Statistics Service =====
[nl2sql.statistics]
# Enable statistics service
enabled = true
# Statistics refresh interval (seconds)
refresh_interval_seconds = 3600
# Cache size for table statistics (MB)
cache_size_mb = 100
# Cache size for column statistics (MB)
column_cache_size_mb = 200
# Use histogram-based estimates
enable_histograms = true
# Histogram bucket count
histogram_buckets = 100
# Minimum table size (rows) for statistics collection
min_table_size = 1000
# ===== Metadata Adapter =====
[nl2sql.metadata]
# Enable metadata integration
enabled = true
# Metadata refresh interval (seconds)
refresh_interval = 3600
# Collect constraint metadata
collect_constraints = true
# Collect index metadata
collect_indexes = true
# Collect foreign key metadata
collect_foreign_keys = true
# Max concurrent metadata collection tasks
max_concurrent_collections = 10
# ===== Statistics Refresh Manager =====
[nl2sql.refresh_manager]
# Enable background statistics refresh
enabled = true
# Refresh queue capacity
queue_capacity = 1000
# Worker threads for refresh tasks
worker_threads = 4
# Priority levels (higher = more urgent)
high_priority_threshold = 8
medium_priority_threshold = 5
# Auto-refresh on query (adaptive)
auto_refresh_on_query = true
# Staleness threshold (seconds) - trigger refresh if stats older than this
staleness_threshold = 7200
# ===== LLM Provider Configuration =====
[llm.openai]
# API key (or use OPENAI_API_KEY environment variable)
api_key = "${OPENAI_API_KEY}"
# Model to use
model = "gpt-4-turbo-preview"
# API endpoint
endpoint = "https://api.openai.com/v1"
# Request timeout (seconds)
timeout = 30
# Max retries
max_retries = 3
# Rate limit (requests per minute)
rate_limit = 500
# Temperature (0.0-1.0)
temperature = 0.0
# Max tokens
max_tokens = 2000
[llm.anthropic]
api_key = "${ANTHROPIC_API_KEY}"
model = "claude-3-5-sonnet-20241022"
endpoint = "https://api.anthropic.com/v1"
timeout = 30
max_retries = 3
rate_limit = 100
temperature = 0.0
max_tokens = 4000
[llm.deepseek]
api_key = "${DEEPSEEK_API_KEY}"
model = "deepseek-chat"
endpoint = "https://api.deepseek.com/v1"
timeout = 30
max_retries = 3
rate_limit = 200
temperature = 0.0
max_tokens = 2000
[llm.grok]
api_key = "${GROK_API_KEY}"
model = "grok-1"
endpoint = "https://api.x.ai/v1"
timeout = 30
max_retries = 3
rate_limit = 100
temperature = 0.0
max_tokens = 2000
# ===== Database Connection =====
[database]
# Database type (postgres, mysql, heliosdb)
type = "postgres"
# Connection string
connection_string = "postgresql://heliosdb:password@localhost:5432/heliosdb_metadata"
# Connection pool size
pool_size = 20
# Connection timeout (seconds)
connection_timeout = 10
# Query timeout (seconds)
query_timeout = 30
# ===== Distributed Caching (Redis) =====
[cache.redis]
# Enable Redis for distributed caching
enabled = false
# Redis connection string
url = "redis://localhost:6379"
# Key prefix for NL2SQL cache entries
key_prefix = "nl2sql:"
# Connection pool size
pool_size = 10
# ===== Observability =====
[observability]
# Enable metrics collection
metrics_enabled = true
# Metrics port (Prometheus scrape endpoint)
metrics_port = 9090
# Metrics path
metrics_path = "/metrics"
# Enable distributed tracing
tracing_enabled = true
# Tracing exporter (jaeger, zipkin, otlp)
tracing_exporter = "jaeger"
# Jaeger agent endpoint
jaeger_agent_endpoint = "localhost:6831"
# Log level (trace, debug, info, warn, error)
log_level = "info"
# Structured logging format (json, text)
log_format = "json"
# Log output (stdout, file)
log_output = "stdout"
# Log file path (if log_output = "file")
log_file = "/var/log/heliosdb/nl2sql.log"
# ===== Security =====
[security]
# Enable API authentication
require_auth = true
# JWT secret (or use JWT_SECRET environment variable)
jwt_secret = "${JWT_SECRET}"
# JWT expiration (seconds)
jwt_expiration = 3600
# Enable TLS
enable_tls = true
# TLS certificate path
tls_cert = "/etc/heliosdb/certs/server.crt"
# TLS key path
tls_key = "/etc/heliosdb/certs/server.key"
# Rate limiting (requests per minute per client)
rate_limit_per_client = 100
# ===== Feature Flags =====
[features]
# Enable experimental features
experimental = false
# Enable A/B testing
ab_testing = false
# A/B test split (0.0-1.0)
ab_test_split = 0.5
# Enable verbose error messages (disable in production)
verbose_errors = false

2.2 Environment Variables

Required:

Terminal window
# LLM API Keys
export OPENAI_API_KEY="sk-proj-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export DEEPSEEK_API_KEY="..." # Optional
export GROK_API_KEY="..." # Optional
# Database
export DATABASE_URL="postgresql://user:pass@localhost:5432/heliosdb"
# Security
export JWT_SECRET="your-256-bit-secret-key-here"
# Deployment
export RUST_LOG="info,heliosdb_nl2sql=debug"
export RUST_BACKTRACE="1"

Optional:

Terminal window
# Redis (for distributed caching)
export REDIS_URL="redis://localhost:6379"
# Observability
export JAEGER_AGENT_HOST="localhost"
export JAEGER_AGENT_PORT="6831"
# Performance tuning
export NL2SQL_WORKER_THREADS="16"
export NL2SQL_MAX_CONCURRENT_QUERIES="200"

2.3 Performance Tuning Parameters

CPU-Bound Workloads (complex decomposition):

[nl2sql.agentic]
max_parallel_tasks = 8 # Match CPU core count
parallel_execution = true
timeout_seconds = 60
[nl2sql.refresh_manager]
worker_threads = 8 # Match CPU core count

Memory-Bound Workloads (large schemas):

[nl2sql]
cache_size_mb = 2000 # Increase for large skeleton cache
[nl2sql.statistics]
cache_size_mb = 500
column_cache_size_mb = 1000
enable_histograms = true
histogram_buckets = 200

High-Throughput Workloads (many concurrent queries):

[nl2sql]
max_concurrent_queries = 500
[database]
pool_size = 50
[cache.redis]
enabled = true
pool_size = 20

Low-Latency Workloads (fast response times):

[nl2sql]
query_timeout = 10
cache_ttl = 7200 # Longer cache retention
[llm.openai]
model = "gpt-3.5-turbo" # Faster, lower latency
timeout = 10
[nl2sql.agentic]
complexity_threshold = 7 # Avoid decomposition for most queries

2.4 Security Settings

Production Security Hardening:

[security]
require_auth = true
enable_tls = true
rate_limit_per_client = 50 # Stricter rate limit
# TLS 1.3 only
tls_min_version = "1.3"
# Disable verbose errors (prevent information disclosure)
[features]
verbose_errors = false
# Enable audit logging
[observability]
audit_log_enabled = true
audit_log_path = "/var/log/heliosdb/audit.log"

API Key Rotation:

Terminal window
# Store API keys in secret manager (AWS Secrets Manager, HashiCorp Vault)
export OPENAI_API_KEY=$(aws secretsmanager get-secret-value \
--secret-id heliosdb/nl2sql/openai --query SecretString --output text)

3. Installation & Setup

3.1 Step-by-Step Deployment Procedure

Step 1: Build Docker image

Terminal window
cd /path/to/heliosdb
docker build -t heliosdb-nl2sql:latest -f docker/Dockerfile.nl2sql .

Step 2: Create configuration

Terminal window
mkdir -p /etc/heliosdb
cp config/nl2sql.toml.example /etc/heliosdb/nl2sql.toml
# Edit /etc/heliosdb/nl2sql.toml with your settings

Step 3: Run with Docker Compose

docker-compose.yml
version: '3.8'
services:
nl2sql:
image: heliosdb-nl2sql:latest
ports:
- "8080:8080"
- "9090:9090"
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- DATABASE_URL=${DATABASE_URL}
- JWT_SECRET=${JWT_SECRET}
- RUST_LOG=info
volumes:
- /etc/heliosdb/nl2sql.toml:/app/config.toml:ro
- nl2sql-cache:/var/cache/nl2sql
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
postgres:
image: postgres:14
environment:
POSTGRES_DB: heliosdb_metadata
POSTGRES_USER: heliosdb
POSTGRES_PASSWORD: secure_password
volumes:
- postgres-data:/var/lib/postgresql/data
ports:
- "5432:5432"
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
- redis-data:/data
volumes:
nl2sql-cache:
postgres-data:
redis-data:
Terminal window
docker-compose up -d

Option B: Native Deployment (Production)

Step 1: Compile HeliosDB with NL2SQL feature

target/release/heliosdb
cd /path/to/heliosdb
cargo build --release --features nl2sql,statistics,advisor
sudo cp target/release/heliosdb /usr/local/bin/

Step 2: Create system user

Terminal window
sudo useradd -r -s /bin/false -d /var/lib/heliosdb heliosdb
sudo mkdir -p /var/lib/heliosdb /var/log/heliosdb /etc/heliosdb
sudo chown -R heliosdb:heliosdb /var/lib/heliosdb /var/log/heliosdb

Step 3: Install configuration

Terminal window
sudo cp config/nl2sql.toml /etc/heliosdb/
sudo chmod 640 /etc/heliosdb/nl2sql.toml
sudo chown root:heliosdb /etc/heliosdb/nl2sql.toml

Step 4: Create systemd service

Terminal window
sudo tee /etc/systemd/system/heliosdb-nl2sql.service > /dev/null <<EOF
[Unit]
Description=HeliosDB NL2SQL Engine
After=network.target postgresql.service
[Service]
Type=simple
User=heliosdb
Group=heliosdb
WorkingDirectory=/var/lib/heliosdb
ExecStart=/usr/local/bin/heliosdb nl2sql --config /etc/heliosdb/nl2sql.toml
Restart=on-failure
RestartSec=10
LimitNOFILE=65536
# Security hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/heliosdb /var/log/heliosdb
# Environment
Environment="RUST_LOG=info,heliosdb_nl2sql=debug"
EnvironmentFile=/etc/heliosdb/nl2sql.env
[Install]
WantedBy=multi-user.target
EOF

Step 5: Create environment file

Terminal window
sudo tee /etc/heliosdb/nl2sql.env > /dev/null <<EOF
OPENAI_API_KEY=sk-proj-...
ANTHROPIC_API_KEY=sk-ant-...
DATABASE_URL=postgresql://heliosdb:password@localhost:5432/heliosdb_metadata
JWT_SECRET=your-256-bit-secret-here
EOF
sudo chmod 600 /etc/heliosdb/nl2sql.env
sudo chown root:heliosdb /etc/heliosdb/nl2sql.env

Step 6: Enable and start service

Terminal window
sudo systemctl daemon-reload
sudo systemctl enable heliosdb-nl2sql
sudo systemctl start heliosdb-nl2sql
sudo systemctl status heliosdb-nl2sql

3.2 Database Initialization

Step 1: Create metadata database

-- PostgreSQL
CREATE DATABASE heliosdb_metadata;
CREATE USER heliosdb WITH PASSWORD 'secure_password';
GRANT ALL PRIVILEGES ON DATABASE heliosdb_metadata TO heliosdb;
-- MySQL
CREATE DATABASE heliosdb_metadata CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'heliosdb'@'localhost' IDENTIFIED BY 'secure_password';
GRANT ALL PRIVILEGES ON heliosdb_metadata.* TO 'heliosdb'@'localhost';
FLUSH PRIVILEGES;

Step 2: Run schema migrations

Terminal window
heliosdb migrate --database $DATABASE_URL --migrations ./migrations/nl2sql

Step 3: Verify schema

-- PostgreSQL
\dt
\di
-- Expected tables:
-- - nl2sql_cache
-- - nl2sql_statistics_table
-- - nl2sql_statistics_column
-- - nl2sql_refresh_queue
-- - nl2sql_audit_log

3.3 Schema Setup

Step 1: Initialize schema metadata (for target database)

Terminal window
# Connect NL2SQL to your application database
heliosdb nl2sql schema sync \
--source-db "postgresql://app_user:pass@localhost:5432/app_db" \
--target-metadata "postgresql://heliosdb:pass@localhost:5432/heliosdb_metadata"

Step 2: Verify schema sync

SELECT table_name, column_count, last_updated
FROM metadata_tables
ORDER BY table_name;

Step 3: Configure schema refresh

[nl2sql.metadata]
enabled = true
refresh_interval = 3600 # Re-sync every hour
# Auto-detect schema changes
auto_sync_on_ddl = true

3.4 Initial Statistics Collection

Step 1: Collect baseline statistics

Terminal window
heliosdb nl2sql statistics collect --all-tables

Step 2: Verify statistics

SELECT
ts.table_name,
ts.row_count,
ts.size_bytes,
COUNT(cs.column_name) AS columns_with_stats
FROM nl2sql_statistics_table ts
LEFT JOIN nl2sql_statistics_column cs ON ts.table_name = cs.table_name
GROUP BY ts.table_name, ts.row_count, ts.size_bytes
ORDER BY ts.row_count DESC;

Step 3: Schedule periodic statistics refresh

Terminal window
# Cron job for daily statistics refresh
0 2 * * * /usr/local/bin/heliosdb nl2sql statistics refresh --priority auto

Step 4: Monitor statistics staleness

SELECT
table_name,
last_updated,
NOW() - last_updated AS staleness
FROM nl2sql_statistics_table
WHERE NOW() - last_updated > INTERVAL '24 hours'
ORDER BY staleness DESC;

4. Monitoring & Observability

4.1 Key Metrics to Monitor

Query Performance Metrics:

# Query latency (histogram)
nl2sql_query_duration_seconds{quantile="0.5"}
nl2sql_query_duration_seconds{quantile="0.95"}
nl2sql_query_duration_seconds{quantile="0.99"}
# Query throughput
rate(nl2sql_queries_total[5m])
# Query success rate
rate(nl2sql_queries_total{status="success"}[5m]) /
rate(nl2sql_queries_total[5m])
# LLM latency
nl2sql_llm_request_duration_seconds{provider="openai"}
nl2sql_llm_request_duration_seconds{provider="anthropic"}
# Cache hit rate
rate(nl2sql_cache_hits[5m]) /
(rate(nl2sql_cache_hits[5m]) + rate(nl2sql_cache_misses[5m]))

Agentic Decomposition Metrics:

# Decomposition frequency
rate(nl2sql_agentic_decompositions_total[5m])
# Average task count per decomposition
nl2sql_agentic_tasks_per_query{quantile="0.95"}
# Stage execution time
nl2sql_agentic_stage_duration_seconds{stage="1"}
nl2sql_agentic_stage_duration_seconds{stage="2"}
# Parallel efficiency
nl2sql_agentic_parallel_speedup_ratio

Code Advisor Metrics:

# Advisor analysis frequency
rate(nl2sql_advisor_analyses_total[5m])
# Quality scores
nl2sql_advisor_quality_score{quantile="0.5"}
# Bottleneck detection rate
rate(nl2sql_advisor_bottlenecks_detected[5m])
# Index recommendation rate
rate(nl2sql_advisor_index_recommendations[5m])

Statistics Service Metrics:

# Statistics cache size
nl2sql_statistics_cache_entries{type="table"}
nl2sql_statistics_cache_entries{type="column"}
# Statistics refresh rate
rate(nl2sql_statistics_refreshes_total[5m])
# Statistics staleness
nl2sql_statistics_age_seconds{quantile="0.95"}
# Cardinality estimation accuracy (if ground truth available)
nl2sql_statistics_estimation_error_ratio

System Resource Metrics:

# Memory usage
process_resident_memory_bytes
nl2sql_cache_memory_bytes
# CPU utilization
rate(process_cpu_seconds_total[5m])
# Connection pool usage
nl2sql_db_connections_active
nl2sql_db_connections_idle

4.2 Grafana Dashboard Configuration

Dashboard JSON (/etc/grafana/dashboards/nl2sql.json):

{
"dashboard": {
"title": "HeliosDB NL2SQL Production Dashboard",
"panels": [
{
"title": "Query Throughput",
"type": "graph",
"targets": [
{
"expr": "rate(nl2sql_queries_total[5m])",
"legendFormat": "Total Queries/sec"
},
{
"expr": "rate(nl2sql_queries_total{status=\"success\"}[5m])",
"legendFormat": "Success"
},
{
"expr": "rate(nl2sql_queries_total{status=\"error\"}[5m])",
"legendFormat": "Errors"
}
]
},
{
"title": "Query Latency (p50, p95, p99)",
"type": "graph",
"targets": [
{
"expr": "nl2sql_query_duration_seconds{quantile=\"0.5\"}",
"legendFormat": "p50"
},
{
"expr": "nl2sql_query_duration_seconds{quantile=\"0.95\"}",
"legendFormat": "p95"
},
{
"expr": "nl2sql_query_duration_seconds{quantile=\"0.99\"}",
"legendFormat": "p99"
}
]
},
{
"title": "Cache Hit Rate",
"type": "gauge",
"targets": [
{
"expr": "rate(nl2sql_cache_hits[5m]) / (rate(nl2sql_cache_hits[5m]) + rate(nl2sql_cache_misses[5m]))",
"legendFormat": "Hit Rate"
}
],
"thresholds": [
{ "value": 0.7, "color": "green" },
{ "value": 0.5, "color": "yellow" },
{ "value": 0, "color": "red" }
]
},
{
"title": "LLM Provider Latency",
"type": "graph",
"targets": [
{
"expr": "nl2sql_llm_request_duration_seconds{provider=\"openai\", quantile=\"0.95\"}",
"legendFormat": "OpenAI p95"
},
{
"expr": "nl2sql_llm_request_duration_seconds{provider=\"anthropic\", quantile=\"0.95\"}",
"legendFormat": "Anthropic p95"
}
]
},
{
"title": "Agentic Decomposition Rate",
"type": "graph",
"targets": [
{
"expr": "rate(nl2sql_agentic_decompositions_total[5m])",
"legendFormat": "Decompositions/sec"
},
{
"expr": "nl2sql_agentic_tasks_per_query{quantile=\"0.95\"}",
"legendFormat": "Tasks per Query (p95)"
}
]
},
{
"title": "Statistics Cache Health",
"type": "stat",
"targets": [
{
"expr": "nl2sql_statistics_cache_entries{type=\"table\"}",
"legendFormat": "Table Stats"
},
{
"expr": "nl2sql_statistics_cache_entries{type=\"column\"}",
"legendFormat": "Column Stats"
}
]
}
]
}
}

Quick Setup:

Terminal window
# Import dashboard
curl -X POST http://admin:admin@localhost:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-d @/etc/grafana/dashboards/nl2sql.json

4.3 Alert Thresholds

Prometheus Alerting Rules (/etc/prometheus/rules/nl2sql.yml):

groups:
- name: nl2sql_alerts
interval: 30s
rules:
# High error rate
- alert: NL2SQLHighErrorRate
expr: |
rate(nl2sql_queries_total{status="error"}[5m]) /
rate(nl2sql_queries_total[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "NL2SQL error rate above 5%"
description: "Error rate is {{ $value | humanizePercentage }}"
# Query latency too high
- alert: NL2SQLHighLatency
expr: nl2sql_query_duration_seconds{quantile="0.95"} > 10
for: 5m
labels:
severity: warning
annotations:
summary: "NL2SQL p95 latency above 10s"
description: "p95 latency is {{ $value }}s"
# LLM provider failure
- alert: NL2SQLLLMProviderDown
expr: |
rate(nl2sql_llm_requests_total{status="error"}[5m]) /
rate(nl2sql_llm_requests_total[5m]) > 0.5
for: 2m
labels:
severity: critical
annotations:
summary: "LLM provider {{ $labels.provider }} has high failure rate"
description: "Failure rate is {{ $value | humanizePercentage }}"
# Cache hit rate too low
- alert: NL2SQLLowCacheHitRate
expr: |
rate(nl2sql_cache_hits[5m]) /
(rate(nl2sql_cache_hits[5m]) + rate(nl2sql_cache_misses[5m])) < 0.5
for: 10m
labels:
severity: info
annotations:
summary: "NL2SQL cache hit rate below 50%"
description: "Hit rate is {{ $value | humanizePercentage }}"
# Statistics stale
- alert: NL2SQLStaleStatistics
expr: nl2sql_statistics_age_seconds{quantile="0.95"} > 86400
for: 1h
labels:
severity: warning
annotations:
summary: "NL2SQL statistics are stale (>24h)"
description: "p95 age is {{ $value | humanizeDuration }}"
# High memory usage
- alert: NL2SQLHighMemoryUsage
expr: nl2sql_cache_memory_bytes > 10737418240 # 10GB
for: 10m
labels:
severity: warning
annotations:
summary: "NL2SQL cache memory usage above 10GB"
description: "Current usage: {{ $value | humanize1024 }}B"
# Connection pool exhaustion
- alert: NL2SQLConnectionPoolExhausted
expr: |
nl2sql_db_connections_active /
(nl2sql_db_connections_active + nl2sql_db_connections_idle) > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "NL2SQL connection pool >90% utilized"
description: "Utilization: {{ $value | humanizePercentage }}"

4.4 Log Aggregation Setup

Structured JSON Logging (configured in nl2sql.toml):

[observability]
log_format = "json"
log_output = "stdout"

Example JSON Log:

{
"timestamp": "2025-10-28T10:23:45.123Z",
"level": "INFO",
"target": "heliosdb_nl2sql::engine",
"message": "Query translation completed",
"fields": {
"query_id": "q-123456",
"nl_query": "Show top 10 customers by revenue",
"complexity": 7,
"agentic_decomposition": true,
"tasks": 4,
"latency_ms": 1523,
"llm_provider": "openai",
"cache_hit": false,
"confidence": 0.92
},
"span": {
"name": "translate_query",
"trace_id": "abc123def456"
}
}

Elasticsearch Integration (via Filebeat):

/etc/filebeat/filebeat.yml
filebeat.inputs:
- type: container
paths:
- '/var/lib/docker/containers/*/*.log'
json.keys_under_root: true
json.add_error_key: true
processors:
- add_fields:
target: ''
fields:
service: heliosdb-nl2sql
output.elasticsearch:
hosts: ["elasticsearch:9200"]
index: "heliosdb-nl2sql-%{+yyyy.MM.dd}"

Log Queries (Elasticsearch/Kibana):

# Find slow queries
GET heliosdb-nl2sql-*/_search
{
"query": {
"bool": {
"must": [
{ "match": { "target": "heliosdb_nl2sql::engine" }},
{ "range": { "fields.latency_ms": { "gte": 5000 }}}
]
}
},
"sort": [ { "fields.latency_ms": "desc" }]
}
# Analyze error patterns
GET heliosdb-nl2sql-*/_search
{
"query": { "match": { "level": "ERROR" }},
"aggs": {
"error_types": {
"terms": { "field": "fields.error_type.keyword" }
}
}
}

4.5 Health Check Endpoints

Liveness Probe (checks if service is running):

Terminal window
curl -f http://localhost:8080/health/liveness
# Response: { "status": "alive", "timestamp": "2025-10-28T10:23:45Z" }

Readiness Probe (checks if service can accept traffic):

Terminal window
curl -f http://localhost:8080/health/readiness
# Response:
# {
# "status": "ready",
# "checks": {
# "database": "ok",
# "llm_providers": "ok",
# "statistics_cache": "ok"
# },
# "timestamp": "2025-10-28T10:23:45Z"
# }

Startup Probe (checks if service has started):

Terminal window
curl -f http://localhost:8080/health/startup
# Response: { "status": "started", "uptime_seconds": 3600 }

5. Operational Runbooks

5.1 Common Operational Tasks

Task 1: Refresh Statistics Manually

When to Use: After bulk data load, schema changes, or when statistics are stale.

Procedure:

Terminal window
# Refresh all tables
heliosdb nl2sql statistics refresh --all
# Refresh specific table
heliosdb nl2sql statistics refresh --table customers --priority high
# Refresh based on staleness
heliosdb nl2sql statistics refresh --stale-threshold 24h --priority auto

Verification:

SELECT table_name, last_updated
FROM nl2sql_statistics_table
WHERE table_name = 'customers';

Task 2: Adjust Refresh Priorities

When to Use: To optimize statistics freshness for critical tables.

Procedure:

Terminal window
# Set high priority for critical tables
heliosdb nl2sql refresh-manager set-priority \
--table orders --priority 9
heliosdb nl2sql refresh-manager set-priority \
--table customers --priority 8
# Lower priority for archival tables
heliosdb nl2sql refresh-manager set-priority \
--table historical_data --priority 2

Configuration:

[nl2sql.refresh_manager]
high_priority_threshold = 8
medium_priority_threshold = 5

Task 3: Clear Cache (Emergency)

When to Use: After schema changes, corrupt cache, or to force re-translation.

Procedure:

Terminal window
# Clear entire cache
heliosdb nl2sql cache clear --all
# Clear specific table entries
heliosdb nl2sql cache clear --table customers
# Clear stale entries only
heliosdb nl2sql cache clear --older-than 7d

Redis Cache Clear:

Terminal window
redis-cli --scan --pattern "nl2sql:*" | xargs redis-cli DEL

Task 4: Update LLM Provider Priorities

When to Use: When primary provider is down or has degraded performance.

Procedure:

# Update config
[nl2sql]
default_provider = "anthropic" # Switch from OpenAI
fallback_providers = ["openai", "deepseek"]
# Or via environment variable
export NL2SQL_DEFAULT_PROVIDER="anthropic"

Reload Configuration (without restart):

Terminal window
sudo systemctl reload heliosdb-nl2sql
# Or send SIGHUP
sudo kill -HUP $(pgrep heliosdb)

Task 5: Enable/Disable Agentic Decomposition

When to Use: To reduce latency or troubleshoot complex query issues.

Procedure:

# Disable agentic decomposition
[nl2sql.agentic]
enabled = false
# Or increase threshold to reduce usage
complexity_threshold = 8 # Only for very complex queries

Runtime Toggle (via API):

Terminal window
curl -X POST http://localhost:8080/admin/features \
-H "Authorization: Bearer $JWT_TOKEN" \
-d '{"agentic_enabled": false}'

5.2 Troubleshooting Procedures

Issue: High Query Latency

Symptoms:

  • p95 latency >5s
  • Timeout errors
  • User complaints

Decision Tree:

High Latency?
├─ Check LLM Provider Latency
│ ├─ >3s? → Switch provider or increase timeout
│ └─ <3s → Continue
├─ Check Agentic Decomposition
│ ├─ Many tasks (>6)? → Increase complexity_threshold
│ └─ Few tasks → Continue
├─ Check Statistics Staleness
│ ├─ >24h? → Refresh statistics
│ └─ Fresh → Continue
└─ Check Database Connection
├─ Pool exhausted? → Increase pool_size
└─ OK → Check cache hit rate

Actions:

Terminal window
# 1. Check current latency breakdown
curl http://localhost:9090/metrics | grep nl2sql_query_duration
# 2. Identify bottleneck
# - If LLM latency high: Switch provider
# - If many tasks: Increase complexity threshold
# - If cache miss: Warm up cache
# 3. Temporarily disable agentic for latency-sensitive queries
curl -X POST http://localhost:8080/admin/features \
-d '{"agentic_complexity_threshold": 10}'

Issue: LLM Provider Failures

Symptoms:

  • llm_request_error in logs
  • Increased error rate
  • Fallback provider usage spiking

Diagnosis:

Terminal window
# Check LLM provider status
curl http://localhost:9090/metrics | grep nl2sql_llm_requests_total
# Check error logs
journalctl -u heliosdb-nl2sql -n 100 | grep "llm_request_error"

Actions:

Terminal window
# 1. Verify API key validity
echo $OPENAI_API_KEY | wc -c # Should be ~50+ chars
# 2. Test API connectivity
curl https://api.openai.com/v1/models \
-H "Authorization: Bearer $OPENAI_API_KEY"
# 3. Switch to fallback provider
export NL2SQL_DEFAULT_PROVIDER="anthropic"
sudo systemctl reload heliosdb-nl2sql
# 4. Monitor fallback performance
watch -n 5 'curl -s http://localhost:9090/metrics | grep nl2sql_llm_requests_total'

Issue: Stale Statistics

Symptoms:

  • Poor query plans
  • Unexpected full table scans
  • Inaccurate cardinality estimates

Diagnosis:

SELECT
table_name,
last_updated,
NOW() - last_updated AS staleness
FROM nl2sql_statistics_table
WHERE NOW() - last_updated > INTERVAL '24 hours'
ORDER BY staleness DESC;

Actions:

Terminal window
# 1. Identify stale tables
heliosdb nl2sql statistics list --stale-threshold 24h
# 2. Prioritize refresh
for table in orders customers products; do
heliosdb nl2sql statistics refresh --table $table --priority high
done
# 3. Verify refresh
watch -n 10 'psql -c "SELECT table_name, last_updated FROM nl2sql_statistics_table WHERE table_name IN ('"'"'orders'"'"', '"'"'customers'"'"');"'
# 4. Update refresh schedule
# Edit cron job to run more frequently
crontab -e
# Change: 0 2 * * * → */6 * * * (every 6 hours)

Issue: Memory Exhaustion

Symptoms:

  • OOM kills
  • High swap usage
  • Cache eviction rate high

Diagnosis:

Terminal window
# Check memory usage
free -h
ps aux | grep heliosdb | awk '{print $2, $4, $6}' | sort -k3 -rn | head -1
# Check cache size
curl http://localhost:9090/metrics | grep nl2sql_cache_memory_bytes

Actions:

# 1. Reduce cache sizes
[nl2sql]
cache_size_mb = 250 # Reduce from 500
[nl2sql.statistics]
cache_size_mb = 50 # Reduce from 100
column_cache_size_mb = 100 # Reduce from 200
# 2. Enable Redis for distributed caching (offload memory)
[cache.redis]
enabled = true
Terminal window
# 3. Restart service
sudo systemctl restart heliosdb-nl2sql
# 4. Monitor memory
watch -n 5 'free -h && curl -s http://localhost:9090/metrics | grep cache_memory'

5.3 Performance Optimization

Optimization 1: Increase Cache Hit Rate

Target: >70% cache hit rate

Actions:

# 1. Increase cache TTL
[nl2sql]
cache_ttl = 7200 # 2 hours
# 2. Pre-warm cache with common queries
# Create warmup script
warmup-cache.sh
#!/bin/bash
QUERIES=(
"Show all active customers"
"List top 10 products by sales"
"Count orders by status"
)
for query in "${QUERIES[@]}"; do
curl -X POST http://localhost:8080/api/nl2sql/translate \
-H "Content-Type: application/json" \
-d "{\"query\": \"$query\"}"
done

Optimization 2: Reduce Agentic Decomposition Overhead

Target: <20% of queries use decomposition

Actions:

# Increase complexity threshold
[nl2sql.agentic]
complexity_threshold = 7 # Up from 5
max_tasks = 6 # Cap at 6 tasks

Monitor Impact:

Terminal window
# Before and after comparison
curl -s http://localhost:9090/metrics | grep nl2sql_agentic_decompositions_total

Optimization 3: Optimize Statistics Collection

Target: Statistics refresh <5 minutes

Actions:

Terminal window
# 1. Identify slow tables
heliosdb nl2sql statistics analyze --report
# 2. Disable histogram collection for large tables
[nl2sql.statistics]
enable_histograms = false # Or reduce buckets
histogram_buckets = 50 # Down from 100
min_table_size = 10000 # Skip small tables

5.4 Capacity Planning

Metrics to Track:

# Query rate growth
rate(nl2sql_queries_total[1d]) - rate(nl2sql_queries_total[1d] offset 7d)
# P95 latency trend
nl2sql_query_duration_seconds{quantile="0.95"} offset 7d
# Memory growth rate
delta(nl2sql_cache_memory_bytes[1d])

Capacity Projections:

# Example: Estimate capacity needs
import pandas as pd
import numpy as np
# Historical data (queries per day)
data = pd.Series([1000, 1200, 1350, 1500, 1700, 2000, 2400])
# Linear regression for 30-day projection
from sklearn.linear_model import LinearRegression
X = np.array(range(len(data))).reshape(-1, 1)
y = data.values
model = LinearRegression().fit(X, y)
# Predict day 30
predicted = model.predict([[30]])[0]
print(f"Projected queries/day in 30 days: {predicted:.0f}")
# Calculate required resources
# Assume: 100 qps = 4 CPU cores, 8GB RAM
current_qps = 2400 / 86400 # ~28 qps
projected_qps = predicted / 86400
cpu_cores_needed = (projected_qps / 25) * 4
memory_gb_needed = (projected_qps / 25) * 8
print(f"Resources needed:")
print(f" CPU cores: {cpu_cores_needed:.0f}")
print(f" Memory: {memory_gb_needed:.0f}GB")

5.5 Backup and Recovery

Backup Strategy:

1. Configuration Backup:

backup-config.sh
#!/bin/bash
BACKUP_DIR="/var/backups/heliosdb-nl2sql"
DATE=$(date +%Y%m%d_%H%M%S)
mkdir -p $BACKUP_DIR
tar -czf $BACKUP_DIR/config-$DATE.tar.gz \
/etc/heliosdb/nl2sql.toml \
/etc/heliosdb/nl2sql.env
# Retain 30 days
find $BACKUP_DIR -name "config-*.tar.gz" -mtime +30 -delete

2. Statistics Backup (PostgreSQL):

backup-statistics.sh
#!/bin/bash
BACKUP_DIR="/var/backups/heliosdb-nl2sql"
DATE=$(date +%Y%m%d_%H%M%S)
pg_dump -h localhost -U heliosdb \
-t nl2sql_statistics_table \
-t nl2sql_statistics_column \
heliosdb_metadata | gzip > $BACKUP_DIR/statistics-$DATE.sql.gz
# Retain 7 days
find $BACKUP_DIR -name "statistics-*.sql.gz" -mtime +7 -delete

3. Cache Snapshot (Redis):

backup-cache.sh
#!/bin/bash
redis-cli --rdb /var/backups/heliosdb-nl2sql/cache-$(date +%Y%m%d).rdb

Recovery Procedure:

Terminal window
# 1. Restore configuration
tar -xzf /var/backups/heliosdb-nl2sql/config-20251028_120000.tar.gz -C /
# 2. Restore statistics
gunzip < /var/backups/heliosdb-nl2sql/statistics-20251028.sql.gz | \
psql -h localhost -U heliosdb heliosdb_metadata
# 3. Restart service
sudo systemctl restart heliosdb-nl2sql
# 4. Verify health
curl http://localhost:8080/health/readiness

6. Production Checklist

6.1 Pre-Deployment Validation

Configuration Validation:

Terminal window
# Validate TOML syntax
heliosdb nl2sql config validate --config /etc/heliosdb/nl2sql.toml
# Test database connectivity
heliosdb nl2sql config test-db --config /etc/heliosdb/nl2sql.toml
# Test LLM provider connectivity
heliosdb nl2sql config test-llm --provider all --config /etc/heliosdb/nl2sql.toml
# Verify API keys are set
env | grep -E '(OPENAI|ANTHROPIC|DEEPSEEK|GROK)_API_KEY'
# Check filesystem permissions
ls -la /etc/heliosdb/nl2sql.toml # Should be 640, root:heliosdb
ls -ld /var/lib/heliosdb # Should be 755, heliosdb:heliosdb

Security Validation:

Terminal window
# Verify TLS certificates
openssl x509 -in /etc/heliosdb/certs/server.crt -noout -dates
# Test JWT secret strength
echo -n $JWT_SECRET | wc -c # Should be ≥32 characters
# Verify rate limiting
curl -I http://localhost:8080/health # Check for X-RateLimit headers
# Check firewall rules
sudo iptables -L -n | grep 8080

Performance Validation:

Terminal window
# Load test with sample queries
ab -n 100 -c 10 -T 'application/json' \
-p test-query.json \
http://localhost:8080/api/nl2sql/translate
# Verify connection pool size
curl http://localhost:9090/metrics | grep nl2sql_db_connections
# Check memory limits
cat /proc/$(pgrep heliosdb)/limits | grep "Max resident set"

6.2 Post-Deployment Verification

Functional Tests:

Terminal window
# Test simple query translation
curl -X POST http://localhost:8080/api/nl2sql/translate \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $JWT_TOKEN" \
-d '{"query": "Show all customers"}' | jq .
# Test complex query (agentic decomposition)
curl -X POST http://localhost:8080/api/nl2sql/translate \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $JWT_TOKEN" \
-d '{"query": "Find top 10 customers by revenue in Q1 2024 with order counts"}' | jq .
# Test Code Advisor
curl -X POST http://localhost:8080/api/nl2sql/advisor/analyze \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $JWT_TOKEN" \
-d '{"sql": "SELECT * FROM orders WHERE customer_id = 123"}' | jq .
# Test statistics endpoint
curl http://localhost:8080/api/nl2sql/statistics/tables | jq .

Monitoring Tests:

Terminal window
# Verify metrics endpoint
curl http://localhost:9090/metrics | grep nl2sql_queries_total
# Check health endpoints
curl http://localhost:8080/health/liveness
curl http://localhost:8080/health/readiness
curl http://localhost:8080/health/startup
# Verify logs are being written
journalctl -u heliosdb-nl2sql -n 10 --no-pager
# Test alerting (trigger error condition)
# Check that Prometheus alerts fire and PagerDuty notifications sent

Integration Tests:

Terminal window
# Verify schema sync
heliosdb nl2sql schema list | grep -q "customers" && echo " Schema synced"
# Verify statistics collection
psql -c "SELECT COUNT(*) FROM nl2sql_statistics_table;" | grep -q "10" && echo " Statistics collected"
# Test cache functionality
# Run same query twice, verify second is faster
time1=$(curl -w "%{time_total}" -o /dev/null -s -X POST http://localhost:8080/api/nl2sql/translate -d '{"query":"test"}')
time2=$(curl -w "%{time_total}" -o /dev/null -s -X POST http://localhost:8080/api/nl2sql/translate -d '{"query":"test"}')
echo "First: ${time1}s, Second: ${time2}s (should be faster)"

6.3 Performance Baseline

Establish Baseline Metrics:

Terminal window
# Run benchmark suite
heliosdb nl2sql benchmark run --output baseline-$(date +%Y%m%d).json
# Example baseline results:
# {
# "simple_query_p50": 0.12s,
# "simple_query_p95": 0.35s,
# "complex_query_p50": 1.8s,
# "complex_query_p95": 3.2s,
# "cache_hit_rate": 0.72,
# "throughput_qps": 350,
# "agentic_decomposition_rate": 0.18
# }

Compare Against Baseline:

Terminal window
# After 1 week, re-run benchmark
heliosdb nl2sql benchmark run --output week1-$(date +%Y%m%d).json
# Compare
heliosdb nl2sql benchmark compare \
--baseline baseline-20251028.json \
--current week1-20251104.json

6.4 Monitoring Setup Verification

Prometheus Targets:

Terminal window
# Verify Prometheus is scraping NL2SQL metrics
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job=="heliosdb-nl2sql")'
# Should show:
# {
# "health": "up",
# "lastScrape": "2025-10-28T10:23:45Z",
# "scrapeUrl": "http://heliosdb-nl2sql:9090/metrics"
# }

Grafana Dashboards:

Terminal window
# Verify dashboard exists
curl -H "Authorization: Bearer $GRAFANA_API_KEY" \
http://localhost:3000/api/search?query=NL2SQL | jq .
# Test dashboard data
curl -H "Authorization: Bearer $GRAFANA_API_KEY" \
'http://localhost:3000/api/datasources/proxy/1/api/v1/query?query=nl2sql_queries_total'

Alerting:

Terminal window
# Verify alert rules loaded
curl http://localhost:9090/api/v1/rules | jq '.data.groups[] | select(.name=="nl2sql_alerts")'
# Test alert firing (optional - trigger condition manually)
# Disable LLM provider temporarily to trigger NL2SQLLLMProviderDown alert

6.5 Documentation Handoff

Required Documentation:

Terminal window
Architecture diagram (saved to /docs/nl2sql/PRODUCTION_DEPLOYMENT_GUIDE.md)
Configuration reference (TOML + env vars)
Runbook procedures (common tasks, troubleshooting)
Monitoring dashboard URLs
Alert contact information (PagerDuty, Slack)
Backup/restore procedures
Escalation matrix

Handoff Checklist:

- [ ] Operations team trained on deployment guide
- [ ] On-call rotation established
- [ ] Runbook procedures tested in staging
- [ ] Grafana dashboard access granted
- [ ] PagerDuty integration tested
- [ ] Backup/restore procedures verified
- [ ] Emergency contacts documented
- [ ] Knowledge transfer session completed

7. Scaling Guidelines

7.1 Horizontal Scaling Strategies

Load Balancer Configuration (NGINX):

upstream nl2sql_backend {
least_conn; # Use least-connections algorithm
server nl2sql-1:8080 max_fails=3 fail_timeout=30s;
server nl2sql-2:8080 max_fails=3 fail_timeout=30s;
server nl2sql-3:8080 max_fails=3 fail_timeout=30s;
keepalive 32;
}
server {
listen 80;
server_name nl2sql.heliosdb.com;
location /api/nl2sql/ {
proxy_pass http://nl2sql_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
# Increase timeout for complex queries
proxy_read_timeout 60s;
proxy_connect_timeout 10s;
}
location /health/ {
proxy_pass http://nl2sql_backend;
proxy_read_timeout 5s;
}
}

Kubernetes Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
name: heliosdb-nl2sql
labels:
app: heliosdb-nl2sql
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: heliosdb-nl2sql
template:
metadata:
labels:
app: heliosdb-nl2sql
spec:
containers:
- name: nl2sql
image: heliosdb-nl2sql:latest
ports:
- containerPort: 8080
name: http
- containerPort: 9090
name: metrics
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: llm-api-keys
key: openai
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: database-credentials
key: url
resources:
requests:
cpu: "4"
memory: "8Gi"
limits:
cpu: "8"
memory: "16Gi"
livenessProbe:
httpGet:
path: /health/liveness
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/readiness
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
volumeMounts:
- name: config
mountPath: /app/config.toml
subPath: nl2sql.toml
volumes:
- name: config
configMap:
name: nl2sql-config
---
apiVersion: v1
kind: Service
metadata:
name: heliosdb-nl2sql
spec:
type: LoadBalancer
selector:
app: heliosdb-nl2sql
ports:
- name: http
port: 80
targetPort: 8080
- name: metrics
port: 9090
targetPort: 9090
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: heliosdb-nl2sql-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: heliosdb-nl2sql
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: nl2sql_queries_total
target:
type: AverageValue
averageValue: "100" # 100 qps per pod

7.2 Resource Allocation

Sizing Guide (per instance):

Workload TypeCPU CoresMemoryQueries/sec
Light (simple queries)2-44-8GB50-100
Medium (mixed)4-88-16GB100-200
Heavy (complex, agentic)8-1616-32GB200-400
Very Heavy (high concurrency)16-3232-64GB400-800

Formula:

CPU_cores = (target_qps * avg_query_latency_seconds * 1.5_overhead) / concurrency_factor
Memory_GB = (cache_size_mb + column_cache_mb + 2048_base) / 1024

Example:

# Target: 300 qps, avg latency 1.5s
target_qps = 300
avg_latency = 1.5
overhead = 1.5
concurrency = 0.8 # 80% of queries can run in parallel
cpu_cores = (target_qps * avg_latency * overhead) / concurrency
# = (300 * 1.5 * 1.5) / 0.8 = 843 → ~12 cores
# Memory
cache_mb = 500
column_cache_mb = 200
base_mb = 2048
memory_gb = (cache_mb + column_cache_mb + base_mb) / 1024
# = 2748 / 1024 = ~3GB → Recommend 8GB with headroom

7.3 Load Balancing

Session Affinity (for cache locality):

# NGINX - IP hash for session affinity
upstream nl2sql_backend {
ip_hash; # Route same IP to same backend
server nl2sql-1:8080;
server nl2sql-2:8080;
server nl2sql-3:8080;
}

Kubernetes Service Mesh (Istio):

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: nl2sql-lb
spec:
host: heliosdb-nl2sql
trafficPolicy:
loadBalancer:
consistentHash:
httpHeaderName: "X-User-ID" # Route based on user ID

Health Check-Based Routing:

# Remove unhealthy backends automatically
upstream nl2sql_backend {
server nl2sql-1:8080 max_fails=3 fail_timeout=30s;
server nl2sql-2:8080 max_fails=3 fail_timeout=30s backup; # Backup server
check interval=5000 rise=2 fall=3 timeout=3000 type=http;
check_http_send "GET /health/readiness HTTP/1.0\r\n\r\n";
check_http_expect_alive http_2xx http_3xx;
}

7.4 Multi-Region Deployment

Architecture:

┌─────────────────┐
│ Global DNS │
│ (Route 53) │
└────────┬────────┘
┌────────────────┼────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ US-EAST-1 │ │ EU-WEST-1 │ │ AP-SOUTH-1 │
│ (Primary) │ │ (Secondary) │ │ (Secondary) │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
┌──────▼───────┐ ┌──────▼───────┐ ┌──────▼───────┐
│ NL2SQL │ │ NL2SQL │ │ NL2SQL │
│ Cluster │ │ Cluster │ │ Cluster │
│ (3 nodes) │ │ (3 nodes) │ │ (3 nodes) │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
┌──────▼───────────────────────────────────▼──────┐
│ Global Statistics Replication │
│ (PostgreSQL Multi-Master / CockroachDB) │
└──────────────────────────────────────────────────┘

Configuration (per region):

[deployment]
region = "us-east-1"
primary_region = true
# Replicate statistics to other regions
[statistics.replication]
enabled = true
targets = [
"eu-west-1.nl2sql.heliosdb.com",
"ap-south-1.nl2sql.heliosdb.com"
]
sync_interval = 300 # 5 minutes

Failover Strategy:

# Route 53 health check
Type: HTTPS
ResourcePath: /health/readiness
FailureThreshold: 3
MeasureLatency: true
Regions:
- us-east-1 (Primary - Weight: 100)
- eu-west-1 (Failover - Weight: 0)
- ap-south-1 (Failover - Weight: 0)
# Automatic failover if primary unhealthy for >1 minute

8. Troubleshooting Guide

8.1 Common Issues and Solutions

Issue 1: Query Translation Failures

Symptoms:

  • translation_error in response
  • Error: “Failed to parse SQL from LLM response”
  • Confidence score consistently low (<0.5)

Root Causes:

  1. Ambiguous natural language query
  2. LLM provider issue
  3. Schema mismatch
  4. Complex query beyond LLM capabilities

Solutions:

Step 1: Check query complexity

Terminal window
curl -X POST http://localhost:8080/api/nl2sql/analyze \
-d '{"query": "YOUR_QUERY"}' | jq '.complexity'
# If complexity >8, query may be too complex

Step 2: Verify schema is synced

SELECT table_name, last_updated
FROM metadata_tables
WHERE table_name IN ('referenced_table_1', 'referenced_table_2');
-- If last_updated is old, re-sync
Terminal window
heliosdb nl2sql schema sync --force

Step 3: Try different LLM provider

Terminal window
curl -X POST http://localhost:8080/api/nl2sql/translate \
-d '{"query": "YOUR_QUERY", "provider": "anthropic"}' | jq .

Step 4: Simplify query

# Instead of:
"Show me the top 10 customers by total revenue in Q1 2024, grouped by region, with running totals and percent of total"
# Try:
"Show top 10 customers by revenue in 2024"

Issue 2: Agentic Decomposition Timeouts

Symptoms:

  • agentic_timeout errors
  • Queries taking >30s
  • Many tasks (>8) in decomposition

Root Causes:

  1. Too many tasks generated
  2. Sequential execution (not parallelized)
  3. LLM provider slow
  4. Database connection slow

Solutions:

Step 1: Increase timeout temporarily

[nl2sql.agentic]
timeout_seconds = 60 # Increase from 30

Step 2: Enable parallel execution

[nl2sql.agentic]
parallel_execution = true
max_parallel_tasks = 6 # Increase from 4

Step 3: Reduce max tasks

[nl2sql.agentic]
max_tasks = 6 # Reduce from 10
complexity_threshold = 7 # Increase to reduce decomposition frequency

Step 4: Check LLM provider latency

Terminal window
curl http://localhost:9090/metrics | grep nl2sql_llm_request_duration_seconds
# If >3s, switch provider

Issue 3: Statistics Collection Failures

Symptoms:

  • statistics_collection_error in logs
  • get_table_stats returns None
  • Cardinality estimates are wrong

Root Causes:

  1. Database permission issues
  2. Table not in metadata
  3. Statistics refresh manager queue full
  4. Database connection timeout

Solutions:

Step 1: Check database permissions

-- PostgreSQL
GRANT SELECT ON ALL TABLES IN SCHEMA public TO heliosdb;
GRANT SELECT ON pg_statistic TO heliosdb;
-- MySQL
GRANT SELECT ON *.* TO 'heliosdb'@'localhost';
GRANT SELECT ON mysql.innodb_table_stats TO 'heliosdb'@'localhost';

Step 2: Verify table exists in metadata

SELECT * FROM metadata_tables WHERE table_name = 'your_table';
-- If not found, re-sync schema

Step 3: Check refresh queue

Terminal window
heliosdb nl2sql refresh-manager status
# If queue full, increase capacity
[nl2sql.refresh_manager]
queue_capacity = 2000 # Increase from 1000
worker_threads = 8 # Increase from 4

Step 4: Manual statistics collection

Terminal window
heliosdb nl2sql statistics refresh --table your_table --force

Issue 4: Code Advisor Performance Issues

Symptoms:

  • Advisor analysis taking >5s
  • Memory spikes during analysis
  • Incomplete recommendations

Root Causes:

  1. Large SQL queries (>10KB)
  2. Statistics lookup overhead
  3. Metadata adapter slow
  4. Too many index recommendations

Solutions:

Step 1: Disable statistics integration temporarily

[nl2sql.advisor]
use_statistics = false # Temporarily disable

Step 2: Limit index recommendations

[nl2sql.advisor]
max_index_recommendations = 5 # Cap at 5

Step 3: Cache advisor results

Terminal window
# Enable Redis caching for advisor results
[cache.redis]
enabled = true
advisor_cache_ttl = 1800 # 30 minutes

Step 4: Increase advisor timeout

[nl2sql.advisor]
analysis_timeout_seconds = 10 # Increase from 5

8.2 Debug Procedures

Enable Debug Logging:

Terminal window
# Temporary (runtime)
export RUST_LOG="debug,heliosdb_nl2sql=trace"
sudo systemctl restart heliosdb-nl2sql
# Permanent (config)
[observability]
log_level = "debug"

Trace Specific Query:

Terminal window
# Enable tracing for single query
curl -X POST http://localhost:8080/api/nl2sql/translate \
-H "X-Trace-Enabled: true" \
-d '{"query": "YOUR_QUERY"}' | jq .
# View trace in Jaeger
open http://localhost:16686

Profile Memory Usage:

Terminal window
# Using heaptrack
heaptrack /usr/local/bin/heliosdb nl2sql --config /etc/heliosdb/nl2sql.toml
# Analyze
heaptrack_gui heaptrack.heliosdb.*.gz

SQL Query Inspection:

Terminal window
# Enable SQL query logging (PostgreSQL)
ALTER SYSTEM SET log_statement = 'all';
SELECT pg_reload_conf();
# View logs
tail -f /var/lib/postgresql/14/main/log/postgresql-*.log | grep heliosdb

8.3 Log Analysis

Find Slow Queries:

Terminal window
journalctl -u heliosdb-nl2sql --since "1 hour ago" | \
jq -r 'select(.fields.latency_ms > 5000) | "\(.timestamp) \(.fields.query_id) \(.fields.latency_ms)ms"'

Analyze Error Patterns:

Terminal window
journalctl -u heliosdb-nl2sql --since "1 day ago" | \
jq -r 'select(.level == "ERROR") | .fields.error_type' | \
sort | uniq -c | sort -rn

Track LLM Provider Usage:

Terminal window
journalctl -u heliosdb-nl2sql --since "1 hour ago" | \
jq -r 'select(.target == "heliosdb_nl2sql::llm") | .fields.provider' | \
sort | uniq -c

8.4 Performance Tuning

Identify Bottlenecks:

Terminal window
# CPU profiling
perf record -F 99 -p $(pgrep heliosdb) -g -- sleep 30
perf report
# I/O profiling
iotop -p $(pgrep heliosdb)
# Network profiling
nethogs -p $(pgrep heliosdb)

Optimize Configuration:

Terminal window
# Benchmark different configurations
for threshold in 5 6 7 8; do
echo "Testing complexity_threshold=$threshold"
sed -i "s/complexity_threshold = .*/complexity_threshold = $threshold/" /etc/heliosdb/nl2sql.toml
sudo systemctl reload heliosdb-nl2sql
sleep 10
heliosdb nl2sql benchmark run --duration 60s --output "benchmark-$threshold.json"
done
# Compare results
heliosdb nl2sql benchmark compare benchmark-*.json

Appendix

A. Configuration Template

See Section 2.1 for complete TOML configuration template.

B. Metrics Reference

Complete metrics list:

# Query metrics
nl2sql_queries_total{status="success|error"}
nl2sql_query_duration_seconds{quantile="0.5|0.95|0.99"}
nl2sql_query_complexity_score{quantile="0.5|0.95|0.99"}
# Agentic metrics
nl2sql_agentic_decompositions_total
nl2sql_agentic_tasks_per_query{quantile="0.5|0.95|0.99"}
nl2sql_agentic_stages_per_query{quantile="0.5|0.95|0.99"}
nl2sql_agentic_stage_duration_seconds{stage="1|2|3|..."}
nl2sql_agentic_parallel_speedup_ratio
# Code Advisor metrics
nl2sql_advisor_analyses_total
nl2sql_advisor_quality_score{quantile="0.5|0.95|0.99"}
nl2sql_advisor_bottlenecks_detected{kind="FullTableScan|MissingIndex|..."}
nl2sql_advisor_index_recommendations_total
# Statistics metrics
nl2sql_statistics_cache_entries{type="table|column"}
nl2sql_statistics_cache_hits
nl2sql_statistics_cache_misses
nl2sql_statistics_refreshes_total{priority="high|medium|low"}
nl2sql_statistics_refresh_duration_seconds
nl2sql_statistics_age_seconds{quantile="0.5|0.95|0.99"}
# LLM metrics
nl2sql_llm_requests_total{provider="openai|anthropic|...", status="success|error"}
nl2sql_llm_request_duration_seconds{provider="...", quantile="0.5|0.95|0.99"}
nl2sql_llm_tokens_used_total{provider="...", type="input|output"}
# Cache metrics
nl2sql_cache_hits
nl2sql_cache_misses
nl2sql_cache_entries
nl2sql_cache_memory_bytes
nl2sql_cache_evictions_total
# Database metrics
nl2sql_db_connections_active
nl2sql_db_connections_idle
nl2sql_db_query_duration_seconds{quantile="0.5|0.95|0.99"}
# System metrics
process_cpu_seconds_total
process_resident_memory_bytes
process_open_fds

C. API Endpoints

Translation API:

  • POST /api/nl2sql/translate - Translate natural language to SQL
  • POST /api/nl2sql/analyze - Analyze query complexity (without translation)

Code Advisor API:

  • POST /api/nl2sql/advisor/analyze - Analyze SQL query
  • POST /api/nl2sql/advisor/recommend-indexes - Get index recommendations

Statistics API:

  • GET /api/nl2sql/statistics/tables - List table statistics
  • GET /api/nl2sql/statistics/tables/{table} - Get table statistics
  • POST /api/nl2sql/statistics/refresh - Trigger statistics refresh

Admin API:

  • POST /admin/features - Toggle feature flags
  • POST /admin/cache/clear - Clear cache
  • GET /admin/config - Get current configuration

Health API:

  • GET /health/liveness - Liveness probe
  • GET /health/readiness - Readiness probe
  • GET /health/startup - Startup probe

D. Emergency Contacts

On-Call Rotation:

PagerDuty:

  • Service: HeliosDB-NL2SQL-Production
  • Escalation Policy: Engineering-NL2SQL

Slack Channels:

  • #heliosdb-nl2sql-prod (production alerts)
  • #heliosdb-nl2sql-oncall (on-call coordination)
  • #heliosdb-platform (general platform issues)

End of Production Deployment Guide

Version: 1.0 Maintained by: HeliosDB Platform Team Last Review: October 28, 2025 Next Review: January 28, 2026