Phase 2 Production Deployment - Quick Reference Guide
Phase 2 Production Deployment - Quick Reference Guide
Version: 1.0 Created: November 24, 2025 Target Audience: DevOps Engineers, SREs
One-Page Summary
Deployment Checklist
☐ Pre-Deployment ☐ Run pre-flight checks ☐ Create full backup ☐ Verify monitoring setup ☐ Review rollback plan ☐ Notify stakeholders
☐ Deployment ☐ Execute rolling deployment ☐ Monitor metrics continuously ☐ Run smoke tests ☐ Verify all nodes healthy
☐ Post-Deployment ☐ Run validation suite ☐ Update documentation ☐ Send completion notice ☐ Schedule retrospectiveKey Commands
# Health checkheliosdb-cli health-check
# Deploy./runbooks/rolling-deployment.sh v7.0.0 canary
# Backup./runbooks/backup.sh full
# Rollback./runbooks/rollback-deployment.sh v6.0.0
# Monitorwatch -n 5 'heliosdb-cli cluster-status'Emergency Contacts
- P0 Hotline: +1-XXX-XXX-XXXX
- Slack: #incidents
- PagerDuty: heliosdb-production
Quick Start Commands
Pre-Deployment
# Pre-flight checks./runbooks/pre-deployment-checklist.sh
# Create backup./runbooks/backup.sh full
# Verify monitoringcurl -sf http://localhost:9090/-/healthycurl -sf http://localhost:3000/api/healthDeployment
# Canary deployment (recommended)./runbooks/rolling-deployment.sh v7.0.0 canary
# Rolling deployment./runbooks/rolling-deployment.sh v7.0.0 rolling
# Blue-green deployment./runbooks/rolling-deployment.sh v7.0.0 blue-greenPost-Deployment
# Validation./runbooks/post-deployment-validation.sh
# Check versionheliosdb-cli version --all-nodes
# Monitor./scripts/monitor-deployment.sh 3600Monitoring URLs
| Service | URL | Port |
|---|---|---|
| Prometheus | http://localhost:9090 | 9090 |
| Grafana | http://localhost:3000 | 3000 |
| Alertmanager | http://localhost:9093 | 9093 |
| HeliosDB Metrics | http://localhost:9091 | 9091 |
Key Dashboards
- Cluster Overview: Grafana → Dashboards → HeliosDB Overview
- Performance: Grafana → Dashboards → HeliosDB Performance
- Reliability: Grafana → Dashboards → HeliosDB Reliability
Critical Metrics
Health Indicators
# Cluster healthheliosdb-cli health-check
# Node statusheliosdb-cli list-nodes --status
# Quorum statusheliosdb-cli quorum-status
# Replication lagheliosdb-cli replication-lagPerformance Metrics
# Query latencyheliosdb-cli metrics --query "query_latency_p99"
# Throughputheliosdb-cli metrics --query "qps"
# Resource usageheliosdb-cli resource-usage --all-nodesCommon Incidents & Quick Fixes
P0: Cluster Down
# 1. Check statusheliosdb-cli cluster-status
# 2. Attempt recovery./runbooks/incident-p0-cluster-outage.sh
# 3. If failed, DR failover./runbooks/dr-failover.shP1: High Latency
# 1. Identify slow queriesheliosdb-cli slow-query-log --last 10m
# 2. Check resourcesheliosdb-cli resource-usage
# 3. Apply fixes./runbooks/incident-p1-high-latency.shP2: Single Node Down
# Restart nodeheliosdb-cli restart-node <node-id>
# If restart fails, replaceheliosdb-cli replace-node <node-id>P2: Disk Space Low
# Free up spaceheliosdb-cli cleanup-old-logs <node-id>heliosdb-cli cleanup-old-snapshots <node-id>
# Expand storageheliosdb-cli expand-storage <node-id> --size +100GBBackup & Restore
Create Backup
# Full backup./runbooks/backup.sh full
# Incremental backup./runbooks/backup.sh incremental
# WAL archive./runbooks/backup.sh walVerify Backup
# Verify latest backupheliosdb-backup verify --latest
# Verify specific backupheliosdb-backup verify --path /backup/heliosdb/2025-11-24Restore
# Full restore./runbooks/restore.sh full 2025-11-24
# Point-in-time restore./runbooks/pitr-restore.sh "2025-11-24T14:30:00Z"Configuration Files
Locations
- Main config:
/etc/heliosdb/config.yaml - Secrets:
/etc/heliosdb/secrets.yaml - TLS certs:
/etc/heliosdb/tls/ - Logs:
/var/log/heliosdb/ - Data:
/var/lib/heliosdb/data/
Key Settings
cluster: name: production nodes: 3 replication_factor: 3
performance: cache_size: 32GB max_connections: 1000 query_timeout: 30s
reliability: backup_schedule: "0 2 * * *" wal_archive: true auto_failover: trueAlert Reference
Critical Alerts (P0)
| Alert | Threshold | Action |
|---|---|---|
| ClusterQuorumLost | quorum < 2 | Run incident-p0-cluster-outage.sh |
| NodeDown | up == 0 for 2min | Restart/replace node |
| DataCorruption | checksum_mismatches > 0 | Run integrity check |
| BackupFailed | backup_failed > 0 | Investigate, retry backup |
Warning Alerts (P1)
| Alert | Threshold | Action |
|---|---|---|
| HighQueryLatency | p99 > 100ms for 5min | Run incident-p1-high-latency.sh |
| HighCPUUsage | cpu > 85% for 10min | Check queries, scale if needed |
| DiskSpaceLow | free < 15% | Cleanup or expand storage |
| ReplicationLag | lag > 5min | Check network, resources |
Rollback Procedures
Immediate Rollback
# Within 1 hour of deployment./runbooks/rollback-deployment.sh v6.0.0Delayed Rollback
# After 1 hour (requires data sync)./runbooks/delayed-rollback.sh v6.0.0Rollback Verification
# Check versionheliosdb-cli version --all-nodes
# Verify healthheliosdb-cli health-check
# Run tests./tests/smoke-tests.shMigration Quick Reference
Zero-Downtime Migration
# Phase 1: Prepare./scripts/deployment/zero-downtime-migration.sh $DB $CLUSTER prepare
# Phase 2: Initial sync./scripts/deployment/zero-downtime-migration.sh $DB $CLUSTER initial-sync
# Phase 3: Delta sync./scripts/deployment/zero-downtime-migration.sh $DB $CLUSTER delta-sync
# Phase 4: Cutover./scripts/deployment/zero-downtime-migration.sh $DB $CLUSTER cutover
# Phase 5: Verify./scripts/deployment/zero-downtime-migration.sh $DB $CLUSTER verifyMigration Status
# Check progressheliosdb-migrate status
# Check lagheliosdb-migrate lag-check
# Verify data./scripts/deployment/data-validation.shPerformance Tuning
Quick Wins
# Increase cache sizeheliosdb-cli set-config cache.size=64GB
# Increase connection poolheliosdb-cli set-config connections.max=2000
# Enable query cacheheliosdb-cli set-config query_cache.enabled=true
# Update statisticsheliosdb-cli update-statistics --all-tablesIndex Optimization
# Get recommendationsheliosdb-cli index-recommendations
# Create recommended indexesheliosdb-cli create-recommended-indexes
# Rebuild indexesheliosdb-cli rebuild-indexes --onlineSecurity Quick Checks
Pre-Deployment Security
# Network securitysudo ufw status
# TLS certificatesopenssl x509 -in /etc/heliosdb/tls/server.crt -noout -dates
# Encryption at restheliosdb-cli get-config encryption.at_rest
# Audit loggingheliosdb-cli get-config audit.enabledCompliance Validation
# SOC2 controlsheliosdb-cli compliance-check --framework soc2
# HIPAA controlsheliosdb-cli compliance-check --framework hipaa
# GDPR controlsheliosdb-cli compliance-check --framework gdprCapacity Planning
Current Capacity
# Cluster capacityheliosdb-cli capacity-report
# Resource usageheliosdb-cli resource-usage --detailed
# Growth projectionheliosdb-cli capacity-forecast --days 90Scaling Decisions
Scale Up When:
- CPU > 70% sustained for 1 hour
- Memory > 80% sustained for 30 minutes
- Disk > 75% usage
- P99 latency > 50ms sustained
Scale Down When:
- CPU < 30% sustained for 7 days
- Memory < 40% sustained for 7 days
- Cost optimization opportunity identified
Troubleshooting Quick Tips
Slow Queries
# Identifyheliosdb-cli slow-query-log --last 1h
# Analyzeheliosdb-cli explain-slow-queries
# Fixheliosdb-cli create-recommended-indexesHigh CPU
# Identify culpritheliosdb-cli queries --sort-by cpu
# Kill slow queriesheliosdb-cli kill-slow-queries --min-duration 30s
# Check for hotspotsheliosdb-cli hotspot-analysisMemory Issues
# Check usageheliosdb-cli memory-usage
# Flush cacheheliosdb-cli cache-flush
# Reduce cache sizeheliosdb-cli set-config cache.size=16GBUseful CLI Commands
Cluster Management
# Statusheliosdb-cli cluster-statusheliosdb-cli list-nodesheliosdb-cli quorum-status
# Operationsheliosdb-cli cluster-startheliosdb-cli cluster-stop --gracefulheliosdb-cli cluster-restartNode Management
# Node operationsheliosdb-cli node-status <node-id>heliosdb-cli restart-node <node-id>heliosdb-cli drain-node <node-id>heliosdb-cli replace-node <node-id>Query Management
# Active queriesheliosdb-cli queriesheliosdb-cli slow-query-log
# Query operationsheliosdb-cli kill-query <query-id>heliosdb-cli kill-slow-queriesheliosdb-cli explain-query <query-id>Backup/Restore
# Backupheliosdb-backup full --output /backup/todayheliosdb-backup incremental --base /backup/baseheliosdb-backup wal-archive
# Restoreheliosdb-restore full --input /backup/todayheliosdb-restore pitr --target-time "2025-11-24T14:00:00Z"Log Locations
# Application logs/var/log/heliosdb/heliosdb.log
# Slow query log/var/log/heliosdb/slow-queries.log
# Audit log/var/log/heliosdb/audit.log
# Error log/var/log/heliosdb/error.log
# View recent logstail -f /var/log/heliosdb/heliosdb.log
# Search logsgrep "ERROR" /var/log/heliosdb/heliosdb.logSupport Resources
Documentation
- Full Deployment Plan: PHASE2_PRODUCTION_DEPLOYMENT_PLAN.md
- Operations Runbooks: ../operations/PHASE2_OPERATIONS_RUNBOOKS.md
- Executive Summary: PHASE2_PRODUCTION_DEPLOYMENT_EXECUTIVE_SUMMARY.md
Contact
- Slack: #production-support
- Email: support@heliosdb.io
- Hotline: +1-XXX-XXX-XXXX (24/7)
- PagerDuty: heliosdb-production
On-Call
# Check on-callheliosdb-cli oncall-status
# Page on-callheliosdb-cli page-oncall "Brief description"Keyboard Shortcuts for Common Tasks
# Add to ~/.bashrc for quick access
alias hdb-health='heliosdb-cli health-check'alias hdb-status='heliosdb-cli cluster-status'alias hdb-nodes='heliosdb-cli list-nodes'alias hdb-backup='./runbooks/backup.sh full'alias hdb-logs='tail -f /var/log/heliosdb/heliosdb.log'alias hdb-queries='heliosdb-cli queries'alias hdb-metrics='heliosdb-cli metrics'alias hdb-incident='./runbooks/incident-p0-cluster-outage.sh'Keep This Handy: Print or bookmark for quick reference during deployments and incidents.
Last Updated: November 24, 2025 Document Status: Production Ready