HeliosDB Infrastructure Setup Guide
HeliosDB Infrastructure Setup Guide
Architecture Overview
HeliosDB production infrastructure uses:
- AWS EKS for Kubernetes orchestration
- VPC with public/private subnets across 3 AZs
- RDS PostgreSQL (optional) for metadata
- ElastiCache Redis for caching
- S3 for backups and logs
- Route53 for DNS
- CloudWatch for monitoring
Terraform Deployment
Directory Structure
infrastructure/terraform/├── main.tf # Main configuration├── variables.tf # Variable definitions├── modules/│ ├── vpc/ # VPC with 3 AZs│ ├── eks/ # EKS cluster + node groups│ ├── rds/ # RDS PostgreSQL│ ├── elasticache/ # Redis cluster│ ├── s3/ # S3 buckets│ ├── iam/ # IAM roles and policies│ ├── route53/ # DNS configuration│ ├── monitoring/ # CloudWatch alarms│ └── security-groups/ # Security groups└── environments/ ├── production.tfvars └── staging.tfvarsProduction Configuration
Create infrastructure/terraform/environments/production.tfvars:
environment = "production"aws_region = "us-west-2"
# VPCvpc_cidr = "10.0.0.0/16"allowed_cidr_blocks = ["10.0.0.0/8"]
# EKSkubernetes_version = "1.28"primary_instance_types = ["r6i.2xlarge", "r6i.4xlarge"]compute_instance_types = ["c6i.2xlarge", "c6i.4xlarge"]min_nodes = 5max_nodes = 10desired_nodes = 5
# RDS (optional)enable_rds_metadata = falserds_instance_class = "db.r6i.xlarge"rds_allocated_storage = 100rds_max_allocated_storage = 1000
# ElastiCacheelasticache_node_type = "cache.r6g.xlarge"
# S3backup_retention_days = 30log_retention_days = 90
# Route53enable_route53 = truedomain_name = "heliosdb.example.com"route53_zone_id = "Z1234567890ABC"
# Monitoringalarm_email = "ops@example.com"
# HeliosDBheliosdb_replicas = 5heliosdb_cpu_request = 2000heliosdb_memory_request = 4096heliosdb_cpu_limit = 4000heliosdb_memory_limit = 8192heliosdb_storage_size = 100Deployment Steps
# 1. Navigate to terraform directorycd infrastructure/terraform
# 2. Initialize Terraformterraform init
# 3. Validate configurationterraform validate
# 4. Review planterraform plan -var-file="environments/production.tfvars" -out=plan.tfplan
# 5. Apply infrastructureterraform apply plan.tfplan
# 6. Save outputsterraform output > outputs.txtImportant Outputs
# VPC IDterraform output vpc_id
# EKS cluster endpointterraform output eks_cluster_endpoint
# Load balancer DNSterraform output load_balancer_dns
# S3 backup bucketterraform output s3_backup_bucketPost-Deployment Configuration
Configure kubectl
# Get cluster name from Terraform outputCLUSTER_NAME=$(terraform output -raw eks_cluster_name)AWS_REGION=$(terraform output -raw aws_region)
# Update kubeconfigaws eks update-kubeconfig \ --name $CLUSTER_NAME \ --region $AWS_REGION
# Verify connectivitykubectl cluster-infokubectl get nodesInstall Kubernetes Add-ons
# AWS Load Balancer Controllerhelm repo add eks https://aws.github.io/eks-chartshelm install aws-load-balancer-controller eks/aws-load-balancer-controller \ -n kube-system \ --set clusterName=$CLUSTER_NAME
# Metrics Serverkubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# EBS CSI Driver (for persistent volumes)kubectl apply -k "github.com/kubernetes-sigs/aws-ebs-csi-driver/deploy/kubernetes/overlays/stable/?ref=master"Install Monitoring Stack
# Prometheus and Grafanahelm repo add prometheus-community https://prometheus-community.github.io/helm-chartshelm install prometheus prometheus-community/kube-prometheus-stack \ -n monitoring \ --create-namespace \ --set prometheus.prometheusSpec.retention=30d \ --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi
# Access Grafanakubectl port-forward -n monitoring svc/prometheus-grafana 3000:80# Default credentials: admin/prom-operatorResource Requirements
Per Environment
Production:
- Nodes: 5-10 (r6i.2xlarge or r6i.4xlarge)
- Storage: 500 GB total (100 GB per pod)
- Network: Network Load Balancer (NLB)
- Cache: ElastiCache Redis (cache.r6g.xlarge)
Staging:
- Nodes: 3-5 (r6i.xlarge)
- Storage: 300 GB total
- Network: Application Load Balancer (ALB)
- Cache: ElastiCache Redis (cache.r6g.large)
Cost Optimization
Recommendations
- Use Spot Instances for non-critical workloads
node_groups = { spot_group = { capacity_type = "SPOT" instance_types = ["r6i.2xlarge", "r5.2xlarge", "r5n.2xlarge"] }}- Enable cluster autoscaler
kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml- Use S3 Intelligent-Tiering
lifecycle_rules = [ { id = "intelligent-tiering" enabled = true transitions = [{ days = 30 storage_class = "INTELLIGENT_TIERING" }] }]Security Best Practices
- Enable VPC Flow Logs
- Use IAM Roles for Service Accounts (IRSA)
- Enable encryption at rest
- Restrict security group ingress
- Enable audit logging
- Use AWS Secrets Manager for sensitive data
Disaster Recovery
Backup Strategy
# Automated daily backups to S3# Retention: 30 days# Recovery Time Objective (RTO): 1 hour# Recovery Point Objective (RPO): 24 hoursRecovery Procedure
# 1. Restore infrastructureterraform apply -var-file="environments/production.tfvars"
# 2. Restore from backuphelm upgrade heliosdb ./helm/heliosdb-prod \ --set restore.enabled=true \ --set restore.source=s3://heliosdb-production-backups/latest.tar.gz
# 3. Verify./scripts/deploy/health-check.shMaintenance
Upgrade Kubernetes
# 1. Update Terraform variablekubernetes_version = "1.29"
# 2. Apply updateterraform apply -var-file="environments/production.tfvars"
# 3. Update node groups (rolling)# Nodes will be updated automatically with zero downtimeScale Cluster
# Update variablesdesired_nodes = 7
# Applyterraform apply -var-file="environments/production.tfvars"Troubleshooting
Terraform Issues
# State lockedterraform force-unlock <LOCK_ID>
# Drift detectionterraform plan -refresh-only
# Import existing resourcesterraform import module.vpc.aws_vpc.main vpc-12345678EKS Issues
# Cluster not accessibleaws eks update-kubeconfig --name <cluster-name> --region <region>
# Nodes not joiningkubectl get nodeskubectl describe node <node-name>
# Check EKS control plane logsaws eks list-clustersaws eks describe-cluster --name <cluster-name>