Skip to content

HeliosDB Infrastructure Setup Guide

HeliosDB Infrastructure Setup Guide

Architecture Overview

HeliosDB production infrastructure uses:

  • AWS EKS for Kubernetes orchestration
  • VPC with public/private subnets across 3 AZs
  • RDS PostgreSQL (optional) for metadata
  • ElastiCache Redis for caching
  • S3 for backups and logs
  • Route53 for DNS
  • CloudWatch for monitoring

Terraform Deployment

Directory Structure

infrastructure/terraform/
├── main.tf # Main configuration
├── variables.tf # Variable definitions
├── modules/
│ ├── vpc/ # VPC with 3 AZs
│ ├── eks/ # EKS cluster + node groups
│ ├── rds/ # RDS PostgreSQL
│ ├── elasticache/ # Redis cluster
│ ├── s3/ # S3 buckets
│ ├── iam/ # IAM roles and policies
│ ├── route53/ # DNS configuration
│ ├── monitoring/ # CloudWatch alarms
│ └── security-groups/ # Security groups
└── environments/
├── production.tfvars
└── staging.tfvars

Production Configuration

Create infrastructure/terraform/environments/production.tfvars:

environment = "production"
aws_region = "us-west-2"
# VPC
vpc_cidr = "10.0.0.0/16"
allowed_cidr_blocks = ["10.0.0.0/8"]
# EKS
kubernetes_version = "1.28"
primary_instance_types = ["r6i.2xlarge", "r6i.4xlarge"]
compute_instance_types = ["c6i.2xlarge", "c6i.4xlarge"]
min_nodes = 5
max_nodes = 10
desired_nodes = 5
# RDS (optional)
enable_rds_metadata = false
rds_instance_class = "db.r6i.xlarge"
rds_allocated_storage = 100
rds_max_allocated_storage = 1000
# ElastiCache
elasticache_node_type = "cache.r6g.xlarge"
# S3
backup_retention_days = 30
log_retention_days = 90
# Route53
enable_route53 = true
domain_name = "heliosdb.example.com"
route53_zone_id = "Z1234567890ABC"
# Monitoring
alarm_email = "ops@example.com"
# HeliosDB
heliosdb_replicas = 5
heliosdb_cpu_request = 2000
heliosdb_memory_request = 4096
heliosdb_cpu_limit = 4000
heliosdb_memory_limit = 8192
heliosdb_storage_size = 100

Deployment Steps

Terminal window
# 1. Navigate to terraform directory
cd infrastructure/terraform
# 2. Initialize Terraform
terraform init
# 3. Validate configuration
terraform validate
# 4. Review plan
terraform plan -var-file="environments/production.tfvars" -out=plan.tfplan
# 5. Apply infrastructure
terraform apply plan.tfplan
# 6. Save outputs
terraform output > outputs.txt

Important Outputs

Terminal window
# VPC ID
terraform output vpc_id
# EKS cluster endpoint
terraform output eks_cluster_endpoint
# Load balancer DNS
terraform output load_balancer_dns
# S3 backup bucket
terraform output s3_backup_bucket

Post-Deployment Configuration

Configure kubectl

Terminal window
# Get cluster name from Terraform output
CLUSTER_NAME=$(terraform output -raw eks_cluster_name)
AWS_REGION=$(terraform output -raw aws_region)
# Update kubeconfig
aws eks update-kubeconfig \
--name $CLUSTER_NAME \
--region $AWS_REGION
# Verify connectivity
kubectl cluster-info
kubectl get nodes

Install Kubernetes Add-ons

Terminal window
# AWS Load Balancer Controller
helm repo add eks https://aws.github.io/eks-charts
helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
-n kube-system \
--set clusterName=$CLUSTER_NAME
# Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# EBS CSI Driver (for persistent volumes)
kubectl apply -k "github.com/kubernetes-sigs/aws-ebs-csi-driver/deploy/kubernetes/overlays/stable/?ref=master"

Install Monitoring Stack

Terminal window
# Prometheus and Grafana
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
-n monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi
# Access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Default credentials: admin/prom-operator

Resource Requirements

Per Environment

Production:

  • Nodes: 5-10 (r6i.2xlarge or r6i.4xlarge)
  • Storage: 500 GB total (100 GB per pod)
  • Network: Network Load Balancer (NLB)
  • Cache: ElastiCache Redis (cache.r6g.xlarge)

Staging:

  • Nodes: 3-5 (r6i.xlarge)
  • Storage: 300 GB total
  • Network: Application Load Balancer (ALB)
  • Cache: ElastiCache Redis (cache.r6g.large)

Cost Optimization

Recommendations

  1. Use Spot Instances for non-critical workloads
node_groups = {
spot_group = {
capacity_type = "SPOT"
instance_types = ["r6i.2xlarge", "r5.2xlarge", "r5n.2xlarge"]
}
}
  1. Enable cluster autoscaler
Terminal window
kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml
  1. Use S3 Intelligent-Tiering
lifecycle_rules = [
{
id = "intelligent-tiering"
enabled = true
transitions = [{
days = 30
storage_class = "INTELLIGENT_TIERING"
}]
}
]

Security Best Practices

  1. Enable VPC Flow Logs
  2. Use IAM Roles for Service Accounts (IRSA)
  3. Enable encryption at rest
  4. Restrict security group ingress
  5. Enable audit logging
  6. Use AWS Secrets Manager for sensitive data

Disaster Recovery

Backup Strategy

Terminal window
# Automated daily backups to S3
# Retention: 30 days
# Recovery Time Objective (RTO): 1 hour
# Recovery Point Objective (RPO): 24 hours

Recovery Procedure

Terminal window
# 1. Restore infrastructure
terraform apply -var-file="environments/production.tfvars"
# 2. Restore from backup
helm upgrade heliosdb ./helm/heliosdb-prod \
--set restore.enabled=true \
--set restore.source=s3://heliosdb-production-backups/latest.tar.gz
# 3. Verify
./scripts/deploy/health-check.sh

Maintenance

Upgrade Kubernetes

Terminal window
# 1. Update Terraform variable
kubernetes_version = "1.29"
# 2. Apply update
terraform apply -var-file="environments/production.tfvars"
# 3. Update node groups (rolling)
# Nodes will be updated automatically with zero downtime

Scale Cluster

Terminal window
# Update variables
desired_nodes = 7
# Apply
terraform apply -var-file="environments/production.tfvars"

Troubleshooting

Terraform Issues

Terminal window
# State locked
terraform force-unlock <LOCK_ID>
# Drift detection
terraform plan -refresh-only
# Import existing resources
terraform import module.vpc.aws_vpc.main vpc-12345678

EKS Issues

Terminal window
# Cluster not accessible
aws eks update-kubeconfig --name <cluster-name> --region <region>
# Nodes not joining
kubectl get nodes
kubectl describe node <node-name>
# Check EKS control plane logs
aws eks list-clusters
aws eks describe-cluster --name <cluster-name>