HeliosDB Infrastructure Setup Guide

Architecture Overview

HeliosDB production infrastructure uses:

AWS EKS for Kubernetes orchestration
VPC with public/private subnets across 3 AZs
RDS PostgreSQL (optional) for metadata
ElastiCache Redis for caching
S3 for backups and logs
Route53 for DNS
CloudWatch for monitoring

Terraform Deployment

Directory Structure

infrastructure/terraform/
├── main.tf                  # Main configuration
├── variables.tf             # Variable definitions
├── modules/
│   ├── vpc/                # VPC with 3 AZs
│   ├── eks/                # EKS cluster + node groups
│   ├── rds/                # RDS PostgreSQL
│   ├── elasticache/        # Redis cluster
│   ├── s3/                 # S3 buckets
│   ├── iam/                # IAM roles and policies
│   ├── route53/            # DNS configuration
│   ├── monitoring/         # CloudWatch alarms
│   └── security-groups/    # Security groups
└── environments/
    ├── production.tfvars
    └── staging.tfvars

Production Configuration

Create infrastructure/terraform/environments/production.tfvars:

environment = "production"
aws_region  = "us-west-2"

# VPC
vpc_cidr = "10.0.0.0/16"
allowed_cidr_blocks = ["10.0.0.0/8"]

# EKS
kubernetes_version = "1.28"
primary_instance_types = ["r6i.2xlarge", "r6i.4xlarge"]
compute_instance_types = ["c6i.2xlarge", "c6i.4xlarge"]
min_nodes = 5
max_nodes = 10
desired_nodes = 5

# RDS (optional)
enable_rds_metadata = false
rds_instance_class = "db.r6i.xlarge"
rds_allocated_storage = 100
rds_max_allocated_storage = 1000

# ElastiCache
elasticache_node_type = "cache.r6g.xlarge"

# S3
backup_retention_days = 30
log_retention_days = 90

# Route53
enable_route53 = true
domain_name = "heliosdb.example.com"
route53_zone_id = "Z1234567890ABC"

# Monitoring
alarm_email = "ops@example.com"

# HeliosDB
heliosdb_replicas = 5
heliosdb_cpu_request = 2000
heliosdb_memory_request = 4096
heliosdb_cpu_limit = 4000
heliosdb_memory_limit = 8192
heliosdb_storage_size = 100

Deployment Steps

# 1. Navigate to terraform directory
cd infrastructure/terraform

# 2. Initialize Terraform
terraform init

# 3. Validate configuration
terraform validate

# 4. Review plan
terraform plan -var-file="environments/production.tfvars" -out=plan.tfplan

# 5. Apply infrastructure
terraform apply plan.tfplan

# 6. Save outputs
terraform output > outputs.txt

Important Outputs

# VPC ID
terraform output vpc_id

# EKS cluster endpoint
terraform output eks_cluster_endpoint

# Load balancer DNS
terraform output load_balancer_dns

# S3 backup bucket
terraform output s3_backup_bucket

Post-Deployment Configuration

Configure kubectl

# Get cluster name from Terraform output
CLUSTER_NAME=$(terraform output -raw eks_cluster_name)
AWS_REGION=$(terraform output -raw aws_region)

# Update kubeconfig
aws eks update-kubeconfig \
  --name $CLUSTER_NAME \
  --region $AWS_REGION

# Verify connectivity
kubectl cluster-info
kubectl get nodes

Install Kubernetes Add-ons

# AWS Load Balancer Controller
helm repo add eks https://aws.github.io/eks-charts
helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
  -n kube-system \
  --set clusterName=$CLUSTER_NAME

# Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# EBS CSI Driver (for persistent volumes)
kubectl apply -k "github.com/kubernetes-sigs/aws-ebs-csi-driver/deploy/kubernetes/overlays/stable/?ref=master"

Install Monitoring Stack

# Prometheus and Grafana
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
  -n monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi

# Access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Default credentials: admin/prom-operator

Resource Requirements

Per Environment

Production:

Nodes: 5-10 (r6i.2xlarge or r6i.4xlarge)
Storage: 500 GB total (100 GB per pod)
Network: Network Load Balancer (NLB)
Cache: ElastiCache Redis (cache.r6g.xlarge)

Staging:

Nodes: 3-5 (r6i.xlarge)
Storage: 300 GB total
Network: Application Load Balancer (ALB)
Cache: ElastiCache Redis (cache.r6g.large)

Cost Optimization

Recommendations

Use Spot Instances for non-critical workloads

node_groups = {
  spot_group = {
    capacity_type = "SPOT"
    instance_types = ["r6i.2xlarge", "r5.2xlarge", "r5n.2xlarge"]
  }
}

Enable cluster autoscaler

kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

Use S3 Intelligent-Tiering

lifecycle_rules = [
  {
    id      = "intelligent-tiering"
    enabled = true
    transitions = [{
      days = 30
      storage_class = "INTELLIGENT_TIERING"
    }]
  }
]

Security Best Practices

Enable VPC Flow Logs
Use IAM Roles for Service Accounts (IRSA)
Enable encryption at rest
Restrict security group ingress
Enable audit logging
Use AWS Secrets Manager for sensitive data

Disaster Recovery

Backup Strategy

# Automated daily backups to S3
# Retention: 30 days
# Recovery Time Objective (RTO): 1 hour
# Recovery Point Objective (RPO): 24 hours

Recovery Procedure

# 1. Restore infrastructure
terraform apply -var-file="environments/production.tfvars"

# 2. Restore from backup
helm upgrade heliosdb ./helm/heliosdb-prod \
  --set restore.enabled=true \
  --set restore.source=s3://heliosdb-production-backups/latest.tar.gz

# 3. Verify
./scripts/deploy/health-check.sh

Maintenance

Upgrade Kubernetes

# 1. Update Terraform variable
kubernetes_version = "1.29"

# 2. Apply update
terraform apply -var-file="environments/production.tfvars"

# 3. Update node groups (rolling)
# Nodes will be updated automatically with zero downtime

Scale Cluster

# Update variables
desired_nodes = 7

# Apply
terraform apply -var-file="environments/production.tfvars"

Troubleshooting

Terraform Issues

# State locked
terraform force-unlock <LOCK_ID>

# Drift detection
terraform plan -refresh-only

# Import existing resources
terraform import module.vpc.aws_vpc.main vpc-12345678

EKS Issues

# Cluster not accessible
aws eks update-kubeconfig --name <cluster-name> --region <region>

# Nodes not joining
kubectl get nodes
kubectl describe node <node-name>

# Check EKS control plane logs
aws eks list-clusters
aws eks describe-cluster --name <cluster-name>