HeliosDB Streaming - Production Deployment Guide

Prerequisites
Single-Node Deployment
Kubernetes Production Deployment
Docker Compose Setup
Monitoring & Observability
Security Configuration
Performance Tuning
Operational Runbook
Troubleshooting
Capacity Planning

1. Prerequisites

System Requirements

Minimum (Development/Testing)

CPU: 2 cores
RAM: 4 GB
Storage: 20 GB SSD
OS: Linux (Ubuntu 20.04+, RHEL 8+, Amazon Linux 2)

Recommended (Production)

CPU: 8 cores (16 vCPU)
RAM: 16 GB (32 GB for high throughput)
Storage: 100 GB NVMe SSD
OS: Linux (Ubuntu 22.04 LTS, RHEL 9, Amazon Linux 2023)
Network: 10 Gbps

Software Dependencies

# Rust toolchain (for building from source)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup default stable

# Docker (for containerized deployment)
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Kubernetes CLI (for k8s deployment)
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

# Helm (for k8s package management)
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

External Services

Required:

Apache Kafka or Apache Pulsar (message broker)
S3/Azure Blob/GCS (checkpoint storage)
AWS KMS/Azure Key Vault/GCP KMS (encryption key management)

Optional:

PostgreSQL (metadata storage)
Prometheus (metrics collection)
Grafana (visualization)
Jaeger (distributed tracing)

2. Single-Node Deployment

2.1 Build from Source

# Clone repository
git clone https://github.com/heliosdb/heliosdb.git
cd heliosdb/heliosdb-streaming

# Build release binary
cargo build --release

# Binary location
ls -lh target/release/heliosdb-streaming

# Run tests
cargo test --all

# Run E2E integration tests
cargo test --test e2e_integration_test

2.2 Configuration

Create /etc/heliosdb/streaming.toml:

[server]
host = "0.0.0.0"
port = 8080
health_check_port = 8081
metrics_port = 9090

[streaming]
checkpoint_interval_secs = 60
watermark_interval_secs = 1
allowed_lateness_secs = 60
max_parallelism = 8

[state]
backend = "file"  # Options: "memory", "file", "s3"
path = "/var/lib/heliosdb/state"
encryption_enabled = true

[kafka]
bootstrap_servers = "localhost:9092"
group_id = "heliosdb-streaming"
auto_offset_reset = "earliest"
enable_ssl = false

[security]
jwt_secret = "your-secret-key-here-change-in-production"
jwt_expiration_hours = 24
rate_limit_enabled = true
rate_limit_requests_per_minute = 100

[kms]
provider = "local"  # Options: "local", "aws", "azure", "gcp"
# For AWS:
# aws_region = "us-east-1"
# For Azure:
# azure_vault_url = "https://your-vault.vault.azure.net"
# For GCP:
# gcp_project_id = "your-project"
# gcp_location = "us-central1"
# gcp_keyring = "heliosdb"

[logging]
level = "info"  # Options: "error", "warn", "info", "debug", "trace"
format = "json"  # Options: "json", "pretty"

2.3 SystemD Service

Create /etc/systemd/system/heliosdb-streaming.service:

[Unit]
Description=HeliosDB Streaming Analytics
After=network.target kafka.service

[Service]
Type=simple
User=heliosdb
Group=heliosdb
WorkingDirectory=/opt/heliosdb
ExecStart=/opt/heliosdb/bin/heliosdb-streaming --config /etc/heliosdb/streaming.toml
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal
SyslogIdentifier=heliosdb-streaming

# Security
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/heliosdb

# Resource limits
LimitNOFILE=65536
LimitNPROC=4096

[Install]
WantedBy=multi-user.target

2.4 Start Service

# Create user and directories
sudo useradd -r -s /bin/false heliosdb
sudo mkdir -p /opt/heliosdb/bin
sudo mkdir -p /var/lib/heliosdb/state
sudo mkdir -p /var/log/heliosdb
sudo chown -R heliosdb:heliosdb /opt/heliosdb /var/lib/heliosdb /var/log/heliosdb

# Copy binary
sudo cp target/release/heliosdb-streaming /opt/heliosdb/bin/

# Enable and start service
sudo systemctl daemon-reload
sudo systemctl enable heliosdb-streaming
sudo systemctl start heliosdb-streaming

# Check status
sudo systemctl status heliosdb-streaming
sudo journalctl -u heliosdb-streaming -f

2.5 Verify Deployment

# Health check
curl http://localhost:8081/health

# Metrics
curl http://localhost:9090/metrics

# Create test job
curl -X POST http://localhost:8080/api/v1/jobs \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <your-jwt-token>" \
  -d '{
    "name": "test-job",
    "source": "kafka",
    "config": {
      "topic": "test-input",
      "parallelism": 4
    }
  }'

3. Kubernetes Production Deployment

3.1 Namespace Setup

apiVersion: v1
kind: Namespace
metadata:
  name: heliosdb-streaming
  labels:
    name: heliosdb-streaming
    environment: production

3.2 ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: heliosdb-config
  namespace: heliosdb-streaming
data:
  streaming.toml: |
    [server]
    host = "0.0.0.0"
    port = 8080
    health_check_port = 8081
    metrics_port = 9090

    [streaming]
    checkpoint_interval_secs = 60
    watermark_interval_secs = 1
    allowed_lateness_secs = 60
    max_parallelism = 8

    [state]
    backend = "s3"
    path = "s3://heliosdb-production/checkpoints"
    encryption_enabled = true

    [kafka]
    bootstrap_servers = "kafka-bootstrap.kafka:9092"
    group_id = "heliosdb-streaming-prod"
    auto_offset_reset = "earliest"
    enable_ssl = true

    [security]
    rate_limit_enabled = true
    rate_limit_requests_per_minute = 500

    [kms]
    provider = "aws"
    aws_region = "us-east-1"

    [logging]
    level = "info"
    format = "json"

3.3 Secrets

apiVersion: v1
kind: Secret
metadata:
  name: heliosdb-secrets
  namespace: heliosdb-streaming
type: Opaque
stringData:
  JWT_SECRET: "your-production-secret-change-this"
  AWS_ACCESS_KEY_ID: "AKIAIOSFODNN7EXAMPLE"
  AWS_SECRET_ACCESS_KEY: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
  KAFKA_PASSWORD: "your-kafka-password"

Generate secrets securely:

# Generate JWT secret
openssl rand -base64 32

# Create secret from file
kubectl create secret generic heliosdb-secrets \
  --from-literal=JWT_SECRET=$(openssl rand -base64 32) \
  --from-file=aws-credentials=/path/to/credentials \
  -n heliosdb-streaming

3.4 StatefulSet

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: heliosdb-streaming
  namespace: heliosdb-streaming
spec:
  serviceName: heliosdb-streaming
  replicas: 3
  selector:
    matchLabels:
      app: heliosdb-streaming
  template:
    metadata:
      labels:
        app: heliosdb-streaming
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
        prometheus.io/path: "/metrics"
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - heliosdb-streaming
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: heliosdb-streaming
        image: heliosdb/heliosdb-streaming:v4.0.0
        imagePullPolicy: IfNotPresent
        ports:
        - name: http
          containerPort: 8080
          protocol: TCP
        - name: health
          containerPort: 8081
          protocol: TCP
        - name: metrics
          containerPort: 9090
          protocol: TCP
        env:
        - name: RUST_LOG
          value: "info"
        - name: RUST_BACKTRACE
          value: "1"
        - name: JWT_SECRET
          valueFrom:
            secretKeyRef:
              name: heliosdb-secrets
              key: JWT_SECRET
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: heliosdb-secrets
              key: AWS_ACCESS_KEY_ID
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: heliosdb-secrets
              key: AWS_SECRET_ACCESS_KEY
        resources:
          requests:
            memory: "4Gi"
            cpu: "2000m"
          limits:
            memory: "8Gi"
            cpu: "4000m"
        livenessProbe:
          httpGet:
            path: /health
            port: health
          initialDelaySeconds: 30
          periodSeconds: 30
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: health
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 3
          failureThreshold: 3
        volumeMounts:
        - name: config
          mountPath: /etc/heliosdb
        - name: data
          mountPath: /var/lib/heliosdb
      volumes:
      - name: config
        configMap:
          name: heliosdb-config
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: "gp3"  # AWS EBS gp3
      resources:
        requests:
          storage: 50Gi

3.5 Service

apiVersion: v1
kind: Service
metadata:
  name: heliosdb-streaming
  namespace: heliosdb-streaming
  labels:
    app: heliosdb-streaming
spec:
  type: LoadBalancer
  ports:
  - name: http
    port: 8080
    targetPort: 8080
    protocol: TCP
  - name: metrics
    port: 9090
    targetPort: 9090
    protocol: TCP
  selector:
    app: heliosdb-streaming
---
apiVersion: v1
kind: Service
metadata:
  name: heliosdb-streaming-headless
  namespace: heliosdb-streaming
spec:
  clusterIP: None
  ports:
  - name: http
    port: 8080
    targetPort: 8080
  selector:
    app: heliosdb-streaming

3.6 HorizontalPodAutoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: heliosdb-streaming-hpa
  namespace: heliosdb-streaming
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: heliosdb-streaming
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: events_per_second
      target:
        type: AverageValue
        averageValue: "50000"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
      - type: Pods
        value: 2
        periodSeconds: 15
      selectPolicy: Max

3.7 PodDisruptionBudget

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: heliosdb-streaming-pdb
  namespace: heliosdb-streaming
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: heliosdb-streaming

3.8 Ingress (with TLS)

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: heliosdb-streaming-ingress
  namespace: heliosdb-streaming
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    nginx.ingress.kubernetes.io/rate-limit: "100"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - streaming.heliosdb.example.com
    secretName: heliosdb-tls
  rules:
  - host: streaming.heliosdb.example.com
    http:
      paths:
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: heliosdb-streaming
            port:
              number: 8080
      - path: /metrics
        pathType: Prefix
        backend:
          service:
            name: heliosdb-streaming
            port:
              number: 9090

3.9 Deploy to Kubernetes

# Apply all manifests
kubectl apply -f namespace.yaml
kubectl apply -f configmap.yaml
kubectl apply -f secrets.yaml
kubectl apply -f statefulset.yaml
kubectl apply -f service.yaml
kubectl apply -f hpa.yaml
kubectl apply -f pdb.yaml
kubectl apply -f ingress.yaml

# Verify deployment
kubectl get pods -n heliosdb-streaming
kubectl get svc -n heliosdb-streaming
kubectl logs -f statefulset/heliosdb-streaming -n heliosdb-streaming

# Check events
kubectl get events -n heliosdb-streaming --sort-by='.lastTimestamp'

4. Docker Compose Setup

4.1 Complete Stack

version: '3.9'

services:
  # Zookeeper for Kafka
  zookeeper:
    image: confluentinc/cp-zookeeper:7.5.0
    hostname: zookeeper
    container_name: zookeeper
    ports:
      - "2181:2181"
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
    volumes:
      - zookeeper-data:/var/lib/zookeeper/data
      - zookeeper-logs:/var/lib/zookeeper/log

  # Kafka
  kafka:
    image: confluentinc/cp-kafka:7.5.0
    hostname: kafka
    container_name: kafka
    depends_on:
      - zookeeper
    ports:
      - "9092:9092"
      - "9101:9101"
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181'
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:29092,PLAINTEXT_HOST://localhost:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
      KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
      KAFKA_JMX_PORT: 9101
      KAFKA_JMX_HOSTNAME: localhost
      KAFKA_AUTO_CREATE_TOPICS_ENABLE: 'true'
    volumes:
      - kafka-data:/var/lib/kafka/data

  # HeliosDB Streaming
  heliosdb-streaming:
    image: heliosdb/heliosdb-streaming:v4.0.0
    container_name: heliosdb-streaming
    depends_on:
      - kafka
      - prometheus
    ports:
      - "8080:8080"
      - "8081:8081"
      - "9090:9090"
    environment:
      RUST_LOG: info
      JWT_SECRET: ${JWT_SECRET:-change-this-in-production}
    volumes:
      - ./config/streaming.toml:/etc/heliosdb/streaming.toml
      - heliosdb-state:/var/lib/heliosdb
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8081/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

  # Prometheus
  prometheus:
    image: prom/prometheus:v2.47.0
    container_name: prometheus
    ports:
      - "9091:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
      - '--storage.tsdb.retention.time=15d'
    volumes:
      - ./config/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus

  # Grafana
  grafana:
    image: grafana/grafana:10.1.0
    container_name: grafana
    depends_on:
      - prometheus
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD:-admin}
      GF_INSTALL_PLUGINS: grafana-piechart-panel
    volumes:
      - ./config/grafana/dashboards:/etc/grafana/provisioning/dashboards
      - ./config/grafana/datasources:/etc/grafana/provisioning/datasources
      - grafana-data:/var/lib/grafana

volumes:
  zookeeper-data:
  zookeeper-logs:
  kafka-data:
  heliosdb-state:
  prometheus-data:
  grafana-data:

networks:
  default:
    name: heliosdb-network

4.2 Prometheus Configuration

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'heliosdb-streaming'
    static_configs:
      - targets: ['heliosdb-streaming:9090']
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        replacement: heliosdb-streaming

  - job_name: 'kafka'
    static_configs:
      - targets: ['kafka:9101']

alerting:
  alertmanagers:
    - static_configs:
        - targets: []

rule_files:
  - "/etc/prometheus/alerts.yml"

4.3 Grafana Datasource

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true

4.4 Start Stack

# Start all services
docker-compose up -d

# View logs
docker-compose logs -f heliosdb-streaming

# Check health
curl http://localhost:8081/health

# Access Grafana
open http://localhost:3000  # admin/admin

# Stop stack
docker-compose down

# Clean up volumes
docker-compose down -v

5. Monitoring & Observability

5.1 Key Metrics

Throughput Metrics:

# Events processed per second
rate(events_processed_total[1m])

# Events ingested per second
rate(events_ingested_total[1m])

# Backpressure ratio
backpressure_ratio

Latency Metrics:

# P50 latency
histogram_quantile(0.50, rate(event_processing_duration_bucket[5m]))

# P95 latency
histogram_quantile(0.95, rate(event_processing_duration_bucket[5m]))

# P99 latency
histogram_quantile(0.99, rate(event_processing_duration_bucket[5m]))

Resource Metrics:

# CPU usage
process_cpu_seconds_total

# Memory usage
process_resident_memory_bytes

# Open file descriptors
process_open_fds

Job Metrics:

# Active jobs
sum(job_status{status="running"})

# Failed jobs
sum(job_status{status="failed"})

# Checkpoint duration
histogram_quantile(0.95, rate(checkpoint_duration_bucket[5m]))

5.2 Grafana Dashboard

{
  "dashboard": {
    "title": "HeliosDB Streaming Overview",
    "panels": [
      {
        "title": "Events Per Second",
        "targets": [
          {
            "expr": "rate(events_processed_total[1m])",
            "legendFormat": "{{instance}}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Latency (P95)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(event_processing_duration_bucket[5m]))",
            "legendFormat": "P95"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Active Jobs",
        "targets": [
          {
            "expr": "sum(job_status{status=\"running\"})",
            "legendFormat": "Running"
          }
        ],
        "type": "stat"
      }
    ]
  }
}

5.3 Alert Rules

groups:
  - name: heliosdb_streaming
    interval: 30s
    rules:
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(event_processing_duration_bucket[5m])) > 0.100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High processing latency detected"
          description: "P95 latency is {{ $value }}s (threshold: 100ms)"

      - alert: LowThroughput
        expr: rate(events_processed_total[5m]) < 10000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low throughput detected"
          description: "Processing {{ $value }} events/sec (expected: >10K)"

      - alert: HighBackpressure
        expr: backpressure_ratio > 0.8
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High backpressure detected"
          description: "Backpressure ratio is {{ $value }} (threshold: 0.8)"

      - alert: JobFailed
        expr: increase(job_status{status="failed"}[5m]) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Job failure detected"
          description: "Job {{ $labels.job_name }} has failed"

      - alert: HighMemoryUsage
        expr: process_resident_memory_bytes / 1024 / 1024 / 1024 > 14
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage"
          description: "Using {{ $value }}GB of RAM (limit: 16GB)"

      - alert: CheckpointFailed
        expr: increase(checkpoint_failures_total[10m]) > 2
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Checkpoint failures detected"
          description: "{{ $value }} checkpoint failures in 10 minutes"

6. Security Configuration

6.1 TLS/SSL Setup

Generate self-signed certificate (development):

openssl req -x509 -newkey rsa:4096 -nodes \
  -keyout key.pem -out cert.pem -days 365 \
  -subj "/CN=streaming.heliosdb.local"

Production certificate (Let’s Encrypt):

# Install certbot
sudo apt install certbot

# Get certificate
sudo certbot certonly --standalone -d streaming.heliosdb.example.com

# Certificates will be in:
# /etc/letsencrypt/live/streaming.heliosdb.example.com/

6.2 JWT Configuration

# Generate secure JWT secret
openssl rand -base64 32

# Add to environment
export JWT_SECRET="<generated-secret>"

# Or add to config file
echo "jwt_secret = \"$(openssl rand -base64 32)\"" >> /etc/heliosdb/streaming.toml

6.3 Create Admin User

# Use API to create initial admin
curl -X POST http://localhost:8080/api/v1/auth/register \
  -H "Content-Type: application/json" \
  -d '{
    "username": "admin",
    "password": "secure-password-here",
    "roles": ["Admin"]
  }'

# Login to get JWT token
curl -X POST http://localhost:8080/api/v1/auth/login \
  -H "Content-Type: application/json" \
  -d '{
    "username": "admin",
    "password": "secure-password-here"
  }'

6.4 RBAC Setup

users:
  - username: admin
    password_hash: "$2b$12$..."  # bcrypt hash
    roles:
      - Admin
    enabled: true

  - username: operator
    password_hash: "$2b$12$..."
    roles:
      - Operator
    enabled: true

  - username: viewer
    password_hash: "$2b$12$..."
    roles:
      - Viewer
    enabled: true

roles:
  - name: Admin
    permissions:
      - Read
      - Write
      - Execute
      - Delete
      - Admin
      - Cancel
      - Manage

  - name: Operator
    permissions:
      - Read
      - Execute
      - Cancel

  - name: Viewer
    permissions:
      - Read

6.5 KMS Configuration

AWS KMS:

[kms]
provider = "aws"
aws_region = "us-east-1"
aws_key_id = "arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012"

Azure Key Vault:

[kms]
provider = "azure"
azure_vault_url = "https://heliosdb-vault.vault.azure.net"
azure_key_name = "heliosdb-encryption-key"

GCP KMS:

[kms]
provider = "gcp"
gcp_project_id = "heliosdb-production"
gcp_location = "us-central1"
gcp_keyring = "heliosdb"
gcp_key_name = "streaming-encryption-key"

7. Performance Tuning

7.1 OS Tuning (Linux)

# Increase file descriptor limits
echo "* soft nofile 65536" >> /etc/security/limits.conf
echo "* hard nofile 65536" >> /etc/security/limits.conf

# Increase TCP buffer sizes
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"

# Increase connection tracking
sysctl -w net.netfilter.nf_conntrack_max=1048576

# Disable swapping for performance
sysctl -w vm.swappiness=1

# Make changes persistent
echo "vm.swappiness=1" >> /etc/sysctl.conf

7.2 JVM Tuning (for Kafka)

# Kafka environment
export KAFKA_HEAP_OPTS="-Xmx4G -Xms4G"
export KAFKA_JVM_PERFORMANCE_OPTS="-XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35"

7.3 HeliosDB Configuration

[streaming]
# Adjust based on CPU cores
max_parallelism = 16  # 2x CPU cores

# Faster checkpoints for high throughput
checkpoint_interval_secs = 30

# Lower latency with more frequent watermarks
watermark_interval_secs = 0.5

[state]
# Use memory for lowest latency (if enough RAM)
backend = "memory"

# Or S3 with local caching
backend = "s3"
cache_size_mb = 1024

[kafka]
# Increase batch size for higher throughput
fetch_min_bytes = 1048576  # 1 MB
fetch_max_wait_ms = 500

# Parallel consumers
num_consumer_threads = 8

7.4 Kafka Topic Configuration

# Create topic with optimal settings
kafka-topics --create \
  --bootstrap-server localhost:9092 \
  --topic heliosdb-input \
  --partitions 16 \
  --replication-factor 3 \
  --config compression.type=lz4 \
  --config min.insync.replicas=2 \
  --config retention.ms=86400000  # 1 day

8. Operational Runbook

8.1 Startup Procedure

# 1. Verify prerequisites
systemctl status kafka
systemctl status zookeeper

# 2. Check disk space
df -h /var/lib/heliosdb

# 3. Verify configuration
cat /etc/heliosdb/streaming.toml

# 4. Start service
systemctl start heliosdb-streaming

# 5. Monitor startup
journalctl -u heliosdb-streaming -f

# 6. Verify health
curl http://localhost:8081/health

# 7. Check metrics
curl http://localhost:9090/metrics | grep events_processed

8.2 Shutdown Procedure

# 1. Stop accepting new jobs
curl -X POST http://localhost:8080/api/v1/admin/pause

# 2. Wait for current jobs to complete
watch -n 5 'curl -s http://localhost:8080/api/v1/jobs | jq ".active_jobs"'

# 3. Create savepoint for safe recovery
curl -X POST http://localhost:8080/api/v1/admin/savepoint

# 4. Stop service
systemctl stop heliosdb-streaming

# 5. Verify shutdown
systemctl status heliosdb-streaming

8.3 Rolling Update (Kubernetes)

# 1. Update image
kubectl set image statefulset/heliosdb-streaming \
  heliosdb-streaming=heliosdb/heliosdb-streaming:v4.1.0 \
  -n heliosdb-streaming

# 2. Monitor rollout
kubectl rollout status statefulset/heliosdb-streaming -n heliosdb-streaming

# 3. Verify new version
kubectl get pods -n heliosdb-streaming -o jsonpath='{.items[*].spec.containers[*].image}'

# 4. Rollback if needed
kubectl rollout undo statefulset/heliosdb-streaming -n heliosdb-streaming

8.4 Backup & Restore

Backup:

# 1. Create savepoint
SAVEPOINT_ID=$(curl -X POST http://localhost:8080/api/v1/admin/savepoint | jq -r '.id')

# 2. Copy to backup location
aws s3 cp \
  s3://heliosdb-production/checkpoints/savepoint-${SAVEPOINT_ID} \
  s3://heliosdb-backups/$(date +%Y%m%d)/savepoint-${SAVEPOINT_ID} \
  --recursive

# 3. Backup configuration
tar czf /backup/heliosdb-config-$(date +%Y%m%d).tar.gz /etc/heliosdb

Restore:

# 1. Stop service
systemctl stop heliosdb-streaming

# 2. Restore savepoint
aws s3 cp \
  s3://heliosdb-backups/20231225/savepoint-12345 \
  s3://heliosdb-production/checkpoints/savepoint-12345 \
  --recursive

# 3. Restore configuration
tar xzf /backup/heliosdb-config-20231225.tar.gz -C /

# 4. Start with specific savepoint
heliosdb-streaming --config /etc/heliosdb/streaming.toml \
  --restore-from-savepoint savepoint-12345

8.5 Scaling Operations

Scale Up:

# Kubernetes
kubectl scale statefulset heliosdb-streaming --replicas=5 -n heliosdb-streaming

# Verify
kubectl get pods -n heliosdb-streaming -w

Scale Down:

# 1. Identify pods to remove
kubectl get pods -n heliosdb-streaming

# 2. Drain specific pod gracefully
kubectl drain <pod-name> --ignore-daemonsets --delete-emptydir-data

# 3. Scale down
kubectl scale statefulset heliosdb-streaming --replicas=3 -n heliosdb-streaming

9. Troubleshooting

9.1 Common Issues

Issue: High Latency

Symptoms:

P99 latency > 100ms
Slow dashboard updates

Diagnosis:

# Check backpressure
curl http://localhost:9090/metrics | grep backpressure

# Check resource usage
top -p $(pgrep heliosdb-streaming)

# Check Kafka lag
kafka-consumer-groups --bootstrap-server localhost:9092 \
  --group heliosdb-streaming-prod --describe

Solutions:

Increase parallelism
Add more nodes
Optimize window size
Check network latency

Issue: Out of Memory

Symptoms:

OOMKilled in Kubernetes
Process crashes with “out of memory”

Diagnosis:

# Check memory usage
curl http://localhost:9090/metrics | grep memory

# Check for memory leaks
valgrind --leak-check=full ./heliosdb-streaming

Solutions:

Increase memory limits
Reduce window sizes
Enable state compression
Use file-based state backend

Issue: Checkpoint Failures

Symptoms:

Checkpoints taking too long
Checkpoint failures in logs

Diagnosis:

# Check checkpoint metrics
curl http://localhost:9090/metrics | grep checkpoint

# Check S3/storage performance
aws s3 ls s3://heliosdb-production/checkpoints/ --summarize

# Check disk I/O
iostat -x 5

Solutions:

Increase checkpoint interval
Use faster storage (SSD/NVMe)
Enable incremental checkpoints
Check network connectivity to S3

Issue: Kafka Connection Errors

Symptoms:

“Failed to connect to Kafka” errors
No events being processed

Diagnosis:

# Test Kafka connectivity
kafka-console-consumer --bootstrap-server localhost:9092 \
  --topic test --from-beginning

# Check Kafka status
systemctl status kafka

# Verify DNS resolution
nslookup kafka-bootstrap.kafka

Solutions:

Verify Kafka is running
Check firewall rules
Verify Kafka advertised listeners
Check SSL/SASL configuration

9.2 Debug Mode

# Enable debug logging
export RUST_LOG=heliosdb_streaming=debug

# Or in config
[logging]
level = "debug"

# Enable backtrace
export RUST_BACKTRACE=full

# Run with profiling
cargo build --release --features profiling
./target/release/heliosdb-streaming --profile

9.3 Useful Commands

# Check process status
ps aux | grep heliosdb-streaming

# Check open connections
lsof -i -P -n | grep heliosdb

# Check file descriptors
lsof -p $(pgrep heliosdb-streaming) | wc -l

# Monitor logs in real-time
tail -f /var/log/heliosdb/streaming.log | jq .

# Export metrics for analysis
curl -s http://localhost:9090/metrics > metrics-$(date +%Y%m%d-%H%M%S).txt

# Dump thread stacktraces
kill -USR1 $(pgrep heliosdb-streaming)

10. Capacity Planning

10.1 Sizing Guidelines

Small Workload (< 10K events/sec)

Nodes: 1-2
CPU: 4 cores per node
RAM: 8 GB per node
Storage: 50 GB SSD
Estimated cost: $200-400/month (AWS)

Medium Workload (10K-100K events/sec)

Nodes: 3-5
CPU: 8 cores per node
RAM: 16 GB per node
Storage: 100 GB NVMe SSD
Estimated cost: $1,500-3,000/month (AWS)

Large Workload (100K-1M events/sec)

Nodes: 10-20
CPU: 16 cores per node
RAM: 32 GB per node
Storage: 200 GB NVMe SSD
Estimated cost: $8,000-15,000/month (AWS)

10.2 Capacity Calculator

def calculate_resources(events_per_sec, avg_event_size_bytes, window_size_secs):
    """
    Calculate required resources for HeliosDB Streaming
    """
    # Throughput calculation
    throughput_mbps = (events_per_sec * avg_event_size_bytes * 8) / 1_000_000

    # CPU cores needed (assuming 30K events/sec per core)
    cpu_cores = max(4, int(events_per_sec / 30_000) * 2)

    # Memory needed (state size + overhead)
    state_size_gb = (events_per_sec * window_size_secs * avg_event_size_bytes) / (1024**3)
    memory_gb = max(8, int(state_size_gb * 2 + 4))  # 2x state + 4GB overhead

    # Storage needed (checkpoints, assuming 10 kept)
    storage_gb = max(50, int(state_size_gb * 10 * 1.5))  # 10 checkpoints + 50% overhead

    # Number of nodes (assuming 8 cores, 16GB per node)
    nodes = max(1, int(cpu_cores / 8))

    return {
        "throughput_mbps": throughput_mbps,
        "cpu_cores": cpu_cores,
        "memory_gb": memory_gb,
        "storage_gb": storage_gb,
        "nodes": nodes,
        "cost_per_month_usd": nodes * 500  # Rough estimate
    }

# Example usage
resources = calculate_resources(
    events_per_sec=50_000,
    avg_event_size_bytes=512,
    window_size_secs=3600
)
print(resources)
# Output: {'throughput_mbps': 204.8, 'cpu_cores': 8, 'memory_gb': 120,
#          'storage_gb': 1800, 'nodes': 1, 'cost_per_month_usd': 500}

10.3 Growth Planning

Monthly Events → Required Resources
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

   1B events/month  →  1 node  (4 cores, 8GB)   → $200/month
  10B events/month  →  2 nodes (8 cores, 16GB)  → $600/month
 100B events/month  →  5 nodes (16 cores, 32GB) → $2,500/month
   1T events/month  → 15 nodes (32 cores, 64GB) → $7,500/month

Summary

This production deployment guide covers:

Single-node deployment with SystemD Kubernetes production setup with HA Docker Compose for local development Complete monitoring with Prometheus + Grafana Enterprise security (TLS, JWT, RBAC, KMS) Performance tuning guidelines Operational runbook (startup, shutdown, backup, restore) Troubleshooting common issues Capacity planning and cost estimation

For additional support:

Documentation: https://docs.heliosdb.com
GitHub: https://github.com/heliosdb/heliosdb
Community: https://community.heliosdb.com
Enterprise Support: support@heliosdb.com

Next Steps:

Choose deployment method (K8s recommended for production)
Set up monitoring (Prometheus + Grafana)
Configure security (TLS, JWT, KMS)
Run load tests to validate capacity
Set up alerting for critical metrics
Document your specific deployment for your team

HeliosDB Streaming - Production Deployment Guide

HeliosDB Streaming - Production Deployment Guide

Table of Contents

1. Prerequisites

System Requirements

Software Dependencies

External Services

2. Single-Node Deployment

2.1 Build from Source

2.2 Configuration

2.3 SystemD Service

2.4 Start Service

2.5 Verify Deployment

3. Kubernetes Production Deployment

3.1 Namespace Setup

3.2 ConfigMap

3.3 Secrets

3.4 StatefulSet

3.5 Service

3.6 HorizontalPodAutoscaler

3.7 PodDisruptionBudget

3.8 Ingress (with TLS)

3.9 Deploy to Kubernetes

4. Docker Compose Setup

4.1 Complete Stack

4.2 Prometheus Configuration

4.3 Grafana Datasource

4.4 Start Stack

5. Monitoring & Observability

5.1 Key Metrics

5.2 Grafana Dashboard

5.3 Alert Rules

6. Security Configuration

6.1 TLS/SSL Setup

6.2 JWT Configuration

6.3 Create Admin User

6.4 RBAC Setup

6.5 KMS Configuration

7. Performance Tuning

7.1 OS Tuning (Linux)

7.2 JVM Tuning (for Kafka)

7.3 HeliosDB Configuration

7.4 Kafka Topic Configuration

8. Operational Runbook

8.1 Startup Procedure

8.2 Shutdown Procedure

8.3 Rolling Update (Kubernetes)

8.4 Backup & Restore

8.5 Scaling Operations

9. Troubleshooting

9.1 Common Issues

9.2 Debug Mode

9.3 Useful Commands

10. Capacity Planning

10.1 Sizing Guidelines

10.2 Capacity Calculator

10.3 Growth Planning

Summary