F1.3 Flink Streaming - Production Deployment Guide

Feature: Real-Time Stream Processing Engine Version: v5.0-v5.4 Status: Production-Ready (152 Tests Passing) Date: October 2025

📋 Table of Contents

Overview
Quick Start
Docker Deployment
Kubernetes Deployment
Monitoring & Observability
Operational Runbook
Security Configuration
Capacity Planning
Troubleshooting

1. Overview

1.1 Deployment Options

Option	Use Case	Complexity	Scalability	Cost
Docker Compose	Dev/Test, Small prod	Low	Limited (1-4 nodes)	$100-500/month
Kubernetes	Production, Enterprise	Medium	High (unlimited)	$500-5000/month
Managed K8s (EKS/AKS/GKE)	Production, Cloud-native	Medium-High	Very High	$1000-10000/month
Bare Metal	On-premise, High performance	High	Medium	Hardware dependent

Recommended: Kubernetes (EKS/AKS/GKE) for production

1.2 System Requirements

Minimum (Single Node):

CPU: 4 cores (8 threads)
RAM: 8 GB
Storage: 50 GB SSD
Network: 1 Gbps
OS: Linux (Ubuntu 22.04, RHEL 8+, Amazon Linux 2)

Recommended (Production Node):

CPU: 8-16 cores
RAM: 16-32 GB
Storage: 100-500 GB NVMe SSD
Network: 10 Gbps
OS: Ubuntu 22.04 LTS

High-Performance (Large Deployments):

CPU: 32-64 cores
RAM: 64-128 GB
Storage: 1-2 TB NVMe SSD
Network: 25-100 Gbps
OS: Ubuntu 22.04 LTS + kernel tuning

1.3 Prerequisites

Required:

Docker 24.0+ (if using containers)
Kubernetes 1.28+ (if using K8s)
kubectl CLI
Helm 3.12+ (optional, for easy K8s deployment)
Git (for cloning repository)
Rust 1.75+ (for building from source)

Optional:

Prometheus (monitoring)
Grafana (dashboards)
Kafka/Pulsar (event sources)
Redis (for distributed caching)

2. Quick Start

2.1 Docker Compose (5 Minutes)

For: Development, testing, small production deployments

# 1. Clone repository
git clone https://github.com/danimoya/HeliosDB.git
cd HeliosDB/heliosdb-streaming

# 2. Create docker-compose.yml (see section 3.2)
# Copy the provided docker-compose.yml to your directory

# 3. Start services
docker-compose up -d

# 4. Check status
docker-compose ps

# 5. View logs
docker-compose logs -f heliosdb-streaming

# 6. Run tests
docker-compose exec heliosdb-streaming cargo test

# Expected: 152 tests passing in ~2 seconds

Access:

HeliosDB Streaming: http://localhost:8080
Prometheus: http://localhost:9090
Grafana: http://localhost:3000 (admin/admin)

2.2 Kubernetes (15 Minutes)

For: Production deployments, auto-scaling, high availability

# 1. Create namespace
kubectl create namespace heliosdb

# 2. Deploy HeliosDB Streaming
kubectl apply -f k8s/heliosdb-streaming-deployment.yaml

# 3. Create service
kubectl apply -f k8s/heliosdb-streaming-service.yaml

# 4. Deploy monitoring (optional)
kubectl apply -f k8s/prometheus.yaml
kubectl apply -f k8s/grafana.yaml

# 5. Check deployment
kubectl get pods -n heliosdb

# Expected output:
# NAME                                READY   STATUS    RESTARTS   AGE
# heliosdb-streaming-0                1/1     Running   0          2m
# heliosdb-streaming-1                1/1     Running   0          2m
# heliosdb-streaming-2                1/1     Running   0          2m
# heliosdb-streaming-3                1/1     Running   0          2m

3. Docker Deployment

3.1 Dockerfile

Create Dockerfile in heliosdb-streaming/:

# Multi-stage build for optimal image size

# Stage 1: Builder
FROM rust:1.75-slim as builder

# Install build dependencies
RUN apt-get update && apt-get install -y \
    pkg-config \
    libssl-dev \
    cmake \
    protobuf-compiler \
    && rm -rf /var/lib/apt/lists/*

# Create app directory
WORKDIR /app

# Copy Cargo files
COPY Cargo.toml Cargo.lock ./
COPY heliosdb-streaming/Cargo.toml ./heliosdb-streaming/

# Copy source code
COPY heliosdb-streaming/src ./heliosdb-streaming/src

# Build release binary
RUN cargo build --release --manifest-path heliosdb-streaming/Cargo.toml

# Run tests to validate
RUN cargo test --release --manifest-path heliosdb-streaming/Cargo.toml

# Stage 2: Runtime
FROM ubuntu:22.04

# Install runtime dependencies
RUN apt-get update && apt-get install -y \
    ca-certificates \
    libssl3 \
    && rm -rf /var/lib/apt/lists/*

# Create non-root user
RUN useradd -m -u 1000 heliosdb

# Create directories
RUN mkdir -p /app/data /app/logs /app/checkpoints && \
    chown -R heliosdb:heliosdb /app

# Switch to non-root user
USER heliosdb
WORKDIR /app

# Copy binary from builder
COPY --from=builder --chown=heliosdb:heliosdb \
    /app/target/release/heliosdb-streaming /app/

# Copy config
COPY --chown=heliosdb:heliosdb heliosdb-streaming/config.yaml /app/

# Expose ports
EXPOSE 8080 9090

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \
    CMD curl -f http://localhost:8080/health || exit 1

# Set environment variables
ENV RUST_LOG=info
ENV RUST_BACKTRACE=1
ENV HELIOSDB_DATA_DIR=/app/data
ENV HELIOSDB_CHECKPOINT_DIR=/app/checkpoints

# Run application
CMD ["/app/heliosdb-streaming"]

Build:

docker build -t heliosdb/streaming:v5.4 -f heliosdb-streaming/Dockerfile .

Image Size: ~150 MB (optimized)

3.2 docker-compose.yml

Create docker-compose.yml:

version: '3.8'

services:
  # HeliosDB Streaming
  heliosdb-streaming:
    image: heliosdb/streaming:v5.4
    container_name: heliosdb-streaming
    restart: unless-stopped
    ports:
      - "8080:8080"   # API
      - "9090:9090"   # Metrics
    volumes:
      - ./data:/app/data
      - ./checkpoints:/app/checkpoints
      - ./logs:/app/logs
      - ./config.yaml:/app/config.yaml:ro
    environment:
      - RUST_LOG=info
      - RUST_BACKTRACE=1
      - HELIOSDB_THREADS=4
      - HELIOSDB_MAX_MEMORY_MB=4096
      - HELIOSDB_CHECKPOINT_INTERVAL_SECS=60
      - HELIOSDB_KMS_PROVIDER=local  # or aws_kms, azure_keyvault
    networks:
      - heliosdb-network
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 8G
        reservations:
          cpus: '2'
          memory: 4G

  # Kafka (optional, for event sources)
  kafka:
    image: confluentinc/cp-kafka:7.5.0
    container_name: kafka
    restart: unless-stopped
    ports:
      - "9092:9092"
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
    networks:
      - heliosdb-network
    depends_on:
      - zookeeper

  zookeeper:
    image: confluentinc/cp-zookeeper:7.5.0
    container_name: zookeeper
    restart: unless-stopped
    ports:
      - "2181:2181"
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
    networks:
      - heliosdb-network

  # Redis (optional, for distributed caching)
  redis:
    image: redis:7.2-alpine
    container_name: redis
    restart: unless-stopped
    ports:
      - "6379:6379"
    command: redis-server --appendonly yes
    volumes:
      - redis-data:/data
    networks:
      - heliosdb-network

  # Prometheus (monitoring)
  prometheus:
    image: prom/prometheus:v2.47.0
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9091:9090"
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
    networks:
      - heliosdb-network

  # Grafana (dashboards)
  grafana:
    image: grafana/grafana:10.1.0
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - ./monitoring/grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml:ro
      - ./monitoring/grafana-dashboards.yml:/etc/grafana/provisioning/dashboards/dashboards.yml:ro
      - ./monitoring/dashboards:/var/lib/grafana/dashboards:ro
      - grafana-data:/var/lib/grafana
    networks:
      - heliosdb-network
    depends_on:
      - prometheus

networks:
  heliosdb-network:
    driver: bridge

volumes:
  redis-data:
  prometheus-data:
  grafana-data:

Usage:

# Start all services
docker-compose up -d

# Scale HeliosDB instances
docker-compose up -d --scale heliosdb-streaming=4

# Stop services
docker-compose down

# Stop and remove volumes
docker-compose down -v

3.3 Configuration File

Create config.yaml:

# HeliosDB Streaming Configuration

# Server Configuration
server:
  host: "0.0.0.0"
  port: 8080
  metrics_port: 9090
  max_connections: 1000

# Threading
threads:
  worker_threads: 4
  blocking_threads: 4

# Memory Management
memory:
  max_memory_mb: 4096
  buffer_pool_size_mb: 512
  gc_interval_secs: 300

# Checkpoint Configuration
checkpoint:
  enabled: true
  interval_secs: 60
  timeout_secs: 10
  min_pause_secs: 30
  max_concurrent: 1
  directory: "/app/checkpoints"
  compression: true
  encryption: true

# State Backend
state_backend:
  type: "rocksdb"  # or "inmemory" for testing
  path: "/app/data/rocksdb"
  cache_size_mb: 1024
  block_cache_mb: 512

# KMS Configuration (choose one)
kms:
  provider: "local"  # or "aws_kms", "azure_keyvault", "gcp_kms"

  # AWS KMS (uncomment if using AWS)
  # aws:
  #   key_id: "arn:aws:kms:us-east-1:123456789012:key/abc-123"
  #   region: "us-east-1"

  # Azure Key Vault (uncomment if using Azure)
  # azure:
  #   vault_url: "https://my-vault.vault.azure.net/"
  #   key_name: "heliosdb-master-key"

  # Local (for dev/test)
  local:
    master_key_file: "/app/data/master.key"

# Key Rotation Policy
key_rotation:
  enabled: true
  interval_secs: 2592000  # 30 days
  max_previous_keys: 3
  auto_rotate: true

# Backpressure Configuration
backpressure:
  strategy: "adaptive"  # or "block", "drop_oldest", "drop_newest", "signal"
  initial_buffer_size: 100
  min_buffer_size: 10
  max_buffer_size: 200

# Connectors
connectors:
  kafka:
    enabled: true
    bootstrap_servers:
      - "kafka:9092"
    consumer_group: "heliosdb-streaming"

  redis:
    enabled: true
    host: "redis"
    port: 6379
    pool_size: 10

# Monitoring
monitoring:
  prometheus:
    enabled: true
    port: 9090

  logging:
    level: "info"  # debug, info, warn, error
    format: "json"  # or "text"
    output: "/app/logs/heliosdb.log"

# Performance Tuning
performance:
  event_batch_size: 1000
  window_size_secs: 60
  watermark_interval_secs: 1
  allowed_lateness_secs: 60

4. Kubernetes Deployment

4.1 Namespace

Create k8s/namespace.yaml:

apiVersion: v1
kind: Namespace
metadata:
  name: heliosdb
  labels:
    app: heliosdb
    environment: production

4.2 ConfigMap

Create k8s/configmap.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: heliosdb-streaming-config
  namespace: heliosdb
data:
  config.yaml: |
    server:
      host: "0.0.0.0"
      port: 8080
      metrics_port: 9090

    threads:
      worker_threads: 8
      blocking_threads: 4

    memory:
      max_memory_mb: 8192
      buffer_pool_size_mb: 1024

    checkpoint:
      enabled: true
      interval_secs: 60
      directory: "/data/checkpoints"
      compression: true
      encryption: true

    state_backend:
      type: "rocksdb"
      path: "/data/rocksdb"
      cache_size_mb: 2048

    kms:
      provider: "aws_kms"
      aws:
        key_id: "${AWS_KMS_KEY_ID}"
        region: "${AWS_REGION}"

    backpressure:
      strategy: "adaptive"
      initial_buffer_size: 100

    monitoring:
      prometheus:
        enabled: true
        port: 9090
      logging:
        level: "info"
        format: "json"

4.3 Secret

Create k8s/secret.yaml:

apiVersion: v1
kind: Secret
metadata:
  name: heliosdb-streaming-secrets
  namespace: heliosdb
type: Opaque
stringData:
  AWS_ACCESS_KEY_ID: "your-aws-access-key"
  AWS_SECRET_ACCESS_KEY: "your-aws-secret-key"
  AWS_REGION: "us-east-1"
  AWS_KMS_KEY_ID: "arn:aws:kms:us-east-1:123456789012:key/abc-123"

Note: Use Kubernetes Secrets or external secret managers (AWS Secrets Manager, Azure Key Vault, HashiCorp Vault) for production.

4.4 StatefulSet

Create k8s/statefulset.yaml:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: heliosdb-streaming
  namespace: heliosdb
  labels:
    app: heliosdb-streaming
    version: v5.4
spec:
  serviceName: heliosdb-streaming
  replicas: 4
  selector:
    matchLabels:
      app: heliosdb-streaming
  template:
    metadata:
      labels:
        app: heliosdb-streaming
        version: v5.4
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
        prometheus.io/path: "/metrics"
    spec:
      serviceAccountName: heliosdb-streaming
      securityContext:
        fsGroup: 1000
        runAsUser: 1000
        runAsNonRoot: true

      containers:
      - name: heliosdb-streaming
        image: heliosdb/streaming:v5.4
        imagePullPolicy: IfNotPresent

        ports:
        - name: api
          containerPort: 8080
          protocol: TCP
        - name: metrics
          containerPort: 9090
          protocol: TCP

        env:
        - name: RUST_LOG
          value: "info"
        - name: RUST_BACKTRACE
          value: "1"
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP

        envFrom:
        - secretRef:
            name: heliosdb-streaming-secrets

        volumeMounts:
        - name: config
          mountPath: /app/config.yaml
          subPath: config.yaml
        - name: data
          mountPath: /data
        - name: logs
          mountPath: /app/logs

        resources:
          requests:
            cpu: "2"
            memory: "8Gi"
          limits:
            cpu: "8"
            memory: "16Gi"

        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 60
          periodSeconds: 30
          timeoutSeconds: 10
          failureThreshold: 3

        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3

      volumes:
      - name: config
        configMap:
          name: heliosdb-streaming-config
      - name: logs
        emptyDir: {}

  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: gp3  # or "standard", "premium-rwo" depending on cloud provider
      resources:
        requests:
          storage: 100Gi

4.5 Service

Create k8s/service.yaml:

apiVersion: v1
kind: Service
metadata:
  name: heliosdb-streaming
  namespace: heliosdb
  labels:
    app: heliosdb-streaming
spec:
  type: ClusterIP
  clusterIP: None  # Headless service for StatefulSet
  ports:
  - name: api
    port: 8080
    targetPort: 8080
    protocol: TCP
  - name: metrics
    port: 9090
    targetPort: 9090
    protocol: TCP
  selector:
    app: heliosdb-streaming

---
apiVersion: v1
kind: Service
metadata:
  name: heliosdb-streaming-lb
  namespace: heliosdb
  labels:
    app: heliosdb-streaming
spec:
  type: LoadBalancer
  ports:
  - name: api
    port: 8080
    targetPort: 8080
    protocol: TCP
  selector:
    app: heliosdb-streaming

4.6 ServiceAccount & RBAC

Create k8s/rbac.yaml:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: heliosdb-streaming
  namespace: heliosdb

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: heliosdb-streaming
  namespace: heliosdb
rules:
- apiGroups: [""]
  resources: ["pods", "configmaps", "secrets"]
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources: ["pods/log"]
  verbs: ["get"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: heliosdb-streaming
  namespace: heliosdb
subjects:
- kind: ServiceAccount
  name: heliosdb-streaming
  namespace: heliosdb
roleRef:
  kind: Role
  name: heliosdb-streaming
  apiGroup: rbac.authorization.k8s.io

4.7 HorizontalPodAutoscaler (HPA)

Create k8s/hpa.yaml:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: heliosdb-streaming-hpa
  namespace: heliosdb
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: heliosdb-streaming
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
      - type: Pods
        value: 2
        periodSeconds: 30
      selectPolicy: Max

4.8 Deployment Commands

# Create namespace
kubectl apply -f k8s/namespace.yaml

# Deploy RBAC
kubectl apply -f k8s/rbac.yaml

# Create ConfigMap and Secrets
kubectl apply -f k8s/configmap.yaml
kubectl apply -f k8s/secret.yaml

# Deploy StatefulSet
kubectl apply -f k8s/statefulset.yaml

# Create Services
kubectl apply -f k8s/service.yaml

# Deploy HPA (optional)
kubectl apply -f k8s/hpa.yaml

# Check deployment
kubectl get all -n heliosdb

# Check logs
kubectl logs -f heliosdb-streaming-0 -n heliosdb

# Scale manually
kubectl scale statefulset heliosdb-streaming --replicas=8 -n heliosdb

# Rolling update
kubectl set image statefulset/heliosdb-streaming heliosdb-streaming=heliosdb/streaming:v5.5 -n heliosdb

5. Monitoring & Observability

5.1 Prometheus Configuration

Create monitoring/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'heliosdb-production'
    environment: 'prod'

scrape_configs:
  # HeliosDB Streaming metrics
  - job_name: 'heliosdb-streaming'
    kubernetes_sd_configs:
    - role: pod
      namespaces:
        names:
        - heliosdb
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_label_app]
      action: keep
      regex: heliosdb-streaming
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
      action: replace
      target_label: __metrics_path__
      regex: (.+)
    - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      regex: ([^:]+)(?::\d+)?;(\d+)
      replacement: $1:$2
      target_label: __address__
    - source_labels: [__meta_kubernetes_namespace]
      target_label: kubernetes_namespace
    - source_labels: [__meta_kubernetes_pod_name]
      target_label: kubernetes_pod_name

5.2 Grafana Dashboards

Create monitoring/dashboards/heliosdb-streaming.json:

{
  "dashboard": {
    "title": "HeliosDB Streaming - Production",
    "tags": ["heliosdb", "streaming", "production"],
    "timezone": "browser",
    "panels": [
      {
        "title": "Throughput (Events/sec)",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(heliosdb_events_processed_total[5m])",
            "legendFormat": "{{pod}}"
          }
        ]
      },
      {
        "title": "Latency (p99)",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, heliosdb_latency_seconds_bucket)",
            "legendFormat": "p99"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "heliosdb_memory_usage_bytes / 1024 / 1024",
            "legendFormat": "{{pod}} - Memory (MB)"
          }
        ]
      },
      {
        "title": "Backpressure Events",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(heliosdb_backpressure_events_total[5m])",
            "legendFormat": "{{pod}}"
          }
        ]
      },
      {
        "title": "Checkpoint Duration",
        "type": "graph",
        "targets": [
          {
            "expr": "heliosdb_checkpoint_duration_seconds",
            "legendFormat": "{{pod}}"
          }
        ]
      }
    ]
  }
}

5.3 Key Metrics

Metric	Description	Alert Threshold
`heliosdb_events_processed_total`	Total events processed	< 100/sec (low throughput)
`heliosdb_latency_seconds`	Event processing latency	p99 > 10ms
`heliosdb_memory_usage_bytes`	Current memory usage	> 80% of limit
`heliosdb_backpressure_events_total`	Backpressure triggers	> 100/min
`heliosdb_checkpoint_duration_seconds`	Checkpoint time	> 1 second
`heliosdb_checkpoint_failures_total`	Failed checkpoints	> 0
`heliosdb_errors_total`	Total errors	> 10/min

5.4 Alerting Rules

Create monitoring/alerts.yaml:

groups:
- name: heliosdb_streaming
  interval: 30s
  rules:
  # High latency
  - alert: HighLatency
    expr: histogram_quantile(0.99, heliosdb_latency_seconds_bucket) > 0.01
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High latency detected (p99 > 10ms)"
      description: "Pod {{ $labels.pod }} has p99 latency of {{ $value }}s"

  # Low throughput
  - alert: LowThroughput
    expr: rate(heliosdb_events_processed_total[5m]) < 100
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Low throughput detected (< 100 events/sec)"

  # High memory usage
  - alert: HighMemoryUsage
    expr: heliosdb_memory_usage_bytes / heliosdb_memory_limit_bytes > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage (> 80%)"

  # Checkpoint failures
  - alert: CheckpointFailures
    expr: increase(heliosdb_checkpoint_failures_total[5m]) > 0
    labels:
      severity: critical
    annotations:
      summary: "Checkpoint failures detected"

  # Pod down
  - alert: PodDown
    expr: up{job="heliosdb-streaming"} == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "HeliosDB pod is down"

6. Operational Runbook

6.1 Starting the Service

Docker Compose:

# Start all services
docker-compose up -d

# Verify
docker-compose ps
docker-compose logs -f heliosdb-streaming

# Check health
curl http://localhost:8080/health

Kubernetes:

# Deploy (if not already deployed)
kubectl apply -f k8s/

# Check status
kubectl get pods -n heliosdb
kubectl logs -f heliosdb-streaming-0 -n heliosdb

# Check health
kubectl port-forward -n heliosdb heliosdb-streaming-0 8080:8080
curl http://localhost:8080/health

6.2 Stopping the Service

Graceful Shutdown (recommended):

# Docker
docker-compose stop heliosdb-streaming

# Kubernetes
kubectl scale statefulset heliosdb-streaming --replicas=0 -n heliosdb

Force Shutdown (if needed):

# Docker
docker-compose kill heliosdb-streaming

# Kubernetes
kubectl delete pod heliosdb-streaming-0 -n heliosdb --grace-period=0 --force

6.3 Scaling

Horizontal Scaling:

# Docker Compose
docker-compose up -d --scale heliosdb-streaming=8

# Kubernetes (manual)
kubectl scale statefulset heliosdb-streaming --replicas=8 -n heliosdb

# Kubernetes (auto-scaling with HPA)
# HPA will automatically scale based on CPU/memory
kubectl get hpa -n heliosdb

Vertical Scaling (increase resources):

# Edit StatefulSet
kubectl edit statefulset heliosdb-streaming -n heliosdb

# Update resources:
# resources:
#   requests:
#     cpu: "4"
#     memory: "16Gi"
#   limits:
#     cpu: "16"
#     memory: "32Gi"

# Rolling restart
kubectl rollout restart statefulset/heliosdb-streaming -n heliosdb

6.4 Rolling Updates

# Update image
kubectl set image statefulset/heliosdb-streaming \
  heliosdb-streaming=heliosdb/streaming:v5.5 \
  -n heliosdb

# Check rollout status
kubectl rollout status statefulset/heliosdb-streaming -n heliosdb

# Rollback if needed
kubectl rollout undo statefulset/heliosdb-streaming -n heliosdb

6.5 Backup & Restore

Backup Checkpoints:

# Docker (copy from container)
docker cp heliosdb-streaming:/app/checkpoints ./backup/checkpoints-$(date +%Y%m%d)

# Kubernetes (copy from pod)
kubectl cp heliosdb/heliosdb-streaming-0:/data/checkpoints \
  ./backup/checkpoints-$(date +%Y%m%d)

# Backup to S3
aws s3 sync /data/checkpoints s3://heliosdb-backups/checkpoints/$(date +%Y%m%d)/

Restore Checkpoints:

# Docker
docker cp ./backup/checkpoints-20251029 heliosdb-streaming:/app/checkpoints

# Kubernetes
kubectl cp ./backup/checkpoints-20251029 \
  heliosdb/heliosdb-streaming-0:/data/checkpoints

# Restore from S3
aws s3 sync s3://heliosdb-backups/checkpoints/20251029/ /data/checkpoints/

6.6 Log Management

View Logs:

# Docker
docker-compose logs -f heliosdb-streaming

# Kubernetes
kubectl logs -f heliosdb-streaming-0 -n heliosdb

# All pods
kubectl logs -l app=heliosdb-streaming -n heliosdb --tail=100

Log Aggregation (ELK/EFK Stack):

# Install Elasticsearch + Kibana (optional)
helm repo add elastic https://helm.elastic.co
helm install elasticsearch elastic/elasticsearch -n heliosdb
helm install kibana elastic/kibana -n heliosdb

# Configure Fluentd/Filebeat to ship logs

7. Security Configuration

7.1 TLS/SSL

Create TLS secret:

# Generate self-signed cert (dev/test only)
openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
  -keyout tls.key -out tls.crt \
  -subj "/CN=heliosdb-streaming.heliosdb.svc.cluster.local"

# Create Kubernetes secret
kubectl create secret tls heliosdb-streaming-tls \
  --cert=tls.crt --key=tls.key \
  -n heliosdb

Update StatefulSet to use TLS:

# Add to container volumes
- name: tls
  secret:
    secretName: heliosdb-streaming-tls

# Add to volumeMounts
- name: tls
  mountPath: /app/tls
  readOnly: true

# Add environment variable
- name: HELIOSDB_TLS_ENABLED
  value: "true"
- name: HELIOSDB_TLS_CERT
  value: "/app/tls/tls.crt"
- name: HELIOSDB_TLS_KEY
  value: "/app/tls/tls.key"

7.2 Network Policies

Create k8s/network-policy.yaml:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: heliosdb-streaming-network-policy
  namespace: heliosdb
spec:
  podSelector:
    matchLabels:
      app: heliosdb-streaming
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    - podSelector:
        matchLabels:
          app: prometheus
    ports:
    - protocol: TCP
      port: 8080
    - protocol: TCP
      port: 9090
  egress:
  - to:
    - namespaceSelector: {}
    ports:
    - protocol: TCP
      port: 9092  # Kafka
    - protocol: TCP
      port: 6379  # Redis
    - protocol: TCP
      port: 443   # HTTPS (for KMS)

7.3 Pod Security Policy

Create k8s/pod-security-policy.yaml:

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: heliosdb-streaming-psp
spec:
  privileged: false
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
  - ALL
  volumes:
  - 'configMap'
  - 'emptyDir'
  - 'projected'
  - 'secret'
  - 'persistentVolumeClaim'
  hostNetwork: false
  hostIPC: false
  hostPID: false
  runAsUser:
    rule: 'MustRunAsNonRoot'
  seLinux:
    rule: 'RunAsAny'
  fsGroup:
    rule: 'RunAsAny'
  readOnlyRootFilesystem: false

8. Capacity Planning

8.1 Sizing Guidelines

Small Deployment (< 10K events/sec):

Nodes: 2-4
CPU: 4 cores per node
RAM: 8 GB per node
Storage: 100 GB per node
Cost: $500-1000/month

Medium Deployment (10K-100K events/sec):

Nodes: 4-8
CPU: 8 cores per node
RAM: 16 GB per node
Storage: 200 GB per node
Cost: $2000-4000/month

Large Deployment (100K-1M events/sec):

Nodes: 8-32
CPU: 16 cores per node
RAM: 32 GB per node
Storage: 500 GB per node
Cost: $8000-20000/month

8.2 Resource Formulas

Memory Calculation:

Required Memory = Base Memory + (Buffer Size × Number of Streams) + State Size
- Base Memory: ~2 GB
- Buffer Size: 100 MB per stream (configurable)
- State Size: Depends on window size and event rate

Storage Calculation:

Required Storage = Checkpoint Size × Retention Count + Log Size
- Checkpoint Size: ~10-20% of memory
- Retention Count: 3-10 (configurable)
- Log Size: ~1 GB per day (with log rotation)

CPU Calculation:

Required CPU = Base CPU + (Events/sec × CPU per Event)
- Base CPU: 1 core
- CPU per Event: ~0.1ms (0.0001 core-seconds)
- Example: 10K events/sec = 1 + (10000 × 0.0001) = 2 cores minimum

9. Troubleshooting

9.1 Common Issues

Issue 1: High Latency (p99 > 10ms)

Symptoms:

Slow event processing
Increasing backlog

Diagnosis:

# Check metrics
kubectl exec -it heliosdb-streaming-0 -n heliosdb -- \
  curl localhost:9090/metrics | grep latency

# Check CPU/memory
kubectl top pod heliosdb-streaming-0 -n heliosdb

Solutions:

Scale horizontally (add more pods)
Increase CPU/memory resources
Optimize window sizes
Enable compression for checkpoints

Issue 2: Checkpoint Failures

Symptoms:

heliosdb_checkpoint_failures_total > 0
Errors in logs

Diagnosis:

# Check logs
kubectl logs heliosdb-streaming-0 -n heliosdb | grep checkpoint

# Check disk space
kubectl exec -it heliosdb-streaming-0 -n heliosdb -- df -h

Solutions:

Increase checkpoint timeout
Check disk space (increase PVC size)
Verify KMS access (check AWS/Azure credentials)
Check state backend configuration

Issue 3: Memory Leaks

Symptoms:

Memory usage continuously increasing
OOMKilled pods

Diagnosis:

# Check memory metrics
kubectl exec -it heliosdb-streaming-0 -n heliosdb -- \
  curl localhost:9090/metrics | grep memory

# Check pod restarts
kubectl get pods -n heliosdb

Solutions:

Reduce buffer sizes
Enable aggressive garbage collection
Check for event accumulation
Increase memory limits

Issue 4: Pod Not Starting

Symptoms:

Pod stuck in Pending or CrashLoopBackOff

Diagnosis:

# Check pod status
kubectl describe pod heliosdb-streaming-0 -n heliosdb

# Check events
kubectl get events -n heliosdb --sort-by='.lastTimestamp'

Solutions:

Check resource requests (CPU/memory)
Verify PVC is bound
Check secrets/configmaps exist
Verify image pull secrets (if using private registry)

9.2 Debug Commands

# Get pod shell
kubectl exec -it heliosdb-streaming-0 -n heliosdb -- /bin/bash

# Check running processes
kubectl exec heliosdb-streaming-0 -n heliosdb -- ps aux

# Check network connectivity
kubectl exec heliosdb-streaming-0 -n heliosdb -- curl kafka:9092

# Check file permissions
kubectl exec heliosdb-streaming-0 -n heliosdb -- ls -la /data

# Run cargo tests
kubectl exec heliosdb-streaming-0 -n heliosdb -- cargo test

# Check Rust binary
kubectl exec heliosdb-streaming-0 -n heliosdb -- /app/heliosdb-streaming --version

📞 Support

Documentation: https://docs.heliosdb.com GitHub Issues: https://github.com/danimoya/HeliosDB/issues Email: support@heliosdb.com Slack: heliosdb.slack.com

Document Version: 1.0 Last Updated: October 29, 2025 Status: Production Deployment Guide Next Review: January 2026