F1.3 Flink Streaming - Production Deployment Guide
F1.3 Flink Streaming - Production Deployment Guide
Feature: Real-Time Stream Processing Engine Version: v5.0-v5.4 Status: Production-Ready (152 Tests Passing) Date: October 2025
📋 Table of Contents
- Overview
- Quick Start
- Docker Deployment
- Kubernetes Deployment
- Monitoring & Observability
- Operational Runbook
- Security Configuration
- Capacity Planning
- Troubleshooting
1. Overview
1.1 Deployment Options
| Option | Use Case | Complexity | Scalability | Cost |
|---|---|---|---|---|
| Docker Compose | Dev/Test, Small prod | Low | Limited (1-4 nodes) | $100-500/month |
| Kubernetes | Production, Enterprise | Medium | High (unlimited) | $500-5000/month |
| Managed K8s (EKS/AKS/GKE) | Production, Cloud-native | Medium-High | Very High | $1000-10000/month |
| Bare Metal | On-premise, High performance | High | Medium | Hardware dependent |
Recommended: Kubernetes (EKS/AKS/GKE) for production
1.2 System Requirements
Minimum (Single Node):
- CPU: 4 cores (8 threads)
- RAM: 8 GB
- Storage: 50 GB SSD
- Network: 1 Gbps
- OS: Linux (Ubuntu 22.04, RHEL 8+, Amazon Linux 2)
Recommended (Production Node):
- CPU: 8-16 cores
- RAM: 16-32 GB
- Storage: 100-500 GB NVMe SSD
- Network: 10 Gbps
- OS: Ubuntu 22.04 LTS
High-Performance (Large Deployments):
- CPU: 32-64 cores
- RAM: 64-128 GB
- Storage: 1-2 TB NVMe SSD
- Network: 25-100 Gbps
- OS: Ubuntu 22.04 LTS + kernel tuning
1.3 Prerequisites
Required:
- Docker 24.0+ (if using containers)
- Kubernetes 1.28+ (if using K8s)
- kubectl CLI
- Helm 3.12+ (optional, for easy K8s deployment)
- Git (for cloning repository)
- Rust 1.75+ (for building from source)
Optional:
- Prometheus (monitoring)
- Grafana (dashboards)
- Kafka/Pulsar (event sources)
- Redis (for distributed caching)
2. Quick Start
2.1 Docker Compose (5 Minutes)
For: Development, testing, small production deployments
# 1. Clone repositorygit clone https://github.com/danimoya/HeliosDB.gitcd HeliosDB/heliosdb-streaming
# 2. Create docker-compose.yml (see section 3.2)# Copy the provided docker-compose.yml to your directory
# 3. Start servicesdocker-compose up -d
# 4. Check statusdocker-compose ps
# 5. View logsdocker-compose logs -f heliosdb-streaming
# 6. Run testsdocker-compose exec heliosdb-streaming cargo test
# Expected: 152 tests passing in ~2 secondsAccess:
- HeliosDB Streaming:
http://localhost:8080 - Prometheus:
http://localhost:9090 - Grafana:
http://localhost:3000(admin/admin)
2.2 Kubernetes (15 Minutes)
For: Production deployments, auto-scaling, high availability
# 1. Create namespacekubectl create namespace heliosdb
# 2. Deploy HeliosDB Streamingkubectl apply -f k8s/heliosdb-streaming-deployment.yaml
# 3. Create servicekubectl apply -f k8s/heliosdb-streaming-service.yaml
# 4. Deploy monitoring (optional)kubectl apply -f k8s/prometheus.yamlkubectl apply -f k8s/grafana.yaml
# 5. Check deploymentkubectl get pods -n heliosdb
# Expected output:# NAME READY STATUS RESTARTS AGE# heliosdb-streaming-0 1/1 Running 0 2m# heliosdb-streaming-1 1/1 Running 0 2m# heliosdb-streaming-2 1/1 Running 0 2m# heliosdb-streaming-3 1/1 Running 0 2m3. Docker Deployment
3.1 Dockerfile
Create Dockerfile in heliosdb-streaming/:
# Multi-stage build for optimal image size
# Stage 1: BuilderFROM rust:1.75-slim as builder
# Install build dependenciesRUN apt-get update && apt-get install -y \ pkg-config \ libssl-dev \ cmake \ protobuf-compiler \ && rm -rf /var/lib/apt/lists/*
# Create app directoryWORKDIR /app
# Copy Cargo filesCOPY Cargo.toml Cargo.lock ./COPY heliosdb-streaming/Cargo.toml ./heliosdb-streaming/
# Copy source codeCOPY heliosdb-streaming/src ./heliosdb-streaming/src
# Build release binaryRUN cargo build --release --manifest-path heliosdb-streaming/Cargo.toml
# Run tests to validateRUN cargo test --release --manifest-path heliosdb-streaming/Cargo.toml
# Stage 2: RuntimeFROM ubuntu:22.04
# Install runtime dependenciesRUN apt-get update && apt-get install -y \ ca-certificates \ libssl3 \ && rm -rf /var/lib/apt/lists/*
# Create non-root userRUN useradd -m -u 1000 heliosdb
# Create directoriesRUN mkdir -p /app/data /app/logs /app/checkpoints && \ chown -R heliosdb:heliosdb /app
# Switch to non-root userUSER heliosdbWORKDIR /app
# Copy binary from builderCOPY --from=builder --chown=heliosdb:heliosdb \ /app/target/release/heliosdb-streaming /app/
# Copy configCOPY --chown=heliosdb:heliosdb heliosdb-streaming/config.yaml /app/
# Expose portsEXPOSE 8080 9090
# Health checkHEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \ CMD curl -f http://localhost:8080/health || exit 1
# Set environment variablesENV RUST_LOG=infoENV RUST_BACKTRACE=1ENV HELIOSDB_DATA_DIR=/app/dataENV HELIOSDB_CHECKPOINT_DIR=/app/checkpoints
# Run applicationCMD ["/app/heliosdb-streaming"]Build:
docker build -t heliosdb/streaming:v5.4 -f heliosdb-streaming/Dockerfile .Image Size: ~150 MB (optimized)
3.2 docker-compose.yml
Create docker-compose.yml:
version: '3.8'
services: # HeliosDB Streaming heliosdb-streaming: image: heliosdb/streaming:v5.4 container_name: heliosdb-streaming restart: unless-stopped ports: - "8080:8080" # API - "9090:9090" # Metrics volumes: - ./data:/app/data - ./checkpoints:/app/checkpoints - ./logs:/app/logs - ./config.yaml:/app/config.yaml:ro environment: - RUST_LOG=info - RUST_BACKTRACE=1 - HELIOSDB_THREADS=4 - HELIOSDB_MAX_MEMORY_MB=4096 - HELIOSDB_CHECKPOINT_INTERVAL_SECS=60 - HELIOSDB_KMS_PROVIDER=local # or aws_kms, azure_keyvault networks: - heliosdb-network healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] interval: 30s timeout: 10s retries: 3 start_period: 40s deploy: resources: limits: cpus: '4' memory: 8G reservations: cpus: '2' memory: 4G
# Kafka (optional, for event sources) kafka: image: confluentinc/cp-kafka:7.5.0 container_name: kafka restart: unless-stopped ports: - "9092:9092" environment: KAFKA_BROKER_ID: 1 KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181 KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092 KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1 networks: - heliosdb-network depends_on: - zookeeper
zookeeper: image: confluentinc/cp-zookeeper:7.5.0 container_name: zookeeper restart: unless-stopped ports: - "2181:2181" environment: ZOOKEEPER_CLIENT_PORT: 2181 ZOOKEEPER_TICK_TIME: 2000 networks: - heliosdb-network
# Redis (optional, for distributed caching) redis: image: redis:7.2-alpine container_name: redis restart: unless-stopped ports: - "6379:6379" command: redis-server --appendonly yes volumes: - redis-data:/data networks: - heliosdb-network
# Prometheus (monitoring) prometheus: image: prom/prometheus:v2.47.0 container_name: prometheus restart: unless-stopped ports: - "9091:9090" volumes: - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro - prometheus-data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--storage.tsdb.retention.time=30d' networks: - heliosdb-network
# Grafana (dashboards) grafana: image: grafana/grafana:10.1.0 container_name: grafana restart: unless-stopped ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_USER=admin - GF_SECURITY_ADMIN_PASSWORD=admin - GF_USERS_ALLOW_SIGN_UP=false volumes: - ./monitoring/grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml:ro - ./monitoring/grafana-dashboards.yml:/etc/grafana/provisioning/dashboards/dashboards.yml:ro - ./monitoring/dashboards:/var/lib/grafana/dashboards:ro - grafana-data:/var/lib/grafana networks: - heliosdb-network depends_on: - prometheus
networks: heliosdb-network: driver: bridge
volumes: redis-data: prometheus-data: grafana-data:Usage:
# Start all servicesdocker-compose up -d
# Scale HeliosDB instancesdocker-compose up -d --scale heliosdb-streaming=4
# Stop servicesdocker-compose down
# Stop and remove volumesdocker-compose down -v3.3 Configuration File
Create config.yaml:
# HeliosDB Streaming Configuration
# Server Configurationserver: host: "0.0.0.0" port: 8080 metrics_port: 9090 max_connections: 1000
# Threadingthreads: worker_threads: 4 blocking_threads: 4
# Memory Managementmemory: max_memory_mb: 4096 buffer_pool_size_mb: 512 gc_interval_secs: 300
# Checkpoint Configurationcheckpoint: enabled: true interval_secs: 60 timeout_secs: 10 min_pause_secs: 30 max_concurrent: 1 directory: "/app/checkpoints" compression: true encryption: true
# State Backendstate_backend: type: "rocksdb" # or "inmemory" for testing path: "/app/data/rocksdb" cache_size_mb: 1024 block_cache_mb: 512
# KMS Configuration (choose one)kms: provider: "local" # or "aws_kms", "azure_keyvault", "gcp_kms"
# AWS KMS (uncomment if using AWS) # aws: # key_id: "arn:aws:kms:us-east-1:123456789012:key/abc-123" # region: "us-east-1"
# Azure Key Vault (uncomment if using Azure) # azure: # vault_url: "https://my-vault.vault.azure.net/" # key_name: "heliosdb-master-key"
# Local (for dev/test) local: master_key_file: "/app/data/master.key"
# Key Rotation Policykey_rotation: enabled: true interval_secs: 2592000 # 30 days max_previous_keys: 3 auto_rotate: true
# Backpressure Configurationbackpressure: strategy: "adaptive" # or "block", "drop_oldest", "drop_newest", "signal" initial_buffer_size: 100 min_buffer_size: 10 max_buffer_size: 200
# Connectorsconnectors: kafka: enabled: true bootstrap_servers: - "kafka:9092" consumer_group: "heliosdb-streaming"
redis: enabled: true host: "redis" port: 6379 pool_size: 10
# Monitoringmonitoring: prometheus: enabled: true port: 9090
logging: level: "info" # debug, info, warn, error format: "json" # or "text" output: "/app/logs/heliosdb.log"
# Performance Tuningperformance: event_batch_size: 1000 window_size_secs: 60 watermark_interval_secs: 1 allowed_lateness_secs: 604. Kubernetes Deployment
4.1 Namespace
Create k8s/namespace.yaml:
apiVersion: v1kind: Namespacemetadata: name: heliosdb labels: app: heliosdb environment: production4.2 ConfigMap
Create k8s/configmap.yaml:
apiVersion: v1kind: ConfigMapmetadata: name: heliosdb-streaming-config namespace: heliosdbdata: config.yaml: | server: host: "0.0.0.0" port: 8080 metrics_port: 9090
threads: worker_threads: 8 blocking_threads: 4
memory: max_memory_mb: 8192 buffer_pool_size_mb: 1024
checkpoint: enabled: true interval_secs: 60 directory: "/data/checkpoints" compression: true encryption: true
state_backend: type: "rocksdb" path: "/data/rocksdb" cache_size_mb: 2048
kms: provider: "aws_kms" aws: key_id: "${AWS_KMS_KEY_ID}" region: "${AWS_REGION}"
backpressure: strategy: "adaptive" initial_buffer_size: 100
monitoring: prometheus: enabled: true port: 9090 logging: level: "info" format: "json"4.3 Secret
Create k8s/secret.yaml:
apiVersion: v1kind: Secretmetadata: name: heliosdb-streaming-secrets namespace: heliosdbtype: OpaquestringData: AWS_ACCESS_KEY_ID: "your-aws-access-key" AWS_SECRET_ACCESS_KEY: "your-aws-secret-key" AWS_REGION: "us-east-1" AWS_KMS_KEY_ID: "arn:aws:kms:us-east-1:123456789012:key/abc-123"Note: Use Kubernetes Secrets or external secret managers (AWS Secrets Manager, Azure Key Vault, HashiCorp Vault) for production.
4.4 StatefulSet
Create k8s/statefulset.yaml:
apiVersion: apps/v1kind: StatefulSetmetadata: name: heliosdb-streaming namespace: heliosdb labels: app: heliosdb-streaming version: v5.4spec: serviceName: heliosdb-streaming replicas: 4 selector: matchLabels: app: heliosdb-streaming template: metadata: labels: app: heliosdb-streaming version: v5.4 annotations: prometheus.io/scrape: "true" prometheus.io/port: "9090" prometheus.io/path: "/metrics" spec: serviceAccountName: heliosdb-streaming securityContext: fsGroup: 1000 runAsUser: 1000 runAsNonRoot: true
containers: - name: heliosdb-streaming image: heliosdb/streaming:v5.4 imagePullPolicy: IfNotPresent
ports: - name: api containerPort: 8080 protocol: TCP - name: metrics containerPort: 9090 protocol: TCP
env: - name: RUST_LOG value: "info" - name: RUST_BACKTRACE value: "1" - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: POD_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace - name: POD_IP valueFrom: fieldRef: fieldPath: status.podIP
envFrom: - secretRef: name: heliosdb-streaming-secrets
volumeMounts: - name: config mountPath: /app/config.yaml subPath: config.yaml - name: data mountPath: /data - name: logs mountPath: /app/logs
resources: requests: cpu: "2" memory: "8Gi" limits: cpu: "8" memory: "16Gi"
livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 60 periodSeconds: 30 timeoutSeconds: 10 failureThreshold: 3
readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3
volumes: - name: config configMap: name: heliosdb-streaming-config - name: logs emptyDir: {}
volumeClaimTemplates: - metadata: name: data spec: accessModes: [ "ReadWriteOnce" ] storageClassName: gp3 # or "standard", "premium-rwo" depending on cloud provider resources: requests: storage: 100Gi4.5 Service
Create k8s/service.yaml:
apiVersion: v1kind: Servicemetadata: name: heliosdb-streaming namespace: heliosdb labels: app: heliosdb-streamingspec: type: ClusterIP clusterIP: None # Headless service for StatefulSet ports: - name: api port: 8080 targetPort: 8080 protocol: TCP - name: metrics port: 9090 targetPort: 9090 protocol: TCP selector: app: heliosdb-streaming
---apiVersion: v1kind: Servicemetadata: name: heliosdb-streaming-lb namespace: heliosdb labels: app: heliosdb-streamingspec: type: LoadBalancer ports: - name: api port: 8080 targetPort: 8080 protocol: TCP selector: app: heliosdb-streaming4.6 ServiceAccount & RBAC
Create k8s/rbac.yaml:
apiVersion: v1kind: ServiceAccountmetadata: name: heliosdb-streaming namespace: heliosdb
---apiVersion: rbac.authorization.k8s.io/v1kind: Rolemetadata: name: heliosdb-streaming namespace: heliosdbrules:- apiGroups: [""] resources: ["pods", "configmaps", "secrets"] verbs: ["get", "list", "watch"]- apiGroups: [""] resources: ["pods/log"] verbs: ["get"]
---apiVersion: rbac.authorization.k8s.io/v1kind: RoleBindingmetadata: name: heliosdb-streaming namespace: heliosdbsubjects:- kind: ServiceAccount name: heliosdb-streaming namespace: heliosdbroleRef: kind: Role name: heliosdb-streaming apiGroup: rbac.authorization.k8s.io4.7 HorizontalPodAutoscaler (HPA)
Create k8s/hpa.yaml:
apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: heliosdb-streaming-hpa namespace: heliosdbspec: scaleTargetRef: apiVersion: apps/v1 kind: StatefulSet name: heliosdb-streaming minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 behavior: scaleDown: stabilizationWindowSeconds: 300 policies: - type: Percent value: 50 periodSeconds: 60 scaleUp: stabilizationWindowSeconds: 60 policies: - type: Percent value: 100 periodSeconds: 30 - type: Pods value: 2 periodSeconds: 30 selectPolicy: Max4.8 Deployment Commands
# Create namespacekubectl apply -f k8s/namespace.yaml
# Deploy RBACkubectl apply -f k8s/rbac.yaml
# Create ConfigMap and Secretskubectl apply -f k8s/configmap.yamlkubectl apply -f k8s/secret.yaml
# Deploy StatefulSetkubectl apply -f k8s/statefulset.yaml
# Create Serviceskubectl apply -f k8s/service.yaml
# Deploy HPA (optional)kubectl apply -f k8s/hpa.yaml
# Check deploymentkubectl get all -n heliosdb
# Check logskubectl logs -f heliosdb-streaming-0 -n heliosdb
# Scale manuallykubectl scale statefulset heliosdb-streaming --replicas=8 -n heliosdb
# Rolling updatekubectl set image statefulset/heliosdb-streaming heliosdb-streaming=heliosdb/streaming:v5.5 -n heliosdb5. Monitoring & Observability
5.1 Prometheus Configuration
Create monitoring/prometheus.yml:
global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: 'heliosdb-production' environment: 'prod'
scrape_configs: # HeliosDB Streaming metrics - job_name: 'heliosdb-streaming' kubernetes_sd_configs: - role: pod namespaces: names: - heliosdb relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app] action: keep regex: heliosdb-streaming - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ - source_labels: [__meta_kubernetes_namespace] target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] target_label: kubernetes_pod_name5.2 Grafana Dashboards
Create monitoring/dashboards/heliosdb-streaming.json:
{ "dashboard": { "title": "HeliosDB Streaming - Production", "tags": ["heliosdb", "streaming", "production"], "timezone": "browser", "panels": [ { "title": "Throughput (Events/sec)", "type": "graph", "targets": [ { "expr": "rate(heliosdb_events_processed_total[5m])", "legendFormat": "{{pod}}" } ] }, { "title": "Latency (p99)", "type": "graph", "targets": [ { "expr": "histogram_quantile(0.99, heliosdb_latency_seconds_bucket)", "legendFormat": "p99" } ] }, { "title": "Memory Usage", "type": "graph", "targets": [ { "expr": "heliosdb_memory_usage_bytes / 1024 / 1024", "legendFormat": "{{pod}} - Memory (MB)" } ] }, { "title": "Backpressure Events", "type": "graph", "targets": [ { "expr": "rate(heliosdb_backpressure_events_total[5m])", "legendFormat": "{{pod}}" } ] }, { "title": "Checkpoint Duration", "type": "graph", "targets": [ { "expr": "heliosdb_checkpoint_duration_seconds", "legendFormat": "{{pod}}" } ] } ] }}5.3 Key Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
heliosdb_events_processed_total | Total events processed | < 100/sec (low throughput) |
heliosdb_latency_seconds | Event processing latency | p99 > 10ms |
heliosdb_memory_usage_bytes | Current memory usage | > 80% of limit |
heliosdb_backpressure_events_total | Backpressure triggers | > 100/min |
heliosdb_checkpoint_duration_seconds | Checkpoint time | > 1 second |
heliosdb_checkpoint_failures_total | Failed checkpoints | > 0 |
heliosdb_errors_total | Total errors | > 10/min |
5.4 Alerting Rules
Create monitoring/alerts.yaml:
groups:- name: heliosdb_streaming interval: 30s rules: # High latency - alert: HighLatency expr: histogram_quantile(0.99, heliosdb_latency_seconds_bucket) > 0.01 for: 5m labels: severity: warning annotations: summary: "High latency detected (p99 > 10ms)" description: "Pod {{ $labels.pod }} has p99 latency of {{ $value }}s"
# Low throughput - alert: LowThroughput expr: rate(heliosdb_events_processed_total[5m]) < 100 for: 10m labels: severity: warning annotations: summary: "Low throughput detected (< 100 events/sec)"
# High memory usage - alert: HighMemoryUsage expr: heliosdb_memory_usage_bytes / heliosdb_memory_limit_bytes > 0.8 for: 5m labels: severity: warning annotations: summary: "High memory usage (> 80%)"
# Checkpoint failures - alert: CheckpointFailures expr: increase(heliosdb_checkpoint_failures_total[5m]) > 0 labels: severity: critical annotations: summary: "Checkpoint failures detected"
# Pod down - alert: PodDown expr: up{job="heliosdb-streaming"} == 0 for: 2m labels: severity: critical annotations: summary: "HeliosDB pod is down"6. Operational Runbook
6.1 Starting the Service
Docker Compose:
# Start all servicesdocker-compose up -d
# Verifydocker-compose psdocker-compose logs -f heliosdb-streaming
# Check healthcurl http://localhost:8080/healthKubernetes:
# Deploy (if not already deployed)kubectl apply -f k8s/
# Check statuskubectl get pods -n heliosdbkubectl logs -f heliosdb-streaming-0 -n heliosdb
# Check healthkubectl port-forward -n heliosdb heliosdb-streaming-0 8080:8080curl http://localhost:8080/health6.2 Stopping the Service
Graceful Shutdown (recommended):
# Dockerdocker-compose stop heliosdb-streaming
# Kuberneteskubectl scale statefulset heliosdb-streaming --replicas=0 -n heliosdbForce Shutdown (if needed):
# Dockerdocker-compose kill heliosdb-streaming
# Kuberneteskubectl delete pod heliosdb-streaming-0 -n heliosdb --grace-period=0 --force6.3 Scaling
Horizontal Scaling:
# Docker Composedocker-compose up -d --scale heliosdb-streaming=8
# Kubernetes (manual)kubectl scale statefulset heliosdb-streaming --replicas=8 -n heliosdb
# Kubernetes (auto-scaling with HPA)# HPA will automatically scale based on CPU/memorykubectl get hpa -n heliosdbVertical Scaling (increase resources):
# Edit StatefulSetkubectl edit statefulset heliosdb-streaming -n heliosdb
# Update resources:# resources:# requests:# cpu: "4"# memory: "16Gi"# limits:# cpu: "16"# memory: "32Gi"
# Rolling restartkubectl rollout restart statefulset/heliosdb-streaming -n heliosdb6.4 Rolling Updates
# Update imagekubectl set image statefulset/heliosdb-streaming \ heliosdb-streaming=heliosdb/streaming:v5.5 \ -n heliosdb
# Check rollout statuskubectl rollout status statefulset/heliosdb-streaming -n heliosdb
# Rollback if neededkubectl rollout undo statefulset/heliosdb-streaming -n heliosdb6.5 Backup & Restore
Backup Checkpoints:
# Docker (copy from container)docker cp heliosdb-streaming:/app/checkpoints ./backup/checkpoints-$(date +%Y%m%d)
# Kubernetes (copy from pod)kubectl cp heliosdb/heliosdb-streaming-0:/data/checkpoints \ ./backup/checkpoints-$(date +%Y%m%d)
# Backup to S3aws s3 sync /data/checkpoints s3://heliosdb-backups/checkpoints/$(date +%Y%m%d)/Restore Checkpoints:
# Dockerdocker cp ./backup/checkpoints-20251029 heliosdb-streaming:/app/checkpoints
# Kuberneteskubectl cp ./backup/checkpoints-20251029 \ heliosdb/heliosdb-streaming-0:/data/checkpoints
# Restore from S3aws s3 sync s3://heliosdb-backups/checkpoints/20251029/ /data/checkpoints/6.6 Log Management
View Logs:
# Dockerdocker-compose logs -f heliosdb-streaming
# Kuberneteskubectl logs -f heliosdb-streaming-0 -n heliosdb
# All podskubectl logs -l app=heliosdb-streaming -n heliosdb --tail=100Log Aggregation (ELK/EFK Stack):
# Install Elasticsearch + Kibana (optional)helm repo add elastic https://helm.elastic.cohelm install elasticsearch elastic/elasticsearch -n heliosdbhelm install kibana elastic/kibana -n heliosdb
# Configure Fluentd/Filebeat to ship logs7. Security Configuration
7.1 TLS/SSL
Create TLS secret:
# Generate self-signed cert (dev/test only)openssl req -x509 -nodes -days 365 -newkey rsa:2048 \ -keyout tls.key -out tls.crt \ -subj "/CN=heliosdb-streaming.heliosdb.svc.cluster.local"
# Create Kubernetes secretkubectl create secret tls heliosdb-streaming-tls \ --cert=tls.crt --key=tls.key \ -n heliosdbUpdate StatefulSet to use TLS:
# Add to container volumes- name: tls secret: secretName: heliosdb-streaming-tls
# Add to volumeMounts- name: tls mountPath: /app/tls readOnly: true
# Add environment variable- name: HELIOSDB_TLS_ENABLED value: "true"- name: HELIOSDB_TLS_CERT value: "/app/tls/tls.crt"- name: HELIOSDB_TLS_KEY value: "/app/tls/tls.key"7.2 Network Policies
Create k8s/network-policy.yaml:
apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: heliosdb-streaming-network-policy namespace: heliosdbspec: podSelector: matchLabels: app: heliosdb-streaming policyTypes: - Ingress - Egress ingress: - from: - namespaceSelector: matchLabels: name: ingress-nginx - podSelector: matchLabels: app: prometheus ports: - protocol: TCP port: 8080 - protocol: TCP port: 9090 egress: - to: - namespaceSelector: {} ports: - protocol: TCP port: 9092 # Kafka - protocol: TCP port: 6379 # Redis - protocol: TCP port: 443 # HTTPS (for KMS)7.3 Pod Security Policy
Create k8s/pod-security-policy.yaml:
apiVersion: policy/v1beta1kind: PodSecurityPolicymetadata: name: heliosdb-streaming-pspspec: privileged: false allowPrivilegeEscalation: false requiredDropCapabilities: - ALL volumes: - 'configMap' - 'emptyDir' - 'projected' - 'secret' - 'persistentVolumeClaim' hostNetwork: false hostIPC: false hostPID: false runAsUser: rule: 'MustRunAsNonRoot' seLinux: rule: 'RunAsAny' fsGroup: rule: 'RunAsAny' readOnlyRootFilesystem: false8. Capacity Planning
8.1 Sizing Guidelines
Small Deployment (< 10K events/sec):
- Nodes: 2-4
- CPU: 4 cores per node
- RAM: 8 GB per node
- Storage: 100 GB per node
- Cost: $500-1000/month
Medium Deployment (10K-100K events/sec):
- Nodes: 4-8
- CPU: 8 cores per node
- RAM: 16 GB per node
- Storage: 200 GB per node
- Cost: $2000-4000/month
Large Deployment (100K-1M events/sec):
- Nodes: 8-32
- CPU: 16 cores per node
- RAM: 32 GB per node
- Storage: 500 GB per node
- Cost: $8000-20000/month
8.2 Resource Formulas
Memory Calculation:
Required Memory = Base Memory + (Buffer Size × Number of Streams) + State Size- Base Memory: ~2 GB- Buffer Size: 100 MB per stream (configurable)- State Size: Depends on window size and event rateStorage Calculation:
Required Storage = Checkpoint Size × Retention Count + Log Size- Checkpoint Size: ~10-20% of memory- Retention Count: 3-10 (configurable)- Log Size: ~1 GB per day (with log rotation)CPU Calculation:
Required CPU = Base CPU + (Events/sec × CPU per Event)- Base CPU: 1 core- CPU per Event: ~0.1ms (0.0001 core-seconds)- Example: 10K events/sec = 1 + (10000 × 0.0001) = 2 cores minimum9. Troubleshooting
9.1 Common Issues
Issue 1: High Latency (p99 > 10ms)
Symptoms:
- Slow event processing
- Increasing backlog
Diagnosis:
# Check metricskubectl exec -it heliosdb-streaming-0 -n heliosdb -- \ curl localhost:9090/metrics | grep latency
# Check CPU/memorykubectl top pod heliosdb-streaming-0 -n heliosdbSolutions:
- Scale horizontally (add more pods)
- Increase CPU/memory resources
- Optimize window sizes
- Enable compression for checkpoints
Issue 2: Checkpoint Failures
Symptoms:
heliosdb_checkpoint_failures_total> 0- Errors in logs
Diagnosis:
# Check logskubectl logs heliosdb-streaming-0 -n heliosdb | grep checkpoint
# Check disk spacekubectl exec -it heliosdb-streaming-0 -n heliosdb -- df -hSolutions:
- Increase checkpoint timeout
- Check disk space (increase PVC size)
- Verify KMS access (check AWS/Azure credentials)
- Check state backend configuration
Issue 3: Memory Leaks
Symptoms:
- Memory usage continuously increasing
- OOMKilled pods
Diagnosis:
# Check memory metricskubectl exec -it heliosdb-streaming-0 -n heliosdb -- \ curl localhost:9090/metrics | grep memory
# Check pod restartskubectl get pods -n heliosdbSolutions:
- Reduce buffer sizes
- Enable aggressive garbage collection
- Check for event accumulation
- Increase memory limits
Issue 4: Pod Not Starting
Symptoms:
- Pod stuck in
PendingorCrashLoopBackOff
Diagnosis:
# Check pod statuskubectl describe pod heliosdb-streaming-0 -n heliosdb
# Check eventskubectl get events -n heliosdb --sort-by='.lastTimestamp'Solutions:
- Check resource requests (CPU/memory)
- Verify PVC is bound
- Check secrets/configmaps exist
- Verify image pull secrets (if using private registry)
9.2 Debug Commands
# Get pod shellkubectl exec -it heliosdb-streaming-0 -n heliosdb -- /bin/bash
# Check running processeskubectl exec heliosdb-streaming-0 -n heliosdb -- ps aux
# Check network connectivitykubectl exec heliosdb-streaming-0 -n heliosdb -- curl kafka:9092
# Check file permissionskubectl exec heliosdb-streaming-0 -n heliosdb -- ls -la /data
# Run cargo testskubectl exec heliosdb-streaming-0 -n heliosdb -- cargo test
# Check Rust binarykubectl exec heliosdb-streaming-0 -n heliosdb -- /app/heliosdb-streaming --version📞 Support
Documentation: https://docs.heliosdb.com GitHub Issues: https://github.com/danimoya/HeliosDB/issues Email: support@heliosdb.com Slack: heliosdb.slack.com
Document Version: 1.0 Last Updated: October 29, 2025 Status: Production Deployment Guide Next Review: January 2026