HeliosDB Streaming - Production Deployment Guide
HeliosDB Streaming - Production Deployment Guide
Table of Contents
- Prerequisites
- Single-Node Deployment
- Kubernetes Production Deployment
- Docker Compose Setup
- Monitoring & Observability
- Security Configuration
- Performance Tuning
- Operational Runbook
- Troubleshooting
- Capacity Planning
1. Prerequisites
System Requirements
Minimum (Development/Testing)
- CPU: 2 cores
- RAM: 4 GB
- Storage: 20 GB SSD
- OS: Linux (Ubuntu 20.04+, RHEL 8+, Amazon Linux 2)
Recommended (Production)
- CPU: 8 cores (16 vCPU)
- RAM: 16 GB (32 GB for high throughput)
- Storage: 100 GB NVMe SSD
- OS: Linux (Ubuntu 22.04 LTS, RHEL 9, Amazon Linux 2023)
- Network: 10 Gbps
Software Dependencies
# Rust toolchain (for building from source)curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | shrustup default stable
# Docker (for containerized deployment)curl -fsSL https://get.docker.com -o get-docker.shsudo sh get-docker.sh
# Kubernetes CLI (for k8s deployment)curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
# Helm (for k8s package management)curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bashExternal Services
Required:
- Apache Kafka or Apache Pulsar (message broker)
- S3/Azure Blob/GCS (checkpoint storage)
- AWS KMS/Azure Key Vault/GCP KMS (encryption key management)
Optional:
- PostgreSQL (metadata storage)
- Prometheus (metrics collection)
- Grafana (visualization)
- Jaeger (distributed tracing)
2. Single-Node Deployment
2.1 Build from Source
# Clone repositorygit clone https://github.com/heliosdb/heliosdb.gitcd heliosdb/heliosdb-streaming
# Build release binarycargo build --release
# Binary locationls -lh target/release/heliosdb-streaming
# Run testscargo test --all
# Run E2E integration testscargo test --test e2e_integration_test2.2 Configuration
Create /etc/heliosdb/streaming.toml:
[server]host = "0.0.0.0"port = 8080health_check_port = 8081metrics_port = 9090
[streaming]checkpoint_interval_secs = 60watermark_interval_secs = 1allowed_lateness_secs = 60max_parallelism = 8
[state]backend = "file" # Options: "memory", "file", "s3"path = "/var/lib/heliosdb/state"encryption_enabled = true
[kafka]bootstrap_servers = "localhost:9092"group_id = "heliosdb-streaming"auto_offset_reset = "earliest"enable_ssl = false
[security]jwt_secret = "your-secret-key-here-change-in-production"jwt_expiration_hours = 24rate_limit_enabled = truerate_limit_requests_per_minute = 100
[kms]provider = "local" # Options: "local", "aws", "azure", "gcp"# For AWS:# aws_region = "us-east-1"# For Azure:# azure_vault_url = "https://your-vault.vault.azure.net"# For GCP:# gcp_project_id = "your-project"# gcp_location = "us-central1"# gcp_keyring = "heliosdb"
[logging]level = "info" # Options: "error", "warn", "info", "debug", "trace"format = "json" # Options: "json", "pretty"2.3 SystemD Service
Create /etc/systemd/system/heliosdb-streaming.service:
[Unit]Description=HeliosDB Streaming AnalyticsAfter=network.target kafka.service
[Service]Type=simpleUser=heliosdbGroup=heliosdbWorkingDirectory=/opt/heliosdbExecStart=/opt/heliosdb/bin/heliosdb-streaming --config /etc/heliosdb/streaming.tomlRestart=on-failureRestartSec=10StandardOutput=journalStandardError=journalSyslogIdentifier=heliosdb-streaming
# SecurityNoNewPrivileges=truePrivateTmp=trueProtectSystem=strictProtectHome=trueReadWritePaths=/var/lib/heliosdb
# Resource limitsLimitNOFILE=65536LimitNPROC=4096
[Install]WantedBy=multi-user.target2.4 Start Service
# Create user and directoriessudo useradd -r -s /bin/false heliosdbsudo mkdir -p /opt/heliosdb/binsudo mkdir -p /var/lib/heliosdb/statesudo mkdir -p /var/log/heliosdbsudo chown -R heliosdb:heliosdb /opt/heliosdb /var/lib/heliosdb /var/log/heliosdb
# Copy binarysudo cp target/release/heliosdb-streaming /opt/heliosdb/bin/
# Enable and start servicesudo systemctl daemon-reloadsudo systemctl enable heliosdb-streamingsudo systemctl start heliosdb-streaming
# Check statussudo systemctl status heliosdb-streamingsudo journalctl -u heliosdb-streaming -f2.5 Verify Deployment
# Health checkcurl http://localhost:8081/health
# Metricscurl http://localhost:9090/metrics
# Create test jobcurl -X POST http://localhost:8080/api/v1/jobs \ -H "Content-Type: application/json" \ -H "Authorization: Bearer <your-jwt-token>" \ -d '{ "name": "test-job", "source": "kafka", "config": { "topic": "test-input", "parallelism": 4 } }'3. Kubernetes Production Deployment
3.1 Namespace Setup
apiVersion: v1kind: Namespacemetadata: name: heliosdb-streaming labels: name: heliosdb-streaming environment: production3.2 ConfigMap
apiVersion: v1kind: ConfigMapmetadata: name: heliosdb-config namespace: heliosdb-streamingdata: streaming.toml: | [server] host = "0.0.0.0" port = 8080 health_check_port = 8081 metrics_port = 9090
[streaming] checkpoint_interval_secs = 60 watermark_interval_secs = 1 allowed_lateness_secs = 60 max_parallelism = 8
[state] backend = "s3" path = "s3://heliosdb-production/checkpoints" encryption_enabled = true
[kafka] bootstrap_servers = "kafka-bootstrap.kafka:9092" group_id = "heliosdb-streaming-prod" auto_offset_reset = "earliest" enable_ssl = true
[security] rate_limit_enabled = true rate_limit_requests_per_minute = 500
[kms] provider = "aws" aws_region = "us-east-1"
[logging] level = "info" format = "json"3.3 Secrets
apiVersion: v1kind: Secretmetadata: name: heliosdb-secrets namespace: heliosdb-streamingtype: OpaquestringData: JWT_SECRET: "your-production-secret-change-this" AWS_ACCESS_KEY_ID: "AKIAIOSFODNN7EXAMPLE" AWS_SECRET_ACCESS_KEY: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" KAFKA_PASSWORD: "your-kafka-password"Generate secrets securely:
# Generate JWT secretopenssl rand -base64 32
# Create secret from filekubectl create secret generic heliosdb-secrets \ --from-literal=JWT_SECRET=$(openssl rand -base64 32) \ --from-file=aws-credentials=/path/to/credentials \ -n heliosdb-streaming3.4 StatefulSet
apiVersion: apps/v1kind: StatefulSetmetadata: name: heliosdb-streaming namespace: heliosdb-streamingspec: serviceName: heliosdb-streaming replicas: 3 selector: matchLabels: app: heliosdb-streaming template: metadata: labels: app: heliosdb-streaming annotations: prometheus.io/scrape: "true" prometheus.io/port: "9090" prometheus.io/path: "/metrics" spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - heliosdb-streaming topologyKey: "kubernetes.io/hostname" containers: - name: heliosdb-streaming image: heliosdb/heliosdb-streaming:v4.0.0 imagePullPolicy: IfNotPresent ports: - name: http containerPort: 8080 protocol: TCP - name: health containerPort: 8081 protocol: TCP - name: metrics containerPort: 9090 protocol: TCP env: - name: RUST_LOG value: "info" - name: RUST_BACKTRACE value: "1" - name: JWT_SECRET valueFrom: secretKeyRef: name: heliosdb-secrets key: JWT_SECRET - name: AWS_ACCESS_KEY_ID valueFrom: secretKeyRef: name: heliosdb-secrets key: AWS_ACCESS_KEY_ID - name: AWS_SECRET_ACCESS_KEY valueFrom: secretKeyRef: name: heliosdb-secrets key: AWS_SECRET_ACCESS_KEY resources: requests: memory: "4Gi" cpu: "2000m" limits: memory: "8Gi" cpu: "4000m" livenessProbe: httpGet: path: /health port: health initialDelaySeconds: 30 periodSeconds: 30 timeoutSeconds: 5 failureThreshold: 3 readinessProbe: httpGet: path: /ready port: health initialDelaySeconds: 10 periodSeconds: 10 timeoutSeconds: 3 failureThreshold: 3 volumeMounts: - name: config mountPath: /etc/heliosdb - name: data mountPath: /var/lib/heliosdb volumes: - name: config configMap: name: heliosdb-config volumeClaimTemplates: - metadata: name: data spec: accessModes: ["ReadWriteOnce"] storageClassName: "gp3" # AWS EBS gp3 resources: requests: storage: 50Gi3.5 Service
apiVersion: v1kind: Servicemetadata: name: heliosdb-streaming namespace: heliosdb-streaming labels: app: heliosdb-streamingspec: type: LoadBalancer ports: - name: http port: 8080 targetPort: 8080 protocol: TCP - name: metrics port: 9090 targetPort: 9090 protocol: TCP selector: app: heliosdb-streaming---apiVersion: v1kind: Servicemetadata: name: heliosdb-streaming-headless namespace: heliosdb-streamingspec: clusterIP: None ports: - name: http port: 8080 targetPort: 8080 selector: app: heliosdb-streaming3.6 HorizontalPodAutoscaler
apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: heliosdb-streaming-hpa namespace: heliosdb-streamingspec: scaleTargetRef: apiVersion: apps/v1 kind: StatefulSet name: heliosdb-streaming minReplicas: 3 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 - type: Pods pods: metric: name: events_per_second target: type: AverageValue averageValue: "50000" behavior: scaleDown: stabilizationWindowSeconds: 300 policies: - type: Percent value: 50 periodSeconds: 60 scaleUp: stabilizationWindowSeconds: 0 policies: - type: Percent value: 100 periodSeconds: 15 - type: Pods value: 2 periodSeconds: 15 selectPolicy: Max3.7 PodDisruptionBudget
apiVersion: policy/v1kind: PodDisruptionBudgetmetadata: name: heliosdb-streaming-pdb namespace: heliosdb-streamingspec: minAvailable: 2 selector: matchLabels: app: heliosdb-streaming3.8 Ingress (with TLS)
apiVersion: networking.k8s.io/v1kind: Ingressmetadata: name: heliosdb-streaming-ingress namespace: heliosdb-streaming annotations: cert-manager.io/cluster-issuer: "letsencrypt-prod" nginx.ingress.kubernetes.io/rate-limit: "100" nginx.ingress.kubernetes.io/ssl-redirect: "true" nginx.ingress.kubernetes.io/force-ssl-redirect: "true"spec: ingressClassName: nginx tls: - hosts: - streaming.heliosdb.example.com secretName: heliosdb-tls rules: - host: streaming.heliosdb.example.com http: paths: - path: /api pathType: Prefix backend: service: name: heliosdb-streaming port: number: 8080 - path: /metrics pathType: Prefix backend: service: name: heliosdb-streaming port: number: 90903.9 Deploy to Kubernetes
# Apply all manifestskubectl apply -f namespace.yamlkubectl apply -f configmap.yamlkubectl apply -f secrets.yamlkubectl apply -f statefulset.yamlkubectl apply -f service.yamlkubectl apply -f hpa.yamlkubectl apply -f pdb.yamlkubectl apply -f ingress.yaml
# Verify deploymentkubectl get pods -n heliosdb-streamingkubectl get svc -n heliosdb-streamingkubectl logs -f statefulset/heliosdb-streaming -n heliosdb-streaming
# Check eventskubectl get events -n heliosdb-streaming --sort-by='.lastTimestamp'4. Docker Compose Setup
4.1 Complete Stack
version: '3.9'
services: # Zookeeper for Kafka zookeeper: image: confluentinc/cp-zookeeper:7.5.0 hostname: zookeeper container_name: zookeeper ports: - "2181:2181" environment: ZOOKEEPER_CLIENT_PORT: 2181 ZOOKEEPER_TICK_TIME: 2000 volumes: - zookeeper-data:/var/lib/zookeeper/data - zookeeper-logs:/var/lib/zookeeper/log
# Kafka kafka: image: confluentinc/cp-kafka:7.5.0 hostname: kafka container_name: kafka depends_on: - zookeeper ports: - "9092:9092" - "9101:9101" environment: KAFKA_BROKER_ID: 1 KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181' KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:29092,PLAINTEXT_HOST://localhost:9092 KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1 KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1 KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1 KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0 KAFKA_JMX_PORT: 9101 KAFKA_JMX_HOSTNAME: localhost KAFKA_AUTO_CREATE_TOPICS_ENABLE: 'true' volumes: - kafka-data:/var/lib/kafka/data
# HeliosDB Streaming heliosdb-streaming: image: heliosdb/heliosdb-streaming:v4.0.0 container_name: heliosdb-streaming depends_on: - kafka - prometheus ports: - "8080:8080" - "8081:8081" - "9090:9090" environment: RUST_LOG: info JWT_SECRET: ${JWT_SECRET:-change-this-in-production} volumes: - ./config/streaming.toml:/etc/heliosdb/streaming.toml - heliosdb-state:/var/lib/heliosdb healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8081/health"] interval: 30s timeout: 10s retries: 3 start_period: 40s
# Prometheus prometheus: image: prom/prometheus:v2.47.0 container_name: prometheus ports: - "9091:9090" command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--web.console.libraries=/usr/share/prometheus/console_libraries' - '--web.console.templates=/usr/share/prometheus/consoles' - '--storage.tsdb.retention.time=15d' volumes: - ./config/prometheus.yml:/etc/prometheus/prometheus.yml - prometheus-data:/prometheus
# Grafana grafana: image: grafana/grafana:10.1.0 container_name: grafana depends_on: - prometheus ports: - "3000:3000" environment: GF_SECURITY_ADMIN_USER: admin GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD:-admin} GF_INSTALL_PLUGINS: grafana-piechart-panel volumes: - ./config/grafana/dashboards:/etc/grafana/provisioning/dashboards - ./config/grafana/datasources:/etc/grafana/provisioning/datasources - grafana-data:/var/lib/grafana
volumes: zookeeper-data: zookeeper-logs: kafka-data: heliosdb-state: prometheus-data: grafana-data:
networks: default: name: heliosdb-network4.2 Prometheus Configuration
global: scrape_interval: 15s evaluation_interval: 15s
scrape_configs: - job_name: 'heliosdb-streaming' static_configs: - targets: ['heliosdb-streaming:9090'] relabel_configs: - source_labels: [__address__] target_label: instance replacement: heliosdb-streaming
- job_name: 'kafka' static_configs: - targets: ['kafka:9101']
alerting: alertmanagers: - static_configs: - targets: []
rule_files: - "/etc/prometheus/alerts.yml"4.3 Grafana Datasource
apiVersion: 1
datasources: - name: Prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: true editable: true4.4 Start Stack
# Start all servicesdocker-compose up -d
# View logsdocker-compose logs -f heliosdb-streaming
# Check healthcurl http://localhost:8081/health
# Access Grafanaopen http://localhost:3000 # admin/admin
# Stop stackdocker-compose down
# Clean up volumesdocker-compose down -v5. Monitoring & Observability
5.1 Key Metrics
Throughput Metrics:
# Events processed per secondrate(events_processed_total[1m])
# Events ingested per secondrate(events_ingested_total[1m])
# Backpressure ratiobackpressure_ratioLatency Metrics:
# P50 latencyhistogram_quantile(0.50, rate(event_processing_duration_bucket[5m]))
# P95 latencyhistogram_quantile(0.95, rate(event_processing_duration_bucket[5m]))
# P99 latencyhistogram_quantile(0.99, rate(event_processing_duration_bucket[5m]))Resource Metrics:
# CPU usageprocess_cpu_seconds_total
# Memory usageprocess_resident_memory_bytes
# Open file descriptorsprocess_open_fdsJob Metrics:
# Active jobssum(job_status{status="running"})
# Failed jobssum(job_status{status="failed"})
# Checkpoint durationhistogram_quantile(0.95, rate(checkpoint_duration_bucket[5m]))5.2 Grafana Dashboard
{ "dashboard": { "title": "HeliosDB Streaming Overview", "panels": [ { "title": "Events Per Second", "targets": [ { "expr": "rate(events_processed_total[1m])", "legendFormat": "{{instance}}" } ], "type": "graph" }, { "title": "Latency (P95)", "targets": [ { "expr": "histogram_quantile(0.95, rate(event_processing_duration_bucket[5m]))", "legendFormat": "P95" } ], "type": "graph" }, { "title": "Active Jobs", "targets": [ { "expr": "sum(job_status{status=\"running\"})", "legendFormat": "Running" } ], "type": "stat" } ] }}5.3 Alert Rules
groups: - name: heliosdb_streaming interval: 30s rules: - alert: HighLatency expr: histogram_quantile(0.95, rate(event_processing_duration_bucket[5m])) > 0.100 for: 5m labels: severity: warning annotations: summary: "High processing latency detected" description: "P95 latency is {{ $value }}s (threshold: 100ms)"
- alert: LowThroughput expr: rate(events_processed_total[5m]) < 10000 for: 10m labels: severity: warning annotations: summary: "Low throughput detected" description: "Processing {{ $value }} events/sec (expected: >10K)"
- alert: HighBackpressure expr: backpressure_ratio > 0.8 for: 5m labels: severity: critical annotations: summary: "High backpressure detected" description: "Backpressure ratio is {{ $value }} (threshold: 0.8)"
- alert: JobFailed expr: increase(job_status{status="failed"}[5m]) > 0 for: 1m labels: severity: critical annotations: summary: "Job failure detected" description: "Job {{ $labels.job_name }} has failed"
- alert: HighMemoryUsage expr: process_resident_memory_bytes / 1024 / 1024 / 1024 > 14 for: 5m labels: severity: warning annotations: summary: "High memory usage" description: "Using {{ $value }}GB of RAM (limit: 16GB)"
- alert: CheckpointFailed expr: increase(checkpoint_failures_total[10m]) > 2 for: 1m labels: severity: critical annotations: summary: "Checkpoint failures detected" description: "{{ $value }} checkpoint failures in 10 minutes"6. Security Configuration
6.1 TLS/SSL Setup
Generate self-signed certificate (development):
openssl req -x509 -newkey rsa:4096 -nodes \ -keyout key.pem -out cert.pem -days 365 \ -subj "/CN=streaming.heliosdb.local"Production certificate (Let’s Encrypt):
# Install certbotsudo apt install certbot
# Get certificatesudo certbot certonly --standalone -d streaming.heliosdb.example.com
# Certificates will be in:# /etc/letsencrypt/live/streaming.heliosdb.example.com/6.2 JWT Configuration
# Generate secure JWT secretopenssl rand -base64 32
# Add to environmentexport JWT_SECRET="<generated-secret>"
# Or add to config fileecho "jwt_secret = \"$(openssl rand -base64 32)\"" >> /etc/heliosdb/streaming.toml6.3 Create Admin User
# Use API to create initial admincurl -X POST http://localhost:8080/api/v1/auth/register \ -H "Content-Type: application/json" \ -d '{ "username": "admin", "password": "secure-password-here", "roles": ["Admin"] }'
# Login to get JWT tokencurl -X POST http://localhost:8080/api/v1/auth/login \ -H "Content-Type: application/json" \ -d '{ "username": "admin", "password": "secure-password-here" }'6.4 RBAC Setup
users: - username: admin password_hash: "$2b$12$..." # bcrypt hash roles: - Admin enabled: true
- username: operator password_hash: "$2b$12$..." roles: - Operator enabled: true
- username: viewer password_hash: "$2b$12$..." roles: - Viewer enabled: true
roles: - name: Admin permissions: - Read - Write - Execute - Delete - Admin - Cancel - Manage
- name: Operator permissions: - Read - Execute - Cancel
- name: Viewer permissions: - Read6.5 KMS Configuration
AWS KMS:
[kms]provider = "aws"aws_region = "us-east-1"aws_key_id = "arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012"Azure Key Vault:
[kms]provider = "azure"azure_vault_url = "https://heliosdb-vault.vault.azure.net"azure_key_name = "heliosdb-encryption-key"GCP KMS:
[kms]provider = "gcp"gcp_project_id = "heliosdb-production"gcp_location = "us-central1"gcp_keyring = "heliosdb"gcp_key_name = "streaming-encryption-key"7. Performance Tuning
7.1 OS Tuning (Linux)
# Increase file descriptor limitsecho "* soft nofile 65536" >> /etc/security/limits.confecho "* hard nofile 65536" >> /etc/security/limits.conf
# Increase TCP buffer sizessysctl -w net.core.rmem_max=16777216sysctl -w net.core.wmem_max=16777216sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"
# Increase connection trackingsysctl -w net.netfilter.nf_conntrack_max=1048576
# Disable swapping for performancesysctl -w vm.swappiness=1
# Make changes persistentecho "vm.swappiness=1" >> /etc/sysctl.conf7.2 JVM Tuning (for Kafka)
# Kafka environmentexport KAFKA_HEAP_OPTS="-Xmx4G -Xms4G"export KAFKA_JVM_PERFORMANCE_OPTS="-XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35"7.3 HeliosDB Configuration
[streaming]# Adjust based on CPU coresmax_parallelism = 16 # 2x CPU cores
# Faster checkpoints for high throughputcheckpoint_interval_secs = 30
# Lower latency with more frequent watermarkswatermark_interval_secs = 0.5
[state]# Use memory for lowest latency (if enough RAM)backend = "memory"
# Or S3 with local cachingbackend = "s3"cache_size_mb = 1024
[kafka]# Increase batch size for higher throughputfetch_min_bytes = 1048576 # 1 MBfetch_max_wait_ms = 500
# Parallel consumersnum_consumer_threads = 87.4 Kafka Topic Configuration
# Create topic with optimal settingskafka-topics --create \ --bootstrap-server localhost:9092 \ --topic heliosdb-input \ --partitions 16 \ --replication-factor 3 \ --config compression.type=lz4 \ --config min.insync.replicas=2 \ --config retention.ms=86400000 # 1 day8. Operational Runbook
8.1 Startup Procedure
# 1. Verify prerequisitessystemctl status kafkasystemctl status zookeeper
# 2. Check disk spacedf -h /var/lib/heliosdb
# 3. Verify configurationcat /etc/heliosdb/streaming.toml
# 4. Start servicesystemctl start heliosdb-streaming
# 5. Monitor startupjournalctl -u heliosdb-streaming -f
# 6. Verify healthcurl http://localhost:8081/health
# 7. Check metricscurl http://localhost:9090/metrics | grep events_processed8.2 Shutdown Procedure
# 1. Stop accepting new jobscurl -X POST http://localhost:8080/api/v1/admin/pause
# 2. Wait for current jobs to completewatch -n 5 'curl -s http://localhost:8080/api/v1/jobs | jq ".active_jobs"'
# 3. Create savepoint for safe recoverycurl -X POST http://localhost:8080/api/v1/admin/savepoint
# 4. Stop servicesystemctl stop heliosdb-streaming
# 5. Verify shutdownsystemctl status heliosdb-streaming8.3 Rolling Update (Kubernetes)
# 1. Update imagekubectl set image statefulset/heliosdb-streaming \ heliosdb-streaming=heliosdb/heliosdb-streaming:v4.1.0 \ -n heliosdb-streaming
# 2. Monitor rolloutkubectl rollout status statefulset/heliosdb-streaming -n heliosdb-streaming
# 3. Verify new versionkubectl get pods -n heliosdb-streaming -o jsonpath='{.items[*].spec.containers[*].image}'
# 4. Rollback if neededkubectl rollout undo statefulset/heliosdb-streaming -n heliosdb-streaming8.4 Backup & Restore
Backup:
# 1. Create savepointSAVEPOINT_ID=$(curl -X POST http://localhost:8080/api/v1/admin/savepoint | jq -r '.id')
# 2. Copy to backup locationaws s3 cp \ s3://heliosdb-production/checkpoints/savepoint-${SAVEPOINT_ID} \ s3://heliosdb-backups/$(date +%Y%m%d)/savepoint-${SAVEPOINT_ID} \ --recursive
# 3. Backup configurationtar czf /backup/heliosdb-config-$(date +%Y%m%d).tar.gz /etc/heliosdbRestore:
# 1. Stop servicesystemctl stop heliosdb-streaming
# 2. Restore savepointaws s3 cp \ s3://heliosdb-backups/20231225/savepoint-12345 \ s3://heliosdb-production/checkpoints/savepoint-12345 \ --recursive
# 3. Restore configurationtar xzf /backup/heliosdb-config-20231225.tar.gz -C /
# 4. Start with specific savepointheliosdb-streaming --config /etc/heliosdb/streaming.toml \ --restore-from-savepoint savepoint-123458.5 Scaling Operations
Scale Up:
# Kuberneteskubectl scale statefulset heliosdb-streaming --replicas=5 -n heliosdb-streaming
# Verifykubectl get pods -n heliosdb-streaming -wScale Down:
# 1. Identify pods to removekubectl get pods -n heliosdb-streaming
# 2. Drain specific pod gracefullykubectl drain <pod-name> --ignore-daemonsets --delete-emptydir-data
# 3. Scale downkubectl scale statefulset heliosdb-streaming --replicas=3 -n heliosdb-streaming9. Troubleshooting
9.1 Common Issues
Issue: High Latency
Symptoms:
- P99 latency > 100ms
- Slow dashboard updates
Diagnosis:
# Check backpressurecurl http://localhost:9090/metrics | grep backpressure
# Check resource usagetop -p $(pgrep heliosdb-streaming)
# Check Kafka lagkafka-consumer-groups --bootstrap-server localhost:9092 \ --group heliosdb-streaming-prod --describeSolutions:
- Increase parallelism
- Add more nodes
- Optimize window size
- Check network latency
Issue: Out of Memory
Symptoms:
- OOMKilled in Kubernetes
- Process crashes with “out of memory”
Diagnosis:
# Check memory usagecurl http://localhost:9090/metrics | grep memory
# Check for memory leaksvalgrind --leak-check=full ./heliosdb-streamingSolutions:
- Increase memory limits
- Reduce window sizes
- Enable state compression
- Use file-based state backend
Issue: Checkpoint Failures
Symptoms:
- Checkpoints taking too long
- Checkpoint failures in logs
Diagnosis:
# Check checkpoint metricscurl http://localhost:9090/metrics | grep checkpoint
# Check S3/storage performanceaws s3 ls s3://heliosdb-production/checkpoints/ --summarize
# Check disk I/Oiostat -x 5Solutions:
- Increase checkpoint interval
- Use faster storage (SSD/NVMe)
- Enable incremental checkpoints
- Check network connectivity to S3
Issue: Kafka Connection Errors
Symptoms:
- “Failed to connect to Kafka” errors
- No events being processed
Diagnosis:
# Test Kafka connectivitykafka-console-consumer --bootstrap-server localhost:9092 \ --topic test --from-beginning
# Check Kafka statussystemctl status kafka
# Verify DNS resolutionnslookup kafka-bootstrap.kafkaSolutions:
- Verify Kafka is running
- Check firewall rules
- Verify Kafka advertised listeners
- Check SSL/SASL configuration
9.2 Debug Mode
# Enable debug loggingexport RUST_LOG=heliosdb_streaming=debug
# Or in config[logging]level = "debug"
# Enable backtraceexport RUST_BACKTRACE=full
# Run with profilingcargo build --release --features profiling./target/release/heliosdb-streaming --profile9.3 Useful Commands
# Check process statusps aux | grep heliosdb-streaming
# Check open connectionslsof -i -P -n | grep heliosdb
# Check file descriptorslsof -p $(pgrep heliosdb-streaming) | wc -l
# Monitor logs in real-timetail -f /var/log/heliosdb/streaming.log | jq .
# Export metrics for analysiscurl -s http://localhost:9090/metrics > metrics-$(date +%Y%m%d-%H%M%S).txt
# Dump thread stacktraceskill -USR1 $(pgrep heliosdb-streaming)10. Capacity Planning
10.1 Sizing Guidelines
Small Workload (< 10K events/sec)
- Nodes: 1-2
- CPU: 4 cores per node
- RAM: 8 GB per node
- Storage: 50 GB SSD
- Estimated cost: $200-400/month (AWS)
Medium Workload (10K-100K events/sec)
- Nodes: 3-5
- CPU: 8 cores per node
- RAM: 16 GB per node
- Storage: 100 GB NVMe SSD
- Estimated cost: $1,500-3,000/month (AWS)
Large Workload (100K-1M events/sec)
- Nodes: 10-20
- CPU: 16 cores per node
- RAM: 32 GB per node
- Storage: 200 GB NVMe SSD
- Estimated cost: $8,000-15,000/month (AWS)
10.2 Capacity Calculator
def calculate_resources(events_per_sec, avg_event_size_bytes, window_size_secs): """ Calculate required resources for HeliosDB Streaming """ # Throughput calculation throughput_mbps = (events_per_sec * avg_event_size_bytes * 8) / 1_000_000
# CPU cores needed (assuming 30K events/sec per core) cpu_cores = max(4, int(events_per_sec / 30_000) * 2)
# Memory needed (state size + overhead) state_size_gb = (events_per_sec * window_size_secs * avg_event_size_bytes) / (1024**3) memory_gb = max(8, int(state_size_gb * 2 + 4)) # 2x state + 4GB overhead
# Storage needed (checkpoints, assuming 10 kept) storage_gb = max(50, int(state_size_gb * 10 * 1.5)) # 10 checkpoints + 50% overhead
# Number of nodes (assuming 8 cores, 16GB per node) nodes = max(1, int(cpu_cores / 8))
return { "throughput_mbps": throughput_mbps, "cpu_cores": cpu_cores, "memory_gb": memory_gb, "storage_gb": storage_gb, "nodes": nodes, "cost_per_month_usd": nodes * 500 # Rough estimate }
# Example usageresources = calculate_resources( events_per_sec=50_000, avg_event_size_bytes=512, window_size_secs=3600)print(resources)# Output: {'throughput_mbps': 204.8, 'cpu_cores': 8, 'memory_gb': 120,# 'storage_gb': 1800, 'nodes': 1, 'cost_per_month_usd': 500}10.3 Growth Planning
Monthly Events → Required Resources━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1B events/month → 1 node (4 cores, 8GB) → $200/month 10B events/month → 2 nodes (8 cores, 16GB) → $600/month 100B events/month → 5 nodes (16 cores, 32GB) → $2,500/month 1T events/month → 15 nodes (32 cores, 64GB) → $7,500/monthSummary
This production deployment guide covers:
Single-node deployment with SystemD Kubernetes production setup with HA Docker Compose for local development Complete monitoring with Prometheus + Grafana Enterprise security (TLS, JWT, RBAC, KMS) Performance tuning guidelines Operational runbook (startup, shutdown, backup, restore) Troubleshooting common issues Capacity planning and cost estimation
For additional support:
- Documentation: https://docs.heliosdb.com
- GitHub: https://github.com/heliosdb/heliosdb
- Community: https://community.heliosdb.com
- Enterprise Support: support@heliosdb.com
Next Steps:
- Choose deployment method (K8s recommended for production)
- Set up monitoring (Prometheus + Grafana)
- Configure security (TLS, JWT, KMS)
- Run load tests to validate capacity
- Set up alerting for critical metrics
- Document your specific deployment for your team