Skip to content

HeliosDB Streaming - Production Deployment Guide

HeliosDB Streaming - Production Deployment Guide

Table of Contents

  1. Prerequisites
  2. Single-Node Deployment
  3. Kubernetes Production Deployment
  4. Docker Compose Setup
  5. Monitoring & Observability
  6. Security Configuration
  7. Performance Tuning
  8. Operational Runbook
  9. Troubleshooting
  10. Capacity Planning

1. Prerequisites

System Requirements

Minimum (Development/Testing)

  • CPU: 2 cores
  • RAM: 4 GB
  • Storage: 20 GB SSD
  • OS: Linux (Ubuntu 20.04+, RHEL 8+, Amazon Linux 2)

Recommended (Production)

  • CPU: 8 cores (16 vCPU)
  • RAM: 16 GB (32 GB for high throughput)
  • Storage: 100 GB NVMe SSD
  • OS: Linux (Ubuntu 22.04 LTS, RHEL 9, Amazon Linux 2023)
  • Network: 10 Gbps

Software Dependencies

Terminal window
# Rust toolchain (for building from source)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup default stable
# Docker (for containerized deployment)
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
# Kubernetes CLI (for k8s deployment)
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
# Helm (for k8s package management)
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

External Services

Required:

  • Apache Kafka or Apache Pulsar (message broker)
  • S3/Azure Blob/GCS (checkpoint storage)
  • AWS KMS/Azure Key Vault/GCP KMS (encryption key management)

Optional:

  • PostgreSQL (metadata storage)
  • Prometheus (metrics collection)
  • Grafana (visualization)
  • Jaeger (distributed tracing)

2. Single-Node Deployment

2.1 Build from Source

Terminal window
# Clone repository
git clone https://github.com/heliosdb/heliosdb.git
cd heliosdb/heliosdb-streaming
# Build release binary
cargo build --release
# Binary location
ls -lh target/release/heliosdb-streaming
# Run tests
cargo test --all
# Run E2E integration tests
cargo test --test e2e_integration_test

2.2 Configuration

Create /etc/heliosdb/streaming.toml:

[server]
host = "0.0.0.0"
port = 8080
health_check_port = 8081
metrics_port = 9090
[streaming]
checkpoint_interval_secs = 60
watermark_interval_secs = 1
allowed_lateness_secs = 60
max_parallelism = 8
[state]
backend = "file" # Options: "memory", "file", "s3"
path = "/var/lib/heliosdb/state"
encryption_enabled = true
[kafka]
bootstrap_servers = "localhost:9092"
group_id = "heliosdb-streaming"
auto_offset_reset = "earliest"
enable_ssl = false
[security]
jwt_secret = "your-secret-key-here-change-in-production"
jwt_expiration_hours = 24
rate_limit_enabled = true
rate_limit_requests_per_minute = 100
[kms]
provider = "local" # Options: "local", "aws", "azure", "gcp"
# For AWS:
# aws_region = "us-east-1"
# For Azure:
# azure_vault_url = "https://your-vault.vault.azure.net"
# For GCP:
# gcp_project_id = "your-project"
# gcp_location = "us-central1"
# gcp_keyring = "heliosdb"
[logging]
level = "info" # Options: "error", "warn", "info", "debug", "trace"
format = "json" # Options: "json", "pretty"

2.3 SystemD Service

Create /etc/systemd/system/heliosdb-streaming.service:

[Unit]
Description=HeliosDB Streaming Analytics
After=network.target kafka.service
[Service]
Type=simple
User=heliosdb
Group=heliosdb
WorkingDirectory=/opt/heliosdb
ExecStart=/opt/heliosdb/bin/heliosdb-streaming --config /etc/heliosdb/streaming.toml
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal
SyslogIdentifier=heliosdb-streaming
# Security
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/heliosdb
# Resource limits
LimitNOFILE=65536
LimitNPROC=4096
[Install]
WantedBy=multi-user.target

2.4 Start Service

Terminal window
# Create user and directories
sudo useradd -r -s /bin/false heliosdb
sudo mkdir -p /opt/heliosdb/bin
sudo mkdir -p /var/lib/heliosdb/state
sudo mkdir -p /var/log/heliosdb
sudo chown -R heliosdb:heliosdb /opt/heliosdb /var/lib/heliosdb /var/log/heliosdb
# Copy binary
sudo cp target/release/heliosdb-streaming /opt/heliosdb/bin/
# Enable and start service
sudo systemctl daemon-reload
sudo systemctl enable heliosdb-streaming
sudo systemctl start heliosdb-streaming
# Check status
sudo systemctl status heliosdb-streaming
sudo journalctl -u heliosdb-streaming -f

2.5 Verify Deployment

Terminal window
# Health check
curl http://localhost:8081/health
# Metrics
curl http://localhost:9090/metrics
# Create test job
curl -X POST http://localhost:8080/api/v1/jobs \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <your-jwt-token>" \
-d '{
"name": "test-job",
"source": "kafka",
"config": {
"topic": "test-input",
"parallelism": 4
}
}'

3. Kubernetes Production Deployment

3.1 Namespace Setup

namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: heliosdb-streaming
labels:
name: heliosdb-streaming
environment: production

3.2 ConfigMap

configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: heliosdb-config
namespace: heliosdb-streaming
data:
streaming.toml: |
[server]
host = "0.0.0.0"
port = 8080
health_check_port = 8081
metrics_port = 9090
[streaming]
checkpoint_interval_secs = 60
watermark_interval_secs = 1
allowed_lateness_secs = 60
max_parallelism = 8
[state]
backend = "s3"
path = "s3://heliosdb-production/checkpoints"
encryption_enabled = true
[kafka]
bootstrap_servers = "kafka-bootstrap.kafka:9092"
group_id = "heliosdb-streaming-prod"
auto_offset_reset = "earliest"
enable_ssl = true
[security]
rate_limit_enabled = true
rate_limit_requests_per_minute = 500
[kms]
provider = "aws"
aws_region = "us-east-1"
[logging]
level = "info"
format = "json"

3.3 Secrets

secrets.yaml
apiVersion: v1
kind: Secret
metadata:
name: heliosdb-secrets
namespace: heliosdb-streaming
type: Opaque
stringData:
JWT_SECRET: "your-production-secret-change-this"
AWS_ACCESS_KEY_ID: "AKIAIOSFODNN7EXAMPLE"
AWS_SECRET_ACCESS_KEY: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
KAFKA_PASSWORD: "your-kafka-password"

Generate secrets securely:

Terminal window
# Generate JWT secret
openssl rand -base64 32
# Create secret from file
kubectl create secret generic heliosdb-secrets \
--from-literal=JWT_SECRET=$(openssl rand -base64 32) \
--from-file=aws-credentials=/path/to/credentials \
-n heliosdb-streaming

3.4 StatefulSet

statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: heliosdb-streaming
namespace: heliosdb-streaming
spec:
serviceName: heliosdb-streaming
replicas: 3
selector:
matchLabels:
app: heliosdb-streaming
template:
metadata:
labels:
app: heliosdb-streaming
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- heliosdb-streaming
topologyKey: "kubernetes.io/hostname"
containers:
- name: heliosdb-streaming
image: heliosdb/heliosdb-streaming:v4.0.0
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 8080
protocol: TCP
- name: health
containerPort: 8081
protocol: TCP
- name: metrics
containerPort: 9090
protocol: TCP
env:
- name: RUST_LOG
value: "info"
- name: RUST_BACKTRACE
value: "1"
- name: JWT_SECRET
valueFrom:
secretKeyRef:
name: heliosdb-secrets
key: JWT_SECRET
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: heliosdb-secrets
key: AWS_ACCESS_KEY_ID
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: heliosdb-secrets
key: AWS_SECRET_ACCESS_KEY
resources:
requests:
memory: "4Gi"
cpu: "2000m"
limits:
memory: "8Gi"
cpu: "4000m"
livenessProbe:
httpGet:
path: /health
port: health
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: health
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
volumeMounts:
- name: config
mountPath: /etc/heliosdb
- name: data
mountPath: /var/lib/heliosdb
volumes:
- name: config
configMap:
name: heliosdb-config
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: "gp3" # AWS EBS gp3
resources:
requests:
storage: 50Gi

3.5 Service

service.yaml
apiVersion: v1
kind: Service
metadata:
name: heliosdb-streaming
namespace: heliosdb-streaming
labels:
app: heliosdb-streaming
spec:
type: LoadBalancer
ports:
- name: http
port: 8080
targetPort: 8080
protocol: TCP
- name: metrics
port: 9090
targetPort: 9090
protocol: TCP
selector:
app: heliosdb-streaming
---
apiVersion: v1
kind: Service
metadata:
name: heliosdb-streaming-headless
namespace: heliosdb-streaming
spec:
clusterIP: None
ports:
- name: http
port: 8080
targetPort: 8080
selector:
app: heliosdb-streaming

3.6 HorizontalPodAutoscaler

hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: heliosdb-streaming-hpa
namespace: heliosdb-streaming
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: StatefulSet
name: heliosdb-streaming
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: events_per_second
target:
type: AverageValue
averageValue: "50000"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 2
periodSeconds: 15
selectPolicy: Max

3.7 PodDisruptionBudget

pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: heliosdb-streaming-pdb
namespace: heliosdb-streaming
spec:
minAvailable: 2
selector:
matchLabels:
app: heliosdb-streaming

3.8 Ingress (with TLS)

ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: heliosdb-streaming-ingress
namespace: heliosdb-streaming
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
spec:
ingressClassName: nginx
tls:
- hosts:
- streaming.heliosdb.example.com
secretName: heliosdb-tls
rules:
- host: streaming.heliosdb.example.com
http:
paths:
- path: /api
pathType: Prefix
backend:
service:
name: heliosdb-streaming
port:
number: 8080
- path: /metrics
pathType: Prefix
backend:
service:
name: heliosdb-streaming
port:
number: 9090

3.9 Deploy to Kubernetes

Terminal window
# Apply all manifests
kubectl apply -f namespace.yaml
kubectl apply -f configmap.yaml
kubectl apply -f secrets.yaml
kubectl apply -f statefulset.yaml
kubectl apply -f service.yaml
kubectl apply -f hpa.yaml
kubectl apply -f pdb.yaml
kubectl apply -f ingress.yaml
# Verify deployment
kubectl get pods -n heliosdb-streaming
kubectl get svc -n heliosdb-streaming
kubectl logs -f statefulset/heliosdb-streaming -n heliosdb-streaming
# Check events
kubectl get events -n heliosdb-streaming --sort-by='.lastTimestamp'

4. Docker Compose Setup

4.1 Complete Stack

docker-compose.yml
version: '3.9'
services:
# Zookeeper for Kafka
zookeeper:
image: confluentinc/cp-zookeeper:7.5.0
hostname: zookeeper
container_name: zookeeper
ports:
- "2181:2181"
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
volumes:
- zookeeper-data:/var/lib/zookeeper/data
- zookeeper-logs:/var/lib/zookeeper/log
# Kafka
kafka:
image: confluentinc/cp-kafka:7.5.0
hostname: kafka
container_name: kafka
depends_on:
- zookeeper
ports:
- "9092:9092"
- "9101:9101"
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181'
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:29092,PLAINTEXT_HOST://localhost:9092
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
KAFKA_JMX_PORT: 9101
KAFKA_JMX_HOSTNAME: localhost
KAFKA_AUTO_CREATE_TOPICS_ENABLE: 'true'
volumes:
- kafka-data:/var/lib/kafka/data
# HeliosDB Streaming
heliosdb-streaming:
image: heliosdb/heliosdb-streaming:v4.0.0
container_name: heliosdb-streaming
depends_on:
- kafka
- prometheus
ports:
- "8080:8080"
- "8081:8081"
- "9090:9090"
environment:
RUST_LOG: info
JWT_SECRET: ${JWT_SECRET:-change-this-in-production}
volumes:
- ./config/streaming.toml:/etc/heliosdb/streaming.toml
- heliosdb-state:/var/lib/heliosdb
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8081/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
# Prometheus
prometheus:
image: prom/prometheus:v2.47.0
container_name: prometheus
ports:
- "9091:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
- '--storage.tsdb.retention.time=15d'
volumes:
- ./config/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
# Grafana
grafana:
image: grafana/grafana:10.1.0
container_name: grafana
depends_on:
- prometheus
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD:-admin}
GF_INSTALL_PLUGINS: grafana-piechart-panel
volumes:
- ./config/grafana/dashboards:/etc/grafana/provisioning/dashboards
- ./config/grafana/datasources:/etc/grafana/provisioning/datasources
- grafana-data:/var/lib/grafana
volumes:
zookeeper-data:
zookeeper-logs:
kafka-data:
heliosdb-state:
prometheus-data:
grafana-data:
networks:
default:
name: heliosdb-network

4.2 Prometheus Configuration

config/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'heliosdb-streaming'
static_configs:
- targets: ['heliosdb-streaming:9090']
relabel_configs:
- source_labels: [__address__]
target_label: instance
replacement: heliosdb-streaming
- job_name: 'kafka'
static_configs:
- targets: ['kafka:9101']
alerting:
alertmanagers:
- static_configs:
- targets: []
rule_files:
- "/etc/prometheus/alerts.yml"

4.3 Grafana Datasource

config/grafana/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true

4.4 Start Stack

Terminal window
# Start all services
docker-compose up -d
# View logs
docker-compose logs -f heliosdb-streaming
# Check health
curl http://localhost:8081/health
# Access Grafana
open http://localhost:3000 # admin/admin
# Stop stack
docker-compose down
# Clean up volumes
docker-compose down -v

5. Monitoring & Observability

5.1 Key Metrics

Throughput Metrics:

# Events processed per second
rate(events_processed_total[1m])
# Events ingested per second
rate(events_ingested_total[1m])
# Backpressure ratio
backpressure_ratio

Latency Metrics:

# P50 latency
histogram_quantile(0.50, rate(event_processing_duration_bucket[5m]))
# P95 latency
histogram_quantile(0.95, rate(event_processing_duration_bucket[5m]))
# P99 latency
histogram_quantile(0.99, rate(event_processing_duration_bucket[5m]))

Resource Metrics:

# CPU usage
process_cpu_seconds_total
# Memory usage
process_resident_memory_bytes
# Open file descriptors
process_open_fds

Job Metrics:

# Active jobs
sum(job_status{status="running"})
# Failed jobs
sum(job_status{status="failed"})
# Checkpoint duration
histogram_quantile(0.95, rate(checkpoint_duration_bucket[5m]))

5.2 Grafana Dashboard

{
"dashboard": {
"title": "HeliosDB Streaming Overview",
"panels": [
{
"title": "Events Per Second",
"targets": [
{
"expr": "rate(events_processed_total[1m])",
"legendFormat": "{{instance}}"
}
],
"type": "graph"
},
{
"title": "Latency (P95)",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(event_processing_duration_bucket[5m]))",
"legendFormat": "P95"
}
],
"type": "graph"
},
{
"title": "Active Jobs",
"targets": [
{
"expr": "sum(job_status{status=\"running\"})",
"legendFormat": "Running"
}
],
"type": "stat"
}
]
}
}

5.3 Alert Rules

config/prometheus/alerts.yml
groups:
- name: heliosdb_streaming
interval: 30s
rules:
- alert: HighLatency
expr: histogram_quantile(0.95, rate(event_processing_duration_bucket[5m])) > 0.100
for: 5m
labels:
severity: warning
annotations:
summary: "High processing latency detected"
description: "P95 latency is {{ $value }}s (threshold: 100ms)"
- alert: LowThroughput
expr: rate(events_processed_total[5m]) < 10000
for: 10m
labels:
severity: warning
annotations:
summary: "Low throughput detected"
description: "Processing {{ $value }} events/sec (expected: >10K)"
- alert: HighBackpressure
expr: backpressure_ratio > 0.8
for: 5m
labels:
severity: critical
annotations:
summary: "High backpressure detected"
description: "Backpressure ratio is {{ $value }} (threshold: 0.8)"
- alert: JobFailed
expr: increase(job_status{status="failed"}[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Job failure detected"
description: "Job {{ $labels.job_name }} has failed"
- alert: HighMemoryUsage
expr: process_resident_memory_bytes / 1024 / 1024 / 1024 > 14
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Using {{ $value }}GB of RAM (limit: 16GB)"
- alert: CheckpointFailed
expr: increase(checkpoint_failures_total[10m]) > 2
for: 1m
labels:
severity: critical
annotations:
summary: "Checkpoint failures detected"
description: "{{ $value }} checkpoint failures in 10 minutes"

6. Security Configuration

6.1 TLS/SSL Setup

Generate self-signed certificate (development):

Terminal window
openssl req -x509 -newkey rsa:4096 -nodes \
-keyout key.pem -out cert.pem -days 365 \
-subj "/CN=streaming.heliosdb.local"

Production certificate (Let’s Encrypt):

Terminal window
# Install certbot
sudo apt install certbot
# Get certificate
sudo certbot certonly --standalone -d streaming.heliosdb.example.com
# Certificates will be in:
# /etc/letsencrypt/live/streaming.heliosdb.example.com/

6.2 JWT Configuration

Terminal window
# Generate secure JWT secret
openssl rand -base64 32
# Add to environment
export JWT_SECRET="<generated-secret>"
# Or add to config file
echo "jwt_secret = \"$(openssl rand -base64 32)\"" >> /etc/heliosdb/streaming.toml

6.3 Create Admin User

Terminal window
# Use API to create initial admin
curl -X POST http://localhost:8080/api/v1/auth/register \
-H "Content-Type: application/json" \
-d '{
"username": "admin",
"password": "secure-password-here",
"roles": ["Admin"]
}'
# Login to get JWT token
curl -X POST http://localhost:8080/api/v1/auth/login \
-H "Content-Type: application/json" \
-d '{
"username": "admin",
"password": "secure-password-here"
}'

6.4 RBAC Setup

rbac-config.yaml
users:
- username: admin
password_hash: "$2b$12$..." # bcrypt hash
roles:
- Admin
enabled: true
- username: operator
password_hash: "$2b$12$..."
roles:
- Operator
enabled: true
- username: viewer
password_hash: "$2b$12$..."
roles:
- Viewer
enabled: true
roles:
- name: Admin
permissions:
- Read
- Write
- Execute
- Delete
- Admin
- Cancel
- Manage
- name: Operator
permissions:
- Read
- Execute
- Cancel
- name: Viewer
permissions:
- Read

6.5 KMS Configuration

AWS KMS:

[kms]
provider = "aws"
aws_region = "us-east-1"
aws_key_id = "arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012"

Azure Key Vault:

[kms]
provider = "azure"
azure_vault_url = "https://heliosdb-vault.vault.azure.net"
azure_key_name = "heliosdb-encryption-key"

GCP KMS:

[kms]
provider = "gcp"
gcp_project_id = "heliosdb-production"
gcp_location = "us-central1"
gcp_keyring = "heliosdb"
gcp_key_name = "streaming-encryption-key"

7. Performance Tuning

7.1 OS Tuning (Linux)

Terminal window
# Increase file descriptor limits
echo "* soft nofile 65536" >> /etc/security/limits.conf
echo "* hard nofile 65536" >> /etc/security/limits.conf
# Increase TCP buffer sizes
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"
# Increase connection tracking
sysctl -w net.netfilter.nf_conntrack_max=1048576
# Disable swapping for performance
sysctl -w vm.swappiness=1
# Make changes persistent
echo "vm.swappiness=1" >> /etc/sysctl.conf

7.2 JVM Tuning (for Kafka)

Terminal window
# Kafka environment
export KAFKA_HEAP_OPTS="-Xmx4G -Xms4G"
export KAFKA_JVM_PERFORMANCE_OPTS="-XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35"

7.3 HeliosDB Configuration

[streaming]
# Adjust based on CPU cores
max_parallelism = 16 # 2x CPU cores
# Faster checkpoints for high throughput
checkpoint_interval_secs = 30
# Lower latency with more frequent watermarks
watermark_interval_secs = 0.5
[state]
# Use memory for lowest latency (if enough RAM)
backend = "memory"
# Or S3 with local caching
backend = "s3"
cache_size_mb = 1024
[kafka]
# Increase batch size for higher throughput
fetch_min_bytes = 1048576 # 1 MB
fetch_max_wait_ms = 500
# Parallel consumers
num_consumer_threads = 8

7.4 Kafka Topic Configuration

Terminal window
# Create topic with optimal settings
kafka-topics --create \
--bootstrap-server localhost:9092 \
--topic heliosdb-input \
--partitions 16 \
--replication-factor 3 \
--config compression.type=lz4 \
--config min.insync.replicas=2 \
--config retention.ms=86400000 # 1 day

8. Operational Runbook

8.1 Startup Procedure

Terminal window
# 1. Verify prerequisites
systemctl status kafka
systemctl status zookeeper
# 2. Check disk space
df -h /var/lib/heliosdb
# 3. Verify configuration
cat /etc/heliosdb/streaming.toml
# 4. Start service
systemctl start heliosdb-streaming
# 5. Monitor startup
journalctl -u heliosdb-streaming -f
# 6. Verify health
curl http://localhost:8081/health
# 7. Check metrics
curl http://localhost:9090/metrics | grep events_processed

8.2 Shutdown Procedure

Terminal window
# 1. Stop accepting new jobs
curl -X POST http://localhost:8080/api/v1/admin/pause
# 2. Wait for current jobs to complete
watch -n 5 'curl -s http://localhost:8080/api/v1/jobs | jq ".active_jobs"'
# 3. Create savepoint for safe recovery
curl -X POST http://localhost:8080/api/v1/admin/savepoint
# 4. Stop service
systemctl stop heliosdb-streaming
# 5. Verify shutdown
systemctl status heliosdb-streaming

8.3 Rolling Update (Kubernetes)

Terminal window
# 1. Update image
kubectl set image statefulset/heliosdb-streaming \
heliosdb-streaming=heliosdb/heliosdb-streaming:v4.1.0 \
-n heliosdb-streaming
# 2. Monitor rollout
kubectl rollout status statefulset/heliosdb-streaming -n heliosdb-streaming
# 3. Verify new version
kubectl get pods -n heliosdb-streaming -o jsonpath='{.items[*].spec.containers[*].image}'
# 4. Rollback if needed
kubectl rollout undo statefulset/heliosdb-streaming -n heliosdb-streaming

8.4 Backup & Restore

Backup:

Terminal window
# 1. Create savepoint
SAVEPOINT_ID=$(curl -X POST http://localhost:8080/api/v1/admin/savepoint | jq -r '.id')
# 2. Copy to backup location
aws s3 cp \
s3://heliosdb-production/checkpoints/savepoint-${SAVEPOINT_ID} \
s3://heliosdb-backups/$(date +%Y%m%d)/savepoint-${SAVEPOINT_ID} \
--recursive
# 3. Backup configuration
tar czf /backup/heliosdb-config-$(date +%Y%m%d).tar.gz /etc/heliosdb

Restore:

Terminal window
# 1. Stop service
systemctl stop heliosdb-streaming
# 2. Restore savepoint
aws s3 cp \
s3://heliosdb-backups/20231225/savepoint-12345 \
s3://heliosdb-production/checkpoints/savepoint-12345 \
--recursive
# 3. Restore configuration
tar xzf /backup/heliosdb-config-20231225.tar.gz -C /
# 4. Start with specific savepoint
heliosdb-streaming --config /etc/heliosdb/streaming.toml \
--restore-from-savepoint savepoint-12345

8.5 Scaling Operations

Scale Up:

Terminal window
# Kubernetes
kubectl scale statefulset heliosdb-streaming --replicas=5 -n heliosdb-streaming
# Verify
kubectl get pods -n heliosdb-streaming -w

Scale Down:

Terminal window
# 1. Identify pods to remove
kubectl get pods -n heliosdb-streaming
# 2. Drain specific pod gracefully
kubectl drain <pod-name> --ignore-daemonsets --delete-emptydir-data
# 3. Scale down
kubectl scale statefulset heliosdb-streaming --replicas=3 -n heliosdb-streaming

9. Troubleshooting

9.1 Common Issues

Issue: High Latency

Symptoms:

  • P99 latency > 100ms
  • Slow dashboard updates

Diagnosis:

Terminal window
# Check backpressure
curl http://localhost:9090/metrics | grep backpressure
# Check resource usage
top -p $(pgrep heliosdb-streaming)
# Check Kafka lag
kafka-consumer-groups --bootstrap-server localhost:9092 \
--group heliosdb-streaming-prod --describe

Solutions:

  • Increase parallelism
  • Add more nodes
  • Optimize window size
  • Check network latency

Issue: Out of Memory

Symptoms:

  • OOMKilled in Kubernetes
  • Process crashes with “out of memory”

Diagnosis:

Terminal window
# Check memory usage
curl http://localhost:9090/metrics | grep memory
# Check for memory leaks
valgrind --leak-check=full ./heliosdb-streaming

Solutions:

  • Increase memory limits
  • Reduce window sizes
  • Enable state compression
  • Use file-based state backend

Issue: Checkpoint Failures

Symptoms:

  • Checkpoints taking too long
  • Checkpoint failures in logs

Diagnosis:

Terminal window
# Check checkpoint metrics
curl http://localhost:9090/metrics | grep checkpoint
# Check S3/storage performance
aws s3 ls s3://heliosdb-production/checkpoints/ --summarize
# Check disk I/O
iostat -x 5

Solutions:

  • Increase checkpoint interval
  • Use faster storage (SSD/NVMe)
  • Enable incremental checkpoints
  • Check network connectivity to S3

Issue: Kafka Connection Errors

Symptoms:

  • “Failed to connect to Kafka” errors
  • No events being processed

Diagnosis:

Terminal window
# Test Kafka connectivity
kafka-console-consumer --bootstrap-server localhost:9092 \
--topic test --from-beginning
# Check Kafka status
systemctl status kafka
# Verify DNS resolution
nslookup kafka-bootstrap.kafka

Solutions:

  • Verify Kafka is running
  • Check firewall rules
  • Verify Kafka advertised listeners
  • Check SSL/SASL configuration

9.2 Debug Mode

Terminal window
# Enable debug logging
export RUST_LOG=heliosdb_streaming=debug
# Or in config
[logging]
level = "debug"
# Enable backtrace
export RUST_BACKTRACE=full
# Run with profiling
cargo build --release --features profiling
./target/release/heliosdb-streaming --profile

9.3 Useful Commands

Terminal window
# Check process status
ps aux | grep heliosdb-streaming
# Check open connections
lsof -i -P -n | grep heliosdb
# Check file descriptors
lsof -p $(pgrep heliosdb-streaming) | wc -l
# Monitor logs in real-time
tail -f /var/log/heliosdb/streaming.log | jq .
# Export metrics for analysis
curl -s http://localhost:9090/metrics > metrics-$(date +%Y%m%d-%H%M%S).txt
# Dump thread stacktraces
kill -USR1 $(pgrep heliosdb-streaming)

10. Capacity Planning

10.1 Sizing Guidelines

Small Workload (< 10K events/sec)

  • Nodes: 1-2
  • CPU: 4 cores per node
  • RAM: 8 GB per node
  • Storage: 50 GB SSD
  • Estimated cost: $200-400/month (AWS)

Medium Workload (10K-100K events/sec)

  • Nodes: 3-5
  • CPU: 8 cores per node
  • RAM: 16 GB per node
  • Storage: 100 GB NVMe SSD
  • Estimated cost: $1,500-3,000/month (AWS)

Large Workload (100K-1M events/sec)

  • Nodes: 10-20
  • CPU: 16 cores per node
  • RAM: 32 GB per node
  • Storage: 200 GB NVMe SSD
  • Estimated cost: $8,000-15,000/month (AWS)

10.2 Capacity Calculator

capacity_calculator.py
def calculate_resources(events_per_sec, avg_event_size_bytes, window_size_secs):
"""
Calculate required resources for HeliosDB Streaming
"""
# Throughput calculation
throughput_mbps = (events_per_sec * avg_event_size_bytes * 8) / 1_000_000
# CPU cores needed (assuming 30K events/sec per core)
cpu_cores = max(4, int(events_per_sec / 30_000) * 2)
# Memory needed (state size + overhead)
state_size_gb = (events_per_sec * window_size_secs * avg_event_size_bytes) / (1024**3)
memory_gb = max(8, int(state_size_gb * 2 + 4)) # 2x state + 4GB overhead
# Storage needed (checkpoints, assuming 10 kept)
storage_gb = max(50, int(state_size_gb * 10 * 1.5)) # 10 checkpoints + 50% overhead
# Number of nodes (assuming 8 cores, 16GB per node)
nodes = max(1, int(cpu_cores / 8))
return {
"throughput_mbps": throughput_mbps,
"cpu_cores": cpu_cores,
"memory_gb": memory_gb,
"storage_gb": storage_gb,
"nodes": nodes,
"cost_per_month_usd": nodes * 500 # Rough estimate
}
# Example usage
resources = calculate_resources(
events_per_sec=50_000,
avg_event_size_bytes=512,
window_size_secs=3600
)
print(resources)
# Output: {'throughput_mbps': 204.8, 'cpu_cores': 8, 'memory_gb': 120,
# 'storage_gb': 1800, 'nodes': 1, 'cost_per_month_usd': 500}

10.3 Growth Planning

Monthly Events → Required Resources
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1B events/month → 1 node (4 cores, 8GB) → $200/month
10B events/month → 2 nodes (8 cores, 16GB) → $600/month
100B events/month → 5 nodes (16 cores, 32GB) → $2,500/month
1T events/month → 15 nodes (32 cores, 64GB) → $7,500/month

Summary

This production deployment guide covers:

Single-node deployment with SystemD Kubernetes production setup with HA Docker Compose for local development Complete monitoring with Prometheus + Grafana Enterprise security (TLS, JWT, RBAC, KMS) Performance tuning guidelines Operational runbook (startup, shutdown, backup, restore) Troubleshooting common issues Capacity planning and cost estimation

For additional support:


Next Steps:

  1. Choose deployment method (K8s recommended for production)
  2. Set up monitoring (Prometheus + Grafana)
  3. Configure security (TLS, JWT, KMS)
  4. Run load tests to validate capacity
  5. Set up alerting for critical metrics
  6. Document your specific deployment for your team