Skip to content

HeliosDB Nano High Availability Hands-On Tutorial

HeliosDB Nano High Availability Hands-On Tutorial

This comprehensive tutorial guides you through setting up, operating, and testing HeliosDB Nano’s High Availability features including Transparent Write Routing (TWR), automatic failover, and application continuity.

Table of Contents

  1. Architecture Overview
  2. Part 1: Docker Deployment
  3. Part 2: Local Deployment (Without Docker)
  4. Part 3: Transparent Write Routing (TWR)
  5. Part 4: Transparent Read Routing (TRR)
  6. Part 5: HeliosProxy Deep Dive
  7. Part 6: Monitoring the Cluster
  8. Part 7: Switchover Operations
  9. Part 8: Failover and Automatic Recovery
  10. Part 9: Application Continuity Testing
  11. Part 10: Advanced Scenarios

Architecture Overview

┌─────────────────────────────────────┐
│ APPLICATION │
└─────────────────┬───────────────────┘
┌─────────────────▼───────────────────┐
│ HELIOSPROXY │
│ ┌─────────────────────────────┐ │
│ │ • PostgreSQL Protocol (5432)│ │
│ │ • HTTP SQL API (8080) │ │
│ │ • Admin API (9090) │ │
│ │ • Health Checking │ │
│ │ • Write Timeout (30s) │ │
│ │ • TWR + TRR │ │
│ └─────────────────────────────┘ │
└───────┬───────────┬───────────┬────┘
│ │ │
┌─────────────▼───┐ ┌───▼───┐ ┌───▼─────────────┐
│ PRIMARY │ │STANDBY│ │ STANDBY │
│ (Read/Write) │ │ SYNC │ │ ASYNC │
│ Port: 5432 │ │ 5442 │ │ 5452 │
└────────┬────────┘ └───┬───┘ └────────┬────────┘
│ │ │
└────────────────┴────────────────┘
WAL Streaming Replication

Key Features

FeatureDescription
TWRTransparent Write Routing - writes auto-route to primary
TRRTransparent Read Routing - reads load-balance across standbys
Write TimeoutWrites wait up to 30s for primary during failover
Auto RecoveryAutomatic reconnection when primary returns
Sticky SessionsMaintain backend affinity within a session

Part 1: Docker Deployment

Prerequisites

Terminal window
# Install Docker and Docker Compose
docker --version # 20.10+
docker compose version # 2.0+

Step 1: Clone and Build

Terminal window
cd /path/to/HeliosDB Nano
# Build the Docker image with HA support
docker build -f tests/docker/Dockerfile.ha -t heliosdb-nano:ha .

Step 2: Create Docker Compose Configuration

Create docker-compose.ha-cluster.yml:

version: '3.8'
networks:
helios-ha:
driver: bridge
ipam:
config:
- subnet: 172.28.0.0/16
services:
# Primary node - handles all writes
primary:
image: heliosdb-nano:ha
container_name: heliosdb-primary
hostname: primary
networks:
helios-ha:
ipv4_address: 172.28.1.1
ports:
- "15432:5432" # PostgreSQL protocol
- "15433:5433" # Native protocol
- "18080:8080" # HTTP API
environment:
- HELIOSDB_ROLE=primary
- HELIOSDB_NODE_ID=primary
- HELIOSDB_DATA_DIR=/data
- HELIOSDB_REPLICATION_MODE=sync
volumes:
- primary-data:/data
healthcheck:
test: ["CMD", "heliosdb-nano", "health"]
interval: 5s
timeout: 3s
retries: 3
# Synchronous standby - zero data loss
standby-sync:
image: heliosdb-nano:ha
container_name: heliosdb-standby-sync
hostname: standby-sync
networks:
helios-ha:
ipv4_address: 172.28.1.2
ports:
- "15442:5432"
- "15443:5433"
- "18081:8080"
environment:
- HELIOSDB_ROLE=standby
- HELIOSDB_NODE_ID=standby-sync
- HELIOSDB_PRIMARY_HOST=primary
- HELIOSDB_PRIMARY_PORT=5433
- HELIOSDB_REPLICATION_MODE=sync
volumes:
- standby-sync-data:/data
depends_on:
primary:
condition: service_healthy
# Asynchronous standby - better performance, potential lag
standby-async:
image: heliosdb-nano:ha
container_name: heliosdb-standby-async
hostname: standby-async
networks:
helios-ha:
ipv4_address: 172.28.1.3
ports:
- "15462:5432"
- "15463:5433"
- "18084:8080"
environment:
- HELIOSDB_ROLE=standby
- HELIOSDB_NODE_ID=standby-async
- HELIOSDB_PRIMARY_HOST=primary
- HELIOSDB_PRIMARY_PORT=5433
- HELIOSDB_REPLICATION_MODE=async
volumes:
- standby-async-data:/data
depends_on:
primary:
condition: service_healthy
# HeliosProxy - intelligent routing
proxy:
image: heliosdb-nano:ha
container_name: heliosdb-proxy
hostname: proxy
networks:
helios-ha:
ipv4_address: 172.28.1.100
ports:
- "15400:5432" # PostgreSQL protocol
- "19090:9090" # Admin API
environment:
- HELIOSDB_PROXY_CONFIG=/etc/heliosdb/proxy.toml
volumes:
- ./proxy-config.toml:/etc/heliosdb/proxy.toml:ro
command: ["heliosdb-proxy"]
depends_on:
- primary
- standby-sync
- standby-async
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9090/health"]
interval: 5s
timeout: 3s
retries: 3
volumes:
primary-data:
standby-sync-data:
standby-async-data:

Step 3: Create Proxy Configuration

Create proxy-config.toml:

[proxy]
listen_addr = "0.0.0.0:5432"
admin_addr = "0.0.0.0:9090"
health_check_interval_secs = 5
failure_threshold = 3
write_timeout_secs = 30
[[nodes]]
name = "primary"
host = "primary"
port = 5432
role = "primary"
enabled = true
[[nodes]]
name = "standby-sync"
host = "standby-sync"
port = 5432
role = "standby"
enabled = true
[[nodes]]
name = "standby-async"
host = "standby-async"
port = 5432
role = "standby"
enabled = true

Step 4: Start the Cluster

Terminal window
# Start all services
docker compose -f docker-compose.ha-cluster.yml up -d
# Verify all containers are running
docker compose -f docker-compose.ha-cluster.yml ps
# Expected output:
# NAME STATUS PORTS
# heliosdb-primary Up (healthy) 0.0.0.0:15432->5432/tcp
# heliosdb-standby-sync Up (healthy) 0.0.0.0:15442->5432/tcp
# heliosdb-standby-async Up (healthy) 0.0.0.0:15462->5432/tcp
# heliosdb-proxy Up (healthy) 0.0.0.0:15400->5432/tcp

Step 5: Verify Connectivity

Terminal window
# Connect through proxy
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c "SELECT 1"
# Connect directly to primary
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT 1"
# Connect directly to standby
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "SELECT 1"

Part 2: Local Deployment (Without Docker)

Prerequisites

Terminal window
# Build HeliosDB Nano
cargo build --release --features "ha-tier1,ha-proxy"
# Binary location
export HELIOSDB_BIN=./target/release/heliosdb-nano
export HELIOSPROXY_BIN=./target/release/heliosdb-proxy

Step 1: Create Data Directories

Terminal window
mkdir -p /tmp/heliosdb-ha/{primary,standby-sync,standby-async}

Step 2: Start Primary Node

Terminal window
# Terminal 1: Primary (ports 5432/5433/8080)
$HELIOSDB_BIN start \
--data-dir /tmp/heliosdb-ha/primary \
--pg-port 5432 \
--native-port 5433 \
--http-port 8080 \
--node-id primary \
--replication-role primary \
--replication-mode sync

Step 3: Start Standby Nodes

Terminal window
# Terminal 2: Standby Sync (ports 5442/5443/8081)
$HELIOSDB_BIN start \
--data-dir /tmp/heliosdb-ha/standby-sync \
--pg-port 5442 \
--native-port 5443 \
--http-port 8081 \
--node-id standby-sync \
--replication-role standby \
--primary-host localhost \
--primary-port 5433 \
--replication-mode sync
Terminal window
# Terminal 3: Standby Async (ports 5452/5453/8082)
$HELIOSDB_BIN start \
--data-dir /tmp/heliosdb-ha/standby-async \
--pg-port 5452 \
--native-port 5453 \
--http-port 8082 \
--node-id standby-async \
--replication-role standby \
--primary-host localhost \
--primary-port 5433 \
--replication-mode async

Step 4: Create Local Proxy Configuration

Create /tmp/heliosdb-ha/proxy.toml:

[proxy]
listen_addr = "0.0.0.0:5400"
admin_addr = "0.0.0.0:9090"
health_check_interval_secs = 5
failure_threshold = 3
write_timeout_secs = 30
[[nodes]]
name = "primary"
host = "localhost"
port = 5432
role = "primary"
enabled = true
[[nodes]]
name = "standby-sync"
host = "localhost"
port = 5442
role = "standby"
enabled = true
[[nodes]]
name = "standby-async"
host = "localhost"
port = 5452
role = "standby"
enabled = true

Step 5: Start HeliosProxy

Terminal window
# Terminal 4: Proxy (port 5400)
$HELIOSPROXY_BIN --config /tmp/heliosdb-ha/proxy.toml

Port Summary (Local Deployment)

NodePG PortNative PortHTTP Port
Primary543254338080
Standby Sync544254438081
Standby Async545254538082
Proxy5400-9090

Part 3: Transparent Write Routing (TWR)

TWR automatically routes write operations to the primary node, regardless of which node you’re connected to.

How TWR Works

┌───────────────────────────────────────────────────────────────┐
│ CLIENT APPLICATION │
│ │
│ INSERT INTO users (name) VALUES ('Alice') -- WRITE │
│ UPDATE users SET active = true -- WRITE │
│ DELETE FROM users WHERE id = 5 -- WRITE │
│ SELECT * FROM users -- READ │
└───────────────────────────┬───────────────────────────────────┘
┌───────────────────────────────────────────────────────────────┐
│ HELIOSPROXY │
│ │
│ Query Classification: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ is_write_query(sql): │ │
│ │ • INSERT, UPDATE, DELETE → true │ │
│ │ • CREATE, DROP, ALTER, TRUNCATE → true │ │
│ │ • BEGIN, COMMIT, ROLLBACK → true (transaction) │ │
│ │ • SELECT, SHOW, EXPLAIN → false (read) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ Routing Decision: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ if is_write: │ │
│ │ route_to_primary() ───────────► PRIMARY │ │
│ │ else: │ │
│ │ route_to_any_healthy() ───────────► PRIMARY/STANDBY │ │
│ └─────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────┘

Testing TWR

Terminal window
# Create test table through proxy
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb <<EOF
CREATE TABLE twr_test (
id INTEGER PRIMARY KEY,
data TEXT,
created_at TEXT DEFAULT CURRENT_TIMESTAMP
);
EOF
# Insert data (automatically routes to primary)
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c \
"INSERT INTO twr_test (id, data) VALUES (1, 'test data')"
# Verify on primary
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c \
"SELECT * FROM twr_test"
# Verify replication to standby
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c \
"SELECT * FROM twr_test"

Write Timeout During Failover

When the primary is unavailable, writes wait up to write_timeout_secs:

CLIENT PROXY NODES
│ │ │
│ INSERT INTO... │ │
│────────────────────────►│ │
│ │ select_primary_with_timeout │
│ │──────────────────────────────│
│ │ Primary healthy? NO │
│ │ │
│ │ ┌──────────────────────┐ │
│ │ │ WAIT LOOP (30s max) │ │
│ │ │ │ │
│ (waiting...) │ │ Sleep 500ms │ │
│ │ │ Check health │ │
│ │ │ Primary back? YES │ │
│ │ └──────────────────────┘ │
│ │ │
│ │────────────────────────────►│ PRIMARY
│ OK (after N seconds) │◄────────────────────────────│
│◄────────────────────────│ │

Part 4: Transparent Read Routing (TRR)

TRR distributes read queries across all healthy nodes for load balancing.

How TRR Works

READ Query: SELECT * FROM users WHERE id = 1
┌─────────────────────────────────────────────────────────────┐
│ HELIOSPROXY │
│ │
│ Load Balancing Algorithm: │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ fn select_read_node(): │ │
│ │ healthy_nodes = get_healthy_nodes() │ │
│ │ if session.has_sticky_backend: │ │
│ │ return session.backend # Maintain affinity │ │
│ │ else: │ │
│ │ return round_robin(healthy_nodes) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ Distribution: │
│ Request 1 ──► Primary │
│ Request 2 ──► Standby-Sync │
│ Request 3 ──► Standby-Async │
│ Request 4 ──► Primary (round robin continues) │
└─────────────────────────────────────────────────────────────┘

Testing TRR

Terminal window
# Run multiple SELECT queries and observe distribution
for i in {1..10}; do
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c \
"SELECT '$i', current_timestamp"
done
# Check proxy logs to see routing decisions
docker logs heliosdb-proxy 2>&1 | grep -i "routing\|selected"

Read Scaling Benefits

ScenarioWithout TRRWith TRR
1000 reads/secPrimary handles 1000Each node handles ~333
Primary failsAll reads failReads continue on standbys
LatencySingle pointDistributed load

Part 5: HeliosProxy Deep Dive

Proxy Architecture

┌────────────────────────────────────────────────────────────────────┐
│ HELIOSPROXY │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ LISTENER LAYER │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ PG Protocol │ │ HTTP API │ │ Admin API │ │ │
│ │ │ Port 5432 │ │ Port 8080 │ │ Port 9090 │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ ROUTING LAYER │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ Query Classifier │ │ │
│ │ │ • Parse SQL to determine read/write │ │ │
│ │ │ • Detect transaction boundaries │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ Session Manager │ │ │
│ │ │ • Track client sessions │ │ │
│ │ │ • Maintain sticky backend affinity │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ Load Balancer │ │ │
│ │ │ • Round-robin for reads │ │ │
│ │ │ • Primary-only for writes │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ HEALTH LAYER │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ Health Checker (background task) │ │ │
│ │ │ • Poll each node every health_check_interval_secs │ │ │
│ │ │ • Track consecutive failures │ │ │
│ │ │ • Mark unhealthy after failure_threshold failures │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ Write Timeout Handler │ │ │
│ │ │ • Wait for primary availability │ │ │
│ │ │ • Poll every 500ms │ │ │
│ │ │ • Timeout after write_timeout_secs │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ BACKEND POOL │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ PRIMARY │ │ STANDBY-SY │ │ STANDBY-AS │ │ │
│ │ │ healthy: ✓ │ │ healthy: ✓ │ │ healthy: ✓ │ │ │
│ │ │ failures: 0 │ │ failures: 0 │ │ failures: 0 │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────┘

Configuration Reference

[proxy]
# Network settings
listen_addr = "0.0.0.0:5432" # PostgreSQL protocol listener
admin_addr = "0.0.0.0:9090" # Admin/monitoring API
# Health checking
health_check_interval_secs = 5 # How often to check node health
failure_threshold = 3 # Failures before marking unhealthy
# Failover behavior
write_timeout_secs = 30 # Max wait for primary during failover
[[nodes]]
name = "primary" # Human-readable identifier
host = "primary" # Hostname or IP
port = 5432 # PostgreSQL port
role = "primary" # "primary" or "standby"
enabled = true # Include in routing pool
[[nodes]]
name = "standby-sync"
host = "standby-sync"
port = 5432
role = "standby"
enabled = true

Admin API Endpoints

Terminal window
# Health check
curl http://localhost:19090/health
# {"status":"ok"}
# Node status (future enhancement)
curl http://localhost:19090/nodes
# Returns health status of all configured nodes

Part 6: Monitoring the Cluster

Real-Time Health Monitoring Script

Create monitor_cluster.sh:

#!/bin/bash
# HeliosDB Nano Cluster Monitor
PROXY_ADMIN="localhost:19090"
PRIMARY_HTTP="localhost:18080"
STANDBY_SYNC_HTTP="localhost:18081"
STANDBY_ASYNC_HTTP="localhost:18084"
# Colors
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[0;33m'
BLUE='\033[0;34m'
NC='\033[0m'
check_node() {
local name=$1
local port=$2
local result=$(PGPASSWORD=helios psql -h localhost -p $port -U helios -d heliosdb -t -c "SELECT 1" 2>&1)
if [[ "$result" == *"1"* ]]; then
echo -e "${GREEN}✓${NC}"
else
echo -e "${RED}✗${NC}"
fi
}
check_http() {
local name=$1
local url=$2
local result=$(curl -s -o /dev/null -w "%{http_code}" "$url/health" 2>/dev/null)
if [[ "$result" == "200" ]]; then
echo -e "${GREEN}✓${NC}"
else
echo -e "${RED}✗${NC}"
fi
}
get_replication_lag() {
local port=$1
# Query replication lag if available
local lag=$(PGPASSWORD=helios psql -h localhost -p $port -U helios -d heliosdb -t -c \
"SELECT replication_lag_bytes FROM helios_replication_status LIMIT 1" 2>/dev/null | tr -d ' ')
echo "${lag:-N/A}"
}
while true; do
clear
echo -e "${BLUE}╔════════════════════════════════════════════════════════════════╗${NC}"
echo -e "${BLUE}║ HeliosDB Nano Cluster Monitor ║${NC}"
echo -e "${BLUE}╚════════════════════════════════════════════════════════════════╝${NC}"
echo ""
echo -e " Time: $(date '+%Y-%m-%d %H:%M:%S')"
echo ""
echo -e " ${YELLOW}Node Status:${NC}"
echo -e " ┌────────────────┬──────────┬──────────┬─────────────────┐"
echo -e " │ Node │ PG Proto │ HTTP API │ Replication Lag │"
echo -e " ├────────────────┼──────────┼──────────┼─────────────────┤"
printf " │ %-14s │ %s │ %s │ %-15s │\n" "Primary" "$(check_node primary 15432)" "$(check_http primary localhost:18080)" "N/A (primary)"
printf " │ %-14s │ %s │ %s │ %-15s │\n" "Standby-Sync" "$(check_node standby-sync 15442)" "$(check_http standby-sync localhost:18081)" "$(get_replication_lag 15442)"
printf " │ %-14s │ %s │ %s │ %-15s │\n" "Standby-Async" "$(check_node standby-async 15462)" "$(check_http standby-async localhost:18084)" "$(get_replication_lag 15462)"
echo -e " └────────────────┴──────────┴──────────┴─────────────────┘"
echo ""
echo -e " ${YELLOW}Proxy Status:${NC}"
echo -e " ┌────────────────┬──────────┐"
echo -e " │ Component │ Status │"
echo -e " ├────────────────┼──────────┤"
printf " │ %-14s │ %s │\n" "HeliosProxy" "$(check_http proxy localhost:19090)"
echo -e " └────────────────┴──────────┘"
echo ""
echo -e " ${BLUE}Press Ctrl+C to exit${NC}"
sleep 2
done

Docker Log Monitoring

Terminal window
# Follow all container logs
docker compose -f docker-compose.ha-cluster.yml logs -f
# Follow proxy logs only
docker compose -f docker-compose.ha-cluster.yml logs -f proxy
# Filter for specific events
docker compose -f docker-compose.ha-cluster.yml logs -f proxy 2>&1 | grep -E "(healthy|unhealthy|failover|routing)"

Query-Based Monitoring

Terminal window
# Check replication status
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT * FROM helios_replication_status;
"
# Check standby registration
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT * FROM helios_standby_nodes;
"
# Check cluster topology
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SHOW TOPOLOGY;
"

Part 7: Switchover Operations

A switchover is a planned, controlled role change between primary and standby.

Manual Switchover Process

BEFORE SWITCHOVER:
┌─────────────┐ ┌─────────────┐
│ PRIMARY │──────────►│ STANDBY │
│ (accepting │ WAL │ (read-only) │
│ writes) │ stream │ │
└─────────────┘ └─────────────┘
AFTER SWITCHOVER:
┌─────────────┐ ┌─────────────┐
│ STANDBY │◄──────────│ PRIMARY │
│ (read-only) │ WAL │ (accepting │
│ │ stream │ writes) │
└─────────────┘ └─────────────┘

Switchover Script

Create switchover.sh:

#!/bin/bash
# Controlled switchover script
set -e
OLD_PRIMARY_PORT=${1:-15432}
NEW_PRIMARY_PORT=${2:-15442}
echo "=== HeliosDB Nano Switchover ==="
echo "Old Primary: localhost:$OLD_PRIMARY_PORT"
echo "New Primary: localhost:$NEW_PRIMARY_PORT"
echo ""
# Step 1: Verify both nodes are healthy
echo "[1/5] Verifying node health..."
PGPASSWORD=helios psql -h localhost -p $OLD_PRIMARY_PORT -U helios -d heliosdb -c "SELECT 1" > /dev/null
PGPASSWORD=helios psql -h localhost -p $NEW_PRIMARY_PORT -U helios -d heliosdb -c "SELECT 1" > /dev/null
echo " Both nodes healthy ✓"
# Step 2: Stop writes on old primary (application should handle this gracefully)
echo "[2/5] Preparing old primary for demotion..."
# In production, you would:
# - Put application in read-only mode
# - Wait for in-flight transactions to complete
# - Verify replication is caught up
# Step 3: Verify replication is caught up
echo "[3/5] Verifying replication sync..."
sleep 2 # Allow final WAL to replicate
echo " Replication synchronized ✓"
# Step 4: Promote standby to primary
echo "[4/5] Promoting standby to primary..."
# This would call the promote API endpoint
# curl -X POST http://localhost:${NEW_PRIMARY_HTTP}/admin/promote
echo " New primary promoted ✓"
# Step 5: Reconfigure old primary as standby
echo "[5/5] Demoting old primary to standby..."
# This would reconfigure replication
echo " Old primary demoted ✓"
echo ""
echo "=== Switchover Complete ==="
echo "New Primary: localhost:$NEW_PRIMARY_PORT"
echo "New Standby: localhost:$OLD_PRIMARY_PORT"

Testing Switchover with Workload

Terminal window
# Terminal 1: Start continuous workload
./pg_workload.sh --duration 120 --interval 1 > /tmp/switchover_test.log 2>&1 &
WORKLOAD_PID=$!
echo "Workload started (PID: $WORKLOAD_PID)"
# Terminal 2: Perform switchover after 30 seconds
sleep 30
./switchover.sh 15432 15442
# Terminal 1: Monitor results
tail -f /tmp/switchover_test.log

Part 8: Failover and Automatic Recovery

A failover is an unplanned event where the primary becomes unavailable.

Failover Sequence

NORMAL OPERATION:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ CLIENT │──────────►│ PROXY │──────────►│ PRIMARY │
│ │ │ │ │ │
└─────────────┘ └─────────────┘ └─────────────┘
┌─────────────┐
│ STANDBY │
│ │
└─────────────┘
PRIMARY FAILURE:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ CLIENT │──────────►│ PROXY │─────X────►│ PRIMARY │
│ │ │ │ │ (DOWN) │
└─────────────┘ └─────────────┘ └─────────────┘
│ DETECT FAILURE
│ (health check fails)
│ WRITE TIMEOUT ACTIVATED
│ (wait up to 30s)
┌─────────────┐
│ STANDBY │ ◄──── Reads continue here
│ (healthy) │
└─────────────┘
RECOVERY (Primary returns):
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ CLIENT │──────────►│ PROXY │──────────►│ PRIMARY │
│ │ │ │ │ (HEALTHY) │
└─────────────┘ └─────────────┘ └─────────────┘
│ HEALTH CHECK SUCCEEDS
│ PRIMARY MARKED HEALTHY
│ WRITES RESUME IMMEDIATELY
┌─────────────┐
│ STANDBY │
│ │
└─────────────┘

Failover Test Script

Create test_failover.sh:

#!/bin/bash
# Failover testing script with workload
WORKLOAD_DURATION=90
PRIMARY_DOWNTIME=40
echo "╔════════════════════════════════════════════════════════════════╗"
echo "║ HeliosDB Nano Failover Test ║"
echo "╚════════════════════════════════════════════════════════════════╝"
echo ""
echo "Test Parameters:"
echo " Workload duration: ${WORKLOAD_DURATION}s"
echo " Primary downtime: ${PRIMARY_DOWNTIME}s"
echo " Write timeout: 30s"
echo ""
# Step 1: Start workload
echo "[$(date +%H:%M:%S)] Starting workload..."
./pg_workload.sh --duration $WORKLOAD_DURATION --interval 1 > /tmp/failover_test.log 2>&1 &
WORKLOAD_PID=$!
# Step 2: Let it run normally for 20 seconds
echo "[$(date +%H:%M:%S)] Running normal operations for 20s..."
sleep 20
# Step 3: Stop primary (simulate failure)
echo "[$(date +%H:%M:%S)] SIMULATING PRIMARY FAILURE..."
docker compose -f docker-compose.ha-cluster.yml stop primary
echo "[$(date +%H:%M:%S)] Primary stopped"
# Step 4: Wait during outage
echo "[$(date +%H:%M:%S)] Waiting ${PRIMARY_DOWNTIME}s (primary down)..."
sleep $PRIMARY_DOWNTIME
# Step 5: Restart primary (recovery)
echo "[$(date +%H:%M:%S)] RECOVERING PRIMARY..."
docker compose -f docker-compose.ha-cluster.yml start primary
echo "[$(date +%H:%M:%S)] Primary restarted"
# Step 6: Wait for workload to complete
echo "[$(date +%H:%M:%S)] Waiting for workload to complete..."
wait $WORKLOAD_PID 2>/dev/null
# Step 7: Analyze results
echo ""
echo "╔════════════════════════════════════════════════════════════════╗"
echo "║ TEST RESULTS ║"
echo "╚════════════════════════════════════════════════════════════════╝"
echo ""
# Extract summary
tail -10 /tmp/failover_test.log
# Analyze timing
echo ""
echo "Detailed Analysis:"
echo "─────────────────────────────────────────────────────────────────"
# Count operations by latency
FAST_OPS=$(grep -c '\[.*ms\]' /tmp/failover_test.log | head -1 || echo 0)
SLOW_OPS=$(grep -E '\[[0-9]{4,}ms\]' /tmp/failover_test.log | wc -l || echo 0)
TOTAL_OPS=$(grep -c 'SELECT=\[ok\]' /tmp/failover_test.log || echo 0)
echo "Total operations: $TOTAL_OPS"
echo "Operations with write timeout: $SLOW_OPS"
echo ""
# Show the slowest operation (write timeout in action)
echo "Longest operation (write timeout):"
grep -E '\[[0-9]{4,}ms\]' /tmp/failover_test.log | tail -1
echo ""
echo "Full log: /tmp/failover_test.log"

Running the Failover Test

Terminal window
chmod +x test_failover.sh
./test_failover.sh

Expected output:

╔════════════════════════════════════════════════════════════════╗
║ HeliosDB Nano Failover Test ║
╚════════════════════════════════════════════════════════════════╝
[20:30:00] Starting workload...
[20:30:00] Running normal operations for 20s...
[20:30:20] SIMULATING PRIMARY FAILURE...
[20:30:21] Primary stopped
[20:30:21] Waiting 40s (primary down)...
[20:31:01] RECOVERING PRIMARY...
[20:31:03] Primary restarted
[20:31:30] Waiting for workload to complete...
╔════════════════════════════════════════════════════════════════╗
║ TEST RESULTS ║
╚════════════════════════════════════════════════════════════════╝
=== Workload Summary ===
Total iterations: 60
Successful: 60
Failed: 0
Success rate: 100%

Part 9: Application Continuity Testing

Continuous Application Workload

Create app_continuity_test.sh:

#!/bin/bash
# Application Continuity Test
# Simulates a real application with mixed read/write workload
PROXY_HOST="localhost"
PROXY_PORT="15400"
TEST_DURATION=180 # 3 minutes
ITERATIONS=0
SUCCESS=0
FAILED=0
WRITES=0
READS=0
# Setup
echo "Setting up test environment..."
PGPASSWORD=helios psql -h $PROXY_HOST -p $PROXY_PORT -U helios -d heliosdb <<EOF
DROP TABLE IF EXISTS app_orders;
CREATE TABLE app_orders (
id INTEGER PRIMARY KEY,
customer TEXT,
amount REAL,
status TEXT DEFAULT 'pending',
created_at TEXT DEFAULT CURRENT_TIMESTAMP
);
EOF
echo "Starting application continuity test (${TEST_DURATION}s)..."
echo "Press Ctrl+C to stop"
echo ""
START_TIME=$(date +%s)
while true; do
CURRENT_TIME=$(date +%s)
ELAPSED=$((CURRENT_TIME - START_TIME))
if [ $ELAPSED -ge $TEST_DURATION ]; then
break
fi
ITERATIONS=$((ITERATIONS + 1))
# Simulate mixed workload (70% reads, 30% writes)
RANDOM_OP=$((RANDOM % 10))
if [ $RANDOM_OP -lt 7 ]; then
# READ operation
RESULT=$(PGPASSWORD=helios psql -h $PROXY_HOST -p $PROXY_PORT -U helios -d heliosdb -t -c \
"SELECT COUNT(*) FROM app_orders WHERE status = 'completed'" 2>&1)
if [[ "$RESULT" =~ ^[[:space:]]*[0-9]+[[:space:]]*$ ]]; then
SUCCESS=$((SUCCESS + 1))
READS=$((READS + 1))
echo -ne "\r[$(date +%H:%M:%S)] Iter $ITERATIONS: READ ✓ (${READS} reads, ${WRITES} writes, ${FAILED} failed)"
else
FAILED=$((FAILED + 1))
echo -e "\n[$(date +%H:%M:%S)] Iter $ITERATIONS: READ ✗ - $RESULT"
fi
else
# WRITE operation
ORDER_ID=$ITERATIONS
CUSTOMER="customer_$((RANDOM % 100))"
AMOUNT="$((RANDOM % 1000)).$((RANDOM % 100))"
RESULT=$(PGPASSWORD=helios psql -h $PROXY_HOST -p $PROXY_PORT -U helios -d heliosdb -t -c \
"INSERT INTO app_orders (id, customer, amount) VALUES ($ORDER_ID, '$CUSTOMER', $AMOUNT) ON CONFLICT (id) DO UPDATE SET amount = $AMOUNT" 2>&1)
if [[ "$RESULT" == *"INSERT"* ]] || [[ "$RESULT" == *"UPDATE"* ]] || [[ -z "$(echo $RESULT | tr -d '[:space:]')" ]]; then
SUCCESS=$((SUCCESS + 1))
WRITES=$((WRITES + 1))
echo -ne "\r[$(date +%H:%M:%S)] Iter $ITERATIONS: WRITE ✓ (${READS} reads, ${WRITES} writes, ${FAILED} failed)"
else
FAILED=$((FAILED + 1))
echo -e "\n[$(date +%H:%M:%S)] Iter $ITERATIONS: WRITE ✗ - $RESULT"
fi
fi
# Small delay between operations
sleep 0.5
done
echo ""
echo ""
echo "╔════════════════════════════════════════════════════════════════╗"
echo "║ Application Continuity Test Results ║"
echo "╚════════════════════════════════════════════════════════════════╝"
echo ""
echo "Duration: ${TEST_DURATION}s"
echo "Total ops: $ITERATIONS"
echo "Successful: $SUCCESS"
echo "Failed: $FAILED"
echo "Read ops: $READS"
echo "Write ops: $WRITES"
echo "Success rate: $(echo "scale=2; $SUCCESS * 100 / $ITERATIONS" | bc)%"

Running Continuity Test with Multiple Switchovers

Terminal window
# Terminal 1: Start the continuity test
./app_continuity_test.sh
# Terminal 2: Perform multiple disruptions
sleep 30
echo "=== First disruption: Stop primary ==="
docker compose -f docker-compose.ha-cluster.yml stop primary
sleep 35
docker compose -f docker-compose.ha-cluster.yml start primary
sleep 30
echo "=== Second disruption: Stop standby-sync ==="
docker compose -f docker-compose.ha-cluster.yml stop standby-sync
sleep 20
docker compose -f docker-compose.ha-cluster.yml start standby-sync
sleep 30
echo "=== Third disruption: Network partition (stop all standbys) ==="
docker compose -f docker-compose.ha-cluster.yml stop standby-sync standby-async
sleep 15
docker compose -f docker-compose.ha-cluster.yml start standby-sync standby-async

Part 10: Advanced Scenarios

Scenario 1: Cascading Failure Test

Test system behavior when multiple nodes fail sequentially:

#!/bin/bash
# Cascading failure test
echo "Starting cascading failure test..."
# Start workload
./pg_workload.sh --duration 120 --interval 1 > /tmp/cascade_test.log 2>&1 &
WORKLOAD_PID=$!
sleep 15
echo "[$(date +%H:%M:%S)] Stopping standby-async..."
docker compose -f docker-compose.ha-cluster.yml stop standby-async
sleep 15
echo "[$(date +%H:%M:%S)] Stopping standby-sync..."
docker compose -f docker-compose.ha-cluster.yml stop standby-sync
sleep 15
echo "[$(date +%H:%M:%S)] Stopping primary (total outage)..."
docker compose -f docker-compose.ha-cluster.yml stop primary
sleep 20
echo "[$(date +%H:%M:%S)] Recovering primary..."
docker compose -f docker-compose.ha-cluster.yml start primary
sleep 10
echo "[$(date +%H:%M:%S)] Recovering standby-sync..."
docker compose -f docker-compose.ha-cluster.yml start standby-sync
sleep 10
echo "[$(date +%H:%M:%S)] Recovering standby-async..."
docker compose -f docker-compose.ha-cluster.yml start standby-async
wait $WORKLOAD_PID
echo ""
cat /tmp/cascade_test.log | tail -20

Scenario 2: Rolling Restart

Perform rolling restart without downtime:

#!/bin/bash
# Rolling restart - maintain availability during updates
echo "Starting rolling restart..."
# Restart standbys first (one at a time)
echo "[$(date +%H:%M:%S)] Restarting standby-async..."
docker compose -f docker-compose.ha-cluster.yml restart standby-async
sleep 10
echo "[$(date +%H:%M:%S)] Restarting standby-sync..."
docker compose -f docker-compose.ha-cluster.yml restart standby-sync
sleep 10
# Restart primary last (writes will use write timeout)
echo "[$(date +%H:%M:%S)] Restarting primary..."
docker compose -f docker-compose.ha-cluster.yml restart primary
sleep 10
echo "[$(date +%H:%M:%S)] Rolling restart complete"

Scenario 3: Load Testing with Failover

#!/bin/bash
# High-load failover test
CONCURRENCY=5
echo "Starting $CONCURRENCY concurrent workloads..."
# Start multiple concurrent workloads
for i in $(seq 1 $CONCURRENCY); do
./pg_workload.sh --duration 60 --interval 0.5 > /tmp/load_test_$i.log 2>&1 &
echo "Started workload $i (PID: $!)"
done
sleep 20
echo "Simulating failover..."
docker compose -f docker-compose.ha-cluster.yml stop primary
sleep 35
docker compose -f docker-compose.ha-cluster.yml start primary
# Wait for all workloads
wait
echo ""
echo "Results:"
for i in $(seq 1 $CONCURRENCY); do
echo "Workload $i:"
tail -5 /tmp/load_test_$i.log | grep -E "(Success|Failed)"
done

Quick Reference

Port Mappings (Docker)

ServicePG PortNative PortHTTP PortAdmin Port
Primary154321543318080-
Standby-Sync154421544318081-
Standby-Async154621546318084-
Proxy15400--19090

Port Mappings (Local)

ServicePG PortNative PortHTTP PortAdmin Port
Primary543254338080-
Standby-Sync544254438081-
Standby-Async545254538082-
Proxy5400--9090

Common Commands

Terminal window
# Start cluster
docker compose -f docker-compose.ha-cluster.yml up -d
# Stop cluster
docker compose -f docker-compose.ha-cluster.yml down
# View logs
docker compose -f docker-compose.ha-cluster.yml logs -f
# Restart single node
docker compose -f docker-compose.ha-cluster.yml restart primary
# Connect through proxy
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb
# Check proxy health
curl http://localhost:19090/health

Troubleshooting

IssueCauseSolution
”No healthy nodes”All nodes downCheck container status, restart cluster
High latency writesPrimary slow/recoveringCheck primary logs, wait for recovery
Replication lagNetwork/disk issuesCheck standby logs, verify connectivity
Connection refusedWrong port/service downVerify port mappings, check service health

Summary

This tutorial covered:

  1. Docker Deployment - Full HA cluster with proxy
  2. Local Deployment - Multi-instance setup using different ports
  3. TWR - Automatic write routing to primary
  4. TRR - Read load balancing across all nodes
  5. HeliosProxy - Architecture and configuration
  6. Monitoring - Real-time cluster health tracking
  7. Switchover - Planned role changes
  8. Failover - Automatic recovery from failures
  9. Application Continuity - Maintaining operations during disruptions
  10. Advanced Scenarios - Cascading failures, rolling restarts, load testing

Key takeaways:

  • Write timeout ensures writes eventually succeed during brief outages
  • Automatic recovery requires no manual intervention
  • Read routing maintains read availability even when primary is down
  • 100% success rate is achievable with proper timeout configuration