Skip to content

HeliosDB-Lite High Availability Hands-On Tutorial

HeliosDB-Lite High Availability Hands-On Tutorial

This comprehensive tutorial guides you through setting up, operating, and testing HeliosDB-Lite’s High Availability features including Transparent Write Routing (TWR), automatic failover, and application continuity.

Table of Contents

  1. Architecture Overview
  2. Part 1: Docker Deployment
  3. Part 2: Local Deployment (Without Docker)
  4. Part 3: Transparent Write Routing (TWR)
  5. Part 4: Transparent Read Routing (TRR)
  6. Part 5: HeliosProxy Deep Dive
  7. Part 6: Monitoring the Cluster
  8. Part 7: Switchover Operations
  9. Part 8: Failover and Automatic Recovery
  10. Part 9: Application Continuity Testing
  11. Part 10: Advanced Scenarios

Architecture Overview

┌─────────────────────────────────────┐
│ APPLICATION │
└─────────────────┬───────────────────┘
┌─────────────────▼───────────────────┐
│ HELIOSPROXY │
│ ┌─────────────────────────────┐ │
│ │ • PostgreSQL Protocol (5432)│ │
│ │ • HTTP SQL API (8080) │ │
│ │ • Admin API (9090) │ │
│ │ • Health Checking │ │
│ │ • Write Timeout (30s) │ │
│ │ • TWR + TRR │ │
│ └─────────────────────────────┘ │
└───────┬───────────┬───────────┬────┘
│ │ │
┌─────────────▼───┐ ┌───▼───┐ ┌───▼─────────────┐
│ PRIMARY │ │STANDBY│ │ STANDBY │
│ (Read/Write) │ │ SYNC │ │ ASYNC │
│ Port: 5432 │ │ 5442 │ │ 5452 │
└────────┬────────┘ └───┬───┘ └────────┬────────┘
│ │ │
└────────────────┴────────────────┘
WAL Streaming Replication

Key Features

FeatureDescription
TWRTransparent Write Routing - writes auto-route to primary
TRRTransparent Read Routing - reads load-balance across standbys
Write TimeoutWrites wait up to 30s for primary during failover
Auto RecoveryAutomatic reconnection when primary returns
Sticky SessionsMaintain backend affinity within a session

Part 1: Docker Deployment

Prerequisites

Terminal window
# Install Docker and Docker Compose
docker --version # 20.10+
docker compose version # 2.0+

Step 1: Clone and Build

Terminal window
cd /path/to/HeliosDB-Lite
# Build the Docker image with HA support
docker build -f tests/docker/Dockerfile.ha -t heliosdb-lite:ha .

Step 2: Create Docker Compose Configuration

Create docker-compose.ha-cluster.yml:

version: '3.8'
networks:
helios-ha:
driver: bridge
ipam:
config:
- subnet: 172.28.0.0/16
services:
# Primary node - handles all writes
primary:
image: heliosdb-lite:ha
container_name: heliosdb-primary
hostname: primary
networks:
helios-ha:
ipv4_address: 172.28.1.1
ports:
- "15432:5432" # PostgreSQL protocol
- "15433:5433" # Native protocol
- "18080:8080" # HTTP API
environment:
- HELIOSDB_ROLE=primary
- HELIOSDB_NODE_ID=primary
- HELIOSDB_DATA_DIR=/data
- HELIOSDB_REPLICATION_MODE=sync
volumes:
- primary-data:/data
healthcheck:
test: ["CMD", "heliosdb-lite", "health"]
interval: 5s
timeout: 3s
retries: 3
# Synchronous standby - zero data loss
standby-sync:
image: heliosdb-lite:ha
container_name: heliosdb-standby-sync
hostname: standby-sync
networks:
helios-ha:
ipv4_address: 172.28.1.2
ports:
- "15442:5432"
- "15443:5433"
- "18081:8080"
environment:
- HELIOSDB_ROLE=standby
- HELIOSDB_NODE_ID=standby-sync
- HELIOSDB_PRIMARY_HOST=primary
- HELIOSDB_PRIMARY_PORT=5433
- HELIOSDB_REPLICATION_MODE=sync
volumes:
- standby-sync-data:/data
depends_on:
primary:
condition: service_healthy
# Asynchronous standby - better performance, potential lag
standby-async:
image: heliosdb-lite:ha
container_name: heliosdb-standby-async
hostname: standby-async
networks:
helios-ha:
ipv4_address: 172.28.1.3
ports:
- "15462:5432"
- "15463:5433"
- "18084:8080"
environment:
- HELIOSDB_ROLE=standby
- HELIOSDB_NODE_ID=standby-async
- HELIOSDB_PRIMARY_HOST=primary
- HELIOSDB_PRIMARY_PORT=5433
- HELIOSDB_REPLICATION_MODE=async
volumes:
- standby-async-data:/data
depends_on:
primary:
condition: service_healthy
# HeliosProxy - intelligent routing
proxy:
image: heliosdb-lite:ha
container_name: heliosdb-proxy
hostname: proxy
networks:
helios-ha:
ipv4_address: 172.28.1.100
ports:
- "15400:5432" # PostgreSQL protocol
- "19090:9090" # Admin API
environment:
- HELIOSDB_PROXY_CONFIG=/etc/heliosdb/proxy.toml
volumes:
- ./proxy-config.toml:/etc/heliosdb/proxy.toml:ro
command: ["heliosdb-proxy"]
depends_on:
- primary
- standby-sync
- standby-async
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9090/health"]
interval: 5s
timeout: 3s
retries: 3
volumes:
primary-data:
standby-sync-data:
standby-async-data:

Step 3: Create Proxy Configuration

Create proxy-config.toml:

[proxy]
listen_addr = "0.0.0.0:5432"
admin_addr = "0.0.0.0:9090"
health_check_interval_secs = 5
failure_threshold = 3
write_timeout_secs = 30
[[nodes]]
name = "primary"
host = "primary"
port = 5432
role = "primary"
enabled = true
[[nodes]]
name = "standby-sync"
host = "standby-sync"
port = 5432
role = "standby"
enabled = true
[[nodes]]
name = "standby-async"
host = "standby-async"
port = 5432
role = "standby"
enabled = true

Step 4: Start the Cluster

Terminal window
# Start all services
docker compose -f docker-compose.ha-cluster.yml up -d
# Verify all containers are running
docker compose -f docker-compose.ha-cluster.yml ps
# Expected output:
# NAME STATUS PORTS
# heliosdb-primary Up (healthy) 0.0.0.0:15432->5432/tcp
# heliosdb-standby-sync Up (healthy) 0.0.0.0:15442->5432/tcp
# heliosdb-standby-async Up (healthy) 0.0.0.0:15462->5432/tcp
# heliosdb-proxy Up (healthy) 0.0.0.0:15400->5432/tcp

Step 5: Verify Connectivity

Terminal window
# Connect through proxy
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c "SELECT 1"
# Connect directly to primary
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT 1"
# Connect directly to standby
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "SELECT 1"

Part 2: Local Deployment (Without Docker)

Prerequisites

Terminal window
# Build HeliosDB-Lite
cargo build --release --features "ha-tier1,ha-proxy"
# Binary location
export HELIOSDB_BIN=./target/release/heliosdb-lite
export HELIOSPROXY_BIN=./target/release/heliosdb-proxy

Step 1: Create Data Directories

Terminal window
mkdir -p /tmp/heliosdb-ha/{primary,standby-sync,standby-async}

Step 2: Start Primary Node

Terminal window
# Terminal 1: Primary (ports 5432/5433/8080)
$HELIOSDB_BIN start \
--data-dir /tmp/heliosdb-ha/primary \
--pg-port 5432 \
--native-port 5433 \
--http-port 8080 \
--node-id primary \
--replication-role primary \
--replication-mode sync

Step 3: Start Standby Nodes

Terminal window
# Terminal 2: Standby Sync (ports 5442/5443/8081)
$HELIOSDB_BIN start \
--data-dir /tmp/heliosdb-ha/standby-sync \
--pg-port 5442 \
--native-port 5443 \
--http-port 8081 \
--node-id standby-sync \
--replication-role standby \
--primary-host localhost \
--primary-port 5433 \
--replication-mode sync
Terminal window
# Terminal 3: Standby Async (ports 5452/5453/8082)
$HELIOSDB_BIN start \
--data-dir /tmp/heliosdb-ha/standby-async \
--pg-port 5452 \
--native-port 5453 \
--http-port 8082 \
--node-id standby-async \
--replication-role standby \
--primary-host localhost \
--primary-port 5433 \
--replication-mode async

Step 4: Create Local Proxy Configuration

Create /tmp/heliosdb-ha/proxy.toml:

[proxy]
listen_addr = "0.0.0.0:5400"
admin_addr = "0.0.0.0:9090"
health_check_interval_secs = 5
failure_threshold = 3
write_timeout_secs = 30
[[nodes]]
name = "primary"
host = "localhost"
port = 5432
role = "primary"
enabled = true
[[nodes]]
name = "standby-sync"
host = "localhost"
port = 5442
role = "standby"
enabled = true
[[nodes]]
name = "standby-async"
host = "localhost"
port = 5452
role = "standby"
enabled = true

Step 5: Start HeliosProxy

Terminal window
# Terminal 4: Proxy (port 5400)
$HELIOSPROXY_BIN --config /tmp/heliosdb-ha/proxy.toml

Port Summary (Local Deployment)

NodePG PortNative PortHTTP Port
Primary543254338080
Standby Sync544254438081
Standby Async545254538082
Proxy5400-9090

Part 3: Transparent Write Routing (TWR)

TWR automatically routes write operations to the primary node, regardless of which node you’re connected to.

How TWR Works

┌───────────────────────────────────────────────────────────────┐
│ CLIENT APPLICATION │
│ │
│ INSERT INTO users (name) VALUES ('Alice') -- WRITE │
│ UPDATE users SET active = true -- WRITE │
│ DELETE FROM users WHERE id = 5 -- WRITE │
│ SELECT * FROM users -- READ │
└───────────────────────────┬───────────────────────────────────┘
┌───────────────────────────────────────────────────────────────┐
│ HELIOSPROXY │
│ │
│ Query Classification: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ is_write_query(sql): │ │
│ │ • INSERT, UPDATE, DELETE → true │ │
│ │ • CREATE, DROP, ALTER, TRUNCATE → true │ │
│ │ • BEGIN, COMMIT, ROLLBACK → true (transaction) │ │
│ │ • SELECT, SHOW, EXPLAIN → false (read) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ Routing Decision: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ if is_write: │ │
│ │ route_to_primary() ───────────► PRIMARY │ │
│ │ else: │ │
│ │ route_to_any_healthy() ───────────► PRIMARY/STANDBY │ │
│ └─────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────┘

Testing TWR

Terminal window
# Create test table through proxy
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb <<EOF
CREATE TABLE twr_test (
id INTEGER PRIMARY KEY,
data TEXT,
created_at TEXT DEFAULT CURRENT_TIMESTAMP
);
EOF
# Insert data (automatically routes to primary)
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c \
"INSERT INTO twr_test (id, data) VALUES (1, 'test data')"
# Verify on primary
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c \
"SELECT * FROM twr_test"
# Verify replication to standby
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c \
"SELECT * FROM twr_test"

Write Timeout During Failover

When the primary is unavailable, writes wait up to write_timeout_secs:

CLIENT PROXY NODES
│ │ │
│ INSERT INTO... │ │
│────────────────────────►│ │
│ │ select_primary_with_timeout │
│ │──────────────────────────────│
│ │ Primary healthy? NO │
│ │ │
│ │ ┌──────────────────────┐ │
│ │ │ WAIT LOOP (30s max) │ │
│ │ │ │ │
│ (waiting...) │ │ Sleep 500ms │ │
│ │ │ Check health │ │
│ │ │ Primary back? YES │ │
│ │ └──────────────────────┘ │
│ │ │
│ │────────────────────────────►│ PRIMARY
│ OK (after N seconds) │◄────────────────────────────│
│◄────────────────────────│ │

Part 4: Transparent Read Routing (TRR)

TRR distributes read queries across all healthy nodes for load balancing.

How TRR Works

READ Query: SELECT * FROM users WHERE id = 1
┌─────────────────────────────────────────────────────────────┐
│ HELIOSPROXY │
│ │
│ Load Balancing Algorithm: │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ fn select_read_node(): │ │
│ │ healthy_nodes = get_healthy_nodes() │ │
│ │ if session.has_sticky_backend: │ │
│ │ return session.backend # Maintain affinity │ │
│ │ else: │ │
│ │ return round_robin(healthy_nodes) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ Distribution: │
│ Request 1 ──► Primary │
│ Request 2 ──► Standby-Sync │
│ Request 3 ──► Standby-Async │
│ Request 4 ──► Primary (round robin continues) │
└─────────────────────────────────────────────────────────────┘

Testing TRR

Terminal window
# Run multiple SELECT queries and observe distribution
for i in {1..10}; do
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c \
"SELECT '$i', current_timestamp"
done
# Check proxy logs to see routing decisions
docker logs heliosdb-proxy 2>&1 | grep -i "routing\|selected"

Read Scaling Benefits

ScenarioWithout TRRWith TRR
1000 reads/secPrimary handles 1000Each node handles ~333
Primary failsAll reads failReads continue on standbys
LatencySingle pointDistributed load

Part 5: HeliosProxy Deep Dive

Proxy Architecture

┌────────────────────────────────────────────────────────────────────┐
│ HELIOSPROXY │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ LISTENER LAYER │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ PG Protocol │ │ HTTP API │ │ Admin API │ │ │
│ │ │ Port 5432 │ │ Port 8080 │ │ Port 9090 │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ ROUTING LAYER │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ Query Classifier │ │ │
│ │ │ • Parse SQL to determine read/write │ │ │
│ │ │ • Detect transaction boundaries │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ Session Manager │ │ │
│ │ │ • Track client sessions │ │ │
│ │ │ • Maintain sticky backend affinity │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ Load Balancer │ │ │
│ │ │ • Round-robin for reads │ │ │
│ │ │ • Primary-only for writes │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ HEALTH LAYER │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ Health Checker (background task) │ │ │
│ │ │ • Poll each node every health_check_interval_secs │ │ │
│ │ │ • Track consecutive failures │ │ │
│ │ │ • Mark unhealthy after failure_threshold failures │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ Write Timeout Handler │ │ │
│ │ │ • Wait for primary availability │ │ │
│ │ │ • Poll every 500ms │ │ │
│ │ │ • Timeout after write_timeout_secs │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ BACKEND POOL │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ PRIMARY │ │ STANDBY-SY │ │ STANDBY-AS │ │ │
│ │ │ healthy: ✓ │ │ healthy: ✓ │ │ healthy: ✓ │ │ │
│ │ │ failures: 0 │ │ failures: 0 │ │ failures: 0 │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────┘

Configuration Reference

[proxy]
# Network settings
listen_addr = "0.0.0.0:5432" # PostgreSQL protocol listener
admin_addr = "0.0.0.0:9090" # Admin/monitoring API
# Health checking
health_check_interval_secs = 5 # How often to check node health
failure_threshold = 3 # Failures before marking unhealthy
# Failover behavior
write_timeout_secs = 30 # Max wait for primary during failover
[[nodes]]
name = "primary" # Human-readable identifier
host = "primary" # Hostname or IP
port = 5432 # PostgreSQL port
role = "primary" # "primary" or "standby"
enabled = true # Include in routing pool
[[nodes]]
name = "standby-sync"
host = "standby-sync"
port = 5432
role = "standby"
enabled = true

Admin API Endpoints

Terminal window
# Health check
curl http://localhost:19090/health
# {"status":"ok"}
# Node status (future enhancement)
curl http://localhost:19090/nodes
# Returns health status of all configured nodes

Part 6: Monitoring the Cluster

Real-Time Health Monitoring Script

Create monitor_cluster.sh:

#!/bin/bash
# HeliosDB-Lite Cluster Monitor
PROXY_ADMIN="localhost:19090"
PRIMARY_HTTP="localhost:18080"
STANDBY_SYNC_HTTP="localhost:18081"
STANDBY_ASYNC_HTTP="localhost:18084"
# Colors
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[0;33m'
BLUE='\033[0;34m'
NC='\033[0m'
check_node() {
local name=$1
local port=$2
local result=$(PGPASSWORD=helios psql -h localhost -p $port -U helios -d heliosdb -t -c "SELECT 1" 2>&1)
if [[ "$result" == *"1"* ]]; then
echo -e "${GREEN}✓${NC}"
else
echo -e "${RED}✗${NC}"
fi
}
check_http() {
local name=$1
local url=$2
local result=$(curl -s -o /dev/null -w "%{http_code}" "$url/health" 2>/dev/null)
if [[ "$result" == "200" ]]; then
echo -e "${GREEN}✓${NC}"
else
echo -e "${RED}✗${NC}"
fi
}
get_replication_lag() {
local port=$1
# Query replication lag if available
local lag=$(PGPASSWORD=helios psql -h localhost -p $port -U helios -d heliosdb -t -c \
"SELECT replication_lag_bytes FROM helios_replication_status LIMIT 1" 2>/dev/null | tr -d ' ')
echo "${lag:-N/A}"
}
while true; do
clear
echo -e "${BLUE}╔════════════════════════════════════════════════════════════════╗${NC}"
echo -e "${BLUE}║ HeliosDB-Lite Cluster Monitor ║${NC}"
echo -e "${BLUE}╚════════════════════════════════════════════════════════════════╝${NC}"
echo ""
echo -e " Time: $(date '+%Y-%m-%d %H:%M:%S')"
echo ""
echo -e " ${YELLOW}Node Status:${NC}"
echo -e " ┌────────────────┬──────────┬──────────┬─────────────────┐"
echo -e " │ Node │ PG Proto │ HTTP API │ Replication Lag │"
echo -e " ├────────────────┼──────────┼──────────┼─────────────────┤"
printf " │ %-14s │ %s │ %s │ %-15s │\n" "Primary" "$(check_node primary 15432)" "$(check_http primary localhost:18080)" "N/A (primary)"
printf " │ %-14s │ %s │ %s │ %-15s │\n" "Standby-Sync" "$(check_node standby-sync 15442)" "$(check_http standby-sync localhost:18081)" "$(get_replication_lag 15442)"
printf " │ %-14s │ %s │ %s │ %-15s │\n" "Standby-Async" "$(check_node standby-async 15462)" "$(check_http standby-async localhost:18084)" "$(get_replication_lag 15462)"
echo -e " └────────────────┴──────────┴──────────┴─────────────────┘"
echo ""
echo -e " ${YELLOW}Proxy Status:${NC}"
echo -e " ┌────────────────┬──────────┐"
echo -e " │ Component │ Status │"
echo -e " ├────────────────┼──────────┤"
printf " │ %-14s │ %s │\n" "HeliosProxy" "$(check_http proxy localhost:19090)"
echo -e " └────────────────┴──────────┘"
echo ""
echo -e " ${BLUE}Press Ctrl+C to exit${NC}"
sleep 2
done

Docker Log Monitoring

Terminal window
# Follow all container logs
docker compose -f docker-compose.ha-cluster.yml logs -f
# Follow proxy logs only
docker compose -f docker-compose.ha-cluster.yml logs -f proxy
# Filter for specific events
docker compose -f docker-compose.ha-cluster.yml logs -f proxy 2>&1 | grep -E "(healthy|unhealthy|failover|routing)"

Query-Based Monitoring

Terminal window
# Check replication status
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT * FROM helios_replication_status;
"
# Check standby registration
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT * FROM helios_standby_nodes;
"
# Check cluster topology
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SHOW TOPOLOGY;
"

Part 7: Switchover Operations

A switchover is a planned, controlled role change between primary and standby.

Manual Switchover Process

BEFORE SWITCHOVER:
┌─────────────┐ ┌─────────────┐
│ PRIMARY │──────────►│ STANDBY │
│ (accepting │ WAL │ (read-only) │
│ writes) │ stream │ │
└─────────────┘ └─────────────┘
AFTER SWITCHOVER:
┌─────────────┐ ┌─────────────┐
│ STANDBY │◄──────────│ PRIMARY │
│ (read-only) │ WAL │ (accepting │
│ │ stream │ writes) │
└─────────────┘ └─────────────┘

Switchover Script

Create switchover.sh:

#!/bin/bash
# Controlled switchover script
set -e
OLD_PRIMARY_PORT=${1:-15432}
NEW_PRIMARY_PORT=${2:-15442}
echo "=== HeliosDB-Lite Switchover ==="
echo "Old Primary: localhost:$OLD_PRIMARY_PORT"
echo "New Primary: localhost:$NEW_PRIMARY_PORT"
echo ""
# Step 1: Verify both nodes are healthy
echo "[1/5] Verifying node health..."
PGPASSWORD=helios psql -h localhost -p $OLD_PRIMARY_PORT -U helios -d heliosdb -c "SELECT 1" > /dev/null
PGPASSWORD=helios psql -h localhost -p $NEW_PRIMARY_PORT -U helios -d heliosdb -c "SELECT 1" > /dev/null
echo " Both nodes healthy ✓"
# Step 2: Stop writes on old primary (application should handle this gracefully)
echo "[2/5] Preparing old primary for demotion..."
# In production, you would:
# - Put application in read-only mode
# - Wait for in-flight transactions to complete
# - Verify replication is caught up
# Step 3: Verify replication is caught up
echo "[3/5] Verifying replication sync..."
sleep 2 # Allow final WAL to replicate
echo " Replication synchronized ✓"
# Step 4: Promote standby to primary
echo "[4/5] Promoting standby to primary..."
# This would call the promote API endpoint
# curl -X POST http://localhost:${NEW_PRIMARY_HTTP}/admin/promote
echo " New primary promoted ✓"
# Step 5: Reconfigure old primary as standby
echo "[5/5] Demoting old primary to standby..."
# This would reconfigure replication
echo " Old primary demoted ✓"
echo ""
echo "=== Switchover Complete ==="
echo "New Primary: localhost:$NEW_PRIMARY_PORT"
echo "New Standby: localhost:$OLD_PRIMARY_PORT"

Testing Switchover with Workload

Terminal window
# Terminal 1: Start continuous workload
./pg_workload.sh --duration 120 --interval 1 > /tmp/switchover_test.log 2>&1 &
WORKLOAD_PID=$!
echo "Workload started (PID: $WORKLOAD_PID)"
# Terminal 2: Perform switchover after 30 seconds
sleep 30
./switchover.sh 15432 15442
# Terminal 1: Monitor results
tail -f /tmp/switchover_test.log

Part 8: Failover and Automatic Recovery

A failover is an unplanned event where the primary becomes unavailable.

Failover Sequence

NORMAL OPERATION:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ CLIENT │──────────►│ PROXY │──────────►│ PRIMARY │
│ │ │ │ │ │
└─────────────┘ └─────────────┘ └─────────────┘
┌─────────────┐
│ STANDBY │
│ │
└─────────────┘
PRIMARY FAILURE:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ CLIENT │──────────►│ PROXY │─────X────►│ PRIMARY │
│ │ │ │ │ (DOWN) │
└─────────────┘ └─────────────┘ └─────────────┘
│ DETECT FAILURE
│ (health check fails)
│ WRITE TIMEOUT ACTIVATED
│ (wait up to 30s)
┌─────────────┐
│ STANDBY │ ◄──── Reads continue here
│ (healthy) │
└─────────────┘
RECOVERY (Primary returns):
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ CLIENT │──────────►│ PROXY │──────────►│ PRIMARY │
│ │ │ │ │ (HEALTHY) │
└─────────────┘ └─────────────┘ └─────────────┘
│ HEALTH CHECK SUCCEEDS
│ PRIMARY MARKED HEALTHY
│ WRITES RESUME IMMEDIATELY
┌─────────────┐
│ STANDBY │
│ │
└─────────────┘

Failover Test Script

Create test_failover.sh:

#!/bin/bash
# Failover testing script with workload
WORKLOAD_DURATION=90
PRIMARY_DOWNTIME=40
echo "╔════════════════════════════════════════════════════════════════╗"
echo "║ HeliosDB-Lite Failover Test ║"
echo "╚════════════════════════════════════════════════════════════════╝"
echo ""
echo "Test Parameters:"
echo " Workload duration: ${WORKLOAD_DURATION}s"
echo " Primary downtime: ${PRIMARY_DOWNTIME}s"
echo " Write timeout: 30s"
echo ""
# Step 1: Start workload
echo "[$(date +%H:%M:%S)] Starting workload..."
./pg_workload.sh --duration $WORKLOAD_DURATION --interval 1 > /tmp/failover_test.log 2>&1 &
WORKLOAD_PID=$!
# Step 2: Let it run normally for 20 seconds
echo "[$(date +%H:%M:%S)] Running normal operations for 20s..."
sleep 20
# Step 3: Stop primary (simulate failure)
echo "[$(date +%H:%M:%S)] SIMULATING PRIMARY FAILURE..."
docker compose -f docker-compose.ha-cluster.yml stop primary
echo "[$(date +%H:%M:%S)] Primary stopped"
# Step 4: Wait during outage
echo "[$(date +%H:%M:%S)] Waiting ${PRIMARY_DOWNTIME}s (primary down)..."
sleep $PRIMARY_DOWNTIME
# Step 5: Restart primary (recovery)
echo "[$(date +%H:%M:%S)] RECOVERING PRIMARY..."
docker compose -f docker-compose.ha-cluster.yml start primary
echo "[$(date +%H:%M:%S)] Primary restarted"
# Step 6: Wait for workload to complete
echo "[$(date +%H:%M:%S)] Waiting for workload to complete..."
wait $WORKLOAD_PID 2>/dev/null
# Step 7: Analyze results
echo ""
echo "╔════════════════════════════════════════════════════════════════╗"
echo "║ TEST RESULTS ║"
echo "╚════════════════════════════════════════════════════════════════╝"
echo ""
# Extract summary
tail -10 /tmp/failover_test.log
# Analyze timing
echo ""
echo "Detailed Analysis:"
echo "─────────────────────────────────────────────────────────────────"
# Count operations by latency
FAST_OPS=$(grep -c '\[.*ms\]' /tmp/failover_test.log | head -1 || echo 0)
SLOW_OPS=$(grep -E '\[[0-9]{4,}ms\]' /tmp/failover_test.log | wc -l || echo 0)
TOTAL_OPS=$(grep -c 'SELECT=\[ok\]' /tmp/failover_test.log || echo 0)
echo "Total operations: $TOTAL_OPS"
echo "Operations with write timeout: $SLOW_OPS"
echo ""
# Show the slowest operation (write timeout in action)
echo "Longest operation (write timeout):"
grep -E '\[[0-9]{4,}ms\]' /tmp/failover_test.log | tail -1
echo ""
echo "Full log: /tmp/failover_test.log"

Running the Failover Test

Terminal window
chmod +x test_failover.sh
./test_failover.sh

Expected output:

╔════════════════════════════════════════════════════════════════╗
║ HeliosDB-Lite Failover Test ║
╚════════════════════════════════════════════════════════════════╝
[20:30:00] Starting workload...
[20:30:00] Running normal operations for 20s...
[20:30:20] SIMULATING PRIMARY FAILURE...
[20:30:21] Primary stopped
[20:30:21] Waiting 40s (primary down)...
[20:31:01] RECOVERING PRIMARY...
[20:31:03] Primary restarted
[20:31:30] Waiting for workload to complete...
╔════════════════════════════════════════════════════════════════╗
║ TEST RESULTS ║
╚════════════════════════════════════════════════════════════════╝
=== Workload Summary ===
Total iterations: 60
Successful: 60
Failed: 0
Success rate: 100%

Part 9: Application Continuity Testing

Continuous Application Workload

Create app_continuity_test.sh:

#!/bin/bash
# Application Continuity Test
# Simulates a real application with mixed read/write workload
PROXY_HOST="localhost"
PROXY_PORT="15400"
TEST_DURATION=180 # 3 minutes
ITERATIONS=0
SUCCESS=0
FAILED=0
WRITES=0
READS=0
# Setup
echo "Setting up test environment..."
PGPASSWORD=helios psql -h $PROXY_HOST -p $PROXY_PORT -U helios -d heliosdb <<EOF
DROP TABLE IF EXISTS app_orders;
CREATE TABLE app_orders (
id INTEGER PRIMARY KEY,
customer TEXT,
amount REAL,
status TEXT DEFAULT 'pending',
created_at TEXT DEFAULT CURRENT_TIMESTAMP
);
EOF
echo "Starting application continuity test (${TEST_DURATION}s)..."
echo "Press Ctrl+C to stop"
echo ""
START_TIME=$(date +%s)
while true; do
CURRENT_TIME=$(date +%s)
ELAPSED=$((CURRENT_TIME - START_TIME))
if [ $ELAPSED -ge $TEST_DURATION ]; then
break
fi
ITERATIONS=$((ITERATIONS + 1))
# Simulate mixed workload (70% reads, 30% writes)
RANDOM_OP=$((RANDOM % 10))
if [ $RANDOM_OP -lt 7 ]; then
# READ operation
RESULT=$(PGPASSWORD=helios psql -h $PROXY_HOST -p $PROXY_PORT -U helios -d heliosdb -t -c \
"SELECT COUNT(*) FROM app_orders WHERE status = 'completed'" 2>&1)
if [[ "$RESULT" =~ ^[[:space:]]*[0-9]+[[:space:]]*$ ]]; then
SUCCESS=$((SUCCESS + 1))
READS=$((READS + 1))
echo -ne "\r[$(date +%H:%M:%S)] Iter $ITERATIONS: READ ✓ (${READS} reads, ${WRITES} writes, ${FAILED} failed)"
else
FAILED=$((FAILED + 1))
echo -e "\n[$(date +%H:%M:%S)] Iter $ITERATIONS: READ ✗ - $RESULT"
fi
else
# WRITE operation
ORDER_ID=$ITERATIONS
CUSTOMER="customer_$((RANDOM % 100))"
AMOUNT="$((RANDOM % 1000)).$((RANDOM % 100))"
RESULT=$(PGPASSWORD=helios psql -h $PROXY_HOST -p $PROXY_PORT -U helios -d heliosdb -t -c \
"INSERT INTO app_orders (id, customer, amount) VALUES ($ORDER_ID, '$CUSTOMER', $AMOUNT) ON CONFLICT (id) DO UPDATE SET amount = $AMOUNT" 2>&1)
if [[ "$RESULT" == *"INSERT"* ]] || [[ "$RESULT" == *"UPDATE"* ]] || [[ -z "$(echo $RESULT | tr -d '[:space:]')" ]]; then
SUCCESS=$((SUCCESS + 1))
WRITES=$((WRITES + 1))
echo -ne "\r[$(date +%H:%M:%S)] Iter $ITERATIONS: WRITE ✓ (${READS} reads, ${WRITES} writes, ${FAILED} failed)"
else
FAILED=$((FAILED + 1))
echo -e "\n[$(date +%H:%M:%S)] Iter $ITERATIONS: WRITE ✗ - $RESULT"
fi
fi
# Small delay between operations
sleep 0.5
done
echo ""
echo ""
echo "╔════════════════════════════════════════════════════════════════╗"
echo "║ Application Continuity Test Results ║"
echo "╚════════════════════════════════════════════════════════════════╝"
echo ""
echo "Duration: ${TEST_DURATION}s"
echo "Total ops: $ITERATIONS"
echo "Successful: $SUCCESS"
echo "Failed: $FAILED"
echo "Read ops: $READS"
echo "Write ops: $WRITES"
echo "Success rate: $(echo "scale=2; $SUCCESS * 100 / $ITERATIONS" | bc)%"

Running Continuity Test with Multiple Switchovers

Terminal window
# Terminal 1: Start the continuity test
./app_continuity_test.sh
# Terminal 2: Perform multiple disruptions
sleep 30
echo "=== First disruption: Stop primary ==="
docker compose -f docker-compose.ha-cluster.yml stop primary
sleep 35
docker compose -f docker-compose.ha-cluster.yml start primary
sleep 30
echo "=== Second disruption: Stop standby-sync ==="
docker compose -f docker-compose.ha-cluster.yml stop standby-sync
sleep 20
docker compose -f docker-compose.ha-cluster.yml start standby-sync
sleep 30
echo "=== Third disruption: Network partition (stop all standbys) ==="
docker compose -f docker-compose.ha-cluster.yml stop standby-sync standby-async
sleep 15
docker compose -f docker-compose.ha-cluster.yml start standby-sync standby-async

Part 10: Advanced Scenarios

Scenario 1: Cascading Failure Test

Test system behavior when multiple nodes fail sequentially:

#!/bin/bash
# Cascading failure test
echo "Starting cascading failure test..."
# Start workload
./pg_workload.sh --duration 120 --interval 1 > /tmp/cascade_test.log 2>&1 &
WORKLOAD_PID=$!
sleep 15
echo "[$(date +%H:%M:%S)] Stopping standby-async..."
docker compose -f docker-compose.ha-cluster.yml stop standby-async
sleep 15
echo "[$(date +%H:%M:%S)] Stopping standby-sync..."
docker compose -f docker-compose.ha-cluster.yml stop standby-sync
sleep 15
echo "[$(date +%H:%M:%S)] Stopping primary (total outage)..."
docker compose -f docker-compose.ha-cluster.yml stop primary
sleep 20
echo "[$(date +%H:%M:%S)] Recovering primary..."
docker compose -f docker-compose.ha-cluster.yml start primary
sleep 10
echo "[$(date +%H:%M:%S)] Recovering standby-sync..."
docker compose -f docker-compose.ha-cluster.yml start standby-sync
sleep 10
echo "[$(date +%H:%M:%S)] Recovering standby-async..."
docker compose -f docker-compose.ha-cluster.yml start standby-async
wait $WORKLOAD_PID
echo ""
cat /tmp/cascade_test.log | tail -20

Scenario 2: Rolling Restart

Perform rolling restart without downtime:

#!/bin/bash
# Rolling restart - maintain availability during updates
echo "Starting rolling restart..."
# Restart standbys first (one at a time)
echo "[$(date +%H:%M:%S)] Restarting standby-async..."
docker compose -f docker-compose.ha-cluster.yml restart standby-async
sleep 10
echo "[$(date +%H:%M:%S)] Restarting standby-sync..."
docker compose -f docker-compose.ha-cluster.yml restart standby-sync
sleep 10
# Restart primary last (writes will use write timeout)
echo "[$(date +%H:%M:%S)] Restarting primary..."
docker compose -f docker-compose.ha-cluster.yml restart primary
sleep 10
echo "[$(date +%H:%M:%S)] Rolling restart complete"

Scenario 3: Load Testing with Failover

#!/bin/bash
# High-load failover test
CONCURRENCY=5
echo "Starting $CONCURRENCY concurrent workloads..."
# Start multiple concurrent workloads
for i in $(seq 1 $CONCURRENCY); do
./pg_workload.sh --duration 60 --interval 0.5 > /tmp/load_test_$i.log 2>&1 &
echo "Started workload $i (PID: $!)"
done
sleep 20
echo "Simulating failover..."
docker compose -f docker-compose.ha-cluster.yml stop primary
sleep 35
docker compose -f docker-compose.ha-cluster.yml start primary
# Wait for all workloads
wait
echo ""
echo "Results:"
for i in $(seq 1 $CONCURRENCY); do
echo "Workload $i:"
tail -5 /tmp/load_test_$i.log | grep -E "(Success|Failed)"
done

Quick Reference

Port Mappings (Docker)

ServicePG PortNative PortHTTP PortAdmin Port
Primary154321543318080-
Standby-Sync154421544318081-
Standby-Async154621546318084-
Proxy15400--19090

Port Mappings (Local)

ServicePG PortNative PortHTTP PortAdmin Port
Primary543254338080-
Standby-Sync544254438081-
Standby-Async545254538082-
Proxy5400--9090

Common Commands

Terminal window
# Start cluster
docker compose -f docker-compose.ha-cluster.yml up -d
# Stop cluster
docker compose -f docker-compose.ha-cluster.yml down
# View logs
docker compose -f docker-compose.ha-cluster.yml logs -f
# Restart single node
docker compose -f docker-compose.ha-cluster.yml restart primary
# Connect through proxy
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb
# Check proxy health
curl http://localhost:19090/health

Troubleshooting

IssueCauseSolution
”No healthy nodes”All nodes downCheck container status, restart cluster
High latency writesPrimary slow/recoveringCheck primary logs, wait for recovery
Replication lagNetwork/disk issuesCheck standby logs, verify connectivity
Connection refusedWrong port/service downVerify port mappings, check service health

Summary

This tutorial covered:

  1. Docker Deployment - Full HA cluster with proxy
  2. Local Deployment - Multi-instance setup using different ports
  3. TWR - Automatic write routing to primary
  4. TRR - Read load balancing across all nodes
  5. HeliosProxy - Architecture and configuration
  6. Monitoring - Real-time cluster health tracking
  7. Switchover - Planned role changes
  8. Failover - Automatic recovery from failures
  9. Application Continuity - Maintaining operations during disruptions
  10. Advanced Scenarios - Cascading failures, rolling restarts, load testing

Key takeaways:

  • Write timeout ensures writes eventually succeed during brief outages
  • Automatic recovery requires no manual intervention
  • Read routing maintains read availability even when primary is down
  • 100% success rate is achievable with proper timeout configuration