Skip to content

HeliosDB Nano High Availability Interactive Tutorial

HeliosDB Nano High Availability Interactive Tutorial

This hands-on tutorial walks you through HeliosDB Nano’s High Availability features with interactive checkpoints where you can pause, observe, and verify cluster behavior in real-time.


Table of Contents

  1. Features & Benefits Overview
  2. Prerequisites
  3. Part 1: Starting the Cluster
  4. Part 2: Understanding Replication Modes
  5. Part 3: Data Replication Verification
  6. Part 4: Transparent Write Routing (TWR)
  7. Part 5: Failover & Recovery
  8. Part 6: Automatic Reconnection
  9. Part 7: Advanced Scenarios
  10. Quick Reference

Features & Benefits Overview

What is HeliosDB Nano HA?

HeliosDB Nano provides enterprise-grade High Availability through WAL (Write-Ahead Log) streaming replication with multiple sync modes, automatic failover detection, and intelligent connection routing.

Key Features

FeatureDescriptionBenefit
WAL Streaming ReplicationContinuous log-based replication from primary to standbysNear-zero data loss, real-time sync
Multiple Sync ModesSync, Semi-sync, and Async modesBalance between durability and performance
Transparent Write Routing (TWR)Writes auto-route to primary from any nodeSimplified application connectivity
Automatic ReconnectionStandbys auto-reconnect with exponential backoffSelf-healing cluster
Split-Brain ProtectionObserver nodes and fencing tokens prevent dual-primaryData consistency guaranteed
HeliosProxyIntelligent connection router with health monitoringZero-downtime failover for applications

Replication Mode Comparison

┌─────────────────────────────────────────────────────────────────────────────┐
│ REPLICATION MODE SPECTRUM │
│ │
│ ◄─────────────────────────────────────────────────────────────────────► │
│ PERFORMANCE DURABILITY │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ ASYNC │ │ SEMI-SYNC │ │ SYNC │ │
│ │ │ │ │ │ │ │
│ │ Fire-and- │ │ Wait for 1 │ │ Wait for 1+ │ │
│ │ forget │ │ standby ACK │ │ standby ACK │ │
│ │ │ │ (flush_lsn) │ │ (flush_lsn) │ │
│ │ Fastest │ │ Balanced │ │ Safest │ │
│ │ May lose │ │ Low risk │ │ Zero loss* │ │
│ │ recent data │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ *Zero loss when at least one sync standby is available │
└─────────────────────────────────────────────────────────────────────────────┘

Architecture

┌─────────────────────────────────────┐
│ APPLICATION │
└─────────────────┬───────────────────┘
┌─────────────────▼───────────────────┐
│ HELIOSPROXY │
│ • Automatic primary detection │
│ • Health-based routing │
│ • Write timeout (waits for primary)│
└───────┬───────────┬───────────┬────┘
│ │ │
┌─────────────▼───┐ ┌───▼───────┐ ┌▼─────────────┐
│ PRIMARY │ │ STANDBY │ │ STANDBY │
│ (Read/Write) │ │ SYNC │ │ ASYNC │
│ │ │ │ │ │
│ Port: 15432 │ │ 15442 │ │ 15462 │
└────────┬────────┘ └─────┬─────┘ └──────┬───────┘
│ │ │
└──────────────────┴───────────────┘
WAL Streaming Replication

Prerequisites

Before starting, ensure you have:

Terminal window
# Docker and Docker Compose
docker --version # 20.10+
docker compose version # 2.0+
# psql client (for verification)
psql --version # Any version
# curl (for API checks)
curl --version
# jq (optional, for JSON formatting)
jq --version

Part 1: Starting the Cluster

Step 1.1: Navigate to the Docker Test Directory

Terminal window
cd /home/app/HeliosDB Nano/tests/docker

Step 1.2: Build and Start the Cluster

Terminal window
# Build with fresh binaries (recommended)
docker compose -f docker-compose.ha-cluster.yml build --no-cache
# Start all services
docker compose -f docker-compose.ha-cluster.yml up -d

Wait approximately 30-45 seconds for all nodes to start and establish replication.


CHECKPOINT 1: Verify Cluster Startup

Open a new terminal and run:

Terminal window
cd /home/app/HeliosDB Nano/tests/docker
./ha_interactive_tutorial.sh checkpoint1

Or manually verify:

Terminal window
# Check all containers are running
docker compose -f docker-compose.ha-cluster.yml ps
# Expected output (all should show "Up" or "healthy"):
# NAME STATUS
# heliosdb-primary Up (healthy)
# heliosdb-standby-sync Up (healthy)
# heliosdb-standby-semisync Up (healthy)
# heliosdb-standby-async Up (healthy)
# heliosdb-observer Up
# heliosdb-proxy Up (healthy)
Terminal window
# Verify each node is accessible
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT 'Primary OK' as status"
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "SELECT 'Standby-Sync OK' as status"
PGPASSWORD=helios psql -h localhost -p 15452 -U helios -d heliosdb -c "SELECT 'Standby-Semi OK' as status"
PGPASSWORD=helios psql -h localhost -p 15462 -U helios -d heliosdb -c "SELECT 'Standby-Async OK' as status"

What you should see:

  • All 6 containers running
  • Each node responds to queries
  • Primary accepts reads and writes, standbys are read-only

Take a moment to:

  1. Run the monitor script: ./monitor_cluster.sh
  2. Observe all nodes showing “UP”
  3. Note the latency values for each node

Press Enter to continue when ready…


Part 2: Understanding Replication Modes

The Cluster Configuration

Our test cluster has 4 database nodes with different sync modes:

NodePortSync ModeDurabilityPerformance
Primary15432N/ASourceFastest
Standby-Sync15442syncGuaranteedSlower writes
Standby-Semi15452semi-syncHighBalanced
Standby-Async15462asyncEventualFast writes

Step 2.1: Check Replication Status on Primary

Terminal window
# Connect to primary and check standbys
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT * FROM pg_replication_standbys;
"

Expected output:

node_id | sync_mode | state | flush_lsn | apply_lsn | lag_ms
------------------+-----------+-----------+-----------+-----------+--------
standby-sync | sync | streaming | 1000 | 1000 | 0
standby-semisync | semi-sync | streaming | 1000 | 1000 | 5
standby-async | async | streaming | 995 | 990 | 15

Step 2.2: Check Primary Status from a Standby

Terminal window
# Connect to standby-sync and check primary connection
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "
SELECT * FROM pg_replication_primary;
"

CHECKPOINT 2: Verify Replication Status

Run the verification script:

Terminal window
./ha_interactive_tutorial.sh checkpoint2

Or manually verify:

Terminal window
# Check primary's view of standbys
echo "=== PRIMARY'S VIEW OF STANDBYS ==="
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT
node_id,
sync_mode,
state,
flush_lsn,
apply_lsn,
lag_bytes,
lag_ms
FROM pg_replication_standbys;
"
# Check each standby's view of primary
for port in 15442 15452 15462; do
echo ""
echo "=== STANDBY ON PORT $port ==="
PGPASSWORD=helios psql -h localhost -p $port -U helios -d heliosdb -c "
SELECT
node_id as primary_id,
state,
primary_lsn,
local_lsn,
lag_bytes
FROM pg_replication_primary;
"
done

What you should see:

  • Primary shows 3-4 connected standbys
  • Each standby shows connection to primary
  • state should be streaming for all
  • lag_ms should be low (< 100ms typically)

Take a moment to:

  1. Compare flush_lsn and apply_lsn values
  2. Note that sync standby has lowest lag
  3. Async standby may show slightly higher lag

Press Enter to continue when ready…


Part 3: Data Replication Verification

Step 3.1: Create a Test Table on Primary

Terminal window
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
CREATE TABLE IF NOT EXISTS replication_test (
id SERIAL PRIMARY KEY,
message TEXT,
sync_mode TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
"

Step 3.2: Insert Data and Watch it Replicate

Terminal window
# Insert a row
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
INSERT INTO replication_test (message, sync_mode)
VALUES ('Hello from primary!', 'sync mode test');
"

CHECKPOINT 3: Verify Data Replication

Run the verification script:

Terminal window
./ha_interactive_tutorial.sh checkpoint3

Or manually verify:

Terminal window
echo "=== DATA ON PRIMARY (port 15432) ==="
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT * FROM replication_test ORDER BY id;
"
echo ""
echo "=== DATA ON STANDBY-SYNC (port 15442) ==="
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "
SELECT * FROM replication_test ORDER BY id;
"
echo ""
echo "=== DATA ON STANDBY-SEMISYNC (port 15452) ==="
PGPASSWORD=helios psql -h localhost -p 15452 -U helios -d heliosdb -c "
SELECT * FROM replication_test ORDER BY id;
"
echo ""
echo "=== DATA ON STANDBY-ASYNC (port 15462) ==="
PGPASSWORD=helios psql -h localhost -p 15462 -U helios -d heliosdb -c "
SELECT * FROM replication_test ORDER BY id;
"

What you should see:

  • Identical data on ALL nodes (primary + 3 standbys)
  • Same row count, same values
  • Timestamps are identical (replicated, not re-generated)

Take a moment to:

  1. Insert more rows and repeat the check
  2. Note how fast replication occurs (typically < 100ms)
  3. Try inserting from different connections simultaneously

Press Enter to continue when ready…


Part 4: Transparent Write Routing (TWR)

TWR allows you to connect to ANY node (including standbys) and have writes automatically routed to the primary.

Step 4.1: Connect to a Standby and Try to Write

Without TWR, writing to a standby would fail. With TWR enabled on sync/semi-sync standbys, writes are transparently forwarded.

Terminal window
# Connect to STANDBY-SYNC and insert data
# This should work because TWR forwards to primary!
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "
INSERT INTO replication_test (message, sync_mode)
VALUES ('Written via standby-sync with TWR!', 'TWR test');
"

Step 4.2: Verify the Write Went to Primary

Terminal window
# Check data on primary
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT id, message, sync_mode FROM replication_test WHERE message LIKE '%TWR%';
"

CHECKPOINT 4: Verify TWR Functionality

Run the verification script:

Terminal window
./ha_interactive_tutorial.sh checkpoint4

Or manually verify:

Terminal window
echo "=== TWR TEST: Writing to each node ==="
# Write via standby-sync
echo "1. Writing via standby-sync (port 15442)..."
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "
INSERT INTO replication_test (message, sync_mode)
VALUES ('TWR via standby-sync', 'sync');
" 2>&1
# Write via standby-semisync
echo "2. Writing via standby-semisync (port 15452)..."
PGPASSWORD=helios psql -h localhost -p 15452 -U helios -d heliosdb -c "
INSERT INTO replication_test (message, sync_mode)
VALUES ('TWR via standby-semisync', 'semi-sync');
" 2>&1
# Write via standby-async (should fail - async doesn't support TWR)
echo "3. Writing via standby-async (port 15462)..."
PGPASSWORD=helios psql -h localhost -p 15462 -U helios -d heliosdb -c "
INSERT INTO replication_test (message, sync_mode)
VALUES ('TWR via standby-async', 'async');
" 2>&1
echo ""
echo "=== FINAL DATA CHECK ==="
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT id, message, sync_mode FROM replication_test ORDER BY id;
"

What you should see:

  • Writes via sync and semi-sync standbys: SUCCESS (TWR forwarded to primary)
  • Write via async standby: REJECTED (async mode doesn’t support TWR)
  • All successful writes appear on primary and replicate to all standbys

Take a moment to:

  1. Notice the latency difference when writing via standby vs direct to primary
  2. Understand why async doesn’t support TWR (data consistency)
  3. Consider use cases for TWR (simplified application configuration)

Press Enter to continue when ready…


Part 5: Failover & Recovery

This section demonstrates what happens when the primary becomes unavailable.

Step 5.1: Start a Background Workload

In one terminal, start a continuous workload:

Terminal window
./pg_workload.sh --duration 120 --interval 2 | tee /tmp/failover_test.log

Step 5.2: Simulate Primary Failure

In another terminal, stop the primary:

Terminal window
# Stop the primary (simulate crash)
docker compose -f docker-compose.ha-cluster.yml stop primary

Step 5.3: Observe Workload Behavior

Watch the workload terminal. You should see:

  • Writes pause briefly (write timeout kicks in)
  • Reads continue (routed to standbys)
  • After ~30 seconds, write operations timeout if primary doesn’t return

CHECKPOINT 5: Verify Failover Behavior

Run the verification script:

Terminal window
./ha_interactive_tutorial.sh checkpoint5

Or manually verify (while primary is stopped):

Terminal window
echo "=== CLUSTER STATUS WITH PRIMARY DOWN ==="
echo ""
echo "1. Primary (port 15432) - should fail:"
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT 1" 2>&1 | head -3
echo ""
echo "2. Standby-sync (port 15442) - should work (read-only):"
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "SELECT COUNT(*) as row_count FROM replication_test" 2>&1
echo ""
echo "3. Via Proxy (port 15400) - READ should work:"
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c "SELECT COUNT(*) as row_count FROM replication_test" 2>&1
echo ""
echo "4. Via Proxy (port 15400) - WRITE will timeout:"
echo " (This will wait up to 30s for primary to return)"
timeout 35 bash -c 'PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c "INSERT INTO replication_test (message) VALUES ('\''failover test'\'')"' 2>&1 || echo " Write timed out (expected)"

What you should see:

  • Primary connection: REFUSED (container stopped)
  • Standby connections: WORKING (read-only)
  • Proxy reads: WORKING (routed to standbys)
  • Proxy writes: TIMEOUT after ~30 seconds (waiting for primary)

Take a moment to:

  1. Check the workload log for timing patterns
  2. Notice reads continue while writes pause
  3. Understand the write timeout behavior

Press Enter to continue when ready…


Part 6: Automatic Reconnection

Step 6.1: Restart the Primary

Terminal window
# Restart the primary
docker compose -f docker-compose.ha-cluster.yml start primary

Step 6.2: Watch Standbys Reconnect

Monitor the standby logs to see reconnection:

Terminal window
# Follow standby-sync logs
docker logs -f heliosdb-standby-sync 2>&1 | grep -E "(connect|reconnect|streaming|primary)"

You should see messages like:

INFO Attempting reconnection to primary (attempt 1)...
INFO Connected to primary, starting streaming...
INFO Streaming replication established

CHECKPOINT 6: Verify Recovery and Reconnection

Run the verification script:

Terminal window
./ha_interactive_tutorial.sh checkpoint6

Or manually verify (after primary restart):

Terminal window
echo "=== WAITING FOR PRIMARY TO START ==="
sleep 10
echo ""
echo "1. Primary (port 15432) - should be back:"
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT 'Primary RECOVERED' as status" 2>&1
echo ""
echo "2. Replication status from primary:"
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT
node_id,
state,
flush_lsn,
lag_ms
FROM pg_replication_standbys;
"
echo ""
echo "3. Test write capability restored:"
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
INSERT INTO replication_test (message, sync_mode)
VALUES ('Post-recovery write', 'recovery test')
RETURNING id, message;
"
echo ""
echo "4. Verify replication still works (check standby-sync):"
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "
SELECT id, message FROM replication_test WHERE message = 'Post-recovery write';
"

What you should see:

  • Primary: RECOVERED and accepting connections
  • All standbys: state = streaming (reconnected)
  • Writes working: New data inserted successfully
  • Replication working: Data appears on standbys

Take a moment to:

  1. Check reconnection attempts in standby logs
  2. Note the exponential backoff (5s → 10s → 20s → …)
  3. Observe how quickly full functionality is restored

Press Enter to continue when ready…


Part 7: Advanced Scenarios

Scenario 7.1: Batch Durability Test

Test that rapid commits in sync mode are preserved after failover:

Terminal window
# Insert 100 rows rapidly
for i in $(seq 1 100); do
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c \
"INSERT INTO replication_test (message) VALUES ('Batch row $i')" > /dev/null
done
echo "Inserted 100 rows"
# Check count on primary
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c \
"SELECT COUNT(*) FROM replication_test WHERE message LIKE 'Batch row%'"
# Stop primary
docker compose -f docker-compose.ha-cluster.yml stop primary
# Check count on standby-sync (should match!)
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c \
"SELECT COUNT(*) FROM replication_test WHERE message LIKE 'Batch row%'"
# Restart primary
docker compose -f docker-compose.ha-cluster.yml start primary

Expected: All 100 rows are preserved on the sync standby.

Scenario 7.2: Read Scaling Test

Observe how reads are distributed across nodes through the proxy:

Terminal window
# Run 10 reads through proxy and check routing
for i in $(seq 1 10); do
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c \
"SELECT '$i' as query_num, current_timestamp" 2>/dev/null
done
# Check proxy logs for routing decisions
docker logs heliosdb-proxy 2>&1 | tail -20 | grep -E "(routing|selected|backend)"

Scenario 7.3: Connection Through Proxy During Failover

Terminal window
# Start continuous reads through proxy in background
(for i in $(seq 1 60); do
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -t -c \
"SELECT COUNT(*) FROM replication_test" 2>&1 | tr -d ' '
sleep 1
done) &
READER_PID=$!
# Wait, then stop primary
sleep 10
docker compose -f docker-compose.ha-cluster.yml stop primary
# Wait during outage
sleep 20
# Restart primary
docker compose -f docker-compose.ha-cluster.yml start primary
# Wait for reads to complete
wait $READER_PID
# All reads should have succeeded (routed to standbys)
echo "Reader completed - check output above for continuous counts"

CHECKPOINT 7: Final Verification

Run the final verification:

Terminal window
./ha_interactive_tutorial.sh checkpoint7

Or manually verify:

Terminal window
echo "╔══════════════════════════════════════════════════════════════╗"
echo "║ FINAL CLUSTER VERIFICATION ║"
echo "╚══════════════════════════════════════════════════════════════╝"
echo ""
echo "1. Cluster Node Status:"
echo " ─────────────────────────────────────────"
for port in 15432 15442 15452 15462; do
result=$(PGPASSWORD=helios psql -h localhost -p $port -U helios -d heliosdb -t -c "SELECT 1" 2>&1)
if [[ "$result" == *"1"* ]]; then
echo " Port $port: UP"
else
echo " Port $port: DOWN"
fi
done
echo ""
echo "2. Proxy Status:"
curl -s http://localhost:19090/health 2>/dev/null && echo " Proxy: HEALTHY" || echo " Proxy: UNHEALTHY"
echo ""
echo "3. Replication Status (from primary):"
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT
node_id,
sync_mode,
state,
lag_ms
FROM pg_replication_standbys
ORDER BY node_id;
"
echo ""
echo "4. Data Consistency Check:"
PRIMARY_COUNT=$(PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -t -c "SELECT COUNT(*) FROM replication_test" 2>/dev/null | tr -d ' ')
SYNC_COUNT=$(PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -t -c "SELECT COUNT(*) FROM replication_test" 2>/dev/null | tr -d ' ')
echo " Primary count: $PRIMARY_COUNT"
echo " Standby-sync count: $SYNC_COUNT"
if [ "$PRIMARY_COUNT" == "$SYNC_COUNT" ]; then
echo " Status: CONSISTENT"
else
echo " Status: INCONSISTENT (may need to wait for replication)"
fi
echo ""
echo "╔══════════════════════════════════════════════════════════════╗"
echo "║ TUTORIAL COMPLETE! ║"
echo "╚══════════════════════════════════════════════════════════════╝"

Quick Reference

Port Mappings

ServicePostgreSQLNativeHTTPAdmin
Primary154321543318080-
Standby-Sync154421544318081-
Standby-Semi154521545318082-
Standby-Async154621546318084-
Observer--18083-
Proxy15400--19090

Common Commands

Terminal window
# Start cluster
docker compose -f docker-compose.ha-cluster.yml up -d
# Stop cluster
docker compose -f docker-compose.ha-cluster.yml down
# View logs (all)
docker compose -f docker-compose.ha-cluster.yml logs -f
# View logs (specific node)
docker logs -f heliosdb-standby-sync
# Restart specific node
docker compose -f docker-compose.ha-cluster.yml restart standby-sync
# Connect via proxy
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb
# Connect directly to primary
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb
# Run monitor
./monitor_cluster.sh
# Run interactive tutorial
./ha_interactive_tutorial.sh

Key System Views

ViewPurpose
pg_replication_statusNode’s role and configuration
pg_replication_standbysConnected standbys (primary only)
pg_replication_primaryPrimary connection info (standby only)
pg_replication_metricsPerformance counters

Cleanup

Terminal window
# Stop and remove all containers and volumes
docker compose -f docker-compose.ha-cluster.yml down -v

Summary: Key Takeaways

  1. Sync Mode Guarantees Zero Data Loss: When writing in sync mode, the transaction is only committed after at least one sync standby acknowledges receipt.

  2. TWR Simplifies Application Design: Applications can connect to any sync/semi-sync node and writes are automatically forwarded to the primary.

  3. Automatic Reconnection Provides Self-Healing: Standbys automatically reconnect to the primary with exponential backoff after connection loss.

  4. The Proxy Provides Seamless Failover: Applications connecting through HeliosProxy experience minimal disruption during primary outages - reads continue immediately, writes wait for primary recovery.

  5. Multiple Sync Modes Offer Flexibility: Choose between performance (async), balance (semi-sync), and durability (sync) based on your application’s needs.


Next Steps