HeliosDB Nano High Availability Interactive Tutorial
HeliosDB Nano High Availability Interactive Tutorial
This hands-on tutorial walks you through HeliosDB Nano’s High Availability features with interactive checkpoints where you can pause, observe, and verify cluster behavior in real-time.
Table of Contents
- Features & Benefits Overview
- Prerequisites
- Part 1: Starting the Cluster
- Part 2: Understanding Replication Modes
- Part 3: Data Replication Verification
- Part 4: Transparent Write Routing (TWR)
- Part 5: Failover & Recovery
- Part 6: Automatic Reconnection
- Part 7: Advanced Scenarios
- Quick Reference
Features & Benefits Overview
What is HeliosDB Nano HA?
HeliosDB Nano provides enterprise-grade High Availability through WAL (Write-Ahead Log) streaming replication with multiple sync modes, automatic failover detection, and intelligent connection routing.
Key Features
| Feature | Description | Benefit |
|---|---|---|
| WAL Streaming Replication | Continuous log-based replication from primary to standbys | Near-zero data loss, real-time sync |
| Multiple Sync Modes | Sync, Semi-sync, and Async modes | Balance between durability and performance |
| Transparent Write Routing (TWR) | Writes auto-route to primary from any node | Simplified application connectivity |
| Automatic Reconnection | Standbys auto-reconnect with exponential backoff | Self-healing cluster |
| Split-Brain Protection | Observer nodes and fencing tokens prevent dual-primary | Data consistency guaranteed |
| HeliosProxy | Intelligent connection router with health monitoring | Zero-downtime failover for applications |
Replication Mode Comparison
┌─────────────────────────────────────────────────────────────────────────────┐│ REPLICATION MODE SPECTRUM ││ ││ ◄─────────────────────────────────────────────────────────────────────► ││ PERFORMANCE DURABILITY ││ ││ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ││ │ ASYNC │ │ SEMI-SYNC │ │ SYNC │ ││ │ │ │ │ │ │ ││ │ Fire-and- │ │ Wait for 1 │ │ Wait for 1+ │ ││ │ forget │ │ standby ACK │ │ standby ACK │ ││ │ │ │ (flush_lsn) │ │ (flush_lsn) │ ││ │ Fastest │ │ Balanced │ │ Safest │ ││ │ May lose │ │ Low risk │ │ Zero loss* │ ││ │ recent data │ │ │ │ │ ││ └─────────────┘ └─────────────┘ └─────────────┘ ││ ││ *Zero loss when at least one sync standby is available │└─────────────────────────────────────────────────────────────────────────────┘Architecture
┌─────────────────────────────────────┐ │ APPLICATION │ └─────────────────┬───────────────────┘ │ ┌─────────────────▼───────────────────┐ │ HELIOSPROXY │ │ • Automatic primary detection │ │ • Health-based routing │ │ • Write timeout (waits for primary)│ └───────┬───────────┬───────────┬────┘ │ │ │ ┌─────────────▼───┐ ┌───▼───────┐ ┌▼─────────────┐ │ PRIMARY │ │ STANDBY │ │ STANDBY │ │ (Read/Write) │ │ SYNC │ │ ASYNC │ │ │ │ │ │ │ │ Port: 15432 │ │ 15442 │ │ 15462 │ └────────┬────────┘ └─────┬─────┘ └──────┬───────┘ │ │ │ └──────────────────┴───────────────┘ WAL Streaming ReplicationPrerequisites
Before starting, ensure you have:
# Docker and Docker Composedocker --version # 20.10+docker compose version # 2.0+
# psql client (for verification)psql --version # Any version
# curl (for API checks)curl --version
# jq (optional, for JSON formatting)jq --versionPart 1: Starting the Cluster
Step 1.1: Navigate to the Docker Test Directory
cd /home/app/HeliosDB Nano/tests/dockerStep 1.2: Build and Start the Cluster
# Build with fresh binaries (recommended)docker compose -f docker-compose.ha-cluster.yml build --no-cache
# Start all servicesdocker compose -f docker-compose.ha-cluster.yml up -dWait approximately 30-45 seconds for all nodes to start and establish replication.
CHECKPOINT 1: Verify Cluster Startup
Open a new terminal and run:
cd /home/app/HeliosDB Nano/tests/docker./ha_interactive_tutorial.sh checkpoint1Or manually verify:
# Check all containers are runningdocker compose -f docker-compose.ha-cluster.yml ps
# Expected output (all should show "Up" or "healthy"):# NAME STATUS# heliosdb-primary Up (healthy)# heliosdb-standby-sync Up (healthy)# heliosdb-standby-semisync Up (healthy)# heliosdb-standby-async Up (healthy)# heliosdb-observer Up# heliosdb-proxy Up (healthy)# Verify each node is accessiblePGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT 'Primary OK' as status"PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "SELECT 'Standby-Sync OK' as status"PGPASSWORD=helios psql -h localhost -p 15452 -U helios -d heliosdb -c "SELECT 'Standby-Semi OK' as status"PGPASSWORD=helios psql -h localhost -p 15462 -U helios -d heliosdb -c "SELECT 'Standby-Async OK' as status"What you should see:
- All 6 containers running
- Each node responds to queries
- Primary accepts reads and writes, standbys are read-only
Take a moment to:
- Run the monitor script:
./monitor_cluster.sh - Observe all nodes showing “UP”
- Note the latency values for each node
Press Enter to continue when ready…
Part 2: Understanding Replication Modes
The Cluster Configuration
Our test cluster has 4 database nodes with different sync modes:
| Node | Port | Sync Mode | Durability | Performance |
|---|---|---|---|---|
| Primary | 15432 | N/A | Source | Fastest |
| Standby-Sync | 15442 | sync | Guaranteed | Slower writes |
| Standby-Semi | 15452 | semi-sync | High | Balanced |
| Standby-Async | 15462 | async | Eventual | Fast writes |
Step 2.1: Check Replication Status on Primary
# Connect to primary and check standbysPGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT * FROM pg_replication_standbys;"Expected output:
node_id | sync_mode | state | flush_lsn | apply_lsn | lag_ms------------------+-----------+-----------+-----------+-----------+-------- standby-sync | sync | streaming | 1000 | 1000 | 0 standby-semisync | semi-sync | streaming | 1000 | 1000 | 5 standby-async | async | streaming | 995 | 990 | 15Step 2.2: Check Primary Status from a Standby
# Connect to standby-sync and check primary connectionPGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "SELECT * FROM pg_replication_primary;"CHECKPOINT 2: Verify Replication Status
Run the verification script:
./ha_interactive_tutorial.sh checkpoint2Or manually verify:
# Check primary's view of standbysecho "=== PRIMARY'S VIEW OF STANDBYS ==="PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT node_id, sync_mode, state, flush_lsn, apply_lsn, lag_bytes, lag_msFROM pg_replication_standbys;"
# Check each standby's view of primaryfor port in 15442 15452 15462; do echo "" echo "=== STANDBY ON PORT $port ===" PGPASSWORD=helios psql -h localhost -p $port -U helios -d heliosdb -c " SELECT node_id as primary_id, state, primary_lsn, local_lsn, lag_bytes FROM pg_replication_primary; "doneWhat you should see:
- Primary shows 3-4 connected standbys
- Each standby shows connection to primary
stateshould bestreamingfor alllag_msshould be low (< 100ms typically)
Take a moment to:
- Compare flush_lsn and apply_lsn values
- Note that sync standby has lowest lag
- Async standby may show slightly higher lag
Press Enter to continue when ready…
Part 3: Data Replication Verification
Step 3.1: Create a Test Table on Primary
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "CREATE TABLE IF NOT EXISTS replication_test ( id SERIAL PRIMARY KEY, message TEXT, sync_mode TEXT, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP);"Step 3.2: Insert Data and Watch it Replicate
# Insert a rowPGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "INSERT INTO replication_test (message, sync_mode)VALUES ('Hello from primary!', 'sync mode test');"CHECKPOINT 3: Verify Data Replication
Run the verification script:
./ha_interactive_tutorial.sh checkpoint3Or manually verify:
echo "=== DATA ON PRIMARY (port 15432) ==="PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT * FROM replication_test ORDER BY id;"
echo ""echo "=== DATA ON STANDBY-SYNC (port 15442) ==="PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "SELECT * FROM replication_test ORDER BY id;"
echo ""echo "=== DATA ON STANDBY-SEMISYNC (port 15452) ==="PGPASSWORD=helios psql -h localhost -p 15452 -U helios -d heliosdb -c "SELECT * FROM replication_test ORDER BY id;"
echo ""echo "=== DATA ON STANDBY-ASYNC (port 15462) ==="PGPASSWORD=helios psql -h localhost -p 15462 -U helios -d heliosdb -c "SELECT * FROM replication_test ORDER BY id;"What you should see:
- Identical data on ALL nodes (primary + 3 standbys)
- Same row count, same values
- Timestamps are identical (replicated, not re-generated)
Take a moment to:
- Insert more rows and repeat the check
- Note how fast replication occurs (typically < 100ms)
- Try inserting from different connections simultaneously
Press Enter to continue when ready…
Part 4: Transparent Write Routing (TWR)
TWR allows you to connect to ANY node (including standbys) and have writes automatically routed to the primary.
Step 4.1: Connect to a Standby and Try to Write
Without TWR, writing to a standby would fail. With TWR enabled on sync/semi-sync standbys, writes are transparently forwarded.
# Connect to STANDBY-SYNC and insert data# This should work because TWR forwards to primary!PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "INSERT INTO replication_test (message, sync_mode)VALUES ('Written via standby-sync with TWR!', 'TWR test');"Step 4.2: Verify the Write Went to Primary
# Check data on primaryPGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT id, message, sync_mode FROM replication_test WHERE message LIKE '%TWR%';"CHECKPOINT 4: Verify TWR Functionality
Run the verification script:
./ha_interactive_tutorial.sh checkpoint4Or manually verify:
echo "=== TWR TEST: Writing to each node ==="
# Write via standby-syncecho "1. Writing via standby-sync (port 15442)..."PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "INSERT INTO replication_test (message, sync_mode)VALUES ('TWR via standby-sync', 'sync');" 2>&1
# Write via standby-semisyncecho "2. Writing via standby-semisync (port 15452)..."PGPASSWORD=helios psql -h localhost -p 15452 -U helios -d heliosdb -c "INSERT INTO replication_test (message, sync_mode)VALUES ('TWR via standby-semisync', 'semi-sync');" 2>&1
# Write via standby-async (should fail - async doesn't support TWR)echo "3. Writing via standby-async (port 15462)..."PGPASSWORD=helios psql -h localhost -p 15462 -U helios -d heliosdb -c "INSERT INTO replication_test (message, sync_mode)VALUES ('TWR via standby-async', 'async');" 2>&1
echo ""echo "=== FINAL DATA CHECK ==="PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT id, message, sync_mode FROM replication_test ORDER BY id;"What you should see:
- Writes via sync and semi-sync standbys: SUCCESS (TWR forwarded to primary)
- Write via async standby: REJECTED (async mode doesn’t support TWR)
- All successful writes appear on primary and replicate to all standbys
Take a moment to:
- Notice the latency difference when writing via standby vs direct to primary
- Understand why async doesn’t support TWR (data consistency)
- Consider use cases for TWR (simplified application configuration)
Press Enter to continue when ready…
Part 5: Failover & Recovery
This section demonstrates what happens when the primary becomes unavailable.
Step 5.1: Start a Background Workload
In one terminal, start a continuous workload:
./pg_workload.sh --duration 120 --interval 2 | tee /tmp/failover_test.logStep 5.2: Simulate Primary Failure
In another terminal, stop the primary:
# Stop the primary (simulate crash)docker compose -f docker-compose.ha-cluster.yml stop primaryStep 5.3: Observe Workload Behavior
Watch the workload terminal. You should see:
- Writes pause briefly (write timeout kicks in)
- Reads continue (routed to standbys)
- After ~30 seconds, write operations timeout if primary doesn’t return
CHECKPOINT 5: Verify Failover Behavior
Run the verification script:
./ha_interactive_tutorial.sh checkpoint5Or manually verify (while primary is stopped):
echo "=== CLUSTER STATUS WITH PRIMARY DOWN ==="echo ""
echo "1. Primary (port 15432) - should fail:"PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT 1" 2>&1 | head -3
echo ""echo "2. Standby-sync (port 15442) - should work (read-only):"PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "SELECT COUNT(*) as row_count FROM replication_test" 2>&1
echo ""echo "3. Via Proxy (port 15400) - READ should work:"PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c "SELECT COUNT(*) as row_count FROM replication_test" 2>&1
echo ""echo "4. Via Proxy (port 15400) - WRITE will timeout:"echo " (This will wait up to 30s for primary to return)"timeout 35 bash -c 'PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c "INSERT INTO replication_test (message) VALUES ('\''failover test'\'')"' 2>&1 || echo " Write timed out (expected)"What you should see:
- Primary connection: REFUSED (container stopped)
- Standby connections: WORKING (read-only)
- Proxy reads: WORKING (routed to standbys)
- Proxy writes: TIMEOUT after ~30 seconds (waiting for primary)
Take a moment to:
- Check the workload log for timing patterns
- Notice reads continue while writes pause
- Understand the write timeout behavior
Press Enter to continue when ready…
Part 6: Automatic Reconnection
Step 6.1: Restart the Primary
# Restart the primarydocker compose -f docker-compose.ha-cluster.yml start primaryStep 6.2: Watch Standbys Reconnect
Monitor the standby logs to see reconnection:
# Follow standby-sync logsdocker logs -f heliosdb-standby-sync 2>&1 | grep -E "(connect|reconnect|streaming|primary)"You should see messages like:
INFO Attempting reconnection to primary (attempt 1)...INFO Connected to primary, starting streaming...INFO Streaming replication establishedCHECKPOINT 6: Verify Recovery and Reconnection
Run the verification script:
./ha_interactive_tutorial.sh checkpoint6Or manually verify (after primary restart):
echo "=== WAITING FOR PRIMARY TO START ==="sleep 10
echo ""echo "1. Primary (port 15432) - should be back:"PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT 'Primary RECOVERED' as status" 2>&1
echo ""echo "2. Replication status from primary:"PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT node_id, state, flush_lsn, lag_msFROM pg_replication_standbys;"
echo ""echo "3. Test write capability restored:"PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "INSERT INTO replication_test (message, sync_mode)VALUES ('Post-recovery write', 'recovery test')RETURNING id, message;"
echo ""echo "4. Verify replication still works (check standby-sync):"PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "SELECT id, message FROM replication_test WHERE message = 'Post-recovery write';"What you should see:
- Primary: RECOVERED and accepting connections
- All standbys:
state = streaming(reconnected) - Writes working: New data inserted successfully
- Replication working: Data appears on standbys
Take a moment to:
- Check reconnection attempts in standby logs
- Note the exponential backoff (5s → 10s → 20s → …)
- Observe how quickly full functionality is restored
Press Enter to continue when ready…
Part 7: Advanced Scenarios
Scenario 7.1: Batch Durability Test
Test that rapid commits in sync mode are preserved after failover:
# Insert 100 rows rapidlyfor i in $(seq 1 100); do PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c \ "INSERT INTO replication_test (message) VALUES ('Batch row $i')" > /dev/nulldoneecho "Inserted 100 rows"
# Check count on primaryPGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c \ "SELECT COUNT(*) FROM replication_test WHERE message LIKE 'Batch row%'"
# Stop primarydocker compose -f docker-compose.ha-cluster.yml stop primary
# Check count on standby-sync (should match!)PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c \ "SELECT COUNT(*) FROM replication_test WHERE message LIKE 'Batch row%'"
# Restart primarydocker compose -f docker-compose.ha-cluster.yml start primaryExpected: All 100 rows are preserved on the sync standby.
Scenario 7.2: Read Scaling Test
Observe how reads are distributed across nodes through the proxy:
# Run 10 reads through proxy and check routingfor i in $(seq 1 10); do PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c \ "SELECT '$i' as query_num, current_timestamp" 2>/dev/nulldone
# Check proxy logs for routing decisionsdocker logs heliosdb-proxy 2>&1 | tail -20 | grep -E "(routing|selected|backend)"Scenario 7.3: Connection Through Proxy During Failover
# Start continuous reads through proxy in background(for i in $(seq 1 60); do PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -t -c \ "SELECT COUNT(*) FROM replication_test" 2>&1 | tr -d ' ' sleep 1done) &READER_PID=$!
# Wait, then stop primarysleep 10docker compose -f docker-compose.ha-cluster.yml stop primary
# Wait during outagesleep 20
# Restart primarydocker compose -f docker-compose.ha-cluster.yml start primary
# Wait for reads to completewait $READER_PID
# All reads should have succeeded (routed to standbys)echo "Reader completed - check output above for continuous counts"CHECKPOINT 7: Final Verification
Run the final verification:
./ha_interactive_tutorial.sh checkpoint7Or manually verify:
echo "╔══════════════════════════════════════════════════════════════╗"echo "║ FINAL CLUSTER VERIFICATION ║"echo "╚══════════════════════════════════════════════════════════════╝"echo ""
echo "1. Cluster Node Status:"echo " ─────────────────────────────────────────"for port in 15432 15442 15452 15462; do result=$(PGPASSWORD=helios psql -h localhost -p $port -U helios -d heliosdb -t -c "SELECT 1" 2>&1) if [[ "$result" == *"1"* ]]; then echo " Port $port: UP" else echo " Port $port: DOWN" fidone
echo ""echo "2. Proxy Status:"curl -s http://localhost:19090/health 2>/dev/null && echo " Proxy: HEALTHY" || echo " Proxy: UNHEALTHY"
echo ""echo "3. Replication Status (from primary):"PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT node_id, sync_mode, state, lag_msFROM pg_replication_standbysORDER BY node_id;"
echo ""echo "4. Data Consistency Check:"PRIMARY_COUNT=$(PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -t -c "SELECT COUNT(*) FROM replication_test" 2>/dev/null | tr -d ' ')SYNC_COUNT=$(PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -t -c "SELECT COUNT(*) FROM replication_test" 2>/dev/null | tr -d ' ')echo " Primary count: $PRIMARY_COUNT"echo " Standby-sync count: $SYNC_COUNT"if [ "$PRIMARY_COUNT" == "$SYNC_COUNT" ]; then echo " Status: CONSISTENT"else echo " Status: INCONSISTENT (may need to wait for replication)"fi
echo ""echo "╔══════════════════════════════════════════════════════════════╗"echo "║ TUTORIAL COMPLETE! ║"echo "╚══════════════════════════════════════════════════════════════╝"Quick Reference
Port Mappings
| Service | PostgreSQL | Native | HTTP | Admin |
|---|---|---|---|---|
| Primary | 15432 | 15433 | 18080 | - |
| Standby-Sync | 15442 | 15443 | 18081 | - |
| Standby-Semi | 15452 | 15453 | 18082 | - |
| Standby-Async | 15462 | 15463 | 18084 | - |
| Observer | - | - | 18083 | - |
| Proxy | 15400 | - | - | 19090 |
Common Commands
# Start clusterdocker compose -f docker-compose.ha-cluster.yml up -d
# Stop clusterdocker compose -f docker-compose.ha-cluster.yml down
# View logs (all)docker compose -f docker-compose.ha-cluster.yml logs -f
# View logs (specific node)docker logs -f heliosdb-standby-sync
# Restart specific nodedocker compose -f docker-compose.ha-cluster.yml restart standby-sync
# Connect via proxyPGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb
# Connect directly to primaryPGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb
# Run monitor./monitor_cluster.sh
# Run interactive tutorial./ha_interactive_tutorial.shKey System Views
| View | Purpose |
|---|---|
pg_replication_status | Node’s role and configuration |
pg_replication_standbys | Connected standbys (primary only) |
pg_replication_primary | Primary connection info (standby only) |
pg_replication_metrics | Performance counters |
Cleanup
# Stop and remove all containers and volumesdocker compose -f docker-compose.ha-cluster.yml down -vSummary: Key Takeaways
-
Sync Mode Guarantees Zero Data Loss: When writing in sync mode, the transaction is only committed after at least one sync standby acknowledges receipt.
-
TWR Simplifies Application Design: Applications can connect to any sync/semi-sync node and writes are automatically forwarded to the primary.
-
Automatic Reconnection Provides Self-Healing: Standbys automatically reconnect to the primary with exponential backoff after connection loss.
-
The Proxy Provides Seamless Failover: Applications connecting through HeliosProxy experience minimal disruption during primary outages - reads continue immediately, writes wait for primary recovery.
-
Multiple Sync Modes Offer Flexibility: Choose between performance (async), balance (semi-sync), and durability (sync) based on your application’s needs.
Next Steps
- Review the High Availability Feature Documentation
- Explore the HA Hands-On Tutorial for more scenarios
- Check the HA Implementation Plan for roadmap details