Skip to content

Chaos Failover Stress Test

Chaos Failover Stress Test

A 5-minute stress test that runs continuous database workload while randomly killing and restarting PostgreSQL nodes. Proves HeliosProxy maintains high availability under sustained failure conditions.

Architecture

+------------------+
| HeliosProxy |
| (port 36432) |
+--------+---------+
|
+------------------+------------------+
| | |
+--------v-------+ +-------v--------+ +-------v--------+
| pg-primary | | pg-standby-sync| | pg-standby-async|
| (port 35432) | | (port 35442) | | (port 35462) |
+----------------+ +----------------+ +----------------+
  • pg-primary: Read-write primary
  • pg-standby-sync: Synchronous streaming standby
  • pg-standby-async: Asynchronous streaming standby
  • HeliosProxy: Connection router with TR, lag-aware routing, auto-failover

The chaos monkey kills nodes with weighted probability: 50% primary, 25% each standby.

Prerequisites

  • Docker and Docker Compose
  • psql (PostgreSQL client)
  • curl and python3
  • Three terminal windows

How to Run

1. Start the cluster

Terminal window
docker compose up -d --build

Wait for all services to be healthy:

Terminal window
docker compose ps

2. Open three terminals

Terminal 1 — Workload generator:

Terminal window
./workload.sh # Runs until Ctrl+C
./workload.sh 300 # Runs for 300 seconds

Terminal 2 — Chaos monkey:

Terminal window
./chaos.sh # 300s default
./chaos.sh 600 # 600s of chaos

Terminal 3 — Live dashboard:

Terminal window
./dashboard.sh

3. Observe

Watch the dashboard for:

  • Node status changes (healthy -> unhealthy -> healthy)
  • Primary failover and promotion events
  • Pool metrics during failures

Watch the workload for:

  • Failed operations during kills (should be minimal)
  • Recovery after restarts
  • Overall success rate

4. Verify

After the chaos test completes:

Terminal window
./verify.sh

This checks:

  • Total rows in the workload table
  • No gaps in the iteration sequence
  • All nodes have consistent data

Expected Results

  • Success rate: 95%+ of operations succeed despite continuous node kills
  • Failover time: Proxy detects failures within ~4 seconds (2s interval, 2 failures)
  • Data consistency: All reachable nodes converge to the same data after recovery
  • Zero data loss: Transaction Replay ensures committed transactions survive failover

Cleanup

Terminal window
docker compose down -v --remove-orphans