Raft Consensus Setup
Raft Consensus Setup — Deploy a 3-Node HeliosDB Cluster
Crate: heliosdb-cluster (cluster coordinator + Raft consensus + health monitoring + failover)
Status: Production
Source: heliosdb-cluster/src + sub-crate heliosdb-cluster/crates/multi-master
UVP
A single HeliosDB node is fast, but it’s also a single point of failure. The heliosdb-cluster crate ships a built-in Raft consensus implementation — leader election, log replication, automatic failover — wrapped behind one struct: ClusterCoordinator. Point three nodes at each other, set replication_factor: 2, and you get sub-second failover, quorum-protected writes, and live cluster metrics out of the box. No ZooKeeper. No etcd. No external coordinator. Same binary, same protocol, same SQL — just three of them.
Prerequisites
- Three Linux hosts (or three Docker containers on a shared bridge)
- HeliosDB Full v7.x+ binary on each
- Bidirectional TCP between nodes on the cluster port (default 5432)
- About 15 minutes
1. The ClusterCoordinator Surface
The whole API for bringing a cluster up lives in one struct. This is the canonical Quick Start from the crate’s README:
use heliosdb_cluster::prelude::*;
let config = ClusterConfig { node_id: "node1".into(), listen_addr: "0.0.0.0:5432".into(), peer_addrs: vec![ "node2:5432".into(), "node3:5432".into(), ], ..Default::default()};
let cluster = ClusterCoordinator::new(config).await?;cluster.start().await?;
let state = cluster.get_state().await;println!("Leader: {:?}", state.leader);println!("Term: {}", state.term);println!("Nodes: {}", state.nodes.len());That’s it. ClusterCoordinator::new wires up three internal components in one go:
| Component | Responsibility |
|---|---|
| Consensus Manager (Raft) | Leader election, log replication, term management |
| Health Checker | Periodic node health checks, alert generation |
| Failover Coordinator | Failure detection, automatic failover, quorum management |
2. Pick Your Three Hosts
Before you write a config file, decide:
- Node IDs — short, stable, unique strings (
node1,node2,node3). - Listen addresses — what each node binds to (often
0.0.0.0:5432). - Peer addresses — DNS or IPs each node uses to reach the others.
- Replication factor — how many additional copies of each log entry. RF=2 with 3 nodes means every write is on 2 of 3 before it’s acknowledged.
Quorum math is fixed: floor(N/2) + 1 nodes must be reachable to elect a leader and accept writes. With 3 nodes that’s 2.
| Cluster size | Tolerates | Quorum |
|---|---|---|
| 3 | 1 failure | 2 |
| 5 | 2 failures | 3 |
| 7 | 3 failures | 4 |
Even-sized clusters are wasteful — a 4-node cluster tolerates the same as 3.
3. The Default Timings
let config = ClusterConfig { node_id: "node1".into(), listen_addr: "0.0.0.0:5432".into(), peer_addrs: vec!["node2:5432".into(), "node3:5432".into()], election_timeout_ms: 300, heartbeat_interval_ms: 100, replication_factor: 2,};Two knobs matter:
heartbeat_interval_ms: 100— how often the leader pings followers. Smaller = faster failure detection, more network chatter.election_timeout_ms: 300— how long a follower waits without a heartbeat before starting a new election. Must be greater thanheartbeat_interval_ms(typically 3-5x), with jitter so two followers don’t time out simultaneously.
The defaults give you sub-second failover on a healthy LAN. For cross-DC, multiply both by 5-10x to absorb latency variance — see multi-region-active-active.md.
4. Bring Up the Cluster
Node 1 (node1)
use heliosdb_cluster::prelude::*;
#[tokio::main]async fn main() -> Result<(), Box<dyn std::error::Error>> { let config = ClusterConfig { node_id: "node1".into(), listen_addr: "0.0.0.0:5432".into(), peer_addrs: vec![ "10.0.0.2:5432".into(), "10.0.0.3:5432".into(), ], ..Default::default() };
let cluster = ClusterCoordinator::new(config).await?; cluster.start().await?;
println!("node1 up — waiting for quorum..."); tokio::signal::ctrl_c().await?; Ok(())}Nodes 2 and 3
Identical, with node_id and peer_addrs rotated. Boot order doesn’t matter — the first two to see each other will elect a leader, and the third will join as a follower.
5. Verify Election
From any node:
let state = cluster.get_state().await;
assert!(state.leader.is_some()); // exactly one leaderassert_eq!(state.nodes.len(), 3); // all three nodes seenprintln!("Leader: {:?}", state.leader); // e.g. "node2"println!("Term: {}", state.term); // monotonically increasingstate.term is the Raft term number. It increases every time an election happens. In a healthy cluster with no failures it stays put.
6. Read the Cluster Metrics
let metrics = cluster.get_metrics();
println!("Consensus: {:?}", metrics.consensus_metrics);println!("Health: {:?}", metrics.health_metrics);consensus_metrics includes:
- log entries appended
- log entries committed
- last applied index
- election count
health_metrics includes:
- last heartbeat per peer
- node status (Healthy / Degraded / Unreachable)
These are also exposed via the standard observability surface — see Operations Guide for Prometheus and OpenTelemetry wiring.
7. Trigger a Failover (Drill)
The simplest disaster drill is to kill the leader and watch a follower take over.
# On the leader hostsudo systemctl stop heliosdbOn either remaining node:
loop { let state = cluster.get_state().await; if state.leader.as_deref() != Some("node-that-just-died") { println!("Failover complete. New leader: {:?}", state.leader); break; } tokio::time::sleep(std::time::Duration::from_millis(50)).await;}With default timings the new leader is elected in 300-600 ms. Bring the dead node back up — it rejoins as a follower and catches up via log replication.
8. Replication Factor and Durability
replication_factor: 2 means:
- A write is acknowledged once it’s on the leader plus one follower.
- You can lose any one node and not lose data.
- A network partition that isolates the leader from both followers will cause writes to stall (quorum lost) — the leader steps down, no split-brain.
Bumping RF=3 (write to all three before ack) gives stronger durability at the cost of latency — every write waits for the slowest follower.
9. Adding a Fourth Node Later
Live membership changes go through the same Raft log. Bring the new node up with all three existing peers in peer_addrs, and existing nodes will pick it up via gossip + Raft membership change. You don’t need to restart the cluster.
Cross-Region Deployments
Three nodes in one DC tolerates host failure. Three nodes across three DCs tolerates DC failure but pays cross-DC latency on every write. For active-active multi-region with sub-100ms cross-region writes, use the dedicated heliosdb-cluster/crates/active-active subsystem — see multi-region-active-active.md.
Where Next
- multi-region-active-active.md — cross-region writes with multiple consistency models.
- sharding-config.md — scale beyond a single Raft group.
- pitr-recovery.md — point-in-time recovery on top of the Raft log.
heliosdb-cluster/README.md— full crate reference.