Raft Consensus Setup

Raft Consensus Setup — Deploy a 3-Node HeliosDB Cluster

Crate: heliosdb-cluster (cluster coordinator + Raft consensus + health monitoring + failover) Status: Production Source: heliosdb-cluster/src + sub-crate heliosdb-cluster/crates/multi-master

UVP

A single HeliosDB node is fast, but it’s also a single point of failure. The heliosdb-cluster crate ships a built-in Raft consensus implementation — leader election, log replication, automatic failover — wrapped behind one struct: ClusterCoordinator. Point three nodes at each other, set replication_factor: 2, and you get sub-second failover, quorum-protected writes, and live cluster metrics out of the box. No ZooKeeper. No etcd. No external coordinator. Same binary, same protocol, same SQL — just three of them.

Prerequisites

Three Linux hosts (or three Docker containers on a shared bridge)
HeliosDB Full v7.x+ binary on each
Bidirectional TCP between nodes on the cluster port (default 5432)
About 15 minutes

1. The `ClusterCoordinator` Surface

The whole API for bringing a cluster up lives in one struct. This is the canonical Quick Start from the crate’s README:

use heliosdb_cluster::prelude::*;

let config = ClusterConfig {
    node_id: "node1".into(),
    listen_addr: "0.0.0.0:5432".into(),
    peer_addrs: vec![
        "node2:5432".into(),
        "node3:5432".into(),
    ],
    ..Default::default()
};

let cluster = ClusterCoordinator::new(config).await?;
cluster.start().await?;

let state = cluster.get_state().await;
println!("Leader: {:?}", state.leader);
println!("Term: {}", state.term);
println!("Nodes: {}", state.nodes.len());

That’s it. ClusterCoordinator::new wires up three internal components in one go:

Component	Responsibility
Consensus Manager (Raft)	Leader election, log replication, term management
Health Checker	Periodic node health checks, alert generation
Failover Coordinator	Failure detection, automatic failover, quorum management

2. Pick Your Three Hosts

Before you write a config file, decide:

Node IDs — short, stable, unique strings (node1, node2, node3).
Listen addresses — what each node binds to (often 0.0.0.0:5432).
Peer addresses — DNS or IPs each node uses to reach the others.
Replication factor — how many additional copies of each log entry. RF=2 with 3 nodes means every write is on 2 of 3 before it’s acknowledged.

Quorum math is fixed: floor(N/2) + 1 nodes must be reachable to elect a leader and accept writes. With 3 nodes that’s 2.

Cluster size	Tolerates	Quorum
3	1 failure	2
5	2 failures	3
7	3 failures	4

Even-sized clusters are wasteful — a 4-node cluster tolerates the same as 3.

3. The Default Timings

let config = ClusterConfig {
    node_id: "node1".into(),
    listen_addr: "0.0.0.0:5432".into(),
    peer_addrs: vec!["node2:5432".into(), "node3:5432".into()],
    election_timeout_ms: 300,
    heartbeat_interval_ms: 100,
    replication_factor: 2,
};

Two knobs matter:

heartbeat_interval_ms: 100 — how often the leader pings followers. Smaller = faster failure detection, more network chatter.
election_timeout_ms: 300 — how long a follower waits without a heartbeat before starting a new election. Must be greater than heartbeat_interval_ms (typically 3-5x), with jitter so two followers don’t time out simultaneously.

The defaults give you sub-second failover on a healthy LAN. For cross-DC, multiply both by 5-10x to absorb latency variance — see multi-region-active-active.md.

4. Bring Up the Cluster

Node 1 (`node1`)

use heliosdb_cluster::prelude::*;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = ClusterConfig {
        node_id: "node1".into(),
        listen_addr: "0.0.0.0:5432".into(),
        peer_addrs: vec![
            "10.0.0.2:5432".into(),
            "10.0.0.3:5432".into(),
        ],
        ..Default::default()
    };

    let cluster = ClusterCoordinator::new(config).await?;
    cluster.start().await?;

    println!("node1 up — waiting for quorum...");
    tokio::signal::ctrl_c().await?;
    Ok(())
}

Nodes 2 and 3

Identical, with node_id and peer_addrs rotated. Boot order doesn’t matter — the first two to see each other will elect a leader, and the third will join as a follower.

5. Verify Election

From any node:

let state = cluster.get_state().await;

assert!(state.leader.is_some());           // exactly one leader
assert_eq!(state.nodes.len(), 3);          // all three nodes seen
println!("Leader: {:?}", state.leader);    // e.g. "node2"
println!("Term:   {}", state.term);        // monotonically increasing

state.term is the Raft term number. It increases every time an election happens. In a healthy cluster with no failures it stays put.

6. Read the Cluster Metrics

let metrics = cluster.get_metrics();

println!("Consensus: {:?}", metrics.consensus_metrics);
println!("Health:    {:?}", metrics.health_metrics);

consensus_metrics includes:

log entries appended
log entries committed
last applied index
election count

health_metrics includes:

last heartbeat per peer
node status (Healthy / Degraded / Unreachable)

These are also exposed via the standard observability surface — see Operations Guide for Prometheus and OpenTelemetry wiring.

7. Trigger a Failover (Drill)

The simplest disaster drill is to kill the leader and watch a follower take over.

# On the leader host
sudo systemctl stop heliosdb

On either remaining node:

loop {
    let state = cluster.get_state().await;
    if state.leader.as_deref() != Some("node-that-just-died") {
        println!("Failover complete. New leader: {:?}", state.leader);
        break;
    }
    tokio::time::sleep(std::time::Duration::from_millis(50)).await;
}

With default timings the new leader is elected in 300-600 ms. Bring the dead node back up — it rejoins as a follower and catches up via log replication.

8. Replication Factor and Durability

replication_factor: 2 means:

A write is acknowledged once it’s on the leader plus one follower.
You can lose any one node and not lose data.
A network partition that isolates the leader from both followers will cause writes to stall (quorum lost) — the leader steps down, no split-brain.

Bumping RF=3 (write to all three before ack) gives stronger durability at the cost of latency — every write waits for the slowest follower.

9. Adding a Fourth Node Later

Live membership changes go through the same Raft log. Bring the new node up with all three existing peers in peer_addrs, and existing nodes will pick it up via gossip + Raft membership change. You don’t need to restart the cluster.

Cross-Region Deployments

Three nodes in one DC tolerates host failure. Three nodes across three DCs tolerates DC failure but pays cross-DC latency on every write. For active-active multi-region with sub-100ms cross-region writes, use the dedicated heliosdb-cluster/crates/active-active subsystem — see multi-region-active-active.md.

Where Next

multi-region-active-active.md — cross-region writes with multiple consistency models.
sharding-config.md — scale beyond a single Raft group.
pitr-recovery.md — point-in-time recovery on top of the Raft log.
heliosdb-cluster/README.md — full crate reference.