Multi-Master Replication Operations Guide

Document Version: 1.0 Status: Production Last Updated: 2025-12-30

Introduction
Multi-Master Replication Concepts
Supported Replication Topologies
Conflict Detection and Resolution
Setup and Configuration
Operational Procedures
Performance Tuning
Conflict Resolution in Practice
Real-World Topologies and Examples
Troubleshooting Replication Issues
High-Availability Patterns

Introduction

Multi-master replication in HeliosDB enables distributed database deployments where multiple nodes can accept write operations independently. This guide provides operational procedures for deploying, managing, and troubleshooting multi-master replication clusters.

Key Capabilities

Bidirectional Synchronization: Write to any master node with automatic propagation
Conflict Resolution: Last-Write-Wins (LWW), application-defined, and CRDT-based strategies
Geographic Distribution: Deploy masters across multiple regions
Automatic Failover: Seamless failover with conflict-free consistency
Causal Consistency: Maintain ordering of dependent operations
99.999% Write Availability: No single point of failure for writes

Benefits Over Single-Master Replication

Aspect	Single-Master	Multi-Master
Write Availability	Single point of failure	Distributed, resilient
Write Latency	Network round-trip	Local write + async replication
Failover Strategy	Promote read replica	Multiple masters available
Geographic Redundancy	Limited	Excellent across regions
Maintenance Downtime	Full cluster outage	Zero downtime

Multi-Master Replication Concepts

Architecture Overview

Multi-master replication creates a peer-to-peer topology where each master node:

Accepts local writes from connected clients
Commits transactions to local storage immediately
Queues changes for asynchronous replication
Sends changes to peer masters
Applies remote changes with conflict detection/resolution

┌──────────────────────────────────────────────────────────────┐
│                   Multi-Master Topology                      │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│    Client A                 Client B           Client C      │
│      │                         │                  │          │
│      ▼                         ▼                  ▼          │
│   [US-East]◄──────────────►[EU-West]◄────────[AP-South]    │
│   Master 1                  Master 2           Master 3      │
│      │                         │                  │          │
│      │  Commit locally         │  Commit locally  │          │
│      │  Queue for replication  │  Queue           │          │
│      │                         │                  │          │
│      └────────Async Replication──────────────────┘          │
│                                                              │
│  - Each master accepts writes independently                 │
│  - Changes propagate asynchronously to peers                │
│  - Conflicts detected and resolved automatically            │
│  - Causal consistency maintained via vector clocks          │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Key Concepts

Version Vectors / Vector Clocks

Track causality and detect concurrent writes. Each master maintains a logical clock for every other master.

Master 1: [10, 0, 0]  # 10 local writes, 0 from Master 2, 0 from Master 3
Master 2: [5, 15, 0]  # 5 from Master 1, 15 local writes, 0 from Master 3
Master 3: [10, 12, 20] # 10 from Master 1, 12 from Master 2, 20 local writes

Replication Lag

Time between commit on master A and application on master B. Typical values: 50-500ms (depends on network).

Application writes to US-East Master
      │
      ├─ Commit locally (1ms)
      │
      ├─ Queue for replication (1ms)
      │
      ├─ Send to EU-West (85ms network)
      │
      ├─ Receive on EU-West (1ms)
      │
      ├─ Apply to storage (10ms)
      │
      └─ Acknowledge (1ms)

Total Replication Lag: ~100ms

Conflict Detection

Two writes to the same row from different masters within a replication window.

Time  US-East Master           EU-West Master
  0   UPDATE users SET x=1     UPDATE users SET x=2
  1   Commit locally           Commit locally
  2   Queue for replication    Queue for replication
 85   Receive EU update        Receive US update
      x=2 (conflict detected)  x=1 (conflict detected)
  86  Apply conflict resolution (e.g., LWW: x=2 wins due to later timestamp)

Supported Replication Topologies

Active-Active Topology

All masters actively accept writes. Best for geographic distribution.

        ┌─────────────────────────────────────────┐
        │    Client Access Patterns               │
        ├─────────────────────────────────────────┤
        │                                         │
        │  Region 1: Route to Master 1            │
        │  Region 2: Route to Master 2            │
        │  Region 3: Route to Master 3            │
        │                                         │
        │  ┌────────────────┐                    │
        │  │   Master 1     │◄───────────────┐   │
        │  │  US-East       │                │   │
        │  └────────────────┘                │   │
        │        ▲      ▲                    │   │
        │        │      └────────────────────┼───┼──┐
        │        │                           │   │  │
        │        │                           │   │  │
        │  ┌─────┴────────────┐      ┌──────┴─┐ │  │
        │  │    Master 2      │◄────►│Master 3│ │  │
        │  │   EU-West       │      │ AP-South│ │  │
        │  └──────────────────┘      └────────┘ │  │
        │                                       │  │
        │  All write paths bidirectional        └──┘
        │                                         │
        └─────────────────────────────────────────┘

Benefits:
  • Writes near-locally for all regions
  • No single point of failure
  • Minimal write latency

Trade-offs:
  • Conflict resolution complexity
  • Coordination overhead
  • Eventual consistency (not strong)

Multi-Region Topology

Masters deployed across geographic regions with optional read replicas per region.

        ┌──────────────────────────────────────────────┐
        │         Multi-Region Architecture             │
        ├──────────────────────────────────────────────┤
        │                                              │
        │  [US Region]          [EU Region]           │
        │  ┌────────────────┐  ┌──────────────────┐   │
        │  │ Master 1       │  │ Master 2         │   │
        │  │ (Writable)     │◄─│ (Writable)       │   │
        │  └────────────────┘  └──────────────────┘   │
        │       ▲ │                   ▲ │             │
        │       │ ├─Read/Write        │ ├─Read/Write  │
        │       │ │                   │ │             │
        │  ┌────┴─┴──────────┐   ┌────┴─┴──────────┐  │
        │  │ Replica 1       │   │ Replica 2       │  │
        │  │ (Read-only)     │   │ (Read-only)     │  │
        │  └─────────────────┘   └─────────────────┘  │
        │       ▲                       ▲              │
        │       │ Async replication    │              │
        │       └───────────────────────┘              │
        │                                              │
        │  [Asia Region]        [Australia Region]    │
        │  ┌──────────────────┐  ┌──────────────────┐ │
        │  │ Master 3         │  │ Master 4         │ │
        │  │ (Writable)       │  │ (Writable)       │ │
        │  └──────────────────┘  └──────────────────┘ │
        │       ▲                       ▲              │
        │       └───────────────────────┘              │
        │                                              │
        │  All masters replicate to all masters       │
        │  Replicas provide read-scaling per region   │
        │                                              │
        └──────────────────────────────────────────────┘

Failover Topology

Supports automatic failover when master becomes unavailable.

        Normal Operation (All Masters Healthy)
        ┌────────────────────────────────────┐
        │                                    │
        │    Master 1 ◄────────────► Master 2│
        │    Primary    Replication  Standby │
        │                                    │
        │    Writes → Master 1               │
        │    Reads → Both                    │
        │                                    │
        └────────────────────────────────────┘

        Master 1 Failure Detected
        ┌────────────────────────────────────┐
        │                                    │
        │    Master 1 ✗                      │
        │    (Unreachable)                   │
        │                                    │
        │                     Master 2       │
        │                     (Promoted)     │
        │                                    │
        │    Writes → Master 2               │
        │    Reads → Master 2                │
        │                                    │
        └────────────────────────────────────┘

        Master 1 Recovery
        ┌────────────────────────────────────┐
        │                                    │
        │    Master 1 ◄────────────► Master 2│
        │    (Recovering) Catch-up  Primary  │
        │                                    │
        │    Writes → Master 2               │
        │    Reads → Both                    │
        │    (Master 1 synchronized)         │
        │                                    │
        └────────────────────────────────────┘

Conflict Detection and Resolution

Conflict Detection Mechanisms

HeliosDB detects conflicts at the row and column level using multiple strategies:

Version Vector Detection

Detects concurrent writes from different masters:

State on Master 1: [row_id=100, value=10, vv=[5,0,0]]
  Master 1 writes value=15: new vv=[6,0,0]

Concurrent write on Master 2:
  Master 2 writes value=20: new vv=[0,1,0]

When Master 1 receives update from Master 2:
  Local vv=[6,0,0], Incoming vv=[0,1,0]
  Neither dominates (no causal order)
  → CONFLICT DETECTED

Resolution: Apply conflict strategy (LWW, CRDT, or custom)

Timestamp-Based Detection

Last-Write-Wins strategy with logical or wall-clock timestamps:

-- Table with conflict detection columns
CREATE TABLE accounts (
    id UUID PRIMARY KEY,
    balance DECIMAL,
    master_id INT,
    timestamp BIGINT,
    write_counter INT,
    CONSTRAINT timestamp_check CHECK (timestamp > 0)
);

-- Write from Master 1
UPDATE accounts
SET balance = 100, master_id = 1, timestamp = 1000, write_counter = 5
WHERE id = 'acc-001';

-- Concurrent write from Master 2
UPDATE accounts
SET balance = 150, master_id = 2, timestamp = 1001, write_counter = 3
WHERE id = 'acc-001';

-- Conflict Resolution: Compare timestamps
-- Master 2's write (timestamp=1001) > Master 1's write (timestamp=1000)
-- Result: balance = 150, master_id = 2

Last-Write-Wins (LWW) Resolution

Simplest strategy: keep the write with the latest timestamp.

Configuration:

[replication]
conflict_resolution = "last_write_wins"
timestamp_source = "server_timestamp"  # server_timestamp | client_timestamp | logical_clock
timestamp_precision_ms = true

SQL Example:

-- Enable LWW for a table
CREATE TABLE products (
    id INT PRIMARY KEY,
    name VARCHAR(255),
    price DECIMAL(10,2),
    _master_ts BIGINT DEFAULT CURRENT_TIMESTAMP,
    _master_id INT DEFAULT CURRENT_MASTER_ID
);

-- System automatically selects write with max(_master_ts)
-- To create conflict:
-- Master 1: UPDATE products SET price = 100 WHERE id = 1;  -- ts=1000
-- Master 2: UPDATE products SET price = 150 WHERE id = 1;  -- ts=1001
-- Result: price = 150 (Master 2 wins)

Pros and Cons:

Aspect	Details
Pros	Simple, deterministic, low overhead
Cons	May lose legitimate writes from slower master
Best For	Non-critical data, eventually consistent apps
Worst For	Financial transactions, inventory counts

Application-Defined Resolution

Custom resolver logic using SQL or application code.

Configuration:

[replication]
conflict_resolution = "custom"
resolver_function = "resolve_product_conflict"
resolver_language = "sql"  # sql | rust | javascript

SQL Custom Resolver Example:

-- Create custom resolver function
CREATE FUNCTION resolve_product_conflict(
    local_value JSONB,
    remote_value JSONB,
    conflict_type VARCHAR
) RETURNS JSONB AS $$
BEGIN
    -- Example: Keep higher price for inventory management
    IF (local_value->>'price')::DECIMAL > (remote_value->>'price')::DECIMAL THEN
        RETURN local_value;
    ELSE
        RETURN remote_value;
    END IF;
END;
$$ LANGUAGE plpgsql;

-- Register resolver for table
ALTER TABLE products
SET REPLICATION CONFLICT_RESOLVER = 'resolve_product_conflict';

Rust Custom Resolver Example:

use heliosdb_multi_master::ConflictResolver;

struct InventoryConflictResolver;

impl ConflictResolver for InventoryConflictResolver {
    fn resolve(&self, local: &Value, remote: &Value) -> Value {
        // Business rule: Keep highest inventory count
        let local_qty = local.get("quantity").and_then(|v| v.as_i64()).unwrap_or(0);
        let remote_qty = remote.get("quantity").and_then(|v| v.as_i64()).unwrap_or(0);

        if local_qty > remote_qty {
            local.clone()
        } else {
            remote.clone()
        }
    }
}

// Register resolver
let config = MultiMasterConfig {
    conflict_resolution: ConflictResolution::Custom {
        resolver: Box::new(InventoryConflictResolver),
    },
    ..Default::default()
};

CRDT-Based Resolution

Conflict-free replicated data types guarantee eventual consistency.

Supported CRDT Types:

-- Counter CRDT (supports distributed increment/decrement)
CREATE TABLE page_views (
    page_id UUID PRIMARY KEY,
    view_count CRDT_COUNTER,
    last_modified TIMESTAMP
);

INSERT INTO page_views VALUES ('page-1', CRDT_COUNTER(0));

-- Increment from any master - no conflicts!
-- Master 1: SELECT CRDT_INCREMENT(view_count, 5)
-- Master 2: SELECT CRDT_INCREMENT(view_count, 3)
-- Both masters converge to same value after replication

-- Set CRDT (supports distributed set operations)
CREATE TABLE user_tags (
    user_id UUID PRIMARY KEY,
    tags CRDT_SET<VARCHAR(255)>,
    last_modified TIMESTAMP
);

-- Add tags from any master
-- Master 1: SELECT CRDT_ADD(tags, 'premium')
-- Master 2: SELECT CRDT_ADD(tags, 'verified')
-- Result: tags = {'premium', 'verified'} on all masters

-- Map CRDT (supports distributed map operations)
CREATE TABLE user_preferences (
    user_id UUID PRIMARY KEY,
    preferences CRDT_MAP<VARCHAR(255), VARCHAR(255)>,
    last_modified TIMESTAMP
);

-- Set preferences from any master
-- Master 1: SELECT CRDT_SET(preferences, 'theme', 'dark')
-- Master 2: SELECT CRDT_SET(preferences, 'language', 'en')
-- Result: preferences = {theme: 'dark', language: 'en'} on all masters

Configuration:

[replication]
conflict_resolution = "crdt"
crdt_types = ["counter", "set", "map", "register"]

Benefits:

No conflicts - mathematically proven convergence
Deterministic merge
Suitable for append-only data
High availability

Limitations:

Limited to CRDT-compatible operations
Requires schema design changes
Not suitable for strong consistency

Setup and Configuration

Prerequisites

HeliosDB v5.3 or later
Network connectivity between all master nodes
Synchronized time across masters (NTP recommended, <100ms skew)
At least 500MB free disk space per master
Port 5432 (PostgreSQL protocol) accessible between masters

Enabling Multi-Master Replication

Step 1: Configuration File

[server]
node_id = "us-east-1"
listen_address = "10.0.1.100"
port = 5432

[replication]
# Enable multi-master mode
mode = "multi_master"

# Local master configuration
local_master_id = "us-east-1"
local_region = "us-east-1"

# Replication behavior
conflict_resolution = "last_write_wins"
timestamp_source = "server_timestamp"
replication_lag_threshold_ms = 500
max_retries = 3
batch_size = 1000

# Causal consistency
enable_causal_consistency = true
vector_clock_update_interval_ms = 10

# Network
replication_timeout_ms = 5000
connection_pool_size = 10
enable_compression = true

# Monitoring
health_check_interval_sec = 10
replication_monitor_interval_sec = 30

Step 2: Configure Replication Nodes

-- Connect to each master and configure peers

-- On Master 1 (us-east-1)
SELECT replicate.add_master(
    master_id => 'us-east-1',
    endpoint => '10.0.1.100:5432',
    region => 'us-east-1',
    priority => 10
);

SELECT replicate.add_master(
    master_id => 'eu-west-1',
    endpoint => '10.0.2.100:5432',
    region => 'eu-west-1',
    priority => 5
);

SELECT replicate.add_master(
    master_id => 'ap-south-1',
    endpoint => '10.0.3.100:5432',
    region => 'ap-south-1',
    priority => 5
);

-- Verify topology
SELECT * FROM replicate.masters;

Step 3: Start Replication Service

# Start HeliosDB with multi-master enabled
systemctl start heliosdb

# Verify replication is active
heliosdb-cli \
  --host 10.0.1.100 \
  --port 5432 \
  -c "SELECT replicate.status();"

# Expected output:
# status: ACTIVE
# master_id: us-east-1
# replication_lag_ms: 0
# healthy_peers: 2
# conflicts_detected: 0

Network and Latency Considerations

Recommended Network Topology

                    Internet / Private Network
                            │
                  ┌─────────┼─────────┐
                  │         │         │
              [Firewall]    │     [Firewall]
                  │         │         │
        ┌─────────┴──┐  ┌──┴─────────┴──┐
        │            │  │               │
   [VPC-US]     [VPC-EU]          [VPC-Asia]
        │            │  │               │
   [Master-1]   [Master-2]         [Master-3]
   [Replica-1]  [Replica-2]        [Replica-3]

Network Requirements:
  • Latency: <100ms between masters (LAN) or <500ms (WAN)
  • Bandwidth: 10 Mbps minimum, 100 Mbps recommended
  • Packet loss: <0.1%
  • TCP connections: Keep-alive enabled

Configuration for Different Network Conditions

Low-Latency (LAN, <10ms):

[replication]
heartbeat_interval_ms = 50
heartbeat_timeout_ms = 200
replication_batch_size = 100
replication_timeout_ms = 2000

Medium-Latency (WAN, 50-100ms):

[replication]
heartbeat_interval_ms = 100
heartbeat_timeout_ms = 500
replication_batch_size = 1000
replication_timeout_ms = 5000

High-Latency (Satellite/Intercontinental, >200ms):

[replication]
heartbeat_interval_ms = 500
heartbeat_timeout_ms = 2000
replication_batch_size = 5000
replication_timeout_ms = 15000
enable_compression = true

Network Monitoring Queries

-- Check latency to peer masters
SELECT
    peer_id,
    region,
    ROUND(AVG(rtt_ms), 2) as avg_latency_ms,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY rtt_ms) as p99_latency_ms,
    COUNT(*) as ping_count
FROM replicate.peer_latency_history
WHERE recorded_at > NOW() - INTERVAL '1 hour'
GROUP BY peer_id, region
ORDER BY avg_latency_ms DESC;

-- Monitor replication lag
SELECT
    master_id,
    current_lsn,
    last_replicated_lsn,
    current_lsn - last_replicated_lsn as pending_operations,
    ROUND((current_lsn - last_replicated_lsn) / 1000.0, 2) as lag_mb,
    ROUND(EXTRACT(EPOCH FROM last_replication_time) * 1000, 0) as last_replication_ms_ago
FROM replicate.masters
ORDER BY pending_operations DESC;

-- Network statistics
SELECT
    direction,
    bytes_sent,
    bytes_received,
    ROUND(bytes_sent / NULLIF(compression_ratio, 0), 0) as bytes_before_compression,
    ROUND(100 * (1 - 1/NULLIF(compression_ratio, 1)), 1) as compression_percent
FROM replicate.network_stats
WHERE recorded_at > NOW() - INTERVAL '1 hour'
ORDER BY recorded_at DESC
LIMIT 10;

Connection Management

Connection Pool Configuration:

[replication.connection_pool]
pool_size = 10
max_connections = 50
idle_timeout_sec = 300
connection_timeout_sec = 10

# Per-peer configuration
[[replication.peer_connections]]
peer_id = "eu-west-1"
pool_size = 15  # More connections for critical peers
priority = 10

SQL Connection Management:

-- View active replication connections
SELECT
    peer_id,
    connection_id,
    state,
    connected_since,
    bytes_sent,
    bytes_received
FROM replicate.connections
WHERE state = 'active'
ORDER BY peer_id;

-- Close stale connections
SELECT replicate.close_connection(
    peer_id => 'eu-west-1',
    connection_id => 12345
);

-- Reset connection pool
SELECT replicate.reset_connection_pool('eu-west-1');

Operational Procedures

Adding a New Replication Node

Procedure: Add Master to Active Cluster

Duration: 5-10 minutes Impact: Minimal, online operation Validation: Required

Prerequisites:

New master fully provisioned and running HeliosDB
All existing masters healthy
Network connectivity verified
Sufficient disk space for backlog

Step 1: Initialize New Master

-- On the new master (ap-northeast-1)
SELECT replicate.initialize_master(
    master_id => 'ap-northeast-1',
    region => 'ap-northeast-1',
    priority => 10
);

-- Verify initialization
SELECT * FROM replicate.status();
-- Expected: status=INITIALIZING, replication_mode=catching_up

Step 2: Add New Master to Existing Masters

-- On existing master (us-east-1)
SELECT replicate.add_master(
    master_id => 'ap-northeast-1',
    endpoint => '10.0.4.100:5432',
    region => 'ap-northeast-1',
    priority => 10,
    initial_role => 'replica'
);

-- Verify master added
SELECT * FROM replicate.masters WHERE master_id = 'ap-northeast-1';

Step 3: Establish Replication Connections

# Wait for connection establishment (30-60 seconds)
# Monitor connection status
heliosdb-cli -c "SELECT replicate.status();" | grep -E "peer_count|healthy_peers"

# Expected output:
# peer_count: 4
# healthy_peers: 3

Step 4: Wait for Catch-Up

-- Monitor catch-up progress
SELECT
    master_id,
    replication_lag_bytes,
    replication_lag_ms,
    catch_up_progress_percent,
    CASE
        WHEN catch_up_progress_percent >= 100 THEN 'COMPLETE'
        ELSE 'IN_PROGRESS'
    END as catch_up_status
FROM replicate.masters
WHERE master_id = 'ap-northeast-1';

-- Loop until catch_up_progress_percent = 100
-- Typical catch-up time: 5-10 minutes for 1GB lag

Step 5: Activate New Master

-- On new master
SELECT replicate.promote_to_master(
    master_id => 'ap-northeast-1',
    validate_topology => true
);

-- Verify promotion
SELECT role FROM replicate.status();
-- Expected: role = master

-- Verify it can accept writes
INSERT INTO test_table VALUES (1, 'test from new master');
COMMIT;

Step 6: Validation

-- Verify replication from new master
-- On new master
INSERT INTO validation_test (id, value, created_at)
VALUES (1, 'from_ap_northeast', NOW());

-- On existing master (wait 1-2 seconds)
SELECT * FROM validation_test WHERE id = 1;
-- Should see the record

-- Check replication health
SELECT replicate.validate_topology();
-- Expected: VALID, all masters healthy

Removing a Replication Node

Procedure: Remove Master from Cluster

Duration: 3-5 minutes Impact: No impact to cluster Validation: Required

Prerequisites:

Target master can be shut down
Cluster has at least 2 remaining masters
No ongoing maintenance on other masters

Step 1: Pre-Removal Validation

-- Verify healthy cluster state
SELECT replicate.validate_topology();
-- Expected: VALID

-- Check master being removed has no pinned transactions
SELECT COUNT(*) as pinned_count FROM replicate.pinned_transactions
WHERE master_id = 'eu-west-1';
-- If count > 0, wait for transactions to complete

-- Check replication lag is low
SELECT master_id, replication_lag_ms FROM replicate.masters;
-- All should be < 1000ms

Step 2: Drain Master (Stop Accepting New Writes)

-- On the master being removed
SELECT replicate.set_read_only(master_id => 'eu-west-1');

-- Verify reads still work
SELECT COUNT(*) FROM large_table;  -- Should succeed

-- Try to write (should fail)
INSERT INTO test_table VALUES (1);
-- Error: Read-only mode

Step 3: Wait for Replication to Catch-Up

-- Monitor until this master is fully replicated
SELECT
    master_id,
    pending_operations,
    replication_lag_ms
FROM replicate.masters
WHERE master_id = 'eu-west-1';

-- Wait until pending_operations = 0 and replication_lag_ms < 100

Step 4: Remove Master from Topology

-- On any remaining master
SELECT replicate.remove_master(
    master_id => 'eu-west-1',
    verify_replication => true
);

-- Verify removal
SELECT COUNT(*) FROM replicate.masters;
-- Count should decrease by 1

Step 5: Shutdown Master

# Stop HeliosDB on removed master
systemctl stop heliosdb

# Verify it's offline
heliosdb-cli --host 10.0.2.100 --port 5432 -c "SELECT 1"
# Should fail with connection refused

Step 6: Post-Removal Validation

-- Verify cluster is healthy with remaining masters
SELECT replicate.validate_topology();
-- Expected: VALID

-- Verify quorum is sufficient
SELECT
    master_count,
    min_quorum,
    CASE WHEN master_count >= min_quorum THEN 'OK' ELSE 'ERROR' END as quorum_status
FROM replicate.cluster_info;

Handling Node Failures

Scenario 1: Master Goes Offline Unexpectedly

Timeline:
  0ms:  Master 2 becomes unreachable (network partition)
  100ms: Heartbeat missed
  500ms: Suspected failure
  2000ms: Confirmed failure, triggers failover logic

Detection Query:

-- Detect failed nodes
SELECT
    master_id,
    last_heartbeat,
    NOW() - last_heartbeat as time_since_heartbeat,
    state,
    is_healthy
FROM replicate.masters
WHERE is_healthy = false
ORDER BY last_heartbeat DESC;

-- Troubleshoot failure reason
SELECT
    master_id,
    failure_type,  -- network_timeout, connection_refused, unresponsive
    last_error,
    failure_count,
    first_failure_time,
    last_failure_time
FROM replicate.failure_log
WHERE recorded_at > NOW() - INTERVAL '30 minutes'
ORDER BY last_failure_time DESC;

Recovery Steps:

-- Option 1: Restart failed master (if it's just a process crash)
-- On failed master
systemctl restart heliosdb

-- Monitor reconnection (30-60 seconds)
SELECT * FROM replicate.masters WHERE master_id = 'master_2';
-- status should change from FAILED to CATCHING_UP then HEALTHY

-- Option 2: Investigate network partition
-- Check network connectivity
-- (Execute from operational master)
-- Note: You would need to check actual network status via ping/traceroute

-- Option 3: If permanent failure - remove master (see Remove procedure above)

-- After recovery, validate topology
SELECT replicate.validate_topology();

Monitoring Replication Lag

Real-Time Lag Monitoring:

-- Current lag for all masters
SELECT
    master_id,
    region,
    CASE
        WHEN replication_lag_ms < 100 THEN 'EXCELLENT'
        WHEN replication_lag_ms < 500 THEN 'GOOD'
        WHEN replication_lag_ms < 1000 THEN 'ACCEPTABLE'
        ELSE 'POOR'
    END as lag_status,
    replication_lag_ms,
    replication_lag_bytes,
    last_replication_time
FROM replicate.masters
ORDER BY replication_lag_ms DESC;

-- Detailed lag breakdown
SELECT
    peer_id,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY lag_ms) as p50_lag_ms,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY lag_ms) as p95_lag_ms,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY lag_ms) as p99_lag_ms,
    MAX(lag_ms) as max_lag_ms,
    COUNT(*) as samples
FROM replicate.lag_history
WHERE recorded_at > NOW() - INTERVAL '1 hour'
GROUP BY peer_id
ORDER BY p99_lag_ms DESC;

-- Operations pending replication
SELECT
    master_id,
    COUNT(*) as pending_ops
FROM replicate.replication_queue
WHERE state = 'pending'
GROUP BY master_id
ORDER BY pending_ops DESC;

Setting Alerts:

-- Create alert for high lag
CREATE ALERT high_replication_lag AS
SELECT master_id, replication_lag_ms
FROM replicate.masters
WHERE replication_lag_ms > 1000
EVERY 30 SECONDS
ACTION NOTIFY 'replication-alerts@company.com';

-- Create alert for failed master
CREATE ALERT master_failure AS
SELECT master_id, COUNT(*) as consecutive_failures
FROM replicate.failure_log
WHERE recorded_at > NOW() - INTERVAL '5 minutes'
GROUP BY master_id
HAVING COUNT(*) > 10
ACTION NOTIFY 'critical-alerts@company.com';

Handling Split-Brain Scenarios

Split-Brain Definition: Two or more masters operate independently when they should be synchronized.

Prevention:

[replication]
# Require quorum for writes to prevent split-brain
require_quorum_for_writes = true
quorum_size = 2  # Majority of 3 masters

# Use witness node to break ties
use_witness_node = true
witness_endpoints = ["witness.example.com:6379"]

Detection:

-- Detect split-brain (different data on different masters)
SELECT
    'SELECT COUNT(DISTINCT master_id) as master_count, COUNT(*) as row_count
     FROM replicate.split_brain_detection'
    as query_to_run_on_all_masters;

-- If same table has different row counts on different masters
SELECT * FROM replicate.split_brain_status;

-- View conflicting rows
SELECT
    table_name,
    row_id,
    master_1_value,
    master_2_value,
    conflict_detected_at
FROM replicate.split_brain_conflicts
ORDER BY table_name, row_id;

Recovery from Split-Brain:

-- Step 1: Isolate the cluster partition
-- (Disable network between partitions)

-- Step 2: Identify which partition has majority of masters
-- (If masters = 3, need 2 for majority)

-- Step 3: On majority partition, mark partition as canonical
SELECT replicate.mark_as_canonical_partition(
    partition_id => 'partition_1',
    force => true
);

-- Step 4: On minority partition, stop replication
SELECT replicate.pause_replication();

-- Step 5: Heal network partition

-- Step 6: Sync minority partition from canonical
SELECT replicate.sync_from_canonical_partition(
    target_partition => 'partition_2'
);

-- Step 7: Resume replication
SELECT replicate.resume_replication();

-- Step 8: Validate topology
SELECT replicate.validate_topology();

Performance Tuning

Replication Buffer Sizes

Configuration:

[replication.buffers]
# Main replication queue
queue_buffer_size_mb = 256

# Per-peer replication buffers
per_peer_buffer_mb = 64

# WAL archive buffer
wal_buffer_size_mb = 512

# Conflict detection buffer
conflict_detection_buffer_mb = 128

Tuning Guidance:

Available Memory: 32 GB
Recommended Allocation:
  - Queue buffer: 256 MB (0.8%)
  - Per-peer (4 peers): 256 MB (0.8%)
  - WAL buffer: 512 MB (1.6%)
  - Conflict buffer: 128 MB (0.4%)
  Total: ~1.6 GB (5% of available)

For lower memory servers (4GB):
  - Queue buffer: 64 MB
  - Per-peer (2 peers): 64 MB
  - WAL buffer: 128 MB
  - Conflict buffer: 32 MB
  Total: ~288 MB (7% of available)

Monitoring Buffer Usage:

-- View current buffer utilization
SELECT
    buffer_type,
    used_mb,
    allocated_mb,
    ROUND(100.0 * used_mb / allocated_mb, 1) as utilization_percent,
    peak_mb,
    peak_at
FROM replicate.buffer_stats
ORDER BY utilization_percent DESC;

-- Alert if approaching capacity
SELECT replicate.set_buffer_alert(
    buffer_type => 'queue',
    alert_threshold_percent => 80
);

Batch Configuration

Settings:

[replication]
# Operations per replication batch
batch_size = 1000

# Time before forcing a batch (ms)
batch_timeout_ms = 100

# Maximum batch size (bytes)
max_batch_size_mb = 10

# Compression threshold
compression_threshold_bytes = 1024

Tuning Recommendations:

Workload	Batch Size	Timeout	Max Size
Small transactions	100-500	50ms	1MB
Mixed	1000	100ms	10MB
Large bulk ops	5000	200ms	50MB

Performance Impact Query:

-- Analyze batch performance
SELECT
    AVG(batch_size) as avg_batch_size,
    MAX(batch_size) as max_batch_size,
    AVG(batch_latency_ms) as avg_latency_ms,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY batch_latency_ms) as p99_latency_ms,
    COUNT(*) as total_batches
FROM replicate.batch_stats
WHERE recorded_at > NOW() - INTERVAL '1 hour';

-- Find optimal batch size
SELECT
    batch_size_bucket,
    COUNT(*) as frequency,
    AVG(latency_ms) as avg_latency,
    (SELECT batch_size_bucket FROM (
        SELECT batch_size_bucket, AVG(latency_ms) as avg_lat
        FROM replicate.batch_stats
        WHERE recorded_at > NOW() - INTERVAL '1 hour'
        GROUP BY batch_size_bucket
    ) ORDER BY avg_lat ASC LIMIT 1)
    as recommended_batch_size
FROM replicate.batch_stats
WHERE recorded_at > NOW() - INTERVAL '1 hour'
GROUP BY batch_size_bucket
ORDER BY frequency DESC;

Network Optimization

TCP Configuration:

# /etc/sysctl.conf - Linux kernel tuning
# Increase TCP buffers for replication
net.core.rmem_max = 134217728      # 128MB
net.core.wmem_max = 134217728      # 128MB
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864

# Enable TCP keep-alive
net.ipv4.tcp_keepalives_intvl = 30
net.ipv4.tcp_keepalives_probes = 3

# Increase TCP backlog
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096

# Apply changes
sysctl -p

Connection Pooling Optimization:

-- Monitor connection pool efficiency
SELECT
    peer_id,
    total_connections,
    active_connections,
    idle_connections,
    ROUND(100.0 * active_connections / total_connections, 1) as active_percent,
    ROUND(100.0 * idle_connections / total_connections, 1) as idle_percent
FROM replicate.connection_stats
ORDER BY active_percent DESC;

-- Tune pool size if consistently high utilization
-- Rule: If active_percent > 80%, increase pool size
SELECT replicate.set_connection_pool_size(
    peer_id => 'eu-west-1',
    pool_size => 20  -- Increase from 15
);

Conflict Resolution in Practice

Detecting Conflicts

Conflict Detection Queries:

-- Show all detected conflicts in last hour
SELECT
    conflict_id,
    table_name,
    row_id,
    master_1_id,
    master_2_id,
    conflict_type,  -- write_conflict, delete_conflict, reference_conflict
    master_1_value,
    master_2_value,
    resolved_value,
    resolution_strategy,
    detected_at
FROM replicate.conflicts
WHERE detected_at > NOW() - INTERVAL '1 hour'
ORDER BY detected_at DESC;

-- Conflict rate by table
SELECT
    table_name,
    COUNT(*) as conflict_count,
    ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER(), 1) as percent_of_total
FROM replicate.conflicts
WHERE detected_at > NOW() - INTERVAL '1 day'
GROUP BY table_name
ORDER BY conflict_count DESC;

-- Conflict rate trend
SELECT
    DATE_TRUNC('hour', detected_at) as hour,
    COUNT(*) as conflicts_per_hour,
    ROUND(AVG(resolution_latency_ms), 1) as avg_resolution_ms
FROM replicate.conflicts
WHERE detected_at > NOW() - INTERVAL '7 days'
GROUP BY hour
ORDER BY hour DESC;

Resolving Conflicts Manually

Case 1: Last-Write-Wins Not Appropriate

-- View conflicted row
SELECT * FROM replicate.conflicts WHERE conflict_id = 'c-12345';

-- Result:
-- master_1 wrote: price = 100 at 2025-12-30 10:00:00
-- master_2 wrote: price = 150 at 2025-12-30 10:00:01
-- Automatically resolved to: price = 150 (LWW)

-- But business rule: Keep higher price only if > 5% increase
-- Current: (150-100)/100 = 50%, so 150 is correct

-- For another conflict:
-- master_1 wrote: price = 100 at 10:00:00
-- master_2 wrote: price = 2 at 10:00:01  (data entry error)
-- Automatically resolved to: price = 2 (LWW by timestamp)

-- Manual override:
SELECT replicate.override_conflict_resolution(
    conflict_id => 'c-12346',
    resolved_value => '100',
    reason => 'Data entry error on master_2, price=2 is invalid'
);

-- Update actual row
UPDATE products
SET price = 100
WHERE id = 123;

Case 2: CRDT Merge Failed

-- Check CRDT merge result
SELECT * FROM replicate.conflicts WHERE conflict_id = 'c-12347';

-- Result:
-- Type: CRDT_SET merge
-- Master 1: tags = {'electronics', 'active'}
-- Master 2: tags = {'electronics', 'discontinued'}
-- Merged: tags = {'electronics', 'active', 'discontinued'}

-- Business rule: Item can't be both active and discontinued
-- Need custom resolution:

SELECT replicate.manual_crdt_merge(
    conflict_id => 'c-12347',
    resolved_value => '["electronics", "discontinued"]'::jsonb,
    reason => 'Item no longer active, discontinued takes precedence'
);

-- Verify
SELECT tags FROM products WHERE id = 456;
-- Result: ['electronics', 'discontinued']

Testing Conflict Scenarios

Test Setup: Create Controlled Conflict

-- Create test table
CREATE TABLE conflict_test (
    id INT PRIMARY KEY,
    value INT,
    _master_ts BIGINT,
    _master_id INT
);

-- Insert initial value
INSERT INTO conflict_test (id, value) VALUES (1, 100);

-- Simulate concurrent writes
-- On Master 1:
UPDATE conflict_test SET value = 150 WHERE id = 1;

-- On Master 2 (quickly after):
UPDATE conflict_test SET value = 200 WHERE id = 1;

-- Wait for replication (100-500ms typically)
SELECT pg_sleep(1);

-- Check conflict detection
SELECT * FROM replicate.conflicts
WHERE table_name = 'conflict_test'
ORDER BY detected_at DESC
LIMIT 1;

-- View resolution
SELECT id, value FROM conflict_test;
-- Value should be 200 (higher, from later write)

Test: Verify Conflict Resolution Strategy

-- Test 1: Last-Write-Wins
-- Expected: Timestamp-based winner

-- Test 2: CRDT Counter
CREATE TABLE crdt_counter_test (
    id INT PRIMARY KEY,
    count CRDT_COUNTER
);

INSERT INTO crdt_counter_test VALUES (1, CRDT_COUNTER(0));

-- Master 1: Increment by 5
SELECT CRDT_INCREMENT(count, 5) FROM crdt_counter_test WHERE id = 1;

-- Master 2: Increment by 3
SELECT CRDT_INCREMENT(count, 3) FROM crdt_counter_test WHERE id = 1;

-- Wait for replication
SELECT pg_sleep(1);

-- Both should see: count = 8
SELECT count FROM crdt_counter_test WHERE id = 1;
-- Expected: 8 on both masters, no conflicts

Chaos Testing: Network Partitions

#!/bin/bash
# Simulate network partition
tc qdisc add dev eth0 root netem loss 100%

# Create conflicts while partitioned
psql -h master2.example.com -c "UPDATE products SET price = 999 WHERE id = 1;"

# Wait
sleep 5

# Heal partition
tc qdisc del dev eth0 root netem

# Monitor conflict resolution
psql -h master1.example.com -c "
    SELECT * FROM replicate.conflicts
    WHERE detected_at > NOW() - INTERVAL '5 minutes';"

Real-World Topologies and Examples

Example 1: Global E-Commerce Platform

Requirements:

3 continents (US, EU, Asia)
Users write to nearest region
<100ms write latency
Last-Write-Wins conflict resolution

Topology:

[US East Master]        [EU West Master]      [AP South Master]
      │                       │                      │
      ├─ Read Replica 1      ├─ Read Replica 2      ├─ Read Replica 3
      │                       │                      │
[Clients US]          [Clients EU]            [Clients Asia]

Replication Path:
  US → EU: 85ms
  US → AP: 180ms
  EU → AP: 140ms

Configuration:

[server]
node_id = "us-east-1"

[replication]
mode = "multi_master"
conflict_resolution = "last_write_wins"
timestamp_source = "server_timestamp"
replication_lag_threshold_ms = 500
batch_size = 2000
enable_compression = true

[[replication.masters]]
master_id = "us-east-1"
endpoint = "us.db.company.com:5432"
priority = 10

[[replication.masters]]
master_id = "eu-west-1"
endpoint = "eu.db.company.com:5432"
priority = 5

[[replication.masters]]
master_id = "ap-south-1"
endpoint = "ap.db.company.com:5432"
priority = 5

Operational SQL:

-- Route write based on user region
PREPARE write_product AS
SELECT
    CASE
        WHEN $1 IN ('US', 'CA', 'MX') THEN 'us-east-1'
        WHEN $1 IN ('GB', 'DE', 'FR') THEN 'eu-west-1'
        WHEN $1 IN ('JP', 'SG', 'IN') THEN 'ap-south-1'
        ELSE 'us-east-1'
    END as target_master;

-- Monitor global replication
SELECT
    master_id,
    region,
    ROUND(AVG(replication_lag_ms), 0) as avg_lag_ms,
    active_users,
    writes_per_sec
FROM replicate.master_stats
WHERE recorded_at > NOW() - INTERVAL '5 minutes'
GROUP BY master_id, region
ORDER BY master_id;

Example 2: Financial Trading System

Requirements:

2 data centers in same city
99.999% uptime
Sub-50ms latency
Custom conflict resolution (prevents duplicate trades)

Topology:

[NYC-1 Master]          [Chicago-1 Master]
      │                       │
   [Traders]               [Traders]

Trade Order Flow:
  1. Submit to nearest master (< 5ms)
  2. Commit locally
  3. Replicate to peer (< 10ms)
  4. Both confirm trade
  5. Send execution to exchange

Configuration:

[replication]
mode = "multi_master"
conflict_resolution = "custom"
custom_resolver = "trade_conflict_resolver"
require_quorum_for_writes = true
quorum_size = 2
heartbeat_interval_ms = 50  # Fast failure detection
heartbeat_timeout_ms = 200
election_timeout_min_ms = 500
election_timeout_max_ms = 1000

Custom Conflict Resolver (Rust):

pub struct TradeConflictResolver;

impl ConflictResolver for TradeConflictResolver {
    fn resolve(&self, local: &Value, remote: &Value) -> Value {
        // Extract trade IDs
        let local_trade_id = local.get("trade_id").and_then(|v| v.as_str());
        let remote_trade_id = remote.get("trade_id").and_then(|v| v.as_str());

        // Different trades: both valid, keep both
        if local_trade_id != remote_trade_id {
            // Return both (merge at application layer)
            return local.clone();
        }

        // Same trade from different masters: reject duplicate
        // Keep the one with valid checksum and earlier timestamp
        let local_ts = local.get("timestamp").and_then(|v| v.as_i64()).unwrap_or(i64::MAX);
        let remote_ts = remote.get("timestamp").and_then(|v| v.as_i64()).unwrap_or(i64::MAX);

        if local_ts < remote_ts {
            local.clone()
        } else {
            remote.clone()
        }
    }
}

Operational Queries:

-- Monitor trade execution lag
SELECT
    symbol,
    COUNT(*) as trades_executed,
    ROUND(AVG(replication_lag_ms), 2) as avg_replication_lag_ms,
    ROUND(PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY execution_latency_ms), 2) as p99_execution_ms
FROM replicate.trade_execution_stats
WHERE execution_time > NOW() - INTERVAL '1 hour'
GROUP BY symbol
ORDER BY avg_replication_lag_ms DESC
LIMIT 10;

-- Alert on duplicate trade attempts
SELECT
    trade_id,
    symbol,
    COUNT(*) as submission_count,
    MAX(submitted_at) - MIN(submitted_at) as time_span_ms
FROM replicate.trade_submissions
WHERE submitted_at > NOW() - INTERVAL '1 minute'
GROUP BY trade_id, symbol
HAVING COUNT(*) > 1
ORDER BY submission_count DESC;

Example 3: Multi-Tenant SaaS Application

Requirements:

Customers in multiple regions
GDPR compliance (data residency)
CRDT-based conflict resolution
Custom metadata per tenant

Topology:

┌─ US Region ─────────────────────┐
│  Master (US-West-2)             │
│  ├─ Tenant: Company-A (US)      │
│  ├─ Tenant: Company-B (US+CA)   │
│  └─ Replicas for read scaling    │
└─────────────────────────────────┘

┌─ EU Region ─────────────────────┐
│  Master (EU-Central-1)          │
│  ├─ Tenant: Company-C (EU)      │
│  ├─ Tenant: Company-D (EU+UK)   │
│  └─ Replicas for read scaling    │
└─────────────────────────────────┘

Replication: US ↔ EU only for shared tenants
Data residency: Company-A stays in US only

Configuration:

[replication]
mode = "multi_master"
conflict_resolution = "crdt"
crdt_types = ["set", "counter", "map", "register"]

# Multi-tenancy settings
enable_tenant_isolation = true
default_tenant_master = "auto"  # Route to tenant's primary region

[replication.tenant_config.Company-A]
primary_region = "us-west-2"
allowed_regions = ["us-west-2"]  # Residency: US only
replication_targets = []  # No replication

[replication.tenant_config.Company-B]
primary_region = "us-west-2"
allowed_regions = ["us-west-2", "eu-central-1"]
replication_targets = ["eu-central-1"]

Operational Queries:

-- Verify data residency compliance
SELECT
    tenant_id,
    table_name,
    storage_region,
    row_count,
    data_size_mb,
    CASE
        WHEN storage_region = primary_region THEN 'COMPLIANT'
        ELSE 'VIOLATION'
    END as residency_status
FROM replicate.tenant_storage_location;

-- Monitor CRDT merge operations
SELECT
    tenant_id,
    operation_type,  -- set_add, counter_increment, map_put
    COUNT(*) as operation_count,
    ROUND(AVG(merge_latency_ms), 2) as avg_merge_ms
FROM replicate.crdt_operations
WHERE executed_at > NOW() - INTERVAL '1 day'
GROUP BY tenant_id, operation_type
ORDER BY tenant_id, operation_count DESC;

-- Check cross-region replication for shared tenants
SELECT
    tenant_id,
    source_region,
    target_region,
    bytes_replicated,
    ROUND(bytes_replicated::NUMERIC / 1024 / 1024, 2) as mb_replicated
FROM replicate.cross_region_replication
WHERE replicated_at > NOW() - INTERVAL '1 hour'
ORDER BY bytes_replicated DESC;

Troubleshooting Replication Issues

Troubleshooting Decision Tree

SYMPTOM: Replication Lag High (> 1 second)
├─ Check network latency
│  └─ Run: SELECT replicate.check_peer_latency('eu-west-1');
│     ├─ If > 200ms: Network issue, escalate to ops team
│     └─ If normal: Proceed to next check
│
├─ Check replication queue size
│  └─ Run: SELECT replicate.get_queue_depth();
│     ├─ If > 10,000 ops: Master slow to replicate
│     │  └─ Check CPU/Memory/Disk on remote master
│     └─ If normal: Proceed to next check
│
├─ Check batch size
│  └─ Run: SELECT replicate.get_batch_metrics();
│     ├─ If too small (< 100): Increase batch_size in config
│     ├─ If too large (> 10,000): Network issue or slow master
│     └─ Otherwise: Normal
│
└─ Check remote master health
   └─ Restart remote master or escalate

SYMPTOM: Conflicts Not Resolving
├─ Verify conflict resolution strategy
│  └─ Run: SELECT replicate.get_conflict_strategy();
│     ├─ If LWW: Check if timestamps are synchronized
│     ├─ If CRDT: Verify CRDT type support for table
│     └─ If custom: Check if resolver function exists
│
├─ Check conflict log
│  └─ Run: SELECT * FROM replicate.conflicts LIMIT 10;
│     ├─ If empty: No conflicts detected
│     └─ If rows exist: Check resolution column
│
└─ Test resolution manually
   └─ See "Testing Conflict Scenarios" section above

SYMPTOM: Master Unreachable / Failed
├─ Check network connectivity
│  └─ ping master2.example.com
│     ├─ If no response: Network down
│     └─ If responds: Process issue
│
├─ Check master process
│  └─ systemctl status heliosdb
│     ├─ If inactive: Start service (systemctl start heliosdb)
│     └─ If active: Check logs (journalctl -u heliosdb -n 100)
│
├─ Check master port
│  └─ telnet master2.example.com 5432
│     ├─ If refused: Port blocked or not listening
│     └─ If connected: Protocol issue
│
└─ Failover if permanent
   └─ Follow "Remove Node" procedure and promote replica

Common Issues and Solutions

Issue 1: “Replication lag continuously increases”

-- Diagnose
SELECT
    lag_trend,
    lag_ms,
    operations_pending,
    network_bandwidth_mbps,
    disk_write_iops
FROM replicate.replication_diagnostics;

-- Solutions:
-- 1. Reduce write rate (application-level throttling)
-- 2. Increase batch_size in configuration
-- 3. Enable compression
-- 4. Add more disk I/O capacity
-- 5. Check for slow queries on remote master

-- Example fix:
ALTER SYSTEM SET replication.batch_size = 5000;
SELECT pg_reload_conf();

Issue 2: “Conflicts spike after network partition”

-- Investigate
SELECT * FROM replicate.conflicts
WHERE detected_at > NOW() - INTERVAL '10 minutes'
ORDER BY conflict_count DESC;

-- Expected: Conflicts should resolve within seconds
SELECT
    COUNT(*) as unresolved_conflicts
FROM replicate.conflicts
WHERE resolution_strategy = 'pending';

-- If unresolved conflicts > 100:
-- 1. Check custom resolver is working
-- 2. Manual intervention might be needed

-- Manual resolution:
SELECT replicate.resolve_pending_conflicts(
    strategy => 'last_write_wins',
    timeout_sec => 30
);

Issue 3: “Master won’t promote to writable”

-- Check promotion blockers
SELECT * FROM replicate.promotion_blockers
WHERE master_id = 'ap-south-1';

-- Common blockers:
-- - Replication lag too high: Wait for catch-up
-- - Topology invalid: Verify other masters healthy
-- - Missing quorum: Need majority of masters healthy

-- Force promotion (use carefully):
SELECT replicate.force_promote(
    master_id => 'ap-south-1',
    override_validations => false
);

High-Availability Patterns

Pattern 1: Active-Active with Automatic Failover

Setup: 3 masters, quorum-based writes

[replication]
mode = "multi_master"
require_quorum_for_writes = true
quorum_size = 2
enable_automatic_failover = true
failover_timeout_sec = 10

RPO (Recovery Point Objective): 0 (no data loss) RTO (Recovery Time Objective): 2-5 seconds Availability: 99.95%

Validation:

-- Verify quorum requirements
SELECT
    master_count,
    min_quorum,
    healthy_count,
    CASE
        WHEN healthy_count >= min_quorum THEN 'OK'
        ELSE 'QUORUM LOST'
    END as status
FROM replicate.cluster_status;

-- Simulate failure
-- (Kill one master process)
-- Monitor automatic failover
SELECT * FROM replicate.failover_history
ORDER BY failover_time DESC LIMIT 1;

Pattern 2: Read-Heavy with Single Writer

Setup: 1 writable master, 3+ read replicas

┌───────────────────────────────┐
│  Writable Master              │
│  All writes accepted here     │
└───────────────────────────────┘
       │
       ├─ Async Replication ──────┐
       │                          │
       ├─ Replica 1 (Read-only)   │
       ├─ Replica 2 (Read-only)   │
       └─ Replica 3 (Read-only)   │

Benefits:
  - No conflict resolution needed
  - Excellent read scaling
  - Simple topology

Trade-offs:
  - Master is single point of failure
  - Failover requires promotion

Configuration:

[replication]
mode = "single_master_with_replicas"
primary_master = "us-east-1"
replica_addresses = [
    "us-east-1-rep1:5432",
    "us-east-1-rep2:5432",
    "us-east-1-rep3:5432"
]
auto_failover = true
min_replicas_for_failover = 1

Pattern 3: Geographic Read-Write Locality

Setup: Write to nearest master, read from nearest replica

┌──────────────────────────────────────┐
│ Client in US                         │
│  1. Write to US Master (5ms)         │
│  2. Read from US Master (2ms)        │
│  3. Background sync to EU/AP         │
└──────────────────────────────────────┘

┌──────────────────────────────────────┐
│ Client in EU                         │
│  1. Write to EU Master (5ms)         │
│  2. Read from EU Master (2ms)        │
│  3. Background sync to US/AP         │
└──────────────────────────────────────┘

Routing Logic:

-- Automatic routing based on client location
CREATE OR REPLACE FUNCTION get_master_for_region(region_code VARCHAR)
RETURNS VARCHAR AS $$
BEGIN
    RETURN CASE region_code
        WHEN 'US' THEN 'us-east-1'
        WHEN 'EU' THEN 'eu-west-1'
        WHEN 'AP' THEN 'ap-south-1'
        ELSE 'us-east-1'
    END;
END;
$$ LANGUAGE plpgsql;

-- Application layer
-- SET application_name = get_master_for_region('US');
-- Then connects to corresponding master

Pattern 4: Disaster Recovery (Cold Standby)

Setup: Active master + warm standby in different region

┌──────────────────────┐
│  Active Master       │
│  Accepts all writes  │
│  Region: US          │
└──────────────────────┘
        │ Async Replication
        │ (Low priority)
        ▼
┌──────────────────────┐
│  Warm Standby       │
│  Read-only mode     │
│  Region: EU         │
│  RTO: 5-10 minutes  │
└──────────────────────┘

Activation Procedure:

#!/bin/bash
# Activate standby master (manual failover)

# Step 1: Verify connectivity to EU standby
ping eu-master.example.com

# Step 2: Stop replication (isolate)
psql -h eu-master.example.com -c "SELECT replicate.pause_replication();"

# Step 3: Promote to writable
psql -h eu-master.example.com -c "SELECT replicate.promote_to_master();"

# Step 4: Point applications to EU master
# (Update connection strings, DNS, etc.)

# Step 5: Verify writes working
psql -h eu-master.example.com -c "INSERT INTO test VALUES (1);"

# Step 6: Failback when original master recovered
# (Reverse steps above)

Conclusion

Multi-master replication provides HeliosDB with unparalleled write availability and geographic distribution. By following the operational procedures and best practices outlined in this guide, you can:

Deploy and manage multi-master clusters confidently
Handle failures gracefully with automatic failover
Optimize performance for your workload
Detect and resolve conflicts systematically
Maintain 99.999% write availability

Key Takeaways

Start simple: Single-master with replicas before multi-master
Monitor constantly: Replication lag, conflicts, and health
Test thoroughly: Especially conflict scenarios and failover
Plan for failure: Network partitions and master crashes
Document your topology: Keep clear records of master assignments

Additional Resources

Architecture Documentation: /home/app/HeliosDB/docs/architecture/12-replication-architecture.md
API Reference: HeliosDB SQL Function Documentation
Troubleshooting Tools: heliosdb-cli --help for diagnostic commands
Support: HeliosDB Support Team

Document Maintained By: HeliosDB Operations Team Last Review: 2025-12-30 Next Review: 2026-03-30

Multi-Master Replication Operations Guide

Multi-Master Replication Operations Guide

Table of Contents

Introduction

Key Capabilities

Benefits Over Single-Master Replication

Multi-Master Replication Concepts

Architecture Overview

Key Concepts

Version Vectors / Vector Clocks

Replication Lag

Conflict Detection

Supported Replication Topologies

Active-Active Topology

Multi-Region Topology

Failover Topology

Conflict Detection and Resolution

Conflict Detection Mechanisms

Version Vector Detection

Timestamp-Based Detection

Last-Write-Wins (LWW) Resolution

Application-Defined Resolution

CRDT-Based Resolution

Setup and Configuration

Prerequisites

Enabling Multi-Master Replication

Network and Latency Considerations

Recommended Network Topology

Configuration for Different Network Conditions

Network Monitoring Queries

Connection Management

Operational Procedures

Adding a New Replication Node

Removing a Replication Node

Handling Node Failures

Monitoring Replication Lag

Handling Split-Brain Scenarios

Performance Tuning

Replication Buffer Sizes

Batch Configuration

Network Optimization

Conflict Resolution in Practice

Detecting Conflicts

Resolving Conflicts Manually

Testing Conflict Scenarios

Real-World Topologies and Examples

Example 1: Global E-Commerce Platform

Example 2: Financial Trading System

Example 3: Multi-Tenant SaaS Application

Troubleshooting Replication Issues

Troubleshooting Decision Tree

Common Issues and Solutions

High-Availability Patterns

Pattern 1: Active-Active with Automatic Failover

Pattern 2: Read-Heavy with Single Writer

Pattern 3: Geographic Read-Write Locality

Pattern 4: Disaster Recovery (Cold Standby)

Conclusion

Key Takeaways

Additional Resources