Multi-Master Replication Operations Guide
Multi-Master Replication Operations Guide
Document Version: 1.0 Status: Production Last Updated: 2025-12-30
Table of Contents
- Introduction
- Multi-Master Replication Concepts
- Supported Replication Topologies
- Conflict Detection and Resolution
- Setup and Configuration
- Operational Procedures
- Performance Tuning
- Conflict Resolution in Practice
- Real-World Topologies and Examples
- Troubleshooting Replication Issues
- High-Availability Patterns
Introduction
Multi-master replication in HeliosDB enables distributed database deployments where multiple nodes can accept write operations independently. This guide provides operational procedures for deploying, managing, and troubleshooting multi-master replication clusters.
Key Capabilities
- Bidirectional Synchronization: Write to any master node with automatic propagation
- Conflict Resolution: Last-Write-Wins (LWW), application-defined, and CRDT-based strategies
- Geographic Distribution: Deploy masters across multiple regions
- Automatic Failover: Seamless failover with conflict-free consistency
- Causal Consistency: Maintain ordering of dependent operations
- 99.999% Write Availability: No single point of failure for writes
Benefits Over Single-Master Replication
| Aspect | Single-Master | Multi-Master |
|---|---|---|
| Write Availability | Single point of failure | Distributed, resilient |
| Write Latency | Network round-trip | Local write + async replication |
| Failover Strategy | Promote read replica | Multiple masters available |
| Geographic Redundancy | Limited | Excellent across regions |
| Maintenance Downtime | Full cluster outage | Zero downtime |
Multi-Master Replication Concepts
Architecture Overview
Multi-master replication creates a peer-to-peer topology where each master node:
- Accepts local writes from connected clients
- Commits transactions to local storage immediately
- Queues changes for asynchronous replication
- Sends changes to peer masters
- Applies remote changes with conflict detection/resolution
┌──────────────────────────────────────────────────────────────┐│ Multi-Master Topology │├──────────────────────────────────────────────────────────────┤│ ││ Client A Client B Client C ││ │ │ │ ││ ▼ ▼ ▼ ││ [US-East]◄──────────────►[EU-West]◄────────[AP-South] ││ Master 1 Master 2 Master 3 ││ │ │ │ ││ │ Commit locally │ Commit locally │ ││ │ Queue for replication │ Queue │ ││ │ │ │ ││ └────────Async Replication──────────────────┘ ││ ││ - Each master accepts writes independently ││ - Changes propagate asynchronously to peers ││ - Conflicts detected and resolved automatically ││ - Causal consistency maintained via vector clocks ││ │└──────────────────────────────────────────────────────────────┘Key Concepts
Version Vectors / Vector Clocks
Track causality and detect concurrent writes. Each master maintains a logical clock for every other master.
Master 1: [10, 0, 0] # 10 local writes, 0 from Master 2, 0 from Master 3Master 2: [5, 15, 0] # 5 from Master 1, 15 local writes, 0 from Master 3Master 3: [10, 12, 20] # 10 from Master 1, 12 from Master 2, 20 local writesReplication Lag
Time between commit on master A and application on master B. Typical values: 50-500ms (depends on network).
Application writes to US-East Master │ ├─ Commit locally (1ms) │ ├─ Queue for replication (1ms) │ ├─ Send to EU-West (85ms network) │ ├─ Receive on EU-West (1ms) │ ├─ Apply to storage (10ms) │ └─ Acknowledge (1ms)
Total Replication Lag: ~100msConflict Detection
Two writes to the same row from different masters within a replication window.
Time US-East Master EU-West Master 0 UPDATE users SET x=1 UPDATE users SET x=2 1 Commit locally Commit locally 2 Queue for replication Queue for replication 85 Receive EU update Receive US update x=2 (conflict detected) x=1 (conflict detected) 86 Apply conflict resolution (e.g., LWW: x=2 wins due to later timestamp)Supported Replication Topologies
Active-Active Topology
All masters actively accept writes. Best for geographic distribution.
┌─────────────────────────────────────────┐ │ Client Access Patterns │ ├─────────────────────────────────────────┤ │ │ │ Region 1: Route to Master 1 │ │ Region 2: Route to Master 2 │ │ Region 3: Route to Master 3 │ │ │ │ ┌────────────────┐ │ │ │ Master 1 │◄───────────────┐ │ │ │ US-East │ │ │ │ └────────────────┘ │ │ │ ▲ ▲ │ │ │ │ └────────────────────┼───┼──┐ │ │ │ │ │ │ │ │ │ │ │ ┌─────┴────────────┐ ┌──────┴─┐ │ │ │ │ Master 2 │◄────►│Master 3│ │ │ │ │ EU-West │ │ AP-South│ │ │ │ └──────────────────┘ └────────┘ │ │ │ │ │ │ All write paths bidirectional └──┘ │ │ └─────────────────────────────────────────┘
Benefits: • Writes near-locally for all regions • No single point of failure • Minimal write latency
Trade-offs: • Conflict resolution complexity • Coordination overhead • Eventual consistency (not strong)Multi-Region Topology
Masters deployed across geographic regions with optional read replicas per region.
┌──────────────────────────────────────────────┐ │ Multi-Region Architecture │ ├──────────────────────────────────────────────┤ │ │ │ [US Region] [EU Region] │ │ ┌────────────────┐ ┌──────────────────┐ │ │ │ Master 1 │ │ Master 2 │ │ │ │ (Writable) │◄─│ (Writable) │ │ │ └────────────────┘ └──────────────────┘ │ │ ▲ │ ▲ │ │ │ │ ├─Read/Write │ ├─Read/Write │ │ │ │ │ │ │ │ ┌────┴─┴──────────┐ ┌────┴─┴──────────┐ │ │ │ Replica 1 │ │ Replica 2 │ │ │ │ (Read-only) │ │ (Read-only) │ │ │ └─────────────────┘ └─────────────────┘ │ │ ▲ ▲ │ │ │ Async replication │ │ │ └───────────────────────┘ │ │ │ │ [Asia Region] [Australia Region] │ │ ┌──────────────────┐ ┌──────────────────┐ │ │ │ Master 3 │ │ Master 4 │ │ │ │ (Writable) │ │ (Writable) │ │ │ └──────────────────┘ └──────────────────┘ │ │ ▲ ▲ │ │ └───────────────────────┘ │ │ │ │ All masters replicate to all masters │ │ Replicas provide read-scaling per region │ │ │ └──────────────────────────────────────────────┘Failover Topology
Supports automatic failover when master becomes unavailable.
Normal Operation (All Masters Healthy) ┌────────────────────────────────────┐ │ │ │ Master 1 ◄────────────► Master 2│ │ Primary Replication Standby │ │ │ │ Writes → Master 1 │ │ Reads → Both │ │ │ └────────────────────────────────────┘
Master 1 Failure Detected ┌────────────────────────────────────┐ │ │ │ Master 1 ✗ │ │ (Unreachable) │ │ │ │ Master 2 │ │ (Promoted) │ │ │ │ Writes → Master 2 │ │ Reads → Master 2 │ │ │ └────────────────────────────────────┘
Master 1 Recovery ┌────────────────────────────────────┐ │ │ │ Master 1 ◄────────────► Master 2│ │ (Recovering) Catch-up Primary │ │ │ │ Writes → Master 2 │ │ Reads → Both │ │ (Master 1 synchronized) │ │ │ └────────────────────────────────────┘Conflict Detection and Resolution
Conflict Detection Mechanisms
HeliosDB detects conflicts at the row and column level using multiple strategies:
Version Vector Detection
Detects concurrent writes from different masters:
State on Master 1: [row_id=100, value=10, vv=[5,0,0]] Master 1 writes value=15: new vv=[6,0,0]
Concurrent write on Master 2: Master 2 writes value=20: new vv=[0,1,0]
When Master 1 receives update from Master 2: Local vv=[6,0,0], Incoming vv=[0,1,0] Neither dominates (no causal order) → CONFLICT DETECTED
Resolution: Apply conflict strategy (LWW, CRDT, or custom)Timestamp-Based Detection
Last-Write-Wins strategy with logical or wall-clock timestamps:
-- Table with conflict detection columnsCREATE TABLE accounts ( id UUID PRIMARY KEY, balance DECIMAL, master_id INT, timestamp BIGINT, write_counter INT, CONSTRAINT timestamp_check CHECK (timestamp > 0));
-- Write from Master 1UPDATE accountsSET balance = 100, master_id = 1, timestamp = 1000, write_counter = 5WHERE id = 'acc-001';
-- Concurrent write from Master 2UPDATE accountsSET balance = 150, master_id = 2, timestamp = 1001, write_counter = 3WHERE id = 'acc-001';
-- Conflict Resolution: Compare timestamps-- Master 2's write (timestamp=1001) > Master 1's write (timestamp=1000)-- Result: balance = 150, master_id = 2Last-Write-Wins (LWW) Resolution
Simplest strategy: keep the write with the latest timestamp.
Configuration:
[replication]conflict_resolution = "last_write_wins"timestamp_source = "server_timestamp" # server_timestamp | client_timestamp | logical_clocktimestamp_precision_ms = trueSQL Example:
-- Enable LWW for a tableCREATE TABLE products ( id INT PRIMARY KEY, name VARCHAR(255), price DECIMAL(10,2), _master_ts BIGINT DEFAULT CURRENT_TIMESTAMP, _master_id INT DEFAULT CURRENT_MASTER_ID);
-- System automatically selects write with max(_master_ts)-- To create conflict:-- Master 1: UPDATE products SET price = 100 WHERE id = 1; -- ts=1000-- Master 2: UPDATE products SET price = 150 WHERE id = 1; -- ts=1001-- Result: price = 150 (Master 2 wins)Pros and Cons:
| Aspect | Details |
|---|---|
| Pros | Simple, deterministic, low overhead |
| Cons | May lose legitimate writes from slower master |
| Best For | Non-critical data, eventually consistent apps |
| Worst For | Financial transactions, inventory counts |
Application-Defined Resolution
Custom resolver logic using SQL or application code.
Configuration:
[replication]conflict_resolution = "custom"resolver_function = "resolve_product_conflict"resolver_language = "sql" # sql | rust | javascriptSQL Custom Resolver Example:
-- Create custom resolver functionCREATE FUNCTION resolve_product_conflict( local_value JSONB, remote_value JSONB, conflict_type VARCHAR) RETURNS JSONB AS $$BEGIN -- Example: Keep higher price for inventory management IF (local_value->>'price')::DECIMAL > (remote_value->>'price')::DECIMAL THEN RETURN local_value; ELSE RETURN remote_value; END IF;END;$$ LANGUAGE plpgsql;
-- Register resolver for tableALTER TABLE productsSET REPLICATION CONFLICT_RESOLVER = 'resolve_product_conflict';Rust Custom Resolver Example:
use heliosdb_multi_master::ConflictResolver;
struct InventoryConflictResolver;
impl ConflictResolver for InventoryConflictResolver { fn resolve(&self, local: &Value, remote: &Value) -> Value { // Business rule: Keep highest inventory count let local_qty = local.get("quantity").and_then(|v| v.as_i64()).unwrap_or(0); let remote_qty = remote.get("quantity").and_then(|v| v.as_i64()).unwrap_or(0);
if local_qty > remote_qty { local.clone() } else { remote.clone() } }}
// Register resolverlet config = MultiMasterConfig { conflict_resolution: ConflictResolution::Custom { resolver: Box::new(InventoryConflictResolver), }, ..Default::default()};CRDT-Based Resolution
Conflict-free replicated data types guarantee eventual consistency.
Supported CRDT Types:
-- Counter CRDT (supports distributed increment/decrement)CREATE TABLE page_views ( page_id UUID PRIMARY KEY, view_count CRDT_COUNTER, last_modified TIMESTAMP);
INSERT INTO page_views VALUES ('page-1', CRDT_COUNTER(0));
-- Increment from any master - no conflicts!-- Master 1: SELECT CRDT_INCREMENT(view_count, 5)-- Master 2: SELECT CRDT_INCREMENT(view_count, 3)-- Both masters converge to same value after replication
-- Set CRDT (supports distributed set operations)CREATE TABLE user_tags ( user_id UUID PRIMARY KEY, tags CRDT_SET<VARCHAR(255)>, last_modified TIMESTAMP);
-- Add tags from any master-- Master 1: SELECT CRDT_ADD(tags, 'premium')-- Master 2: SELECT CRDT_ADD(tags, 'verified')-- Result: tags = {'premium', 'verified'} on all masters
-- Map CRDT (supports distributed map operations)CREATE TABLE user_preferences ( user_id UUID PRIMARY KEY, preferences CRDT_MAP<VARCHAR(255), VARCHAR(255)>, last_modified TIMESTAMP);
-- Set preferences from any master-- Master 1: SELECT CRDT_SET(preferences, 'theme', 'dark')-- Master 2: SELECT CRDT_SET(preferences, 'language', 'en')-- Result: preferences = {theme: 'dark', language: 'en'} on all mastersConfiguration:
[replication]conflict_resolution = "crdt"crdt_types = ["counter", "set", "map", "register"]Benefits:
- No conflicts - mathematically proven convergence
- Deterministic merge
- Suitable for append-only data
- High availability
Limitations:
- Limited to CRDT-compatible operations
- Requires schema design changes
- Not suitable for strong consistency
Setup and Configuration
Prerequisites
- HeliosDB v5.3 or later
- Network connectivity between all master nodes
- Synchronized time across masters (NTP recommended, <100ms skew)
- At least 500MB free disk space per master
- Port 5432 (PostgreSQL protocol) accessible between masters
Enabling Multi-Master Replication
Step 1: Configuration File
[server]node_id = "us-east-1"listen_address = "10.0.1.100"port = 5432
[replication]# Enable multi-master modemode = "multi_master"
# Local master configurationlocal_master_id = "us-east-1"local_region = "us-east-1"
# Replication behaviorconflict_resolution = "last_write_wins"timestamp_source = "server_timestamp"replication_lag_threshold_ms = 500max_retries = 3batch_size = 1000
# Causal consistencyenable_causal_consistency = truevector_clock_update_interval_ms = 10
# Networkreplication_timeout_ms = 5000connection_pool_size = 10enable_compression = true
# Monitoringhealth_check_interval_sec = 10replication_monitor_interval_sec = 30Step 2: Configure Replication Nodes
-- Connect to each master and configure peers
-- On Master 1 (us-east-1)SELECT replicate.add_master( master_id => 'us-east-1', endpoint => '10.0.1.100:5432', region => 'us-east-1', priority => 10);
SELECT replicate.add_master( master_id => 'eu-west-1', endpoint => '10.0.2.100:5432', region => 'eu-west-1', priority => 5);
SELECT replicate.add_master( master_id => 'ap-south-1', endpoint => '10.0.3.100:5432', region => 'ap-south-1', priority => 5);
-- Verify topologySELECT * FROM replicate.masters;Step 3: Start Replication Service
# Start HeliosDB with multi-master enabledsystemctl start heliosdb
# Verify replication is activeheliosdb-cli \ --host 10.0.1.100 \ --port 5432 \ -c "SELECT replicate.status();"
# Expected output:# status: ACTIVE# master_id: us-east-1# replication_lag_ms: 0# healthy_peers: 2# conflicts_detected: 0Network and Latency Considerations
Recommended Network Topology
Internet / Private Network │ ┌─────────┼─────────┐ │ │ │ [Firewall] │ [Firewall] │ │ │ ┌─────────┴──┐ ┌──┴─────────┴──┐ │ │ │ │ [VPC-US] [VPC-EU] [VPC-Asia] │ │ │ │ [Master-1] [Master-2] [Master-3] [Replica-1] [Replica-2] [Replica-3]
Network Requirements: • Latency: <100ms between masters (LAN) or <500ms (WAN) • Bandwidth: 10 Mbps minimum, 100 Mbps recommended • Packet loss: <0.1% • TCP connections: Keep-alive enabledConfiguration for Different Network Conditions
Low-Latency (LAN, <10ms):
[replication]heartbeat_interval_ms = 50heartbeat_timeout_ms = 200replication_batch_size = 100replication_timeout_ms = 2000Medium-Latency (WAN, 50-100ms):
[replication]heartbeat_interval_ms = 100heartbeat_timeout_ms = 500replication_batch_size = 1000replication_timeout_ms = 5000High-Latency (Satellite/Intercontinental, >200ms):
[replication]heartbeat_interval_ms = 500heartbeat_timeout_ms = 2000replication_batch_size = 5000replication_timeout_ms = 15000enable_compression = trueNetwork Monitoring Queries
-- Check latency to peer mastersSELECT peer_id, region, ROUND(AVG(rtt_ms), 2) as avg_latency_ms, PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY rtt_ms) as p99_latency_ms, COUNT(*) as ping_countFROM replicate.peer_latency_historyWHERE recorded_at > NOW() - INTERVAL '1 hour'GROUP BY peer_id, regionORDER BY avg_latency_ms DESC;
-- Monitor replication lagSELECT master_id, current_lsn, last_replicated_lsn, current_lsn - last_replicated_lsn as pending_operations, ROUND((current_lsn - last_replicated_lsn) / 1000.0, 2) as lag_mb, ROUND(EXTRACT(EPOCH FROM last_replication_time) * 1000, 0) as last_replication_ms_agoFROM replicate.mastersORDER BY pending_operations DESC;
-- Network statisticsSELECT direction, bytes_sent, bytes_received, ROUND(bytes_sent / NULLIF(compression_ratio, 0), 0) as bytes_before_compression, ROUND(100 * (1 - 1/NULLIF(compression_ratio, 1)), 1) as compression_percentFROM replicate.network_statsWHERE recorded_at > NOW() - INTERVAL '1 hour'ORDER BY recorded_at DESCLIMIT 10;Connection Management
Connection Pool Configuration:
[replication.connection_pool]pool_size = 10max_connections = 50idle_timeout_sec = 300connection_timeout_sec = 10
# Per-peer configuration[[replication.peer_connections]]peer_id = "eu-west-1"pool_size = 15 # More connections for critical peerspriority = 10SQL Connection Management:
-- View active replication connectionsSELECT peer_id, connection_id, state, connected_since, bytes_sent, bytes_receivedFROM replicate.connectionsWHERE state = 'active'ORDER BY peer_id;
-- Close stale connectionsSELECT replicate.close_connection( peer_id => 'eu-west-1', connection_id => 12345);
-- Reset connection poolSELECT replicate.reset_connection_pool('eu-west-1');Operational Procedures
Adding a New Replication Node
Procedure: Add Master to Active Cluster
Duration: 5-10 minutes Impact: Minimal, online operation Validation: Required
Prerequisites:
- New master fully provisioned and running HeliosDB
- All existing masters healthy
- Network connectivity verified
- Sufficient disk space for backlog
Step 1: Initialize New Master
-- On the new master (ap-northeast-1)SELECT replicate.initialize_master( master_id => 'ap-northeast-1', region => 'ap-northeast-1', priority => 10);
-- Verify initializationSELECT * FROM replicate.status();-- Expected: status=INITIALIZING, replication_mode=catching_upStep 2: Add New Master to Existing Masters
-- On existing master (us-east-1)SELECT replicate.add_master( master_id => 'ap-northeast-1', endpoint => '10.0.4.100:5432', region => 'ap-northeast-1', priority => 10, initial_role => 'replica');
-- Verify master addedSELECT * FROM replicate.masters WHERE master_id = 'ap-northeast-1';Step 3: Establish Replication Connections
# Wait for connection establishment (30-60 seconds)# Monitor connection statusheliosdb-cli -c "SELECT replicate.status();" | grep -E "peer_count|healthy_peers"
# Expected output:# peer_count: 4# healthy_peers: 3Step 4: Wait for Catch-Up
-- Monitor catch-up progressSELECT master_id, replication_lag_bytes, replication_lag_ms, catch_up_progress_percent, CASE WHEN catch_up_progress_percent >= 100 THEN 'COMPLETE' ELSE 'IN_PROGRESS' END as catch_up_statusFROM replicate.mastersWHERE master_id = 'ap-northeast-1';
-- Loop until catch_up_progress_percent = 100-- Typical catch-up time: 5-10 minutes for 1GB lagStep 5: Activate New Master
-- On new masterSELECT replicate.promote_to_master( master_id => 'ap-northeast-1', validate_topology => true);
-- Verify promotionSELECT role FROM replicate.status();-- Expected: role = master
-- Verify it can accept writesINSERT INTO test_table VALUES (1, 'test from new master');COMMIT;Step 6: Validation
-- Verify replication from new master-- On new masterINSERT INTO validation_test (id, value, created_at)VALUES (1, 'from_ap_northeast', NOW());
-- On existing master (wait 1-2 seconds)SELECT * FROM validation_test WHERE id = 1;-- Should see the record
-- Check replication healthSELECT replicate.validate_topology();-- Expected: VALID, all masters healthyRemoving a Replication Node
Procedure: Remove Master from Cluster
Duration: 3-5 minutes Impact: No impact to cluster Validation: Required
Prerequisites:
- Target master can be shut down
- Cluster has at least 2 remaining masters
- No ongoing maintenance on other masters
Step 1: Pre-Removal Validation
-- Verify healthy cluster stateSELECT replicate.validate_topology();-- Expected: VALID
-- Check master being removed has no pinned transactionsSELECT COUNT(*) as pinned_count FROM replicate.pinned_transactionsWHERE master_id = 'eu-west-1';-- If count > 0, wait for transactions to complete
-- Check replication lag is lowSELECT master_id, replication_lag_ms FROM replicate.masters;-- All should be < 1000msStep 2: Drain Master (Stop Accepting New Writes)
-- On the master being removedSELECT replicate.set_read_only(master_id => 'eu-west-1');
-- Verify reads still workSELECT COUNT(*) FROM large_table; -- Should succeed
-- Try to write (should fail)INSERT INTO test_table VALUES (1);-- Error: Read-only modeStep 3: Wait for Replication to Catch-Up
-- Monitor until this master is fully replicatedSELECT master_id, pending_operations, replication_lag_msFROM replicate.mastersWHERE master_id = 'eu-west-1';
-- Wait until pending_operations = 0 and replication_lag_ms < 100Step 4: Remove Master from Topology
-- On any remaining masterSELECT replicate.remove_master( master_id => 'eu-west-1', verify_replication => true);
-- Verify removalSELECT COUNT(*) FROM replicate.masters;-- Count should decrease by 1Step 5: Shutdown Master
# Stop HeliosDB on removed mastersystemctl stop heliosdb
# Verify it's offlineheliosdb-cli --host 10.0.2.100 --port 5432 -c "SELECT 1"# Should fail with connection refusedStep 6: Post-Removal Validation
-- Verify cluster is healthy with remaining mastersSELECT replicate.validate_topology();-- Expected: VALID
-- Verify quorum is sufficientSELECT master_count, min_quorum, CASE WHEN master_count >= min_quorum THEN 'OK' ELSE 'ERROR' END as quorum_statusFROM replicate.cluster_info;Handling Node Failures
Scenario 1: Master Goes Offline Unexpectedly
Timeline: 0ms: Master 2 becomes unreachable (network partition) 100ms: Heartbeat missed 500ms: Suspected failure 2000ms: Confirmed failure, triggers failover logic
Detection Query:-- Detect failed nodesSELECT master_id, last_heartbeat, NOW() - last_heartbeat as time_since_heartbeat, state, is_healthyFROM replicate.mastersWHERE is_healthy = falseORDER BY last_heartbeat DESC;
-- Troubleshoot failure reasonSELECT master_id, failure_type, -- network_timeout, connection_refused, unresponsive last_error, failure_count, first_failure_time, last_failure_timeFROM replicate.failure_logWHERE recorded_at > NOW() - INTERVAL '30 minutes'ORDER BY last_failure_time DESC;Recovery Steps:
-- Option 1: Restart failed master (if it's just a process crash)-- On failed mastersystemctl restart heliosdb
-- Monitor reconnection (30-60 seconds)SELECT * FROM replicate.masters WHERE master_id = 'master_2';-- status should change from FAILED to CATCHING_UP then HEALTHY
-- Option 2: Investigate network partition-- Check network connectivity-- (Execute from operational master)-- Note: You would need to check actual network status via ping/traceroute
-- Option 3: If permanent failure - remove master (see Remove procedure above)
-- After recovery, validate topologySELECT replicate.validate_topology();Monitoring Replication Lag
Real-Time Lag Monitoring:
-- Current lag for all mastersSELECT master_id, region, CASE WHEN replication_lag_ms < 100 THEN 'EXCELLENT' WHEN replication_lag_ms < 500 THEN 'GOOD' WHEN replication_lag_ms < 1000 THEN 'ACCEPTABLE' ELSE 'POOR' END as lag_status, replication_lag_ms, replication_lag_bytes, last_replication_timeFROM replicate.mastersORDER BY replication_lag_ms DESC;
-- Detailed lag breakdownSELECT peer_id, PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY lag_ms) as p50_lag_ms, PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY lag_ms) as p95_lag_ms, PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY lag_ms) as p99_lag_ms, MAX(lag_ms) as max_lag_ms, COUNT(*) as samplesFROM replicate.lag_historyWHERE recorded_at > NOW() - INTERVAL '1 hour'GROUP BY peer_idORDER BY p99_lag_ms DESC;
-- Operations pending replicationSELECT master_id, COUNT(*) as pending_opsFROM replicate.replication_queueWHERE state = 'pending'GROUP BY master_idORDER BY pending_ops DESC;Setting Alerts:
-- Create alert for high lagCREATE ALERT high_replication_lag ASSELECT master_id, replication_lag_msFROM replicate.mastersWHERE replication_lag_ms > 1000EVERY 30 SECONDSACTION NOTIFY 'replication-alerts@company.com';
-- Create alert for failed masterCREATE ALERT master_failure ASSELECT master_id, COUNT(*) as consecutive_failuresFROM replicate.failure_logWHERE recorded_at > NOW() - INTERVAL '5 minutes'GROUP BY master_idHAVING COUNT(*) > 10ACTION NOTIFY 'critical-alerts@company.com';Handling Split-Brain Scenarios
Split-Brain Definition: Two or more masters operate independently when they should be synchronized.
Prevention:
[replication]# Require quorum for writes to prevent split-brainrequire_quorum_for_writes = truequorum_size = 2 # Majority of 3 masters
# Use witness node to break tiesuse_witness_node = truewitness_endpoints = ["witness.example.com:6379"]Detection:
-- Detect split-brain (different data on different masters)SELECT 'SELECT COUNT(DISTINCT master_id) as master_count, COUNT(*) as row_count FROM replicate.split_brain_detection' as query_to_run_on_all_masters;
-- If same table has different row counts on different mastersSELECT * FROM replicate.split_brain_status;
-- View conflicting rowsSELECT table_name, row_id, master_1_value, master_2_value, conflict_detected_atFROM replicate.split_brain_conflictsORDER BY table_name, row_id;Recovery from Split-Brain:
-- Step 1: Isolate the cluster partition-- (Disable network between partitions)
-- Step 2: Identify which partition has majority of masters-- (If masters = 3, need 2 for majority)
-- Step 3: On majority partition, mark partition as canonicalSELECT replicate.mark_as_canonical_partition( partition_id => 'partition_1', force => true);
-- Step 4: On minority partition, stop replicationSELECT replicate.pause_replication();
-- Step 5: Heal network partition
-- Step 6: Sync minority partition from canonicalSELECT replicate.sync_from_canonical_partition( target_partition => 'partition_2');
-- Step 7: Resume replicationSELECT replicate.resume_replication();
-- Step 8: Validate topologySELECT replicate.validate_topology();Performance Tuning
Replication Buffer Sizes
Configuration:
[replication.buffers]# Main replication queuequeue_buffer_size_mb = 256
# Per-peer replication buffersper_peer_buffer_mb = 64
# WAL archive bufferwal_buffer_size_mb = 512
# Conflict detection bufferconflict_detection_buffer_mb = 128Tuning Guidance:
Available Memory: 32 GBRecommended Allocation: - Queue buffer: 256 MB (0.8%) - Per-peer (4 peers): 256 MB (0.8%) - WAL buffer: 512 MB (1.6%) - Conflict buffer: 128 MB (0.4%) Total: ~1.6 GB (5% of available)
For lower memory servers (4GB): - Queue buffer: 64 MB - Per-peer (2 peers): 64 MB - WAL buffer: 128 MB - Conflict buffer: 32 MB Total: ~288 MB (7% of available)Monitoring Buffer Usage:
-- View current buffer utilizationSELECT buffer_type, used_mb, allocated_mb, ROUND(100.0 * used_mb / allocated_mb, 1) as utilization_percent, peak_mb, peak_atFROM replicate.buffer_statsORDER BY utilization_percent DESC;
-- Alert if approaching capacitySELECT replicate.set_buffer_alert( buffer_type => 'queue', alert_threshold_percent => 80);Batch Configuration
Settings:
[replication]# Operations per replication batchbatch_size = 1000
# Time before forcing a batch (ms)batch_timeout_ms = 100
# Maximum batch size (bytes)max_batch_size_mb = 10
# Compression thresholdcompression_threshold_bytes = 1024Tuning Recommendations:
| Workload | Batch Size | Timeout | Max Size |
|---|---|---|---|
| Small transactions | 100-500 | 50ms | 1MB |
| Mixed | 1000 | 100ms | 10MB |
| Large bulk ops | 5000 | 200ms | 50MB |
Performance Impact Query:
-- Analyze batch performanceSELECT AVG(batch_size) as avg_batch_size, MAX(batch_size) as max_batch_size, AVG(batch_latency_ms) as avg_latency_ms, PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY batch_latency_ms) as p99_latency_ms, COUNT(*) as total_batchesFROM replicate.batch_statsWHERE recorded_at > NOW() - INTERVAL '1 hour';
-- Find optimal batch sizeSELECT batch_size_bucket, COUNT(*) as frequency, AVG(latency_ms) as avg_latency, (SELECT batch_size_bucket FROM ( SELECT batch_size_bucket, AVG(latency_ms) as avg_lat FROM replicate.batch_stats WHERE recorded_at > NOW() - INTERVAL '1 hour' GROUP BY batch_size_bucket ) ORDER BY avg_lat ASC LIMIT 1) as recommended_batch_sizeFROM replicate.batch_statsWHERE recorded_at > NOW() - INTERVAL '1 hour'GROUP BY batch_size_bucketORDER BY frequency DESC;Network Optimization
TCP Configuration:
# /etc/sysctl.conf - Linux kernel tuning# Increase TCP buffers for replicationnet.core.rmem_max = 134217728 # 128MBnet.core.wmem_max = 134217728 # 128MBnet.ipv4.tcp_rmem = 4096 87380 67108864net.ipv4.tcp_wmem = 4096 65536 67108864
# Enable TCP keep-alivenet.ipv4.tcp_keepalives_intvl = 30net.ipv4.tcp_keepalives_probes = 3
# Increase TCP backlognet.core.somaxconn = 4096net.ipv4.tcp_max_syn_backlog = 4096
# Apply changessysctl -pConnection Pooling Optimization:
-- Monitor connection pool efficiencySELECT peer_id, total_connections, active_connections, idle_connections, ROUND(100.0 * active_connections / total_connections, 1) as active_percent, ROUND(100.0 * idle_connections / total_connections, 1) as idle_percentFROM replicate.connection_statsORDER BY active_percent DESC;
-- Tune pool size if consistently high utilization-- Rule: If active_percent > 80%, increase pool sizeSELECT replicate.set_connection_pool_size( peer_id => 'eu-west-1', pool_size => 20 -- Increase from 15);Conflict Resolution in Practice
Detecting Conflicts
Conflict Detection Queries:
-- Show all detected conflicts in last hourSELECT conflict_id, table_name, row_id, master_1_id, master_2_id, conflict_type, -- write_conflict, delete_conflict, reference_conflict master_1_value, master_2_value, resolved_value, resolution_strategy, detected_atFROM replicate.conflictsWHERE detected_at > NOW() - INTERVAL '1 hour'ORDER BY detected_at DESC;
-- Conflict rate by tableSELECT table_name, COUNT(*) as conflict_count, ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER(), 1) as percent_of_totalFROM replicate.conflictsWHERE detected_at > NOW() - INTERVAL '1 day'GROUP BY table_nameORDER BY conflict_count DESC;
-- Conflict rate trendSELECT DATE_TRUNC('hour', detected_at) as hour, COUNT(*) as conflicts_per_hour, ROUND(AVG(resolution_latency_ms), 1) as avg_resolution_msFROM replicate.conflictsWHERE detected_at > NOW() - INTERVAL '7 days'GROUP BY hourORDER BY hour DESC;Resolving Conflicts Manually
Case 1: Last-Write-Wins Not Appropriate
-- View conflicted rowSELECT * FROM replicate.conflicts WHERE conflict_id = 'c-12345';
-- Result:-- master_1 wrote: price = 100 at 2025-12-30 10:00:00-- master_2 wrote: price = 150 at 2025-12-30 10:00:01-- Automatically resolved to: price = 150 (LWW)
-- But business rule: Keep higher price only if > 5% increase-- Current: (150-100)/100 = 50%, so 150 is correct
-- For another conflict:-- master_1 wrote: price = 100 at 10:00:00-- master_2 wrote: price = 2 at 10:00:01 (data entry error)-- Automatically resolved to: price = 2 (LWW by timestamp)
-- Manual override:SELECT replicate.override_conflict_resolution( conflict_id => 'c-12346', resolved_value => '100', reason => 'Data entry error on master_2, price=2 is invalid');
-- Update actual rowUPDATE productsSET price = 100WHERE id = 123;Case 2: CRDT Merge Failed
-- Check CRDT merge resultSELECT * FROM replicate.conflicts WHERE conflict_id = 'c-12347';
-- Result:-- Type: CRDT_SET merge-- Master 1: tags = {'electronics', 'active'}-- Master 2: tags = {'electronics', 'discontinued'}-- Merged: tags = {'electronics', 'active', 'discontinued'}
-- Business rule: Item can't be both active and discontinued-- Need custom resolution:
SELECT replicate.manual_crdt_merge( conflict_id => 'c-12347', resolved_value => '["electronics", "discontinued"]'::jsonb, reason => 'Item no longer active, discontinued takes precedence');
-- VerifySELECT tags FROM products WHERE id = 456;-- Result: ['electronics', 'discontinued']Testing Conflict Scenarios
Test Setup: Create Controlled Conflict
-- Create test tableCREATE TABLE conflict_test ( id INT PRIMARY KEY, value INT, _master_ts BIGINT, _master_id INT);
-- Insert initial valueINSERT INTO conflict_test (id, value) VALUES (1, 100);
-- Simulate concurrent writes-- On Master 1:UPDATE conflict_test SET value = 150 WHERE id = 1;
-- On Master 2 (quickly after):UPDATE conflict_test SET value = 200 WHERE id = 1;
-- Wait for replication (100-500ms typically)SELECT pg_sleep(1);
-- Check conflict detectionSELECT * FROM replicate.conflictsWHERE table_name = 'conflict_test'ORDER BY detected_at DESCLIMIT 1;
-- View resolutionSELECT id, value FROM conflict_test;-- Value should be 200 (higher, from later write)Test: Verify Conflict Resolution Strategy
-- Test 1: Last-Write-Wins-- Expected: Timestamp-based winner
-- Test 2: CRDT CounterCREATE TABLE crdt_counter_test ( id INT PRIMARY KEY, count CRDT_COUNTER);
INSERT INTO crdt_counter_test VALUES (1, CRDT_COUNTER(0));
-- Master 1: Increment by 5SELECT CRDT_INCREMENT(count, 5) FROM crdt_counter_test WHERE id = 1;
-- Master 2: Increment by 3SELECT CRDT_INCREMENT(count, 3) FROM crdt_counter_test WHERE id = 1;
-- Wait for replicationSELECT pg_sleep(1);
-- Both should see: count = 8SELECT count FROM crdt_counter_test WHERE id = 1;-- Expected: 8 on both masters, no conflictsChaos Testing: Network Partitions
#!/bin/bash# Simulate network partitiontc qdisc add dev eth0 root netem loss 100%
# Create conflicts while partitionedpsql -h master2.example.com -c "UPDATE products SET price = 999 WHERE id = 1;"
# Waitsleep 5
# Heal partitiontc qdisc del dev eth0 root netem
# Monitor conflict resolutionpsql -h master1.example.com -c " SELECT * FROM replicate.conflicts WHERE detected_at > NOW() - INTERVAL '5 minutes';"Real-World Topologies and Examples
Example 1: Global E-Commerce Platform
Requirements:
- 3 continents (US, EU, Asia)
- Users write to nearest region
- <100ms write latency
- Last-Write-Wins conflict resolution
Topology:
[US East Master] [EU West Master] [AP South Master] │ │ │ ├─ Read Replica 1 ├─ Read Replica 2 ├─ Read Replica 3 │ │ │[Clients US] [Clients EU] [Clients Asia]
Replication Path: US → EU: 85ms US → AP: 180ms EU → AP: 140msConfiguration:
[server]node_id = "us-east-1"
[replication]mode = "multi_master"conflict_resolution = "last_write_wins"timestamp_source = "server_timestamp"replication_lag_threshold_ms = 500batch_size = 2000enable_compression = true
[[replication.masters]]master_id = "us-east-1"endpoint = "us.db.company.com:5432"priority = 10
[[replication.masters]]master_id = "eu-west-1"endpoint = "eu.db.company.com:5432"priority = 5
[[replication.masters]]master_id = "ap-south-1"endpoint = "ap.db.company.com:5432"priority = 5Operational SQL:
-- Route write based on user regionPREPARE write_product ASSELECT CASE WHEN $1 IN ('US', 'CA', 'MX') THEN 'us-east-1' WHEN $1 IN ('GB', 'DE', 'FR') THEN 'eu-west-1' WHEN $1 IN ('JP', 'SG', 'IN') THEN 'ap-south-1' ELSE 'us-east-1' END as target_master;
-- Monitor global replicationSELECT master_id, region, ROUND(AVG(replication_lag_ms), 0) as avg_lag_ms, active_users, writes_per_secFROM replicate.master_statsWHERE recorded_at > NOW() - INTERVAL '5 minutes'GROUP BY master_id, regionORDER BY master_id;Example 2: Financial Trading System
Requirements:
- 2 data centers in same city
- 99.999% uptime
- Sub-50ms latency
- Custom conflict resolution (prevents duplicate trades)
Topology:
[NYC-1 Master] [Chicago-1 Master] │ │ [Traders] [Traders]
Trade Order Flow: 1. Submit to nearest master (< 5ms) 2. Commit locally 3. Replicate to peer (< 10ms) 4. Both confirm trade 5. Send execution to exchangeConfiguration:
[replication]mode = "multi_master"conflict_resolution = "custom"custom_resolver = "trade_conflict_resolver"require_quorum_for_writes = truequorum_size = 2heartbeat_interval_ms = 50 # Fast failure detectionheartbeat_timeout_ms = 200election_timeout_min_ms = 500election_timeout_max_ms = 1000Custom Conflict Resolver (Rust):
pub struct TradeConflictResolver;
impl ConflictResolver for TradeConflictResolver { fn resolve(&self, local: &Value, remote: &Value) -> Value { // Extract trade IDs let local_trade_id = local.get("trade_id").and_then(|v| v.as_str()); let remote_trade_id = remote.get("trade_id").and_then(|v| v.as_str());
// Different trades: both valid, keep both if local_trade_id != remote_trade_id { // Return both (merge at application layer) return local.clone(); }
// Same trade from different masters: reject duplicate // Keep the one with valid checksum and earlier timestamp let local_ts = local.get("timestamp").and_then(|v| v.as_i64()).unwrap_or(i64::MAX); let remote_ts = remote.get("timestamp").and_then(|v| v.as_i64()).unwrap_or(i64::MAX);
if local_ts < remote_ts { local.clone() } else { remote.clone() } }}Operational Queries:
-- Monitor trade execution lagSELECT symbol, COUNT(*) as trades_executed, ROUND(AVG(replication_lag_ms), 2) as avg_replication_lag_ms, ROUND(PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY execution_latency_ms), 2) as p99_execution_msFROM replicate.trade_execution_statsWHERE execution_time > NOW() - INTERVAL '1 hour'GROUP BY symbolORDER BY avg_replication_lag_ms DESCLIMIT 10;
-- Alert on duplicate trade attemptsSELECT trade_id, symbol, COUNT(*) as submission_count, MAX(submitted_at) - MIN(submitted_at) as time_span_msFROM replicate.trade_submissionsWHERE submitted_at > NOW() - INTERVAL '1 minute'GROUP BY trade_id, symbolHAVING COUNT(*) > 1ORDER BY submission_count DESC;Example 3: Multi-Tenant SaaS Application
Requirements:
- Customers in multiple regions
- GDPR compliance (data residency)
- CRDT-based conflict resolution
- Custom metadata per tenant
Topology:
┌─ US Region ─────────────────────┐│ Master (US-West-2) ││ ├─ Tenant: Company-A (US) ││ ├─ Tenant: Company-B (US+CA) ││ └─ Replicas for read scaling │└─────────────────────────────────┘
┌─ EU Region ─────────────────────┐│ Master (EU-Central-1) ││ ├─ Tenant: Company-C (EU) ││ ├─ Tenant: Company-D (EU+UK) ││ └─ Replicas for read scaling │└─────────────────────────────────┘
Replication: US ↔ EU only for shared tenantsData residency: Company-A stays in US onlyConfiguration:
[replication]mode = "multi_master"conflict_resolution = "crdt"crdt_types = ["set", "counter", "map", "register"]
# Multi-tenancy settingsenable_tenant_isolation = truedefault_tenant_master = "auto" # Route to tenant's primary region
[replication.tenant_config.Company-A]primary_region = "us-west-2"allowed_regions = ["us-west-2"] # Residency: US onlyreplication_targets = [] # No replication
[replication.tenant_config.Company-B]primary_region = "us-west-2"allowed_regions = ["us-west-2", "eu-central-1"]replication_targets = ["eu-central-1"]Operational Queries:
-- Verify data residency complianceSELECT tenant_id, table_name, storage_region, row_count, data_size_mb, CASE WHEN storage_region = primary_region THEN 'COMPLIANT' ELSE 'VIOLATION' END as residency_statusFROM replicate.tenant_storage_location;
-- Monitor CRDT merge operationsSELECT tenant_id, operation_type, -- set_add, counter_increment, map_put COUNT(*) as operation_count, ROUND(AVG(merge_latency_ms), 2) as avg_merge_msFROM replicate.crdt_operationsWHERE executed_at > NOW() - INTERVAL '1 day'GROUP BY tenant_id, operation_typeORDER BY tenant_id, operation_count DESC;
-- Check cross-region replication for shared tenantsSELECT tenant_id, source_region, target_region, bytes_replicated, ROUND(bytes_replicated::NUMERIC / 1024 / 1024, 2) as mb_replicatedFROM replicate.cross_region_replicationWHERE replicated_at > NOW() - INTERVAL '1 hour'ORDER BY bytes_replicated DESC;Troubleshooting Replication Issues
Troubleshooting Decision Tree
SYMPTOM: Replication Lag High (> 1 second)├─ Check network latency│ └─ Run: SELECT replicate.check_peer_latency('eu-west-1');│ ├─ If > 200ms: Network issue, escalate to ops team│ └─ If normal: Proceed to next check│├─ Check replication queue size│ └─ Run: SELECT replicate.get_queue_depth();│ ├─ If > 10,000 ops: Master slow to replicate│ │ └─ Check CPU/Memory/Disk on remote master│ └─ If normal: Proceed to next check│├─ Check batch size│ └─ Run: SELECT replicate.get_batch_metrics();│ ├─ If too small (< 100): Increase batch_size in config│ ├─ If too large (> 10,000): Network issue or slow master│ └─ Otherwise: Normal│└─ Check remote master health └─ Restart remote master or escalate
SYMPTOM: Conflicts Not Resolving├─ Verify conflict resolution strategy│ └─ Run: SELECT replicate.get_conflict_strategy();│ ├─ If LWW: Check if timestamps are synchronized│ ├─ If CRDT: Verify CRDT type support for table│ └─ If custom: Check if resolver function exists│├─ Check conflict log│ └─ Run: SELECT * FROM replicate.conflicts LIMIT 10;│ ├─ If empty: No conflicts detected│ └─ If rows exist: Check resolution column│└─ Test resolution manually └─ See "Testing Conflict Scenarios" section above
SYMPTOM: Master Unreachable / Failed├─ Check network connectivity│ └─ ping master2.example.com│ ├─ If no response: Network down│ └─ If responds: Process issue│├─ Check master process│ └─ systemctl status heliosdb│ ├─ If inactive: Start service (systemctl start heliosdb)│ └─ If active: Check logs (journalctl -u heliosdb -n 100)│├─ Check master port│ └─ telnet master2.example.com 5432│ ├─ If refused: Port blocked or not listening│ └─ If connected: Protocol issue│└─ Failover if permanent └─ Follow "Remove Node" procedure and promote replicaCommon Issues and Solutions
Issue 1: “Replication lag continuously increases”
-- DiagnoseSELECT lag_trend, lag_ms, operations_pending, network_bandwidth_mbps, disk_write_iopsFROM replicate.replication_diagnostics;
-- Solutions:-- 1. Reduce write rate (application-level throttling)-- 2. Increase batch_size in configuration-- 3. Enable compression-- 4. Add more disk I/O capacity-- 5. Check for slow queries on remote master
-- Example fix:ALTER SYSTEM SET replication.batch_size = 5000;SELECT pg_reload_conf();Issue 2: “Conflicts spike after network partition”
-- InvestigateSELECT * FROM replicate.conflictsWHERE detected_at > NOW() - INTERVAL '10 minutes'ORDER BY conflict_count DESC;
-- Expected: Conflicts should resolve within secondsSELECT COUNT(*) as unresolved_conflictsFROM replicate.conflictsWHERE resolution_strategy = 'pending';
-- If unresolved conflicts > 100:-- 1. Check custom resolver is working-- 2. Manual intervention might be needed
-- Manual resolution:SELECT replicate.resolve_pending_conflicts( strategy => 'last_write_wins', timeout_sec => 30);Issue 3: “Master won’t promote to writable”
-- Check promotion blockersSELECT * FROM replicate.promotion_blockersWHERE master_id = 'ap-south-1';
-- Common blockers:-- - Replication lag too high: Wait for catch-up-- - Topology invalid: Verify other masters healthy-- - Missing quorum: Need majority of masters healthy
-- Force promotion (use carefully):SELECT replicate.force_promote( master_id => 'ap-south-1', override_validations => false);High-Availability Patterns
Pattern 1: Active-Active with Automatic Failover
Setup: 3 masters, quorum-based writes
[replication]mode = "multi_master"require_quorum_for_writes = truequorum_size = 2enable_automatic_failover = truefailover_timeout_sec = 10RPO (Recovery Point Objective): 0 (no data loss) RTO (Recovery Time Objective): 2-5 seconds Availability: 99.95%
Validation:
-- Verify quorum requirementsSELECT master_count, min_quorum, healthy_count, CASE WHEN healthy_count >= min_quorum THEN 'OK' ELSE 'QUORUM LOST' END as statusFROM replicate.cluster_status;
-- Simulate failure-- (Kill one master process)-- Monitor automatic failoverSELECT * FROM replicate.failover_historyORDER BY failover_time DESC LIMIT 1;Pattern 2: Read-Heavy with Single Writer
Setup: 1 writable master, 3+ read replicas
┌───────────────────────────────┐│ Writable Master ││ All writes accepted here │└───────────────────────────────┘ │ ├─ Async Replication ──────┐ │ │ ├─ Replica 1 (Read-only) │ ├─ Replica 2 (Read-only) │ └─ Replica 3 (Read-only) │
Benefits: - No conflict resolution needed - Excellent read scaling - Simple topology
Trade-offs: - Master is single point of failure - Failover requires promotionConfiguration:
[replication]mode = "single_master_with_replicas"primary_master = "us-east-1"replica_addresses = [ "us-east-1-rep1:5432", "us-east-1-rep2:5432", "us-east-1-rep3:5432"]auto_failover = truemin_replicas_for_failover = 1Pattern 3: Geographic Read-Write Locality
Setup: Write to nearest master, read from nearest replica
┌──────────────────────────────────────┐│ Client in US ││ 1. Write to US Master (5ms) ││ 2. Read from US Master (2ms) ││ 3. Background sync to EU/AP │└──────────────────────────────────────┘
┌──────────────────────────────────────┐│ Client in EU ││ 1. Write to EU Master (5ms) ││ 2. Read from EU Master (2ms) ││ 3. Background sync to US/AP │└──────────────────────────────────────┘Routing Logic:
-- Automatic routing based on client locationCREATE OR REPLACE FUNCTION get_master_for_region(region_code VARCHAR)RETURNS VARCHAR AS $$BEGIN RETURN CASE region_code WHEN 'US' THEN 'us-east-1' WHEN 'EU' THEN 'eu-west-1' WHEN 'AP' THEN 'ap-south-1' ELSE 'us-east-1' END;END;$$ LANGUAGE plpgsql;
-- Application layer-- SET application_name = get_master_for_region('US');-- Then connects to corresponding masterPattern 4: Disaster Recovery (Cold Standby)
Setup: Active master + warm standby in different region
┌──────────────────────┐│ Active Master ││ Accepts all writes ││ Region: US │└──────────────────────┘ │ Async Replication │ (Low priority) ▼┌──────────────────────┐│ Warm Standby ││ Read-only mode ││ Region: EU ││ RTO: 5-10 minutes │└──────────────────────┘Activation Procedure:
#!/bin/bash# Activate standby master (manual failover)
# Step 1: Verify connectivity to EU standbyping eu-master.example.com
# Step 2: Stop replication (isolate)psql -h eu-master.example.com -c "SELECT replicate.pause_replication();"
# Step 3: Promote to writablepsql -h eu-master.example.com -c "SELECT replicate.promote_to_master();"
# Step 4: Point applications to EU master# (Update connection strings, DNS, etc.)
# Step 5: Verify writes workingpsql -h eu-master.example.com -c "INSERT INTO test VALUES (1);"
# Step 6: Failback when original master recovered# (Reverse steps above)Conclusion
Multi-master replication provides HeliosDB with unparalleled write availability and geographic distribution. By following the operational procedures and best practices outlined in this guide, you can:
- Deploy and manage multi-master clusters confidently
- Handle failures gracefully with automatic failover
- Optimize performance for your workload
- Detect and resolve conflicts systematically
- Maintain 99.999% write availability
Key Takeaways
- Start simple: Single-master with replicas before multi-master
- Monitor constantly: Replication lag, conflicts, and health
- Test thoroughly: Especially conflict scenarios and failover
- Plan for failure: Network partitions and master crashes
- Document your topology: Keep clear records of master assignments
Additional Resources
- Architecture Documentation:
/home/app/HeliosDB/docs/architecture/12-replication-architecture.md - API Reference: HeliosDB SQL Function Documentation
- Troubleshooting Tools:
heliosdb-cli --helpfor diagnostic commands - Support: HeliosDB Support Team
Document Maintained By: HeliosDB Operations Team Last Review: 2025-12-30 Next Review: 2026-03-30