Skip to content

HeliosDB Multi-Region Deployment

HeliosDB Multi-Region Deployment

Comprehensive multi-region deployment system for HeliosDB v3.0 with cross-datacenter replication, conflict resolution, and global transaction coordination.

Features

  • Cross-Datacenter Replication: Asynchronous WAL streaming between regions
  • Conflict Resolution: Multiple strategies (LWW, FWW, Custom)
  • Global Transactions: Two-Phase Commit (2PC) protocol across regions
  • Region-Aware Routing: Latency-based, load-based, and policy-based query routing
  • Health Monitoring: Continuous health checks across all regions
  • Split-Brain Prevention: Quorum-based decision making
  • Network Partition Detection: Automatic detection and handling
  • Dynamic Topology: Add/remove regions at runtime
  • Compression & Encryption: Optional for WAL streaming

Quick Start

Add to your Cargo.toml:

[dependencies]
heliosdb-multiregion = "3.0.0"

Basic Setup

use heliosdb_multiregion::*;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Define regions
let regions = vec![
RegionConfig {
region_id: "us-east-1".to_string(),
datacenter: "Virginia".to_string(),
nodes: vec!["10.0.1.1:5432".to_string()],
is_primary: true,
},
RegionConfig {
region_id: "eu-west-1".to_string(),
datacenter: "Ireland".to_string(),
nodes: vec!["10.0.2.1:5432".to_string()],
is_primary: false,
},
];
// Create cluster
let cluster = MultiRegionCluster::new(regions).await?;
// Execute global transaction
let mut txn = cluster.begin_global_transaction().await?;
txn.add_operation(Operation::Write {
key: "user:1".to_string(),
value: b"John Doe".to_vec(),
});
cluster.commit_global(txn).await?;
// Route query
let target = cluster.route_query("SELECT * FROM users", Some("eu-west-1")).await?;
println!("Query routed to: {}", target);
cluster.shutdown().await?;
Ok(())
}

Configuration

Replication Configuration

let config = ReplicationConfig {
mode: ReplicationMode::ActiveActive,
conflict_resolution: ConflictStrategy::LastWriteWins,
consistency_level: ConsistencyLevel::Quorum,
compression: true,
encryption: true,
max_lag_ms: 5000,
};
let cluster = MultiRegionCluster::new_with_config(regions, config).await?;

Replication Modes

  • ActiveActive: All regions accept writes
  • ActivePassive: One primary, others read-only

Conflict Resolution Strategies

  • LastWriteWins: Use timestamp to determine winner (default)
  • FirstWriteWins: Keep the earliest write
  • Custom: Implement your own ConflictResolver trait

Consistency Levels

  • Eventual: Return immediately, data propagates async
  • Quorum: Wait for majority of regions
  • Strong: Wait for all regions to confirm

Core Operations

Region Management

// Add region
let new_region = RegionConfig { /* ... */ };
cluster.add_region(new_region).await?;
// Remove region
cluster.remove_region("region-id").await?;
// Promote region to primary
cluster.promote_region("eu-west-1").await?;
// Demote region to secondary
cluster.demote_region("us-east-1").await?;

Global Transactions

// Begin transaction
let mut txn = cluster.begin_global_transaction().await?;
// Add operations
txn.add_operation(Operation::Write {
key: "key1".to_string(),
value: b"value1".to_vec(),
});
txn.add_operation(Operation::Delete {
key: "key2".to_string(),
});
// Commit (uses 2PC protocol)
cluster.commit_global(txn).await?;

Query Routing

// Route to user's local region
let target = cluster.route_query(query, Some("us-east-1")).await?;
// Route based on latency (default)
let target = cluster.route_query(query, None).await?;

Health Monitoring

// Get status for specific region
let status = cluster.get_region_status("us-east-1").await?;
println!("Health: {}", status.is_healthy);
println!("Lag: {}ms", status.lag_ms);
println!("Pending: {}", status.pending_ops);
// Get status for all regions
let statuses = cluster.get_all_region_status().await?;

Conflict Resolution

Last-Write-Wins (LWW)

Uses Hybrid Logical Clock (HLC) for causality:

let config = ReplicationConfig {
conflict_resolution: ConflictStrategy::LastWriteWins,
// ...
};

Custom Resolver

Implement your own conflict resolution logic:

use async_trait::async_trait;
use heliosdb_multiregion::*;
struct MyResolver;
#[async_trait]
impl ConflictResolver for MyResolver {
async fn resolve(&self, local: &VersionedRow, remote: &VersionedRow) -> Result<VersionedRow> {
// Your custom logic here
Ok(local.clone())
}
}
// Register custom resolver
let config = ReplicationConfig {
conflict_resolution: ConflictStrategy::Custom("my_resolver".to_string()),
// ...
};

Architecture Components

Topology Manager

  • Manages cluster topology
  • Tracks region membership and roles
  • Handles region promotion/demotion

Replication Engine

  • WAL streaming between regions
  • Compression and encryption support
  • Conflict detection and resolution

Global Coordinator

  • Two-Phase Commit (2PC) protocol
  • Transaction log management
  • Participant coordination

Query Router

  • Latency-based routing
  • Load-based routing
  • Policy-based routing

Health Monitor

  • Periodic health checks
  • Uptime tracking
  • Failure detection

Partition Handler

  • Network partition detection
  • Split-brain prevention
  • Quorum management

Deployment Guide

3-Region Setup

let regions = vec![
RegionConfig {
region_id: "us-east-1".to_string(),
datacenter: "Virginia".to_string(),
nodes: vec!["10.0.1.1:5432".to_string(), "10.0.1.2:5432".to_string()],
is_primary: true,
},
RegionConfig {
region_id: "eu-west-1".to_string(),
datacenter: "Ireland".to_string(),
nodes: vec!["10.0.2.1:5432".to_string(), "10.0.2.2:5432".to_string()],
is_primary: false,
},
RegionConfig {
region_id: "ap-south-1".to_string(),
datacenter: "Mumbai".to_string(),
nodes: vec!["10.0.3.1:5432".to_string(), "10.0.3.2:5432".to_string()],
is_primary: false,
},
];
let config = ReplicationConfig {
mode: ReplicationMode::ActiveActive,
conflict_resolution: ConflictStrategy::LastWriteWins,
consistency_level: ConsistencyLevel::Quorum,
compression: true,
encryption: true,
max_lag_ms: 5000,
};
let cluster = MultiRegionCluster::new_with_config(regions, config).await?;

Region Failover

// Detect primary failure
let primary_status = cluster.get_region_status("us-east-1").await?;
if !primary_status.is_healthy {
// Promote a healthy secondary
cluster.promote_region("eu-west-1").await?;
println!("Failover complete: eu-west-1 is now primary");
}

Performance Considerations

Replication Lag

Monitor replication lag across regions:

for status in cluster.get_all_region_status().await? {
if status.lag_ms > 5000 {
println!("Warning: High lag in {}: {}ms", status.region_id, status.lag_ms);
}
}

Compression

Enable compression for reduced bandwidth:

let config = ReplicationConfig {
compression: true, // Uses LZ4
// ...
};

Consistency vs Performance

  • Eventual: Best performance, eventual consistency
  • Quorum: Balanced performance and consistency
  • Strong: Strongest consistency, higher latency

Testing

Run unit tests:

Terminal window
cargo test --lib

Run integration tests:

Terminal window
cargo test --test integration_test

Run specific test:

Terminal window
cargo test test_multi_region_cluster_setup

Examples

See examples/multiregion_setup.rs for a complete example:

Terminal window
cargo run --example multiregion_setup

API Reference

See the API documentation for detailed API reference.

Best Practices

  1. Use Quorum Consistency: Balance between performance and consistency
  2. Monitor Replication Lag: Set alerts for high lag
  3. Enable Compression: Reduce bandwidth usage
  4. Plan Failover Strategy: Document and test failover procedures
  5. Use Local Routing: Route queries to user’s nearest region
  6. Regular Health Checks: Monitor region health continuously
  7. Test Network Partitions: Simulate and test partition scenarios

Troubleshooting

High Replication Lag

// Check lag for all regions
for status in cluster.get_all_region_status().await? {
println!("{}: {}ms lag", status.region_id, status.lag_ms);
}
// Adjust max lag threshold
let mut config = ReplicationConfig::default();
config.max_lag_ms = 10000; // Increase to 10 seconds

Network Partition

// Check partition status
let partition_status = partition_handler.get_partition_status().await?;
if partition_status.has_partition {
println!("Partition detected!");
println!("Reachable: {:?}", partition_status.reachable_regions);
println!("Unreachable: {:?}", partition_status.unreachable_regions);
}

Transaction Failures

// Transactions have built-in timeout
let mut txn = cluster.begin_global_transaction().await?;
txn.timeout_secs = 60; // Increase timeout to 60 seconds
// Add operations and commit
// ...

License

MIT OR Apache-2.0

Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.