HeliosDB Multi-Region Deployment
HeliosDB Multi-Region Deployment
Comprehensive multi-region deployment system for HeliosDB v3.0 with cross-datacenter replication, conflict resolution, and global transaction coordination.
Features
- Cross-Datacenter Replication: Asynchronous WAL streaming between regions
- Conflict Resolution: Multiple strategies (LWW, FWW, Custom)
- Global Transactions: Two-Phase Commit (2PC) protocol across regions
- Region-Aware Routing: Latency-based, load-based, and policy-based query routing
- Health Monitoring: Continuous health checks across all regions
- Split-Brain Prevention: Quorum-based decision making
- Network Partition Detection: Automatic detection and handling
- Dynamic Topology: Add/remove regions at runtime
- Compression & Encryption: Optional for WAL streaming
Quick Start
Add to your Cargo.toml:
[dependencies]heliosdb-multiregion = "3.0.0"Basic Setup
use heliosdb_multiregion::*;
#[tokio::main]async fn main() -> Result<(), Box<dyn std::error::Error>> { // Define regions let regions = vec![ RegionConfig { region_id: "us-east-1".to_string(), datacenter: "Virginia".to_string(), nodes: vec!["10.0.1.1:5432".to_string()], is_primary: true, }, RegionConfig { region_id: "eu-west-1".to_string(), datacenter: "Ireland".to_string(), nodes: vec!["10.0.2.1:5432".to_string()], is_primary: false, }, ];
// Create cluster let cluster = MultiRegionCluster::new(regions).await?;
// Execute global transaction let mut txn = cluster.begin_global_transaction().await?; txn.add_operation(Operation::Write { key: "user:1".to_string(), value: b"John Doe".to_vec(), }); cluster.commit_global(txn).await?;
// Route query let target = cluster.route_query("SELECT * FROM users", Some("eu-west-1")).await?; println!("Query routed to: {}", target);
cluster.shutdown().await?; Ok(())}Configuration
Replication Configuration
let config = ReplicationConfig { mode: ReplicationMode::ActiveActive, conflict_resolution: ConflictStrategy::LastWriteWins, consistency_level: ConsistencyLevel::Quorum, compression: true, encryption: true, max_lag_ms: 5000,};
let cluster = MultiRegionCluster::new_with_config(regions, config).await?;Replication Modes
- ActiveActive: All regions accept writes
- ActivePassive: One primary, others read-only
Conflict Resolution Strategies
- LastWriteWins: Use timestamp to determine winner (default)
- FirstWriteWins: Keep the earliest write
- Custom: Implement your own
ConflictResolvertrait
Consistency Levels
- Eventual: Return immediately, data propagates async
- Quorum: Wait for majority of regions
- Strong: Wait for all regions to confirm
Core Operations
Region Management
// Add regionlet new_region = RegionConfig { /* ... */ };cluster.add_region(new_region).await?;
// Remove regioncluster.remove_region("region-id").await?;
// Promote region to primarycluster.promote_region("eu-west-1").await?;
// Demote region to secondarycluster.demote_region("us-east-1").await?;Global Transactions
// Begin transactionlet mut txn = cluster.begin_global_transaction().await?;
// Add operationstxn.add_operation(Operation::Write { key: "key1".to_string(), value: b"value1".to_vec(),});
txn.add_operation(Operation::Delete { key: "key2".to_string(),});
// Commit (uses 2PC protocol)cluster.commit_global(txn).await?;Query Routing
// Route to user's local regionlet target = cluster.route_query(query, Some("us-east-1")).await?;
// Route based on latency (default)let target = cluster.route_query(query, None).await?;Health Monitoring
// Get status for specific regionlet status = cluster.get_region_status("us-east-1").await?;println!("Health: {}", status.is_healthy);println!("Lag: {}ms", status.lag_ms);println!("Pending: {}", status.pending_ops);
// Get status for all regionslet statuses = cluster.get_all_region_status().await?;Conflict Resolution
Last-Write-Wins (LWW)
Uses Hybrid Logical Clock (HLC) for causality:
let config = ReplicationConfig { conflict_resolution: ConflictStrategy::LastWriteWins, // ...};Custom Resolver
Implement your own conflict resolution logic:
use async_trait::async_trait;use heliosdb_multiregion::*;
struct MyResolver;
#[async_trait]impl ConflictResolver for MyResolver { async fn resolve(&self, local: &VersionedRow, remote: &VersionedRow) -> Result<VersionedRow> { // Your custom logic here Ok(local.clone()) }}
// Register custom resolverlet config = ReplicationConfig { conflict_resolution: ConflictStrategy::Custom("my_resolver".to_string()), // ...};Architecture Components
Topology Manager
- Manages cluster topology
- Tracks region membership and roles
- Handles region promotion/demotion
Replication Engine
- WAL streaming between regions
- Compression and encryption support
- Conflict detection and resolution
Global Coordinator
- Two-Phase Commit (2PC) protocol
- Transaction log management
- Participant coordination
Query Router
- Latency-based routing
- Load-based routing
- Policy-based routing
Health Monitor
- Periodic health checks
- Uptime tracking
- Failure detection
Partition Handler
- Network partition detection
- Split-brain prevention
- Quorum management
Deployment Guide
3-Region Setup
let regions = vec![ RegionConfig { region_id: "us-east-1".to_string(), datacenter: "Virginia".to_string(), nodes: vec!["10.0.1.1:5432".to_string(), "10.0.1.2:5432".to_string()], is_primary: true, }, RegionConfig { region_id: "eu-west-1".to_string(), datacenter: "Ireland".to_string(), nodes: vec!["10.0.2.1:5432".to_string(), "10.0.2.2:5432".to_string()], is_primary: false, }, RegionConfig { region_id: "ap-south-1".to_string(), datacenter: "Mumbai".to_string(), nodes: vec!["10.0.3.1:5432".to_string(), "10.0.3.2:5432".to_string()], is_primary: false, },];
let config = ReplicationConfig { mode: ReplicationMode::ActiveActive, conflict_resolution: ConflictStrategy::LastWriteWins, consistency_level: ConsistencyLevel::Quorum, compression: true, encryption: true, max_lag_ms: 5000,};
let cluster = MultiRegionCluster::new_with_config(regions, config).await?;Region Failover
// Detect primary failurelet primary_status = cluster.get_region_status("us-east-1").await?;if !primary_status.is_healthy { // Promote a healthy secondary cluster.promote_region("eu-west-1").await?; println!("Failover complete: eu-west-1 is now primary");}Performance Considerations
Replication Lag
Monitor replication lag across regions:
for status in cluster.get_all_region_status().await? { if status.lag_ms > 5000 { println!("Warning: High lag in {}: {}ms", status.region_id, status.lag_ms); }}Compression
Enable compression for reduced bandwidth:
let config = ReplicationConfig { compression: true, // Uses LZ4 // ...};Consistency vs Performance
- Eventual: Best performance, eventual consistency
- Quorum: Balanced performance and consistency
- Strong: Strongest consistency, higher latency
Testing
Run unit tests:
cargo test --libRun integration tests:
cargo test --test integration_testRun specific test:
cargo test test_multi_region_cluster_setupExamples
See examples/multiregion_setup.rs for a complete example:
cargo run --example multiregion_setupAPI Reference
See the API documentation for detailed API reference.
Best Practices
- Use Quorum Consistency: Balance between performance and consistency
- Monitor Replication Lag: Set alerts for high lag
- Enable Compression: Reduce bandwidth usage
- Plan Failover Strategy: Document and test failover procedures
- Use Local Routing: Route queries to user’s nearest region
- Regular Health Checks: Monitor region health continuously
- Test Network Partitions: Simulate and test partition scenarios
Troubleshooting
High Replication Lag
// Check lag for all regionsfor status in cluster.get_all_region_status().await? { println!("{}: {}ms lag", status.region_id, status.lag_ms);}
// Adjust max lag thresholdlet mut config = ReplicationConfig::default();config.max_lag_ms = 10000; // Increase to 10 secondsNetwork Partition
// Check partition statuslet partition_status = partition_handler.get_partition_status().await?;if partition_status.has_partition { println!("Partition detected!"); println!("Reachable: {:?}", partition_status.reachable_regions); println!("Unreachable: {:?}", partition_status.unreachable_regions);}Transaction Failures
// Transactions have built-in timeoutlet mut txn = cluster.begin_global_transaction().await?;txn.timeout_secs = 60; // Increase timeout to 60 seconds
// Add operations and commit// ...License
MIT OR Apache-2.0
Contributing
Contributions welcome! See CONTRIBUTING.md for guidelines.