Multi-Region Architecture
Multi-Region Architecture
This document describes the technical architecture and design decisions for the HeliosDB multi-region deployment system.
Overview
The multi-region system enables HeliosDB to operate across multiple geographic regions with:
- Automatic replication
- Conflict resolution
- Global transaction coordination
- Split-brain prevention
- Query routing
Architecture Diagram
┌─────────────────────────────────────────────────────────────────┐│ Multi-Region Cluster │├─────────────────────────────────────────────────────────────────┤│ ││ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││ │ Region A │ │ Region B │ │ Region C │ ││ │ (Primary) │ │ (Secondary) │ │ (Secondary) │ ││ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ ││ │ │ │ ││ └──────────────────┼──────────────────┘ ││ │ ││ ┌───────▼────────┐ ││ │ Replication │ ││ │ Engine │ ││ └───────┬────────┘ ││ │ ││ ┌──────────────────┼──────────────────┐ ││ │ │ │ ││ ┌──────▼───────┐ ┌──────▼───────┐ ┌──────▼───────┐ ││ │ Conflict │ │ Global │ │ Query │ ││ │ Resolution │ │ Coordinator │ │ Router │ ││ └──────────────┘ └──────────────┘ └──────────────┘ ││ ││ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││ │ Health │ │ Partition │ │ Topology │ ││ │ Monitor │ │ Handler │ │ Manager │ ││ └──────────────┘ └──────────────┘ └──────────────┘ ││ │└─────────────────────────────────────────────────────────────────┘Core Components
1. Topology Manager
Purpose: Manages cluster topology and region membership.
Key Responsibilities:
- Track all regions and their configurations
- Manage region roles (Primary/Secondary/Standby)
- Handle region promotion/demotion
- Maintain topology versioning
Data Structures:
struct RegionTopology { regions: HashMap<String, RegionMetadata>, primary_region: Option<String>, version: u64,}
struct RegionMetadata { config: RegionConfig, role: RegionRole, epoch: u64, version: u64,}Design Decisions:
- Use versioning for topology changes
- Cache region metadata for fast lookups
- Thread-safe with RwLock and DashMap
2. Replication Engine
Purpose: Handles WAL streaming and replication between regions.
Key Features:
- Asynchronous replication using WAL streaming
- Compression (LZ4) for bandwidth efficiency
- Encryption (AES) for security
- Lag monitoring and tracking
- Conflict detection
Replication Flow:
1. Write occurs in Region A2. Create WAL entry3. Compress entry (if enabled)4. Encrypt entry (if enabled)5. Stream to Regions B, C, ...6. Apply to remote regions7. Detect conflicts8. Resolve conflicts9. Update replication statsData Structures:
struct WalEntry { id: Uuid, sequence: u64, timestamp: DateTime<Utc>, source_region: String, operation: WalOperation,}
struct ReplicationStream { region_id: String, wal_queue: VecDeque<WalEntry>, sequence: AtomicU64, lag_ms: AtomicU64,}Performance Optimizations:
- Queue-based streaming (bounded queues)
- Batch operations where possible
- Async processing with Tokio
- LZ4 compression (fast and efficient)
3. Conflict Resolution
Purpose: Resolve conflicts when same data modified in multiple regions.
Strategies:
Last-Write-Wins (LWW)
- Uses Hybrid Logical Clock (HLC) for causality
- Timestamp-based with node_id tiebreaker
- Simple and deterministic
if remote.timestamp > local.timestamp { return remote; // Remote is newer} else if remote.timestamp == local.timestamp { return if remote.node_id > local.node_id { remote } else { local };} else { return local; // Local is newer}First-Write-Wins (FWW)
- Keep the earliest write
- Useful for immutable data
Version Vectors
- Track causality per node
- Detect concurrent updates
- Fallback to LWW for concurrent conflicts
Conflict Detection:
fn detect_conflict(local: &VersionedRow, remote: &VersionedRow) -> bool { local.key == remote.key && local.value != remote.value}4. Global Coordinator
Purpose: Coordinate transactions across multiple regions using Two-Phase Commit (2PC).
2PC Protocol:
Phase 1: Prepare┌────────────┐│ Coordinator│└─────┬──────┘ │ ├─── PREPARE ──→ Region A ──→ PREPARED ├─── PREPARE ──→ Region B ──→ PREPARED └─── PREPARE ──→ Region C ──→ PREPARED
Phase 2: Commit┌────────────┐│ Coordinator│└─────┬──────┘ │ ├─── COMMIT ──→ Region A ──→ COMMITTED ├─── COMMIT ──→ Region B ──→ COMMITTED └─── COMMIT ──→ Region C ──→ COMMITTEDTransaction States:
- Preparing: Asking participants if ready
- Prepared: All participants ready
- Committing: Telling participants to commit
- Committed: Transaction complete
- Aborting/Aborted: Transaction cancelled
Consistency Levels:
- Eventual: Don’t wait for participants
- Quorum: Wait for majority (N/2 + 1)
- Strong: Wait for all participants
Timeout Handling:
- Transactions have configurable timeout
- Auto-abort on timeout
- Recovery mechanisms for crashed coordinator
5. Query Router
Purpose: Route queries to optimal region based on policy.
Routing Policies:
Latency-Based (Default)
1. Measure latency to all healthy regions2. Select region with minimum latency3. Prefer user's local region if healthyLoad-Based
1. Track load metrics per region2. Select least loaded region3. Balance load across regionsPrimary-Only
1. Always route to primary region2. Fail if primary unhealthy3. Simple but single point of contentionLocal-Preferred
1. Use user's local region if healthy2. Fallback to latency-based3. Best user experienceRound-Robin
1. Distribute across all healthy regions2. Simple load balancing3. No latency consideration6. Health Monitor
Purpose: Continuous health monitoring of all regions.
Health Check Flow:
1. Periodic ping (default: 5 seconds)2. Measure response time3. Update health status4. Track consecutive failures5. Mark unhealthy after threshold (default: 3)6. Calculate uptime percentageHealth Metrics:
- is_healthy: Current health status
- consecutive_failures: Failure streak
- total_checks: Lifetime checks
- total_failures: Lifetime failures
- uptime_percentage: Availability metric
Recovery:
- Auto-recovery when region responds
- Reset consecutive failure count
- Update health status immediately
7. Partition Handler
Purpose: Detect network partitions and prevent split-brain.
Partition Detection:
1. Monitor region connectivity2. Build connectivity graph3. Find connected components4. Detect split-brain (multiple components)5. Use quorum to resolveQuorum Management:
- Minimum regions required for operations
- Default: N/2 + 1 (majority)
- Prevents split-brain scenarios
Partition Events:
- PartitionDetected: Network split detected
- PartitionHealed: Network restored
- SplitBrainDetected: Multiple primaries
- RegionIsolated: Region can’t reach others
- QuorumLost: Not enough healthy regions
Split-Brain Prevention:
if components.len() > 1 { // Multiple partitions detected for component in components { if component.len() < quorum_size { // This partition loses, mark read-only mark_readonly(component); } }}Data Flow
Write Operation
1. Client writes to local region2. Create WAL entry3. Apply locally4. Replicate to other regions (async)5. Other regions apply WAL entry6. Detect conflicts (if any)7. Resolve conflicts8. Update replication statsRead Operation
1. Client sends query2. Router determines target region3. Route to optimal region4. Execute query5. Return resultsGlobal Transaction
1. Client begins global transaction2. Coordinator tracks transaction3. Client adds operations4. Client commits5. Coordinator: Phase 1 - Prepare - Ask all regions if ready - Wait for responses6. If all prepared: Coordinator: Phase 2 - Commit - Tell all regions to commit - Transaction complete7. If any abort: - Abort transaction - Rollback all regionsConsistency Model
CAP Theorem Trade-offs
The system provides tunable consistency via consistency levels:
Eventual Consistency (AP)
- Availability + Partition tolerance
- Best performance
- Eventual consistency
Quorum Consistency (CP with A)
- Consistency + Partition tolerance + Some availability
- Balanced approach
- Majority quorum required
Strong Consistency (CP)
- Consistency + Partition tolerance
- Highest consistency
- Lower availability
Consistency Guarantees
- Write Consistency: Guaranteed via 2PC (for global transactions)
- Read Consistency: Depends on routing policy
- Conflict Resolution: Eventually consistent with deterministic resolution
Failure Scenarios
Primary Region Failure
1. Health monitor detects failure2. Mark primary unhealthy3. If auto-failover enabled: a. Select healthy secondary b. Promote to primary c. Update topology4. Redirect writes to new primarySecondary Region Failure
1. Health monitor detects failure2. Mark secondary unhealthy3. Stop routing reads to failed region4. Continue operations with remaining regions5. Replication continues to healthy regionsNetwork Partition
1. Partition handler detects split2. Determine quorum for each partition3. Partition with quorum continues operations4. Other partitions enter read-only mode5. When partition heals: a. Reconcile conflicts b. Restore normal operationsTransaction Coordinator Failure
1. Detect coordinator failure2. Recovery process: a. Read transaction log b. For Preparing state: Abort c. For Committing state: Complete commit3. Resume normal operationsPerformance Characteristics
Replication Throughput
- Depends on network bandwidth
- Compression reduces bandwidth by ~70%
- Async replication: non-blocking writes
Transaction Latency
- Eventual: ~1ms (local only)
- Quorum: ~50ms (network RTT)
- Strong: ~100ms (multiple regions)
Query Routing
- Latency overhead: ~1ms
- Routing decision: O(N) where N = regions
- Cached latency measurements
Memory Usage
- Per region: ~1KB metadata
- Per WAL entry: ~200 bytes + data
- Bounded queues prevent memory exhaustion
Security Considerations
Encryption
- AES-256 for WAL streaming
- TLS for inter-region communication
- At-rest encryption (optional)
Authentication
- Mutual TLS between regions
- Token-based authentication
- Role-based access control
Network Security
- VPN tunnels recommended
- Firewall rules between regions
- DDoS protection at edge
Operational Guidelines
Monitoring
- Track replication lag per region
- Monitor health check failures
- Alert on partition events
- Track transaction success rate
Scaling
- Add regions dynamically
- Rebalance load after adding region
- Remove regions gracefully
- No downtime for topology changes
Backup & Recovery
- Per-region backups
- Cross-region backup replication
- Point-in-time recovery
- Transaction log archival
Disaster Recovery
- Detect disaster (region unavailable)
- Promote healthy region to primary
- Redirect all traffic
- When disaster region recovers:
- Sync from primary
- Rejoin as secondary
Testing Strategy
Unit Tests
- Component isolation
- Mock dependencies
- Fast execution
- High coverage (>80%)
Integration Tests
- Multi-component interaction
- Real async operations
- Scenario-based testing
Chaos Engineering
- Random region failures
- Network partition simulation
- Transaction abort scenarios
- Load testing
Future Enhancements
- Geo-replication: Optimize for geographic proximity
- Read Replicas: Add read-only replicas per region
- CDC: Change Data Capture for external systems
- Multi-DC Raft: Consensus across datacenters
- Automatic Sharding: Distribute data across regions