Skip to content

Multi-Region Architecture

Multi-Region Architecture

This document describes the technical architecture and design decisions for the HeliosDB multi-region deployment system.

Overview

The multi-region system enables HeliosDB to operate across multiple geographic regions with:

  • Automatic replication
  • Conflict resolution
  • Global transaction coordination
  • Split-brain prevention
  • Query routing

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│ Multi-Region Cluster │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Region A │ │ Region B │ │ Region C │ │
│ │ (Primary) │ │ (Secondary) │ │ (Secondary) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └──────────────────┼──────────────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ Replication │ │
│ │ Engine │ │
│ └───────┬────────┘ │
│ │ │
│ ┌──────────────────┼──────────────────┐ │
│ │ │ │ │
│ ┌──────▼───────┐ ┌──────▼───────┐ ┌──────▼───────┐ │
│ │ Conflict │ │ Global │ │ Query │ │
│ │ Resolution │ │ Coordinator │ │ Router │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Health │ │ Partition │ │ Topology │ │
│ │ Monitor │ │ Handler │ │ Manager │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘

Core Components

1. Topology Manager

Purpose: Manages cluster topology and region membership.

Key Responsibilities:

  • Track all regions and their configurations
  • Manage region roles (Primary/Secondary/Standby)
  • Handle region promotion/demotion
  • Maintain topology versioning

Data Structures:

struct RegionTopology {
regions: HashMap<String, RegionMetadata>,
primary_region: Option<String>,
version: u64,
}
struct RegionMetadata {
config: RegionConfig,
role: RegionRole,
epoch: u64,
version: u64,
}

Design Decisions:

  • Use versioning for topology changes
  • Cache region metadata for fast lookups
  • Thread-safe with RwLock and DashMap

2. Replication Engine

Purpose: Handles WAL streaming and replication between regions.

Key Features:

  • Asynchronous replication using WAL streaming
  • Compression (LZ4) for bandwidth efficiency
  • Encryption (AES) for security
  • Lag monitoring and tracking
  • Conflict detection

Replication Flow:

1. Write occurs in Region A
2. Create WAL entry
3. Compress entry (if enabled)
4. Encrypt entry (if enabled)
5. Stream to Regions B, C, ...
6. Apply to remote regions
7. Detect conflicts
8. Resolve conflicts
9. Update replication stats

Data Structures:

struct WalEntry {
id: Uuid,
sequence: u64,
timestamp: DateTime<Utc>,
source_region: String,
operation: WalOperation,
}
struct ReplicationStream {
region_id: String,
wal_queue: VecDeque<WalEntry>,
sequence: AtomicU64,
lag_ms: AtomicU64,
}

Performance Optimizations:

  • Queue-based streaming (bounded queues)
  • Batch operations where possible
  • Async processing with Tokio
  • LZ4 compression (fast and efficient)

3. Conflict Resolution

Purpose: Resolve conflicts when same data modified in multiple regions.

Strategies:

Last-Write-Wins (LWW)

  • Uses Hybrid Logical Clock (HLC) for causality
  • Timestamp-based with node_id tiebreaker
  • Simple and deterministic
if remote.timestamp > local.timestamp {
return remote; // Remote is newer
} else if remote.timestamp == local.timestamp {
return if remote.node_id > local.node_id { remote } else { local };
} else {
return local; // Local is newer
}

First-Write-Wins (FWW)

  • Keep the earliest write
  • Useful for immutable data

Version Vectors

  • Track causality per node
  • Detect concurrent updates
  • Fallback to LWW for concurrent conflicts

Conflict Detection:

fn detect_conflict(local: &VersionedRow, remote: &VersionedRow) -> bool {
local.key == remote.key && local.value != remote.value
}

4. Global Coordinator

Purpose: Coordinate transactions across multiple regions using Two-Phase Commit (2PC).

2PC Protocol:

Phase 1: Prepare
┌────────────┐
│ Coordinator│
└─────┬──────┘
├─── PREPARE ──→ Region A ──→ PREPARED
├─── PREPARE ──→ Region B ──→ PREPARED
└─── PREPARE ──→ Region C ──→ PREPARED
Phase 2: Commit
┌────────────┐
│ Coordinator│
└─────┬──────┘
├─── COMMIT ──→ Region A ──→ COMMITTED
├─── COMMIT ──→ Region B ──→ COMMITTED
└─── COMMIT ──→ Region C ──→ COMMITTED

Transaction States:

  • Preparing: Asking participants if ready
  • Prepared: All participants ready
  • Committing: Telling participants to commit
  • Committed: Transaction complete
  • Aborting/Aborted: Transaction cancelled

Consistency Levels:

  • Eventual: Don’t wait for participants
  • Quorum: Wait for majority (N/2 + 1)
  • Strong: Wait for all participants

Timeout Handling:

  • Transactions have configurable timeout
  • Auto-abort on timeout
  • Recovery mechanisms for crashed coordinator

5. Query Router

Purpose: Route queries to optimal region based on policy.

Routing Policies:

Latency-Based (Default)

1. Measure latency to all healthy regions
2. Select region with minimum latency
3. Prefer user's local region if healthy

Load-Based

1. Track load metrics per region
2. Select least loaded region
3. Balance load across regions

Primary-Only

1. Always route to primary region
2. Fail if primary unhealthy
3. Simple but single point of contention

Local-Preferred

1. Use user's local region if healthy
2. Fallback to latency-based
3. Best user experience

Round-Robin

1. Distribute across all healthy regions
2. Simple load balancing
3. No latency consideration

6. Health Monitor

Purpose: Continuous health monitoring of all regions.

Health Check Flow:

1. Periodic ping (default: 5 seconds)
2. Measure response time
3. Update health status
4. Track consecutive failures
5. Mark unhealthy after threshold (default: 3)
6. Calculate uptime percentage

Health Metrics:

  • is_healthy: Current health status
  • consecutive_failures: Failure streak
  • total_checks: Lifetime checks
  • total_failures: Lifetime failures
  • uptime_percentage: Availability metric

Recovery:

  • Auto-recovery when region responds
  • Reset consecutive failure count
  • Update health status immediately

7. Partition Handler

Purpose: Detect network partitions and prevent split-brain.

Partition Detection:

1. Monitor region connectivity
2. Build connectivity graph
3. Find connected components
4. Detect split-brain (multiple components)
5. Use quorum to resolve

Quorum Management:

  • Minimum regions required for operations
  • Default: N/2 + 1 (majority)
  • Prevents split-brain scenarios

Partition Events:

  • PartitionDetected: Network split detected
  • PartitionHealed: Network restored
  • SplitBrainDetected: Multiple primaries
  • RegionIsolated: Region can’t reach others
  • QuorumLost: Not enough healthy regions

Split-Brain Prevention:

if components.len() > 1 {
// Multiple partitions detected
for component in components {
if component.len() < quorum_size {
// This partition loses, mark read-only
mark_readonly(component);
}
}
}

Data Flow

Write Operation

1. Client writes to local region
2. Create WAL entry
3. Apply locally
4. Replicate to other regions (async)
5. Other regions apply WAL entry
6. Detect conflicts (if any)
7. Resolve conflicts
8. Update replication stats

Read Operation

1. Client sends query
2. Router determines target region
3. Route to optimal region
4. Execute query
5. Return results

Global Transaction

1. Client begins global transaction
2. Coordinator tracks transaction
3. Client adds operations
4. Client commits
5. Coordinator: Phase 1 - Prepare
- Ask all regions if ready
- Wait for responses
6. If all prepared:
Coordinator: Phase 2 - Commit
- Tell all regions to commit
- Transaction complete
7. If any abort:
- Abort transaction
- Rollback all regions

Consistency Model

CAP Theorem Trade-offs

The system provides tunable consistency via consistency levels:

Eventual Consistency (AP)

  • Availability + Partition tolerance
  • Best performance
  • Eventual consistency

Quorum Consistency (CP with A)

  • Consistency + Partition tolerance + Some availability
  • Balanced approach
  • Majority quorum required

Strong Consistency (CP)

  • Consistency + Partition tolerance
  • Highest consistency
  • Lower availability

Consistency Guarantees

  • Write Consistency: Guaranteed via 2PC (for global transactions)
  • Read Consistency: Depends on routing policy
  • Conflict Resolution: Eventually consistent with deterministic resolution

Failure Scenarios

Primary Region Failure

1. Health monitor detects failure
2. Mark primary unhealthy
3. If auto-failover enabled:
a. Select healthy secondary
b. Promote to primary
c. Update topology
4. Redirect writes to new primary

Secondary Region Failure

1. Health monitor detects failure
2. Mark secondary unhealthy
3. Stop routing reads to failed region
4. Continue operations with remaining regions
5. Replication continues to healthy regions

Network Partition

1. Partition handler detects split
2. Determine quorum for each partition
3. Partition with quorum continues operations
4. Other partitions enter read-only mode
5. When partition heals:
a. Reconcile conflicts
b. Restore normal operations

Transaction Coordinator Failure

1. Detect coordinator failure
2. Recovery process:
a. Read transaction log
b. For Preparing state: Abort
c. For Committing state: Complete commit
3. Resume normal operations

Performance Characteristics

Replication Throughput

  • Depends on network bandwidth
  • Compression reduces bandwidth by ~70%
  • Async replication: non-blocking writes

Transaction Latency

  • Eventual: ~1ms (local only)
  • Quorum: ~50ms (network RTT)
  • Strong: ~100ms (multiple regions)

Query Routing

  • Latency overhead: ~1ms
  • Routing decision: O(N) where N = regions
  • Cached latency measurements

Memory Usage

  • Per region: ~1KB metadata
  • Per WAL entry: ~200 bytes + data
  • Bounded queues prevent memory exhaustion

Security Considerations

Encryption

  • AES-256 for WAL streaming
  • TLS for inter-region communication
  • At-rest encryption (optional)

Authentication

  • Mutual TLS between regions
  • Token-based authentication
  • Role-based access control

Network Security

  • VPN tunnels recommended
  • Firewall rules between regions
  • DDoS protection at edge

Operational Guidelines

Monitoring

  • Track replication lag per region
  • Monitor health check failures
  • Alert on partition events
  • Track transaction success rate

Scaling

  • Add regions dynamically
  • Rebalance load after adding region
  • Remove regions gracefully
  • No downtime for topology changes

Backup & Recovery

  • Per-region backups
  • Cross-region backup replication
  • Point-in-time recovery
  • Transaction log archival

Disaster Recovery

  1. Detect disaster (region unavailable)
  2. Promote healthy region to primary
  3. Redirect all traffic
  4. When disaster region recovers:
    • Sync from primary
    • Rejoin as secondary

Testing Strategy

Unit Tests

  • Component isolation
  • Mock dependencies
  • Fast execution
  • High coverage (>80%)

Integration Tests

  • Multi-component interaction
  • Real async operations
  • Scenario-based testing

Chaos Engineering

  • Random region failures
  • Network partition simulation
  • Transaction abort scenarios
  • Load testing

Future Enhancements

  1. Geo-replication: Optimize for geographic proximity
  2. Read Replicas: Add read-only replicas per region
  3. CDC: Change Data Capture for external systems
  4. Multi-DC Raft: Consensus across datacenters
  5. Automatic Sharding: Distribute data across regions

References