Skip to content

Synchronous Replication Overhead Analysis

Synchronous Replication Overhead Analysis

Executive Summary

This analysis quantifies the latency impact and throughput effects of synchronous mirroring (single mirror per shard) on HeliosDB’s write performance, examining the trade-offs between data durability (RPO=0) and write throughput across different network configurations and workload patterns.

1. Synchronous Replication Architecture

1.1 Write Path with Mirroring

Client Write Request
┌────────────────────────────────────┐
│ Primary Storage Node │
├────────────────────────────────────┤
│ 1. Validate write │ ← 50 μs
│ 2. Append to commit log (local) │ ← 400 μs (NVMe)
│ 3. Insert to memtable (local) │ ← 50 μs
│ 4. Replicate to mirror → → → → → →│─┐
└────────────────────────────────────┘ │
│ Network RTT
┌────────────────────────────────────┐ │
│ Mirror Storage Node │←┘
├────────────────────────────────────┤
│ 5. Receive replication data │
│ 6. Append to commit log (local) │ ← 400 μs
│ 7. Insert to memtable (local) │ ← 50 μs
│ 8. Send ACK → → → → → → → → → → → │─┐
└────────────────────────────────────┘ │
↑ │
└───────────────────────────────┘
Network RTT
┌────────────────────────────────────┐
│ Primary: Acknowledge to client │
└────────────────────────────────────┘

Critical Path Components:

  1. Primary processing: 500 μs (commit log + memtable)
  2. Network to mirror: RTT/2
  3. Mirror processing: 450 μs (commit log + memtable)
  4. Network ACK: RTT/2
  5. Total network: 1 RTT

Total latency = Primary (500 μs) + Mirror (450 μs) + Network (1 RTT)

1.2 Comparison: Asynchronous vs Synchronous

Asynchronous Replication (Not Default):

Write latency seen by client:
- Local commit log: 400 μs
- Local memtable: 50 μs
- Client ACK: 450 μs
Background replication (not blocking):
- Network to mirror: Asynchronous
- Mirror write: Happens in parallel
Client latency: 450 μs
RPO: >0 (potential data loss if primary fails before replication)

Synchronous Replication (HeliosDB Default):

Write latency seen by client:
- Primary write: 500 μs
- Network RTT: Variable (see below)
- Mirror write: 450 μs
Client latency: 500 + RTT + 450 μs
RPO: 0 (no data loss on primary failure)

2. Network RTT Impact Analysis

2.1 RTT by Network Technology

Network TypeLocationRTT (Round-Trip Time)Write LatencyOverhead vs Async
RDMA (same rack)<5 meters2-5 μs952-955 μs+502-505 μs (+106%)
RDMA (same datacenter)<100 meters5-15 μs955-965 μs+505-515 μs (+106-114%)
10 Gbps Ethernet (same rack)<5 meters50-100 μs1,000-1,050 μs+550-600 μs (+122-133%)
10 Gbps Ethernet (same datacenter)<100 meters200-500 μs1,150-1,450 μs+700-1,000 μs (+156-222%)
10 Gbps Ethernet (same region)<500 km5-15 ms5,950-15,950 μs+5,500-15,500 μs (+1,222-3,444%)

Key Observations:

  1. RDMA (Same Datacenter):

    • RTT: 5-15 μs (negligible)
    • Total write latency: ~960 μs
    • Overhead: +106% vs async (but still <1ms)
    • Recommended configuration for production
  2. 10 Gbps Ethernet (Same Datacenter):

    • RTT: 200-500 μs (significant)
    • Total write latency: 1.2-1.5 ms
    • Overhead: +156-222% vs async
    • Acceptable for moderate write workloads
  3. Cross-Region Replication:

    • RTT: 5-15 ms (prohibitive for synchronous)
    • Total write latency: 6-16 ms
    • Not suitable for synchronous replication
    • Use asynchronous replication for geo-distributed setups

2.2 Quantitative Impact on Write Throughput

Single-Threaded Write Performance:

NetworkWrite LatencyMax Throughput (1/latency)Writes/sec
Async (baseline)450 μs1 ÷ 0.00045 s2,222
RDMA (same DC)960 μs1 ÷ 0.00096 s1,042
10G Eth (same DC)1,200 μs1 ÷ 0.0012 s833
10G Eth (same region)6,000 μs1 ÷ 0.006 s167

Throughput reduction:

  • RDMA: 53% reduction (2,222 → 1,042 writes/sec)
  • 10G Ethernet: 62% reduction (2,222 → 833 writes/sec)

Multi-Threaded Write Performance (100 Concurrent Writes):

Pipelined writes allow overlapping network RTT
RDMA Configuration:
- Per-write latency: 960 μs
- Concurrency: 100 parallel writes
- Pipeline depth: ~100 in-flight writes
- Effective throughput: 100 × (1 ÷ 0.00096s) = 104,167 writes/sec
vs Async:
- Throughput: 100 × (1 ÷ 0.00045s) = 222,222 writes/sec
- Reduction: 53% (same as single-threaded)
Conclusion: Synchronous mirroring reduces throughput by 50-60% regardless of concurrency

3. Workload-Specific Impact Analysis

3.1 OLTP Workload (High Write Frequency)

Profile:

  • Write rate: 50,000 transactions/sec (target)
  • Average write size: 2 KB
  • Read:Write ratio: 70:30

Async Replication:

Per-node write capacity:
- Single-threaded: 2,222 writes/sec
- With 100 threads: 222,222 writes/sec
- Nodes needed for 50K writes/sec: 50,000 ÷ 222,222 = 1 node
Total nodes: 1 storage node (plus replicas for reads)

Sync Replication (RDMA):

Per-node write capacity:
- Single-threaded: 1,042 writes/sec
- With 100 threads: 104,167 writes/sec
- Nodes needed for 50K writes/sec: 50,000 ÷ 104,167 = 1 node (still sufficient)
Total nodes: 1 storage node
Latency penalty: +106% (960 μs vs 450 μs)

Analysis:

  • For moderate OLTP workloads (<100K writes/sec), synchronous replication with RDMA is viable
  • Latency increases by ~500 μs, but stays under 1 ms
  • Trade-off: 2x write latency for zero data loss (RPO=0)

Sync Replication (10G Ethernet):

Per-node write capacity:
- With 100 threads: 83,333 writes/sec
- Nodes needed: 50,000 ÷ 83,333 = 1 node (marginal)
At peak load (100K writes/sec):
- Nodes needed: 100,000 ÷ 83,333 = 2 nodes
- With RDMA: 100,000 ÷ 104,167 = 1 node
Cost: 2x nodes needed vs RDMA

Recommendation for OLTP: Use RDMA for synchronous replication to maintain <1ms write latency.

3.2 HTAP Workload (Mixed Read/Write)

Profile:

  • Write rate: 10,000 transactions/sec
  • Read rate: 100,000 queries/sec
  • Analytical queries: 1,000/sec (complex, multi-shard)

Impact of Synchronous Replication:

Write latency impact:
- Async: 450 μs
- Sync (RDMA): 960 μs
- Difference: +510 μs
For transactional inserts (user-facing):
- Async: User sees 450 μs insert latency
- Sync: User sees 960 μs insert latency
- Perceptible but acceptable (<1ms)
For analytical queries:
- Write latency doesn't affect read performance
- Reads can hit either primary or mirror (load balancing)
- **Synchronous replication enables zero-lag mirror reads**

Benefit: Real-Time Analytics on Mirror

Async replication lag: 10-100 ms (typical)
- Analytics queries may see stale data
- Not suitable for real-time dashboards
Sync replication lag: 0 ms
- Mirror is always up-to-date
- Analytics queries can safely read from mirror
- Load balancing: Primary handles writes, mirror handles reads
Read throughput improvement:
- Single node: 100K reads/sec
- With mirror (read replica): 200K reads/sec
- **2x read capacity with zero lag**

Recommendation for HTAP: Synchronous replication is ideal—enables real-time analytics on mirror while maintaining durability.

3.3 Bulk Load Workload (Write-Heavy)

Profile:

  • Initial data load: 1 TB of data
  • Sustained write rate: 500,000 rows/sec
  • Temporary (batch processing)

Async Replication:

Write throughput per node:
- With 200 concurrent threads: 444,444 writes/sec
- Nodes needed for 500K writes/sec: 500,000 ÷ 444,444 = 2 nodes
Load time for 1 TB:
- Row size: 2 KB
- Total rows: 1 TB ÷ 2 KB = 500M rows
- Time: 500M ÷ 500K writes/sec = 1,000 seconds = 16.7 minutes

Sync Replication (RDMA):

Write throughput per node:
- With 200 concurrent threads: 208,333 writes/sec
- Nodes needed for 500K writes/sec: 500,000 ÷ 208,333 = 3 nodes
Load time for 1 TB:
- Time: 500M ÷ 500K writes/sec = 1,000 seconds = 16.7 minutes (same)
Additional resource cost:
- 3 nodes vs 2 nodes = +50% compute
- But mirrors provide redundancy

Optimization: Disable Synchronous Replication for Bulk Loads

-- HeliosDB DDL extension
ALTER TABLE staging_data SET REPLICATION MODE = 'ASYNC';
-- Bulk load with asynchronous replication
COPY staging_data FROM '/data/bulk.csv';
-- Re-enable synchronous replication
ALTER TABLE staging_data SET REPLICATION MODE = 'SYNC';
-- Force mirror catch-up (blocking)
ALTER TABLE staging_data SYNC REPLICAS;

Result:

  • Bulk load at full async speed (444K writes/sec per node)
  • Post-load sync: 100-500 seconds (depending on mirror lag)
  • Best of both worlds: Fast loads + eventual strong consistency

Recommendation for Bulk Loads: Temporarily disable synchronous replication, re-enable after load completion.

4. Failure Scenarios and Recovery Time

4.1 Primary Node Failure (Synchronous Replication)

Scenario: Primary node crashes mid-transaction

State at failure:
- Primary commit log: Last write at LSN 1,234,567
- Mirror commit log: Last write at LSN 1,234,567 (guaranteed)
- No data loss
Failover process:
1. Witness detects primary failure: 1-3 seconds (heartbeat timeout)
2. Witness grants failover to mirror: 100 ms (quorum decision)
3. Mirror promoted to primary: 50 ms (metadata update)
4. New primary ready for writes: 50 ms (state transition)
Total: 1.2-3.2 seconds
Recovery Point Objective (RPO): 0 (no data loss)
Recovery Time Objective (RTO): 1-3 seconds

Client Impact:

In-flight writes at failure time:
- Writes acknowledged to client: 0 lost (committed to mirror)
- Writes not yet acknowledged: Failed, client retries (idempotent writes)
Downtime: 1-3 seconds
User experience: Brief unavailability, automatic recovery

4.2 Primary Node Failure (Asynchronous Replication)

Scenario: Primary node crashes mid-transaction

State at failure:
- Primary commit log: Last write at LSN 1,234,567
- Mirror commit log: Last write at LSN 1,234,500 (lagging)
- Data loss: 67 transactions
Replication lag at failure:
- Typical async lag: 10-100 ms
- Writes in-flight: 100 ms × 100K writes/sec = 10,000 writes
- Data loss: Up to 10,000 transactions
Recovery Point Objective (RPO): 10-100 ms (data loss window)
Recovery Time Objective (RTO): 1-3 seconds (same as sync)

Client Impact:

Data loss consequences:
- 10,000 acknowledged writes lost (committed to primary, not mirror)
- Client believes writes succeeded, but data is gone
- Application-level inconsistency
Mitigation:
- Application-level write-ahead logging
- Client-side retry with idempotency keys
- **Not suitable for financial transactions or critical data**

4.3 Network Partition (Split-Brain Prevention)

Scenario: Network partition between primary and mirror

Witness-Based Quorum:
Partition Scenario 1: Primary isolated
┌─────────┐ ┌────────┐
│ Primary │ X │ Mirror │───┐
└─────────┘ └────────┘ │
┌──────────┐ │
│ Witness │─┘
└──────────┘
Quorum decision:
- Mirror + Witness = 2 votes (majority)
- Primary = 1 vote (minority)
- Mirror promoted to primary ✓
- Old primary demoted (cannot accept writes) ✓
Partition Scenario 2: Mirror isolated
┌─────────┐ ┌────────┐
│ Primary │───┐ │ Mirror │ X
└─────────┘ │ └────────┘
┌──────────┐ │
│ Witness │──┘
└──────────┘
Quorum decision:
- Primary + Witness = 2 votes (majority)
- Mirror = 1 vote (minority)
- Primary remains primary ✓
- Mirror cannot self-promote ✓
Result: No split-brain, guaranteed single active primary

Synchronous Replication Advantage:

With sync replication:
- Writes block until mirror acknowledges
- If mirror is partitioned, writes fail (but no data loss)
- Client sees write errors, can retry
With async replication:
- Writes succeed on primary (no blocking)
- If mirror is partitioned, replication lag grows
- On failover, large data loss window

Recommendation: Synchronous replication critical for split-brain safety in presence of network partitions.

5. Tuning and Optimization Strategies

5.1 Adaptive Replication Mode

Dynamic Mode Selection:

pub enum ReplicationMode {
Synchronous, // RPO = 0, slower writes
Asynchronous, // RPO > 0, faster writes
SemiSync, // Hybrid: sync for critical tables, async for others
}
pub struct TableReplicationPolicy {
mode: ReplicationMode,
timeout_ms: u64, // Max time to wait for sync ACK
}
impl TableReplicationPolicy {
pub fn new_critical() -> Self {
Self {
mode: ReplicationMode::Synchronous,
timeout_ms: 5000, // 5 sec timeout before async fallback
}
}
pub fn new_best_effort() -> Self {
Self {
mode: ReplicationMode::Asynchronous,
timeout_ms: 0,
}
}
}

Per-Table Configuration:

-- Critical data: Financial transactions
CREATE TABLE transactions (
txn_id BIGINT PRIMARY KEY,
amount DECIMAL(20,2),
...
) WITH (replication_mode = 'SYNCHRONOUS');
-- Non-critical data: User sessions
CREATE TABLE sessions (
session_id UUID PRIMARY KEY,
user_id BIGINT,
...
) WITH (replication_mode = 'ASYNCHRONOUS');
-- Analytics: Logs (can tolerate loss)
CREATE TABLE access_logs (
timestamp TIMESTAMP,
user_id BIGINT,
...
) WITH (replication_mode = 'ASYNCHRONOUS');

Benefit:

  • Critical tables: RPO = 0, slower writes (acceptable for low-volume txns)
  • Non-critical tables: Fast writes, some data loss risk (acceptable for high-volume logs)
  • Cluster-wide throughput optimized

5.2 Batched Replication

Problem: Synchronous replication adds 1 RTT per write

Solution: Group multiple writes into single replication batch

Traditional (1 write per RTT):
Write 1 → Mirror ACK → 960 μs
Write 2 → Mirror ACK → 960 μs
Write 3 → Mirror ACK → 960 μs
Total: 2,880 μs for 3 writes
Batched (N writes per RTT):
Write 1 ┐
Write 2 ├→ Batch → Mirror ACK → 980 μs
Write 3 ┘
Total: 980 μs for 3 writes
Latency per write: 980 ÷ 3 = 327 μs
Throughput: 3x improvement

Implementation:

pub struct ReplicationBatcher {
pending_writes: Vec<WriteOp>,
batch_size: usize,
max_wait_us: u64,
}
impl ReplicationBatcher {
pub async fn add_write(&mut self, write: WriteOp) -> Result<()> {
self.pending_writes.push(write);
if self.pending_writes.len() >= self.batch_size {
self.flush().await?;
}
Ok(())
}
async fn flush(&mut self) -> Result<()> {
// Send entire batch to mirror in one message
let batch = std::mem::take(&mut self.pending_writes);
self.send_batch_to_mirror(batch).await?;
// Single RTT for entire batch
self.wait_for_mirror_ack().await?;
Ok(())
}
}

Configuration:

[replication.batching]
enabled = true
max_batch_size = 100 # Up to 100 writes per batch
max_wait_us = 500 # Flush batch after 500 μs even if not full

Trade-offs:

MetricNon-BatchedBatched (size=100)
Latency (per write)960 μs600 μs (avg)
Throughput104K writes/sec167K writes/sec
Max delay960 μs1,460 μs (worst case: wait 500μs + 960μs RTT)

Recommendation: Enable batching for high-throughput workloads, with max_wait tuned to latency SLA.

5.3 Parallel Replication Streams

Problem: Single replication stream serializes writes

Solution: Partition replication by shard/table, use multiple parallel streams

Single Stream:
Primary → [Queue: W1, W2, W3, ...] → Mirror
Bottleneck: Network RTT serializes all writes
Parallel Streams (4 streams):
Primary → [Queue 1: W1, W5, ...] → Mirror
Primary → [Queue 2: W2, W6, ...] → Mirror
Primary → [Queue 3: W3, W7, ...] → Mirror
Primary → [Queue 4: W4, W8, ...] → Mirror
Throughput: 4x (each stream pipelined independently)

Configuration:

[replication.parallelism]
num_streams = 4 # 4 parallel TCP/RDMA connections per primary-mirror pair
stream_assignment = "hash" # Hash shard key to assign stream

Performance Impact:

With 4 parallel streams:
- Each stream: 104K writes/sec (RDMA)
- Total: 4 × 104K = 416K writes/sec per node
Improvement: 4x vs single stream

Caveat: Within a single shard, write order must be preserved (use same stream for same shard).

6. Cost-Benefit Analysis

6.1 Performance vs Durability Trade-off Matrix

Replication ModeWrite LatencyThroughputRPORTOUse Case
None450 μs222K/sec∞ (all data lost on failure)N/ADevelopment only
Async450 μs222K/sec10-100 ms1-3 secLogs, caches, non-critical
Sync (10G Eth)1,200 μs83K/sec01-3 secModerate criticality
Sync (RDMA)960 μs104K/sec01-3 secProduction default
Sync (RDMA) + Batching600 μs (avg)167K/sec01-3 secHigh-throughput OLTP

6.2 Financial Impact of Data Loss (RPO > 0)

Scenario: E-commerce database with async replication

Assumptions:
- Average transaction value: $100
- Write rate: 1,000 transactions/sec
- Replication lag: 50 ms (typical async)
- Primary failure probability: 0.1% per year (1 failure every 1000 days)
Data loss on failure:
- Transactions in lag window: 1,000 writes/sec × 0.05 sec = 50 transactions
- Value at risk: 50 × $100 = $5,000 per failure
Expected annual loss:
- Failure rate: 0.001/year
- Expected loss: $5,000 × 0.001 = $5/year
Seems low, but:
- Reputational damage: Unquantified but significant
- Regulatory compliance: Financial transactions require RPO=0 (PCI-DSS, SOX)
- Legal liability: Class-action lawsuits for data loss
Conclusion: For financial data, synchronous replication is mandatory regardless of cost

6.3 Infrastructure Cost Comparison

Scenario: 30-node HeliosDB cluster

Async Replication:

Throughput target: 3 million writes/sec
Nodes needed (primary + mirror):
- Per-node capacity: 222K writes/sec
- Primary nodes: 3M ÷ 222K = 14 nodes
- Mirror nodes: 14 nodes (same capacity)
- Total: 28 nodes
Network: 10 Gbps Ethernet
- Cost per node: $800
- Total network cost: 28 × $800 = $22,400
Node cost (compute + storage):
- Cost per node: $5,000
- Total: 28 × $5,000 = $140,000
Total infrastructure: $162,400

Sync Replication (RDMA):

Throughput target: 3 million writes/sec
Nodes needed:
- Per-node capacity: 104K writes/sec (RDMA sync)
- Primary nodes: 3M ÷ 104K = 29 nodes
- Mirror nodes: 29 nodes
- Total: 58 nodes
Network: 100 Gbps RDMA
- Cost per node: $3,200
- Total network cost: 58 × $3,200 = $185,600
Node cost:
- Cost per node: $5,000
- Total: 58 × $5,000 = $290,000
Total infrastructure: $475,600
Additional cost: $313,200 (1.93x more expensive)

BUT: Durability Benefit

Async: RPO = 50 ms
- Data loss on failure: 50,000 transactions
- Financial risk: Potentially $millions in transaction value
Sync: RPO = 0
- Data loss on failure: 0 transactions
- Financial risk: $0
For mission-critical systems, 1.93x infrastructure cost is justified

Optimization: Hybrid Strategy

Configure 80% of tables with async (non-critical logs, caches):
- Nodes needed: 14 primary + 14 mirror = 28 nodes (async)
Configure 20% of tables with sync (financial transactions):
- Nodes needed: 6 primary + 6 mirror = 12 nodes (sync, RDMA)
Total nodes: 40 nodes
Network cost: 28 × $800 + 12 × $3,200 = $61,040
Node cost: 40 × $5,000 = $200,000
Total: $261,040
Savings: $214,560 (45% cheaper than full sync)
Durability: RPO=0 for critical data, RPO>0 for non-critical
**Best of both worlds**

7. Monitoring and Alerting

7.1 Key Replication Metrics

Latency Metrics:

replication.primary_write_latency_us: Time for local write on primary
replication.mirror_ack_latency_us: Time waiting for mirror ACK
replication.total_write_latency_us: End-to-end write latency
replication.network_rtt_us: Measured network RTT to mirror
Alert thresholds:
- total_write_latency_us > 2000: Warning (>2ms)
- total_write_latency_us > 5000: Critical (>5ms, likely network issue)

Throughput Metrics:

replication.writes_per_sec: Actual write throughput
replication.bytes_replicated_per_sec: Data replication bandwidth
replication.pending_writes: Queue depth on primary
Alert thresholds:
- pending_writes > 10,000: Replication is lagging
- writes_per_sec < 50,000: Underutilized (cost optimization opportunity)

Failure Metrics:

replication.mirror_failures_total: Count of failed replications
replication.mirror_lag_seconds: How far mirror is behind (async mode)
replication.failovers_total: Count of primary → mirror promotions
Alert thresholds:
- mirror_failures_total increasing: Network or mirror node issues
- mirror_lag_seconds > 1.0: Async lag too high (risk of data loss)

7.2 Automated Remediation

Auto-Fallback to Async:

pub struct ReplicationManager {
mode: ReplicationMode,
timeout_ms: u64,
failure_count: AtomicU64,
}
impl ReplicationManager {
pub async fn replicate_write(&self, write: WriteOp) -> Result<()> {
match self.mode {
ReplicationMode::Synchronous => {
match timeout(self.timeout_ms, self.sync_replicate(write)).await {
Ok(Ok(())) => {
// Success, reset failure counter
self.failure_count.store(0, Ordering::Relaxed);
Ok(())
}
Ok(Err(e)) | Err(_) => {
// Sync replication failed or timed out
self.failure_count.fetch_add(1, Ordering::Relaxed);
if self.failure_count.load(Ordering::Relaxed) > 10 {
// Too many failures, fallback to async
warn!("Sync replication failing, falling back to async");
self.mode = ReplicationMode::Asynchronous;
}
// Write succeeds locally, replication async
self.async_replicate(write).await
}
}
}
ReplicationMode::Asynchronous => self.async_replicate(write).await,
}
}
}

Benefit: Graceful degradation under network issues (availability > durability temporarily)

8. Conclusion

Key Findings:

  1. Latency Impact:

    • RDMA: +510 μs (106% increase, but <1ms total)
    • 10G Ethernet: +700-1,000 μs (156-222% increase, ~1.5ms total)
    • RDMA critical for sub-millisecond write latency
  2. Throughput Impact:

    • Reduction: 50-60% vs async (104K vs 222K writes/sec per node)
    • Mitigation: Batching can recover 60% of loss (167K writes/sec)
    • 2x more nodes needed for same throughput
  3. Durability Benefit:

    • RPO: 0 (zero data loss on primary failure)
    • RTO: 1-3 seconds (automatic failover with witness quorum)
    • Essential for financial transactions, user data, compliance
  4. Cost-Benefit:

    • Infrastructure: 1.9x more expensive (more nodes needed)
    • Data loss risk: Eliminated ($0 vs potentially $millions)
    • Justified for mission-critical production workloads
  5. Optimization Strategies:

    • Batching: 60% throughput recovery with <1ms added latency
    • Parallel streams: 4x throughput with 4 streams
    • Hybrid mode: 45% cost savings (async for non-critical, sync for critical)

Recommended Configuration:

[replication]
default_mode = "synchronous" # RPO=0 by default
network = "rdma" # <1ms write latency
witness_quorum = true # Split-brain protection
[replication.batching]
enabled = true
max_batch_size = 50
max_wait_us = 300 # 300μs batching window
[replication.parallelism]
num_streams = 4 # 4x throughput
# Per-table overrides
[replication.table_overrides]
"access_logs" = "asynchronous" # Non-critical
"user_sessions" = "asynchronous" # Non-critical
"transactions" = "synchronous" # Critical
"account_balances" = "synchronous" # Critical

Decision Matrix:

Workload TypeRecommended ModeNetworkExpected Write LatencyExpected Throughput
Financial transactionsSyncRDMA<1ms100K/sec per node
E-commerce ordersSyncRDMA<1ms100K/sec per node
User profile updatesSync10G Eth1-2ms80K/sec per node
Analytics/logsAsync10G Eth<0.5ms200K/sec per node
Development/testingAsync10G Eth<0.5ms200K/sec per node

Next Steps:

  1. Implement per-table replication mode configuration
  2. Add batching support for high-throughput workloads
  3. Develop automatic fallback mechanism (sync → async on network issues)
  4. Create monitoring dashboard for replication health metrics
  5. Benchmark failover times under various failure scenarios