Skip to content

Network Performance Analysis: RDMA vs TCP for HeliosDB

Network Performance Analysis: RDMA vs TCP for HeliosDB

Executive Summary

This analysis quantifies the performance benefits of RDMA over Converged Ethernet (RoCEv2) compared to traditional TCP/IP for HeliosDB’s inter-node communication, with focus on data transfer patterns, latency characteristics, and cost-benefit analysis.

1. Network Architecture Comparison

1.1 TCP/IP Stack (Traditional)

Application Layer (HeliosDB)
↓ System calls
┌──────────────┐
│ User Space │
└──────┬───────┘
↓ Context switch
┌──────────────┐
│ Kernel Space │
│ - TCP stack │
│ - IP routing │
│ - Driver │
└──────┬───────┘
↓ DMA
┌──────────────┐
│ NIC │
└──────────────┘
Latency Components:
1. System call overhead: 500-1000 ns
2. Context switch: 1-3 μs
3. TCP processing: 2-5 μs
4. Memory copy (kernel→user): 5-20 μs for large data
5. Network wire time: 0.5-2 μs (datacenter)
Total: 10-30 μs per message

1.2 RDMA/RoCEv2 Stack

Application Layer (HeliosDB)
↓ Verbs API (direct memory access)
┌──────────────┐
│ User Space │
│ - RDMA lib │
│ - Memory reg │
└──────┬───────┘
↓ Direct memory access (no context switch)
┌──────────────┐
│ RDMA NIC │
│ (InfiniBand/ │
│ RoCEv2) │
└──────────────┘
Latency Components:
1. Verbs call: 100-200 ns
2. NIC DMA: 1-2 μs
3. Network wire time: 0.5-2 μs
Total: 2-5 μs per message
Latency reduction: 5-10x vs TCP

1.3 Key Architectural Differences

CharacteristicTCP/IPRDMA/RoCEv2
Kernel involvementEvery operationNone (kernel bypass)
Context switches2 per send/recv0
Memory copies1-2 (kernel ↔ user)0 (direct DMA)
CPU overheadHigh (protocol processing)Very low (<5% CPU)
Latency10-30 μs2-5 μs
Throughput10-40 Gbps (CPU-limited)100-200 Gbps (wire-limited)
Zero-copyNoYes

2. Latency Analysis by Operation Type

2.1 Small Message Latency (RPC Calls)

Scenario: Metadata lookup from Compute to Metadata Service

Message size: 256 bytes (table schema request)
TCP/IP:
- Serialization: 0.5 μs
- System call + kernel: 3 μs
- TCP overhead: 3 μs
- Network wire: 1 μs (10 Gbps)
- Receiver kernel: 3 μs
- Context switch: 2 μs
- Deserialization: 0.5 μs
Total: ~13 μs one-way, ~26 μs RTT
RDMA:
- Serialization: 0.5 μs
- Verbs post: 0.2 μs
- NIC DMA: 1 μs
- Network wire: 1 μs
- Receiver NIC: 1 μs
- Deserialization: 0.5 μs
Total: ~4 μs one-way, ~8 μs RTT
Improvement: 3.2x faster (26 μs → 8 μs)

Impact on Query Latency:

For a distributed query requiring 5 metadata lookups:

  • TCP: 5 × 26 μs = 130 μs
  • RDMA: 5 × 8 μs = 40 μs
  • Savings: 90 μs per query

At 10,000 QPS: 900ms saved per second = 90% CPU reduction for metadata ops

2.2 Medium Message Latency (Predicate Pushdown)

Scenario: PredicatePushdownRequest from Compute to Storage Node

Message size: 8 KB (serialized predicates + projection list)
TCP/IP:
- Serialization: 2 μs
- System call: 2 μs
- TCP segmentation (2 packets): 5 μs
- Memory copy: 5 μs
- Network wire: 3 μs (10 Gbps = 1.25 GB/s → 8KB in ~6.4 μs)
- Receiver processing: 10 μs
Total: ~27 μs one-way
RDMA (SEND/RECV):
- Serialization: 2 μs
- Verbs post: 0.3 μs
- NIC DMA: 2 μs
- Network wire: 3 μs
- Receiver NIC: 2 μs
Total: ~9 μs one-way
Improvement: 3x faster (27 μs → 9 μs)

2.3 Large Message Latency (Data Transfer)

Scenario: FilteredResultSet stream from Storage to Compute

Message size: 1 MB (compressed columnar data)
TCP/IP:
- Bandwidth-bound regime
- Throughput: 10-25 Gbps (CPU-limited due to TCP overhead)
- Effective throughput: ~1.5 GB/sec
- Transfer time: 1 MB ÷ 1.5 GB/s = 667 μs
- CPU overhead: ~30% of one core
RDMA (WRITE):
- Bandwidth-bound regime
- Throughput: 90-100 Gbps (wire-limited)
- Effective throughput: ~11 GB/sec
- Transfer time: 1 MB ÷ 11 GB/s = 91 μs
- CPU overhead: ~3% of one core
Improvement: 7.3x faster (667 μs → 91 μs)
CPU savings: 90% reduction in CPU overhead

Cumulative Effect for Multi-Shard Query:

Query scans 10 shards, each returns 10 MB of filtered data
TCP/IP:
- Per shard: 10 MB ÷ 1.5 GB/s = 6.67 ms
- Sequential: 10 × 6.67 ms = 66.7 ms
- Parallel (10 streams): 6.67 ms (network saturated at ~15 Gbps)
- CPU usage: 10 cores × 30% = 3 full cores
RDMA:
- Per shard: 10 MB ÷ 11 GB/s = 0.91 ms
- Parallel (10 streams): 0.91 ms (network at ~88 Gbps, not saturated)
- CPU usage: 10 cores × 3% = 0.3 cores
Improvement: 7.3x faster data transfer
CPU savings: 90% (3 cores → 0.3 cores)
Network headroom: 100 Gbps NIC still has 12% free capacity

3. Data Transfer Pattern Analysis

3.1 Predicate Pushdown Pattern

Workload Characteristics:

  • Request: Small (1-16 KB serialized predicates)
  • Response: Large (1-100 MB filtered data, streaming)
  • Frequency: High (every analytical query)
  • Parallelism: High (N storage nodes per query)

TCP Performance:

Scenario: Query 16 storage nodes with predicate pushdown
Request phase (broadcast):
- 16 nodes × 8 KB × 27 μs = 432 μs (sequential, bottlenecked)
- Actual: ~50 μs (parallel network I/O, but CPU-bound)
Response phase (gather):
- Average 5 MB per node × 16 nodes = 80 MB total
- With 10 Gbps (1.25 GB/s): 80 MB ÷ 1.25 GB/s = 64 ms
- TCP CPU overhead: 2 cores fully utilized
Total: ~64 ms (network-bound)

RDMA Performance:

Request phase (broadcast):
- 16 nodes × 8 KB × 9 μs = 144 μs (sequential)
- Actual: ~15 μs (parallel, minimal CPU)
Response phase (gather):
- 80 MB total
- With 100 Gbps (12.5 GB/s) RDMA: 80 MB ÷ 12.5 GB/s = 6.4 ms
- CPU overhead: 0.2 cores
Total: ~6.4 ms (network-bound)
Improvement: 10x faster (64 ms → 6.4 ms)

Analysis:

  • Small messages: 3x improvement (latency-bound)
  • Large transfers: 10x improvement (bandwidth-bound)
  • Critical insight: RDMA enables 100 Gbps fabric utilization, TCP caps at ~10-15 Gbps

3.2 Vector Search Pattern

Workload Characteristics:

  • Request: Small (query vector + filter predicates, ~50 KB)
  • Response: Medium (top-K results with vectors, ~500 KB)
  • Frequency: Very high (interactive search)
  • Latency-sensitive: Yes (p99 < 50 ms target)

TCP Performance:

Single vector search query:
- Request (50 KB): 50 μs
- HNSW search on storage node: 25 ms (compute-bound)
- Response (500 KB): 400 μs
Total: ~25.5 ms
With 100 concurrent queries (typical load):
- Network congestion at ~12 Gbps (TCP limit)
- Increased latency due to queueing: +10-20 ms
- p99 latency: 45-60 ms (exceeds target)

RDMA Performance:

Single vector search query:
- Request (50 KB): 15 μs
- HNSW search: 25 ms (same, compute-bound)
- Response (500 KB): 50 μs
Total: ~25.1 ms
With 100 concurrent queries:
- Network at ~40 Gbps (well below 100 Gbps limit)
- Minimal queueing delay: +1-2 ms
- p99 latency: 28-32 ms (meets target)
Improvement: 1.5-2x better p99 latency under load

Analysis:

  • Single query: Minimal improvement (compute-bound)
  • High concurrency: 2x improvement (TCP saturates network)
  • RDMA critical for maintaining low p99 latency under load

3.3 Synchronous Replication Pattern

Workload Characteristics:

  • Request: Medium (commit log entry, 1-64 KB)
  • Response: Tiny (ACK, 64 bytes)
  • Frequency: Every write operation
  • Latency-critical: Yes (blocks user transaction)

TCP Performance:

Write operation with synchronous mirroring:
1. Primary write (local): 500 μs
2. Replicate to mirror:
- Send commit log (8 KB): 27 μs
- Network RTT: 500 μs (intra-datacenter)
- Mirror write (local): 500 μs
- ACK return: 13 μs
Total replication: ~1,040 μs
Total write latency: 500 μs + 1,040 μs = 1,540 μs (1.54 ms)
Throughput impact:
- Max write rate: 1 / 1.54ms = 649 writes/sec (single-threaded)
- With pipelining (100 concurrent): ~65,000 writes/sec
- CPU overhead: ~20% per core due to TCP

RDMA Performance:

Write operation with synchronous mirroring:
1. Primary write (local): 500 μs
2. Replicate to mirror:
- RDMA WRITE (8 KB): 8 μs
- Network RTT: 5 μs (RDMA ultra-low latency)
- Mirror write (local): 500 μs
- RDMA READ (ACK): 5 μs
Total replication: ~518 μs
Total write latency: 500 μs + 518 μs = 1,018 μs (1.02 ms)
Improvement: 1.5x faster (1.54 ms → 1.02 ms)
Throughput impact:
- Max write rate: 1 / 1.02ms = 980 writes/sec (single-threaded)
- With pipelining (100 concurrent): ~98,000 writes/sec (+50% vs TCP)
- CPU overhead: ~2% per core

Analysis:

  • Latency reduction: 33% (520 μs saved per write)
  • Throughput increase: 50% (65K → 98K writes/sec)
  • RDMA critical for synchronous replication performance
  • CPU savings: 90% (enables more concurrent writes)

3.4 Cache Invalidation Broadcast

Workload Characteristics:

  • Request: Tiny (invalidation notice, 128 bytes)
  • Response: None (fire-and-forget)
  • Frequency: Every committed transaction
  • Fanout: High (all compute nodes)

TCP Performance:

Scenario: Invalidate cache on 50 compute nodes
Sequential broadcast approach:
- 50 nodes × 13 μs = 650 μs
- Impractical (too slow)
Parallel TCP connections:
- 50 parallel sends: ~20 μs (network I/O parallel)
- CPU overhead: 15 μs (system calls, context switches)
- Total: ~35 μs
At 10,000 transactions/sec:
- CPU time: 10,000 × 35 μs = 350 ms/sec = 35% of one core

RDMA Performance:

Parallel RDMA SEND:
- 50 parallel sends: ~8 μs (verbs API, minimal overhead)
- Total: ~8 μs
At 10,000 transactions/sec:
- CPU time: 10,000 × 8 μs = 80 ms/sec = 8% of one core
Improvement: 4.4x faster (35 μs → 8 μs)
CPU savings: 77% (35% → 8% core utilization)

4. CPU Efficiency Analysis

4.1 CPU Cycles per Byte Transferred

TCP/IP:

Breakdown for 1 MB transfer:
- User-space overhead: 50K cycles
- Kernel TCP stack: 500K cycles
- Memory copy: 300K cycles
- Driver: 100K cycles
Total: ~950K cycles per MB
At 3 GHz CPU: 950K cycles = 317 μs
Effective CPU cost: ~0.3 cores for 1 GB/sec throughput

RDMA:

Breakdown for 1 MB transfer:
- User-space (verbs): 10K cycles
- No kernel involvement: 0 cycles
- No memory copy: 0 cycles
- RDMA NIC handles rest
Total: ~10K cycles per MB
At 3 GHz CPU: 10K cycles = 3.3 μs
Effective CPU cost: ~0.003 cores for 1 GB/sec throughput
CPU efficiency: 95x better (950K → 10K cycles/MB)

4.2 System-Wide CPU Savings

Scenario: HeliosDB cluster under load

Cluster configuration:
- 10 compute nodes
- 20 storage nodes
- 100 GB/sec aggregate data transfer (analytical workload)
TCP CPU overhead:
- 100 GB/sec × 0.3 cores/(GB/sec) = 30 cores
- Distributed across 30 nodes = 1 core per node (average)
- Peak nodes (hot compute): 5-10 cores consumed by networking
RDMA CPU overhead:
- 100 GB/sec × 0.003 cores/(GB/sec) = 0.3 cores total
- Negligible per-node overhead
CPU reclaimed: 30 cores → 0.3 cores = **29.7 cores freed for query processing**
At $0.10/core-hour (cloud pricing):
- Cost savings: 29.7 cores × $0.10/hr × 730 hrs/month = $2,168/month

5. Throughput and Scalability Analysis

5.1 Network Bandwidth Utilization

TCP/IP (10 Gbps NIC):

Theoretical bandwidth: 10 Gbps = 1.25 GB/sec
Practical TCP throughput:
- Single stream: 8-9 Gbps (~1.1 GB/sec)
- Multiple streams: 9-10 Gbps (~1.2 GB/sec)
- Bottleneck: CPU overhead limits further scaling
Effective bandwidth: 70-80% of wire speed

RDMA/RoCEv2 (100 Gbps NIC):

Theoretical bandwidth: 100 Gbps = 12.5 GB/sec
Practical RDMA throughput:
- Single stream: 90-95 Gbps (~11.5 GB/sec)
- Multiple streams: 95-98 Gbps (~12.2 GB/sec)
- Bottleneck: Wire speed
Effective bandwidth: 95-98% of wire speed
Bandwidth improvement: 10x (1.2 GB/sec → 12.2 GB/sec)

5.2 Scalability Limits

TCP-based cluster:

Network becomes bottleneck at:
- 10 compute nodes × 1.2 GB/sec = 12 GB/sec total
- With 10 Gbps inter-switch links
- Limited to ~8-10 nodes before network saturation
Mitigation:
- Add more switches (expensive)
- Over-subscription ratio 4:1 (poor performance under load)

RDMA-based cluster:

Network bottleneck at:
- 10 compute nodes × 12 GB/sec = 120 GB/sec total
- With 100 Gbps inter-switch links
- Can scale to 80+ nodes before saturation
Mitigation:
- Rare to hit network limit
- Compute/storage resources exhaust first

Scalability improvement: 8-10x more nodes before network bottleneck

6. Cost-Benefit Analysis

6.1 Hardware Cost Comparison

TCP Setup (10 Gbps Ethernet):

Per-node cost:
- 10 Gbps NIC: $300
- Switch port (amortized): $500
Total: $800 per node
30-node cluster: 30 × $800 = $24,000

RDMA Setup (100 Gbps RoCEv2):

Per-node cost:
- 100 Gbps RDMA NIC: $1,200
- RoCEv2-capable switch port (amortized): $2,000
Total: $3,200 per node
30-node cluster: 30 × $3,200 = $96,000
Additional cost: $72,000 (4x more expensive)

6.2 Total Cost of Ownership (TCO)

Performance Gains → Compute Savings:

With RDMA, CPU efficiency gains allow:
- 30% fewer nodes for same throughput (due to CPU savings)
- Or 50% higher throughput with same nodes
Scenario: 30-node TCP cluster vs 21-node RDMA cluster (same performance)
TCP cluster:
- Hardware: $24,000
- Servers (30 × $5,000): $150,000
- Power (30 × $1,000/yr): $30,000/yr
Total initial: $174,000
Annual opex: $30,000
RDMA cluster (21 nodes):
- Hardware: 21 × $3,200 = $67,200
- Servers (21 × $5,000): $105,000
- Power (21 × $1,000/yr): $21,000/yr
Total initial: $172,200
Annual opex: $21,000
Break-even: Immediate (lower total cost)
Annual savings: $9,000/year in power alone

Performance Gains → Revenue Opportunity:

Scenario: Same 30 nodes, RDMA enables 50% more throughput
TCP: 100,000 queries/sec
RDMA: 150,000 queries/sec
If revenue is query-driven (e.g., SaaS):
- Additional $72,000 investment
- 50% more capacity = 50% more customers
- Break-even: 72K ÷ (annual revenue per customer × 50% increase)
Example: $500/customer/year, 1000 customers
- Additional revenue: 500 × $500 = $250,000/year
- ROI: 72K investment → 250K annual return
- Break-even: 3.5 months

6.3 Recommendation Matrix

Use CaseNetwork ChoiceRationale
Development/Testing10 Gbps TCPLow cost, sufficient for dev workloads
Production (OLTP-heavy)25 Gbps TCP or RoCEv2Moderate cost, low-latency writes critical
Production (HTAP)100 Gbps RoCEv2Recommended, best price/performance
Production (Analytics-heavy)100 Gbps RoCEv2Essential for high throughput
Extreme scale (>100 nodes)200 Gbps RoCEv2Future-proof

7. RDMA Implementation Considerations

7.1 RDMA Operations for HeliosDB

Operation Mapping:

HeliosDB OperationRDMA PrimitiveRationale
Predicate pushdown requestSEND/RECVTwo-sided, requires receiver processing
Cache invalidationSEND/RECVBroadcast, minimal payload
Data transfer (large)WRITEOne-sided, zero-copy bulk transfer
Synchronous replicationWRITE + SENDWrite commit log, send ACK
Metadata lookupSEND/RECVSmall message, low latency
Vector search requestSEND/RECVRequires processing on both sides

7.2 Memory Registration Overhead

Challenge: RDMA requires memory registration (pinning pages)

Registration cost:
- Small buffer (4 KB): 5-10 μs
- Large buffer (1 MB): 50-200 μs
Mitigation strategies:
1. Pre-register buffers in pool:
- Allocate 1000 × 1MB buffers at startup
- Registration cost amortized over lifetime
- Total startup cost: 50-200 ms (acceptable)
2. Reuse registered buffers:
- Implement buffer pool with MR (Memory Region) caching
- Registration overhead: 0 μs (amortized)
Recommended: Pre-allocate 10-20 GB of registered memory per node

7.3 Reliability and Error Handling

RDMA Reliability Modes:

ModeGuaranteeLatencyUse Case
RC (Reliable Connection)In-order, exactly-onceLowestRecommended for HeliosDB
UC (Unreliable Connection)Best-effortVery lowNot suitable (data loss risk)
UD (Unreliable Datagram)Best-effort multicastLowCache invalidation (optional)

Recommendation: Use RC (Reliable Connection) for all HeliosDB operations

  • Guarantees in-order delivery
  • Automatic retransmission on packet loss
  • Performance overhead: Negligible (<1%)

7.4 Congestion Control

TCP Congestion Control:

  • Automatic (built into protocol)
  • Adapts to network conditions
  • Can cause latency spikes under congestion

RDMA Congestion Control:

  • Hardware-based (PFC - Priority Flow Control)
  • Lossless Ethernet with backpressure
  • Requires proper switch configuration

HeliosDB Configuration:

[network.rdma]
# Enable ECN (Explicit Congestion Notification) for RoCEv2
enable_ecn = true
# QoS settings
priority = 3 # High priority for database traffic
# Buffer sizes
send_buffer_mb = 256
recv_buffer_mb = 256
# Connection pools
max_connections_per_node = 16

8. Monitoring and Performance Validation

8.1 Key Metrics

RDMA Performance:

  • rdma.post_send_latency_us: Time to post SEND/WRITE
  • rdma.completion_latency_us: Time until completion event
  • rdma.throughput_gbps: Achieved bandwidth per connection
  • rdma.cpu_percent: CPU overhead for RDMA operations

Comparison Metrics:

  • network.tcp_throughput_gbps: Baseline TCP performance
  • network.rdma_speedup: Actual RDMA speedup vs TCP
  • cpu.network_overhead_percent: CPU spent on networking

8.2 Benchmarking Protocol

Micro-benchmarks:

Terminal window
# RDMA latency test
ib_send_lat -d mlx5_0 -F --report_gbits
# RDMA bandwidth test
ib_send_bw -d mlx5_0 -D 10 --report_gbits
# Compare vs TCP
netperf -H storage_node -t TCP_STREAM

Application-level benchmarks:

-- Measure predicate pushdown latency
EXPLAIN ANALYZE
SELECT count(*)
FROM large_table
WHERE category = 'electronics' AND price < 100;
-- Compare TCP vs RDMA:
-- TCP: 150ms
-- RDMA: 20ms (expected)

9. Migration Path

9.1 Hybrid Deployment (TCP + RDMA)

Phase 1: TCP-only (Months 1-6)

  • Deploy with 10 Gbps Ethernet
  • Validate functionality
  • Gather performance baselines

Phase 2: Hybrid (Months 7-9)

  • Add RDMA NICs to critical nodes
  • Run TCP and RDMA in parallel
  • Gradual migration of traffic
  • Validate performance gains

Phase 3: RDMA-native (Months 10+)

  • All inter-node traffic on RDMA
  • TCP only for external client connections
  • Optimize for RDMA-specific features (one-sided WRITE)

9.2 Fallback Mechanism

// HeliosDB network layer with automatic fallback
pub enum TransportMode {
RDMA,
TCP,
}
pub struct NetworkConnection {
mode: TransportMode,
rdma_qp: Option<RdmaQueuePair>,
tcp_socket: Option<TcpStream>,
}
impl NetworkConnection {
pub async fn send(&self, data: &[u8]) -> Result<()> {
match self.mode {
TransportMode::RDMA => {
if let Some(qp) = &self.rdma_qp {
qp.post_send(data).await
} else {
// Fallback to TCP if RDMA unavailable
self.tcp_socket.as_ref().unwrap().write(data).await
}
}
TransportMode::TCP => {
self.tcp_socket.as_ref().unwrap().write(data).await
}
}
}
}

Benefit: Graceful degradation if RDMA fails

10. Conclusion

Key Findings:

  1. Latency Improvements:

    • Small messages (RPC): 3x faster (26 μs → 8 μs)
    • Medium messages: 3x faster (27 μs → 9 μs)
    • Large transfers: 7x faster (667 μs → 91 μs)
    • Synchronous replication: 1.5x faster (1.54 ms → 1.02 ms)
  2. Throughput Improvements:

    • Single stream: 10x higher (1.2 GB/sec → 12.2 GB/sec)
    • Multi-shard queries: 10x faster data gathering
    • Write throughput: 50% higher (65K → 98K writes/sec)
  3. CPU Efficiency:

    • 95x reduction in CPU cycles per byte (950K → 10K cycles/MB)
    • 90% CPU savings for networking (30 cores → 0.3 cores)
    • Enables 30-50% more throughput on same hardware
  4. Cost-Benefit:

    • 4x higher upfront cost ($24K → $96K for 30-node cluster)
    • Break-even: Immediate (due to fewer nodes needed)
    • Or 3-6 months (due to higher throughput → more revenue)
    • Annual opex savings: $9K+ in power alone
  5. Scalability:

    • TCP: Network bottleneck at 8-10 nodes
    • RDMA: Can scale to 80+ nodes before network limit
    • 8-10x better scalability ceiling

Recommendation:

RDMA/RoCEv2 is critical for HeliosDB production deployments.

The performance benefits are transformative:

  • 10x faster analytical queries (multi-shard aggregations)
  • 50% higher write throughput (synchronous replication)
  • 90% CPU savings (enables more concurrent queries)

The 4x hardware premium is justified by:

  • Immediate TCO benefits (fewer nodes needed)
  • Future scalability (100+ node clusters)
  • Competitive advantage (sub-50ms p99 for complex queries)

Minimum Recommended Configuration:

  • 100 Gbps RoCEv2 for production
  • 25 Gbps RoCEv2 for smaller deployments (<10 nodes)
  • TCP acceptable only for development/testing

Next Steps:

  1. Implement RDMA connection management in heliosdb-network module
  2. Add automatic TCP fallback for compatibility
  3. Develop RDMA-specific optimizations (one-sided WRITE for replication)
  4. Create performance benchmarking suite (TCP vs RDMA comparison)