Network Performance Analysis: RDMA vs TCP for HeliosDB
Network Performance Analysis: RDMA vs TCP for HeliosDB
Executive Summary
This analysis quantifies the performance benefits of RDMA over Converged Ethernet (RoCEv2) compared to traditional TCP/IP for HeliosDB’s inter-node communication, with focus on data transfer patterns, latency characteristics, and cost-benefit analysis.
1. Network Architecture Comparison
1.1 TCP/IP Stack (Traditional)
Application Layer (HeliosDB) ↓ System calls ┌──────────────┐ │ User Space │ └──────┬───────┘ ↓ Context switch ┌──────────────┐ │ Kernel Space │ │ - TCP stack │ │ - IP routing │ │ - Driver │ └──────┬───────┘ ↓ DMA ┌──────────────┐ │ NIC │ └──────────────┘
Latency Components:1. System call overhead: 500-1000 ns2. Context switch: 1-3 μs3. TCP processing: 2-5 μs4. Memory copy (kernel→user): 5-20 μs for large data5. Network wire time: 0.5-2 μs (datacenter)Total: 10-30 μs per message1.2 RDMA/RoCEv2 Stack
Application Layer (HeliosDB) ↓ Verbs API (direct memory access) ┌──────────────┐ │ User Space │ │ - RDMA lib │ │ - Memory reg │ └──────┬───────┘ ↓ Direct memory access (no context switch) ┌──────────────┐ │ RDMA NIC │ │ (InfiniBand/ │ │ RoCEv2) │ └──────────────┘
Latency Components:1. Verbs call: 100-200 ns2. NIC DMA: 1-2 μs3. Network wire time: 0.5-2 μsTotal: 2-5 μs per message
Latency reduction: 5-10x vs TCP1.3 Key Architectural Differences
| Characteristic | TCP/IP | RDMA/RoCEv2 |
|---|---|---|
| Kernel involvement | Every operation | None (kernel bypass) |
| Context switches | 2 per send/recv | 0 |
| Memory copies | 1-2 (kernel ↔ user) | 0 (direct DMA) |
| CPU overhead | High (protocol processing) | Very low (<5% CPU) |
| Latency | 10-30 μs | 2-5 μs |
| Throughput | 10-40 Gbps (CPU-limited) | 100-200 Gbps (wire-limited) |
| Zero-copy | No | Yes |
2. Latency Analysis by Operation Type
2.1 Small Message Latency (RPC Calls)
Scenario: Metadata lookup from Compute to Metadata Service
Message size: 256 bytes (table schema request)
TCP/IP:- Serialization: 0.5 μs- System call + kernel: 3 μs- TCP overhead: 3 μs- Network wire: 1 μs (10 Gbps)- Receiver kernel: 3 μs- Context switch: 2 μs- Deserialization: 0.5 μsTotal: ~13 μs one-way, ~26 μs RTT
RDMA:- Serialization: 0.5 μs- Verbs post: 0.2 μs- NIC DMA: 1 μs- Network wire: 1 μs- Receiver NIC: 1 μs- Deserialization: 0.5 μsTotal: ~4 μs one-way, ~8 μs RTT
Improvement: 3.2x faster (26 μs → 8 μs)Impact on Query Latency:
For a distributed query requiring 5 metadata lookups:
- TCP: 5 × 26 μs = 130 μs
- RDMA: 5 × 8 μs = 40 μs
- Savings: 90 μs per query
At 10,000 QPS: 900ms saved per second = 90% CPU reduction for metadata ops
2.2 Medium Message Latency (Predicate Pushdown)
Scenario: PredicatePushdownRequest from Compute to Storage Node
Message size: 8 KB (serialized predicates + projection list)
TCP/IP:- Serialization: 2 μs- System call: 2 μs- TCP segmentation (2 packets): 5 μs- Memory copy: 5 μs- Network wire: 3 μs (10 Gbps = 1.25 GB/s → 8KB in ~6.4 μs)- Receiver processing: 10 μsTotal: ~27 μs one-way
RDMA (SEND/RECV):- Serialization: 2 μs- Verbs post: 0.3 μs- NIC DMA: 2 μs- Network wire: 3 μs- Receiver NIC: 2 μsTotal: ~9 μs one-way
Improvement: 3x faster (27 μs → 9 μs)2.3 Large Message Latency (Data Transfer)
Scenario: FilteredResultSet stream from Storage to Compute
Message size: 1 MB (compressed columnar data)
TCP/IP:- Bandwidth-bound regime- Throughput: 10-25 Gbps (CPU-limited due to TCP overhead)- Effective throughput: ~1.5 GB/sec- Transfer time: 1 MB ÷ 1.5 GB/s = 667 μs- CPU overhead: ~30% of one core
RDMA (WRITE):- Bandwidth-bound regime- Throughput: 90-100 Gbps (wire-limited)- Effective throughput: ~11 GB/sec- Transfer time: 1 MB ÷ 11 GB/s = 91 μs- CPU overhead: ~3% of one core
Improvement: 7.3x faster (667 μs → 91 μs)CPU savings: 90% reduction in CPU overheadCumulative Effect for Multi-Shard Query:
Query scans 10 shards, each returns 10 MB of filtered data
TCP/IP:- Per shard: 10 MB ÷ 1.5 GB/s = 6.67 ms- Sequential: 10 × 6.67 ms = 66.7 ms- Parallel (10 streams): 6.67 ms (network saturated at ~15 Gbps)- CPU usage: 10 cores × 30% = 3 full cores
RDMA:- Per shard: 10 MB ÷ 11 GB/s = 0.91 ms- Parallel (10 streams): 0.91 ms (network at ~88 Gbps, not saturated)- CPU usage: 10 cores × 3% = 0.3 cores
Improvement: 7.3x faster data transferCPU savings: 90% (3 cores → 0.3 cores)Network headroom: 100 Gbps NIC still has 12% free capacity3. Data Transfer Pattern Analysis
3.1 Predicate Pushdown Pattern
Workload Characteristics:
- Request: Small (1-16 KB serialized predicates)
- Response: Large (1-100 MB filtered data, streaming)
- Frequency: High (every analytical query)
- Parallelism: High (N storage nodes per query)
TCP Performance:
Scenario: Query 16 storage nodes with predicate pushdown
Request phase (broadcast):- 16 nodes × 8 KB × 27 μs = 432 μs (sequential, bottlenecked)- Actual: ~50 μs (parallel network I/O, but CPU-bound)
Response phase (gather):- Average 5 MB per node × 16 nodes = 80 MB total- With 10 Gbps (1.25 GB/s): 80 MB ÷ 1.25 GB/s = 64 ms- TCP CPU overhead: 2 cores fully utilized
Total: ~64 ms (network-bound)RDMA Performance:
Request phase (broadcast):- 16 nodes × 8 KB × 9 μs = 144 μs (sequential)- Actual: ~15 μs (parallel, minimal CPU)
Response phase (gather):- 80 MB total- With 100 Gbps (12.5 GB/s) RDMA: 80 MB ÷ 12.5 GB/s = 6.4 ms- CPU overhead: 0.2 cores
Total: ~6.4 ms (network-bound)
Improvement: 10x faster (64 ms → 6.4 ms)Analysis:
- Small messages: 3x improvement (latency-bound)
- Large transfers: 10x improvement (bandwidth-bound)
- Critical insight: RDMA enables 100 Gbps fabric utilization, TCP caps at ~10-15 Gbps
3.2 Vector Search Pattern
Workload Characteristics:
- Request: Small (query vector + filter predicates, ~50 KB)
- Response: Medium (top-K results with vectors, ~500 KB)
- Frequency: Very high (interactive search)
- Latency-sensitive: Yes (p99 < 50 ms target)
TCP Performance:
Single vector search query:- Request (50 KB): 50 μs- HNSW search on storage node: 25 ms (compute-bound)- Response (500 KB): 400 μsTotal: ~25.5 ms
With 100 concurrent queries (typical load):- Network congestion at ~12 Gbps (TCP limit)- Increased latency due to queueing: +10-20 ms- p99 latency: 45-60 ms (exceeds target)RDMA Performance:
Single vector search query:- Request (50 KB): 15 μs- HNSW search: 25 ms (same, compute-bound)- Response (500 KB): 50 μsTotal: ~25.1 ms
With 100 concurrent queries:- Network at ~40 Gbps (well below 100 Gbps limit)- Minimal queueing delay: +1-2 ms- p99 latency: 28-32 ms (meets target)
Improvement: 1.5-2x better p99 latency under loadAnalysis:
- Single query: Minimal improvement (compute-bound)
- High concurrency: 2x improvement (TCP saturates network)
- RDMA critical for maintaining low p99 latency under load
3.3 Synchronous Replication Pattern
Workload Characteristics:
- Request: Medium (commit log entry, 1-64 KB)
- Response: Tiny (ACK, 64 bytes)
- Frequency: Every write operation
- Latency-critical: Yes (blocks user transaction)
TCP Performance:
Write operation with synchronous mirroring:
1. Primary write (local): 500 μs2. Replicate to mirror: - Send commit log (8 KB): 27 μs - Network RTT: 500 μs (intra-datacenter) - Mirror write (local): 500 μs - ACK return: 13 μs Total replication: ~1,040 μs
Total write latency: 500 μs + 1,040 μs = 1,540 μs (1.54 ms)
Throughput impact:- Max write rate: 1 / 1.54ms = 649 writes/sec (single-threaded)- With pipelining (100 concurrent): ~65,000 writes/sec- CPU overhead: ~20% per core due to TCPRDMA Performance:
Write operation with synchronous mirroring:
1. Primary write (local): 500 μs2. Replicate to mirror: - RDMA WRITE (8 KB): 8 μs - Network RTT: 5 μs (RDMA ultra-low latency) - Mirror write (local): 500 μs - RDMA READ (ACK): 5 μs Total replication: ~518 μs
Total write latency: 500 μs + 518 μs = 1,018 μs (1.02 ms)
Improvement: 1.5x faster (1.54 ms → 1.02 ms)
Throughput impact:- Max write rate: 1 / 1.02ms = 980 writes/sec (single-threaded)- With pipelining (100 concurrent): ~98,000 writes/sec (+50% vs TCP)- CPU overhead: ~2% per coreAnalysis:
- Latency reduction: 33% (520 μs saved per write)
- Throughput increase: 50% (65K → 98K writes/sec)
- RDMA critical for synchronous replication performance
- CPU savings: 90% (enables more concurrent writes)
3.4 Cache Invalidation Broadcast
Workload Characteristics:
- Request: Tiny (invalidation notice, 128 bytes)
- Response: None (fire-and-forget)
- Frequency: Every committed transaction
- Fanout: High (all compute nodes)
TCP Performance:
Scenario: Invalidate cache on 50 compute nodes
Sequential broadcast approach:- 50 nodes × 13 μs = 650 μs- Impractical (too slow)
Parallel TCP connections:- 50 parallel sends: ~20 μs (network I/O parallel)- CPU overhead: 15 μs (system calls, context switches)- Total: ~35 μs
At 10,000 transactions/sec:- CPU time: 10,000 × 35 μs = 350 ms/sec = 35% of one coreRDMA Performance:
Parallel RDMA SEND:- 50 parallel sends: ~8 μs (verbs API, minimal overhead)- Total: ~8 μs
At 10,000 transactions/sec:- CPU time: 10,000 × 8 μs = 80 ms/sec = 8% of one core
Improvement: 4.4x faster (35 μs → 8 μs)CPU savings: 77% (35% → 8% core utilization)4. CPU Efficiency Analysis
4.1 CPU Cycles per Byte Transferred
TCP/IP:
Breakdown for 1 MB transfer:- User-space overhead: 50K cycles- Kernel TCP stack: 500K cycles- Memory copy: 300K cycles- Driver: 100K cyclesTotal: ~950K cycles per MB
At 3 GHz CPU: 950K cycles = 317 μsEffective CPU cost: ~0.3 cores for 1 GB/sec throughputRDMA:
Breakdown for 1 MB transfer:- User-space (verbs): 10K cycles- No kernel involvement: 0 cycles- No memory copy: 0 cycles- RDMA NIC handles restTotal: ~10K cycles per MB
At 3 GHz CPU: 10K cycles = 3.3 μsEffective CPU cost: ~0.003 cores for 1 GB/sec throughput
CPU efficiency: 95x better (950K → 10K cycles/MB)4.2 System-Wide CPU Savings
Scenario: HeliosDB cluster under load
Cluster configuration:- 10 compute nodes- 20 storage nodes- 100 GB/sec aggregate data transfer (analytical workload)
TCP CPU overhead:- 100 GB/sec × 0.3 cores/(GB/sec) = 30 cores- Distributed across 30 nodes = 1 core per node (average)- Peak nodes (hot compute): 5-10 cores consumed by networking
RDMA CPU overhead:- 100 GB/sec × 0.003 cores/(GB/sec) = 0.3 cores total- Negligible per-node overhead
CPU reclaimed: 30 cores → 0.3 cores = **29.7 cores freed for query processing**
At $0.10/core-hour (cloud pricing):- Cost savings: 29.7 cores × $0.10/hr × 730 hrs/month = $2,168/month5. Throughput and Scalability Analysis
5.1 Network Bandwidth Utilization
TCP/IP (10 Gbps NIC):
Theoretical bandwidth: 10 Gbps = 1.25 GB/secPractical TCP throughput:- Single stream: 8-9 Gbps (~1.1 GB/sec)- Multiple streams: 9-10 Gbps (~1.2 GB/sec)- Bottleneck: CPU overhead limits further scaling
Effective bandwidth: 70-80% of wire speedRDMA/RoCEv2 (100 Gbps NIC):
Theoretical bandwidth: 100 Gbps = 12.5 GB/secPractical RDMA throughput:- Single stream: 90-95 Gbps (~11.5 GB/sec)- Multiple streams: 95-98 Gbps (~12.2 GB/sec)- Bottleneck: Wire speed
Effective bandwidth: 95-98% of wire speed
Bandwidth improvement: 10x (1.2 GB/sec → 12.2 GB/sec)5.2 Scalability Limits
TCP-based cluster:
Network becomes bottleneck at:- 10 compute nodes × 1.2 GB/sec = 12 GB/sec total- With 10 Gbps inter-switch links- Limited to ~8-10 nodes before network saturation
Mitigation:- Add more switches (expensive)- Over-subscription ratio 4:1 (poor performance under load)RDMA-based cluster:
Network bottleneck at:- 10 compute nodes × 12 GB/sec = 120 GB/sec total- With 100 Gbps inter-switch links- Can scale to 80+ nodes before saturation
Mitigation:- Rare to hit network limit- Compute/storage resources exhaust firstScalability improvement: 8-10x more nodes before network bottleneck
6. Cost-Benefit Analysis
6.1 Hardware Cost Comparison
TCP Setup (10 Gbps Ethernet):
Per-node cost:- 10 Gbps NIC: $300- Switch port (amortized): $500Total: $800 per node
30-node cluster: 30 × $800 = $24,000RDMA Setup (100 Gbps RoCEv2):
Per-node cost:- 100 Gbps RDMA NIC: $1,200- RoCEv2-capable switch port (amortized): $2,000Total: $3,200 per node
30-node cluster: 30 × $3,200 = $96,000
Additional cost: $72,000 (4x more expensive)6.2 Total Cost of Ownership (TCO)
Performance Gains → Compute Savings:
With RDMA, CPU efficiency gains allow:- 30% fewer nodes for same throughput (due to CPU savings)- Or 50% higher throughput with same nodes
Scenario: 30-node TCP cluster vs 21-node RDMA cluster (same performance)
TCP cluster:- Hardware: $24,000- Servers (30 × $5,000): $150,000- Power (30 × $1,000/yr): $30,000/yrTotal initial: $174,000Annual opex: $30,000
RDMA cluster (21 nodes):- Hardware: 21 × $3,200 = $67,200- Servers (21 × $5,000): $105,000- Power (21 × $1,000/yr): $21,000/yrTotal initial: $172,200Annual opex: $21,000
Break-even: Immediate (lower total cost)Annual savings: $9,000/year in power alonePerformance Gains → Revenue Opportunity:
Scenario: Same 30 nodes, RDMA enables 50% more throughput
TCP: 100,000 queries/secRDMA: 150,000 queries/sec
If revenue is query-driven (e.g., SaaS):- Additional $72,000 investment- 50% more capacity = 50% more customers- Break-even: 72K ÷ (annual revenue per customer × 50% increase)
Example: $500/customer/year, 1000 customers- Additional revenue: 500 × $500 = $250,000/year- ROI: 72K investment → 250K annual return- Break-even: 3.5 months6.3 Recommendation Matrix
| Use Case | Network Choice | Rationale |
|---|---|---|
| Development/Testing | 10 Gbps TCP | Low cost, sufficient for dev workloads |
| Production (OLTP-heavy) | 25 Gbps TCP or RoCEv2 | Moderate cost, low-latency writes critical |
| Production (HTAP) | 100 Gbps RoCEv2 | Recommended, best price/performance |
| Production (Analytics-heavy) | 100 Gbps RoCEv2 | Essential for high throughput |
| Extreme scale (>100 nodes) | 200 Gbps RoCEv2 | Future-proof |
7. RDMA Implementation Considerations
7.1 RDMA Operations for HeliosDB
Operation Mapping:
| HeliosDB Operation | RDMA Primitive | Rationale |
|---|---|---|
| Predicate pushdown request | SEND/RECV | Two-sided, requires receiver processing |
| Cache invalidation | SEND/RECV | Broadcast, minimal payload |
| Data transfer (large) | WRITE | One-sided, zero-copy bulk transfer |
| Synchronous replication | WRITE + SEND | Write commit log, send ACK |
| Metadata lookup | SEND/RECV | Small message, low latency |
| Vector search request | SEND/RECV | Requires processing on both sides |
7.2 Memory Registration Overhead
Challenge: RDMA requires memory registration (pinning pages)
Registration cost:- Small buffer (4 KB): 5-10 μs- Large buffer (1 MB): 50-200 μs
Mitigation strategies:1. Pre-register buffers in pool: - Allocate 1000 × 1MB buffers at startup - Registration cost amortized over lifetime - Total startup cost: 50-200 ms (acceptable)
2. Reuse registered buffers: - Implement buffer pool with MR (Memory Region) caching - Registration overhead: 0 μs (amortized)
Recommended: Pre-allocate 10-20 GB of registered memory per node7.3 Reliability and Error Handling
RDMA Reliability Modes:
| Mode | Guarantee | Latency | Use Case |
|---|---|---|---|
| RC (Reliable Connection) | In-order, exactly-once | Lowest | Recommended for HeliosDB |
| UC (Unreliable Connection) | Best-effort | Very low | Not suitable (data loss risk) |
| UD (Unreliable Datagram) | Best-effort multicast | Low | Cache invalidation (optional) |
Recommendation: Use RC (Reliable Connection) for all HeliosDB operations
- Guarantees in-order delivery
- Automatic retransmission on packet loss
- Performance overhead: Negligible (<1%)
7.4 Congestion Control
TCP Congestion Control:
- Automatic (built into protocol)
- Adapts to network conditions
- Can cause latency spikes under congestion
RDMA Congestion Control:
- Hardware-based (PFC - Priority Flow Control)
- Lossless Ethernet with backpressure
- Requires proper switch configuration
HeliosDB Configuration:
[network.rdma]# Enable ECN (Explicit Congestion Notification) for RoCEv2enable_ecn = true
# QoS settingspriority = 3 # High priority for database traffic
# Buffer sizessend_buffer_mb = 256recv_buffer_mb = 256
# Connection poolsmax_connections_per_node = 168. Monitoring and Performance Validation
8.1 Key Metrics
RDMA Performance:
rdma.post_send_latency_us: Time to post SEND/WRITErdma.completion_latency_us: Time until completion eventrdma.throughput_gbps: Achieved bandwidth per connectionrdma.cpu_percent: CPU overhead for RDMA operations
Comparison Metrics:
network.tcp_throughput_gbps: Baseline TCP performancenetwork.rdma_speedup: Actual RDMA speedup vs TCPcpu.network_overhead_percent: CPU spent on networking
8.2 Benchmarking Protocol
Micro-benchmarks:
# RDMA latency testib_send_lat -d mlx5_0 -F --report_gbits
# RDMA bandwidth testib_send_bw -d mlx5_0 -D 10 --report_gbits
# Compare vs TCPnetperf -H storage_node -t TCP_STREAMApplication-level benchmarks:
-- Measure predicate pushdown latencyEXPLAIN ANALYZESELECT count(*)FROM large_tableWHERE category = 'electronics' AND price < 100;
-- Compare TCP vs RDMA:-- TCP: 150ms-- RDMA: 20ms (expected)9. Migration Path
9.1 Hybrid Deployment (TCP + RDMA)
Phase 1: TCP-only (Months 1-6)
- Deploy with 10 Gbps Ethernet
- Validate functionality
- Gather performance baselines
Phase 2: Hybrid (Months 7-9)
- Add RDMA NICs to critical nodes
- Run TCP and RDMA in parallel
- Gradual migration of traffic
- Validate performance gains
Phase 3: RDMA-native (Months 10+)
- All inter-node traffic on RDMA
- TCP only for external client connections
- Optimize for RDMA-specific features (one-sided WRITE)
9.2 Fallback Mechanism
// HeliosDB network layer with automatic fallbackpub enum TransportMode { RDMA, TCP,}
pub struct NetworkConnection { mode: TransportMode, rdma_qp: Option<RdmaQueuePair>, tcp_socket: Option<TcpStream>,}
impl NetworkConnection { pub async fn send(&self, data: &[u8]) -> Result<()> { match self.mode { TransportMode::RDMA => { if let Some(qp) = &self.rdma_qp { qp.post_send(data).await } else { // Fallback to TCP if RDMA unavailable self.tcp_socket.as_ref().unwrap().write(data).await } } TransportMode::TCP => { self.tcp_socket.as_ref().unwrap().write(data).await } } }}Benefit: Graceful degradation if RDMA fails
10. Conclusion
Key Findings:
-
Latency Improvements:
- Small messages (RPC): 3x faster (26 μs → 8 μs)
- Medium messages: 3x faster (27 μs → 9 μs)
- Large transfers: 7x faster (667 μs → 91 μs)
- Synchronous replication: 1.5x faster (1.54 ms → 1.02 ms)
-
Throughput Improvements:
- Single stream: 10x higher (1.2 GB/sec → 12.2 GB/sec)
- Multi-shard queries: 10x faster data gathering
- Write throughput: 50% higher (65K → 98K writes/sec)
-
CPU Efficiency:
- 95x reduction in CPU cycles per byte (950K → 10K cycles/MB)
- 90% CPU savings for networking (30 cores → 0.3 cores)
- Enables 30-50% more throughput on same hardware
-
Cost-Benefit:
- 4x higher upfront cost ($24K → $96K for 30-node cluster)
- Break-even: Immediate (due to fewer nodes needed)
- Or 3-6 months (due to higher throughput → more revenue)
- Annual opex savings: $9K+ in power alone
-
Scalability:
- TCP: Network bottleneck at 8-10 nodes
- RDMA: Can scale to 80+ nodes before network limit
- 8-10x better scalability ceiling
Recommendation:
RDMA/RoCEv2 is critical for HeliosDB production deployments.
The performance benefits are transformative:
- 10x faster analytical queries (multi-shard aggregations)
- 50% higher write throughput (synchronous replication)
- 90% CPU savings (enables more concurrent queries)
The 4x hardware premium is justified by:
- Immediate TCO benefits (fewer nodes needed)
- Future scalability (100+ node clusters)
- Competitive advantage (sub-50ms p99 for complex queries)
Minimum Recommended Configuration:
- 100 Gbps RoCEv2 for production
- 25 Gbps RoCEv2 for smaller deployments (<10 nodes)
- TCP acceptable only for development/testing
Next Steps:
- Implement RDMA connection management in heliosdb-network module
- Add automatic TCP fallback for compatibility
- Develop RDMA-specific optimizations (one-sided WRITE for replication)
- Create performance benchmarking suite (TCP vs RDMA comparison)