Network Performance Analysis: RDMA vs TCP for HeliosDB

Executive Summary

This analysis quantifies the performance benefits of RDMA over Converged Ethernet (RoCEv2) compared to traditional TCP/IP for HeliosDB’s inter-node communication, with focus on data transfer patterns, latency characteristics, and cost-benefit analysis.

1. Network Architecture Comparison

1.1 TCP/IP Stack (Traditional)

Application Layer (HeliosDB)
        ↓ System calls
    ┌──────────────┐
    │ User Space   │
    └──────┬───────┘
        ↓ Context switch
    ┌──────────────┐
    │ Kernel Space │
    │ - TCP stack  │
    │ - IP routing │
    │ - Driver     │
    └──────┬───────┘
        ↓ DMA
    ┌──────────────┐
    │     NIC      │
    └──────────────┘

Latency Components:
1. System call overhead: 500-1000 ns
2. Context switch: 1-3 μs
3. TCP processing: 2-5 μs
4. Memory copy (kernel→user): 5-20 μs for large data
5. Network wire time: 0.5-2 μs (datacenter)
Total: 10-30 μs per message

1.2 RDMA/RoCEv2 Stack

Application Layer (HeliosDB)
        ↓ Verbs API (direct memory access)
    ┌──────────────┐
    │ User Space   │
    │ - RDMA lib   │
    │ - Memory reg │
    └──────┬───────┘
        ↓ Direct memory access (no context switch)
    ┌──────────────┐
    │  RDMA NIC    │
    │ (InfiniBand/ │
    │  RoCEv2)     │
    └──────────────┘

Latency Components:
1. Verbs call: 100-200 ns
2. NIC DMA: 1-2 μs
3. Network wire time: 0.5-2 μs
Total: 2-5 μs per message

Latency reduction: 5-10x vs TCP

1.3 Key Architectural Differences

Characteristic	TCP/IP	RDMA/RoCEv2
Kernel involvement	Every operation	None (kernel bypass)
Context switches	2 per send/recv	0
Memory copies	1-2 (kernel ↔ user)	0 (direct DMA)
CPU overhead	High (protocol processing)	Very low (<5% CPU)
Latency	10-30 μs	2-5 μs
Throughput	10-40 Gbps (CPU-limited)	100-200 Gbps (wire-limited)
Zero-copy	No	Yes

2. Latency Analysis by Operation Type

2.1 Small Message Latency (RPC Calls)

Scenario: Metadata lookup from Compute to Metadata Service

Message size: 256 bytes (table schema request)

TCP/IP:
- Serialization: 0.5 μs
- System call + kernel: 3 μs
- TCP overhead: 3 μs
- Network wire: 1 μs (10 Gbps)
- Receiver kernel: 3 μs
- Context switch: 2 μs
- Deserialization: 0.5 μs
Total: ~13 μs one-way, ~26 μs RTT

RDMA:
- Serialization: 0.5 μs
- Verbs post: 0.2 μs
- NIC DMA: 1 μs
- Network wire: 1 μs
- Receiver NIC: 1 μs
- Deserialization: 0.5 μs
Total: ~4 μs one-way, ~8 μs RTT

Improvement: 3.2x faster (26 μs → 8 μs)

Impact on Query Latency:

For a distributed query requiring 5 metadata lookups:

TCP: 5 × 26 μs = 130 μs
RDMA: 5 × 8 μs = 40 μs
Savings: 90 μs per query

At 10,000 QPS: 900ms saved per second = 90% CPU reduction for metadata ops

2.2 Medium Message Latency (Predicate Pushdown)

Scenario: PredicatePushdownRequest from Compute to Storage Node

Message size: 8 KB (serialized predicates + projection list)

TCP/IP:
- Serialization: 2 μs
- System call: 2 μs
- TCP segmentation (2 packets): 5 μs
- Memory copy: 5 μs
- Network wire: 3 μs (10 Gbps = 1.25 GB/s → 8KB in ~6.4 μs)
- Receiver processing: 10 μs
Total: ~27 μs one-way

RDMA (SEND/RECV):
- Serialization: 2 μs
- Verbs post: 0.3 μs
- NIC DMA: 2 μs
- Network wire: 3 μs
- Receiver NIC: 2 μs
Total: ~9 μs one-way

Improvement: 3x faster (27 μs → 9 μs)

2.3 Large Message Latency (Data Transfer)

Scenario: FilteredResultSet stream from Storage to Compute

Message size: 1 MB (compressed columnar data)

TCP/IP:
- Bandwidth-bound regime
- Throughput: 10-25 Gbps (CPU-limited due to TCP overhead)
- Effective throughput: ~1.5 GB/sec
- Transfer time: 1 MB ÷ 1.5 GB/s = 667 μs
- CPU overhead: ~30% of one core

RDMA (WRITE):
- Bandwidth-bound regime
- Throughput: 90-100 Gbps (wire-limited)
- Effective throughput: ~11 GB/sec
- Transfer time: 1 MB ÷ 11 GB/s = 91 μs
- CPU overhead: ~3% of one core

Improvement: 7.3x faster (667 μs → 91 μs)
CPU savings: 90% reduction in CPU overhead

Cumulative Effect for Multi-Shard Query:

Query scans 10 shards, each returns 10 MB of filtered data

TCP/IP:
- Per shard: 10 MB ÷ 1.5 GB/s = 6.67 ms
- Sequential: 10 × 6.67 ms = 66.7 ms
- Parallel (10 streams): 6.67 ms (network saturated at ~15 Gbps)
- CPU usage: 10 cores × 30% = 3 full cores

RDMA:
- Per shard: 10 MB ÷ 11 GB/s = 0.91 ms
- Parallel (10 streams): 0.91 ms (network at ~88 Gbps, not saturated)
- CPU usage: 10 cores × 3% = 0.3 cores

Improvement: 7.3x faster data transfer
CPU savings: 90% (3 cores → 0.3 cores)
Network headroom: 100 Gbps NIC still has 12% free capacity

3. Data Transfer Pattern Analysis

3.1 Predicate Pushdown Pattern

Workload Characteristics:

Request: Small (1-16 KB serialized predicates)
Response: Large (1-100 MB filtered data, streaming)
Frequency: High (every analytical query)
Parallelism: High (N storage nodes per query)

TCP Performance:

Scenario: Query 16 storage nodes with predicate pushdown

Request phase (broadcast):
- 16 nodes × 8 KB × 27 μs = 432 μs (sequential, bottlenecked)
- Actual: ~50 μs (parallel network I/O, but CPU-bound)

Response phase (gather):
- Average 5 MB per node × 16 nodes = 80 MB total
- With 10 Gbps (1.25 GB/s): 80 MB ÷ 1.25 GB/s = 64 ms
- TCP CPU overhead: 2 cores fully utilized

Total: ~64 ms (network-bound)

RDMA Performance:

Request phase (broadcast):
- 16 nodes × 8 KB × 9 μs = 144 μs (sequential)
- Actual: ~15 μs (parallel, minimal CPU)

Response phase (gather):
- 80 MB total
- With 100 Gbps (12.5 GB/s) RDMA: 80 MB ÷ 12.5 GB/s = 6.4 ms
- CPU overhead: 0.2 cores

Total: ~6.4 ms (network-bound)

Improvement: 10x faster (64 ms → 6.4 ms)

Analysis:

Small messages: 3x improvement (latency-bound)
Large transfers: 10x improvement (bandwidth-bound)
Critical insight: RDMA enables 100 Gbps fabric utilization, TCP caps at ~10-15 Gbps

3.2 Vector Search Pattern

Workload Characteristics:

Request: Small (query vector + filter predicates, ~50 KB)
Response: Medium (top-K results with vectors, ~500 KB)
Frequency: Very high (interactive search)
Latency-sensitive: Yes (p99 < 50 ms target)

TCP Performance:

Single vector search query:
- Request (50 KB): 50 μs
- HNSW search on storage node: 25 ms (compute-bound)
- Response (500 KB): 400 μs
Total: ~25.5 ms

With 100 concurrent queries (typical load):
- Network congestion at ~12 Gbps (TCP limit)
- Increased latency due to queueing: +10-20 ms
- p99 latency: 45-60 ms (exceeds target)

RDMA Performance:

Single vector search query:
- Request (50 KB): 15 μs
- HNSW search: 25 ms (same, compute-bound)
- Response (500 KB): 50 μs
Total: ~25.1 ms

With 100 concurrent queries:
- Network at ~40 Gbps (well below 100 Gbps limit)
- Minimal queueing delay: +1-2 ms
- p99 latency: 28-32 ms (meets target)

Improvement: 1.5-2x better p99 latency under load

Analysis:

Single query: Minimal improvement (compute-bound)
High concurrency: 2x improvement (TCP saturates network)
RDMA critical for maintaining low p99 latency under load

3.3 Synchronous Replication Pattern

Workload Characteristics:

Request: Medium (commit log entry, 1-64 KB)
Response: Tiny (ACK, 64 bytes)
Frequency: Every write operation
Latency-critical: Yes (blocks user transaction)

TCP Performance:

Write operation with synchronous mirroring:

1. Primary write (local): 500 μs
2. Replicate to mirror:
   - Send commit log (8 KB): 27 μs
   - Network RTT: 500 μs (intra-datacenter)
   - Mirror write (local): 500 μs
   - ACK return: 13 μs
   Total replication: ~1,040 μs

Total write latency: 500 μs + 1,040 μs = 1,540 μs (1.54 ms)

Throughput impact:
- Max write rate: 1 / 1.54ms = 649 writes/sec (single-threaded)
- With pipelining (100 concurrent): ~65,000 writes/sec
- CPU overhead: ~20% per core due to TCP

RDMA Performance:

Write operation with synchronous mirroring:

1. Primary write (local): 500 μs
2. Replicate to mirror:
   - RDMA WRITE (8 KB): 8 μs
   - Network RTT: 5 μs (RDMA ultra-low latency)
   - Mirror write (local): 500 μs
   - RDMA READ (ACK): 5 μs
   Total replication: ~518 μs

Total write latency: 500 μs + 518 μs = 1,018 μs (1.02 ms)

Improvement: 1.5x faster (1.54 ms → 1.02 ms)

Throughput impact:
- Max write rate: 1 / 1.02ms = 980 writes/sec (single-threaded)
- With pipelining (100 concurrent): ~98,000 writes/sec (+50% vs TCP)
- CPU overhead: ~2% per core

Analysis:

Latency reduction: 33% (520 μs saved per write)
Throughput increase: 50% (65K → 98K writes/sec)
RDMA critical for synchronous replication performance
CPU savings: 90% (enables more concurrent writes)

3.4 Cache Invalidation Broadcast

Workload Characteristics:

Request: Tiny (invalidation notice, 128 bytes)
Response: None (fire-and-forget)
Frequency: Every committed transaction
Fanout: High (all compute nodes)

TCP Performance:

Scenario: Invalidate cache on 50 compute nodes

Sequential broadcast approach:
- 50 nodes × 13 μs = 650 μs
- Impractical (too slow)

Parallel TCP connections:
- 50 parallel sends: ~20 μs (network I/O parallel)
- CPU overhead: 15 μs (system calls, context switches)
- Total: ~35 μs

At 10,000 transactions/sec:
- CPU time: 10,000 × 35 μs = 350 ms/sec = 35% of one core

RDMA Performance:

Parallel RDMA SEND:
- 50 parallel sends: ~8 μs (verbs API, minimal overhead)
- Total: ~8 μs

At 10,000 transactions/sec:
- CPU time: 10,000 × 8 μs = 80 ms/sec = 8% of one core

Improvement: 4.4x faster (35 μs → 8 μs)
CPU savings: 77% (35% → 8% core utilization)

4. CPU Efficiency Analysis

4.1 CPU Cycles per Byte Transferred

TCP/IP:

Breakdown for 1 MB transfer:
- User-space overhead: 50K cycles
- Kernel TCP stack: 500K cycles
- Memory copy: 300K cycles
- Driver: 100K cycles
Total: ~950K cycles per MB

At 3 GHz CPU: 950K cycles = 317 μs
Effective CPU cost: ~0.3 cores for 1 GB/sec throughput

RDMA:

Breakdown for 1 MB transfer:
- User-space (verbs): 10K cycles
- No kernel involvement: 0 cycles
- No memory copy: 0 cycles
- RDMA NIC handles rest
Total: ~10K cycles per MB

At 3 GHz CPU: 10K cycles = 3.3 μs
Effective CPU cost: ~0.003 cores for 1 GB/sec throughput

CPU efficiency: 95x better (950K → 10K cycles/MB)

4.2 System-Wide CPU Savings

Scenario: HeliosDB cluster under load

Cluster configuration:
- 10 compute nodes
- 20 storage nodes
- 100 GB/sec aggregate data transfer (analytical workload)

TCP CPU overhead:
- 100 GB/sec × 0.3 cores/(GB/sec) = 30 cores
- Distributed across 30 nodes = 1 core per node (average)
- Peak nodes (hot compute): 5-10 cores consumed by networking

RDMA CPU overhead:
- 100 GB/sec × 0.003 cores/(GB/sec) = 0.3 cores total
- Negligible per-node overhead

CPU reclaimed: 30 cores → 0.3 cores = **29.7 cores freed for query processing**

At $0.10/core-hour (cloud pricing):
- Cost savings: 29.7 cores × $0.10/hr × 730 hrs/month = $2,168/month

5. Throughput and Scalability Analysis

5.1 Network Bandwidth Utilization

TCP/IP (10 Gbps NIC):

Theoretical bandwidth: 10 Gbps = 1.25 GB/sec
Practical TCP throughput:
- Single stream: 8-9 Gbps (~1.1 GB/sec)
- Multiple streams: 9-10 Gbps (~1.2 GB/sec)
- Bottleneck: CPU overhead limits further scaling

Effective bandwidth: 70-80% of wire speed

RDMA/RoCEv2 (100 Gbps NIC):

Theoretical bandwidth: 100 Gbps = 12.5 GB/sec
Practical RDMA throughput:
- Single stream: 90-95 Gbps (~11.5 GB/sec)
- Multiple streams: 95-98 Gbps (~12.2 GB/sec)
- Bottleneck: Wire speed

Effective bandwidth: 95-98% of wire speed

Bandwidth improvement: 10x (1.2 GB/sec → 12.2 GB/sec)

5.2 Scalability Limits

TCP-based cluster:

Network becomes bottleneck at:
- 10 compute nodes × 1.2 GB/sec = 12 GB/sec total
- With 10 Gbps inter-switch links
- Limited to ~8-10 nodes before network saturation

Mitigation:
- Add more switches (expensive)
- Over-subscription ratio 4:1 (poor performance under load)

RDMA-based cluster:

Network bottleneck at:
- 10 compute nodes × 12 GB/sec = 120 GB/sec total
- With 100 Gbps inter-switch links
- Can scale to 80+ nodes before saturation

Mitigation:
- Rare to hit network limit
- Compute/storage resources exhaust first

Scalability improvement: 8-10x more nodes before network bottleneck

6. Cost-Benefit Analysis

6.1 Hardware Cost Comparison

TCP Setup (10 Gbps Ethernet):

Per-node cost:
- 10 Gbps NIC: $300
- Switch port (amortized): $500
Total: $800 per node

30-node cluster: 30 × $800 = $24,000

RDMA Setup (100 Gbps RoCEv2):

Per-node cost:
- 100 Gbps RDMA NIC: $1,200
- RoCEv2-capable switch port (amortized): $2,000
Total: $3,200 per node

30-node cluster: 30 × $3,200 = $96,000

Additional cost: $72,000 (4x more expensive)

6.2 Total Cost of Ownership (TCO)

Performance Gains → Compute Savings:

With RDMA, CPU efficiency gains allow:
- 30% fewer nodes for same throughput (due to CPU savings)
- Or 50% higher throughput with same nodes

Scenario: 30-node TCP cluster vs 21-node RDMA cluster (same performance)

TCP cluster:
- Hardware: $24,000
- Servers (30 × $5,000): $150,000
- Power (30 × $1,000/yr): $30,000/yr
Total initial: $174,000
Annual opex: $30,000

RDMA cluster (21 nodes):
- Hardware: 21 × $3,200 = $67,200
- Servers (21 × $5,000): $105,000
- Power (21 × $1,000/yr): $21,000/yr
Total initial: $172,200
Annual opex: $21,000

Break-even: Immediate (lower total cost)
Annual savings: $9,000/year in power alone

Performance Gains → Revenue Opportunity:

Scenario: Same 30 nodes, RDMA enables 50% more throughput

TCP: 100,000 queries/sec
RDMA: 150,000 queries/sec

If revenue is query-driven (e.g., SaaS):
- Additional $72,000 investment
- 50% more capacity = 50% more customers
- Break-even: 72K ÷ (annual revenue per customer × 50% increase)

Example: $500/customer/year, 1000 customers
- Additional revenue: 500 × $500 = $250,000/year
- ROI: 72K investment → 250K annual return
- Break-even: 3.5 months

6.3 Recommendation Matrix

Use Case	Network Choice	Rationale
Development/Testing	10 Gbps TCP	Low cost, sufficient for dev workloads
Production (OLTP-heavy)	25 Gbps TCP or RoCEv2	Moderate cost, low-latency writes critical
Production (HTAP)	100 Gbps RoCEv2	Recommended, best price/performance
Production (Analytics-heavy)	100 Gbps RoCEv2	Essential for high throughput
Extreme scale (>100 nodes)	200 Gbps RoCEv2	Future-proof

7. RDMA Implementation Considerations

7.1 RDMA Operations for HeliosDB

Operation Mapping:

HeliosDB Operation	RDMA Primitive	Rationale
Predicate pushdown request	SEND/RECV	Two-sided, requires receiver processing
Cache invalidation	SEND/RECV	Broadcast, minimal payload
Data transfer (large)	WRITE	One-sided, zero-copy bulk transfer
Synchronous replication	WRITE + SEND	Write commit log, send ACK
Metadata lookup	SEND/RECV	Small message, low latency
Vector search request	SEND/RECV	Requires processing on both sides

7.2 Memory Registration Overhead

Challenge: RDMA requires memory registration (pinning pages)

Registration cost:
- Small buffer (4 KB): 5-10 μs
- Large buffer (1 MB): 50-200 μs

Mitigation strategies:
1. Pre-register buffers in pool:
   - Allocate 1000 × 1MB buffers at startup
   - Registration cost amortized over lifetime
   - Total startup cost: 50-200 ms (acceptable)

2. Reuse registered buffers:
   - Implement buffer pool with MR (Memory Region) caching
   - Registration overhead: 0 μs (amortized)

Recommended: Pre-allocate 10-20 GB of registered memory per node

7.3 Reliability and Error Handling

RDMA Reliability Modes:

Mode	Guarantee	Latency	Use Case
RC (Reliable Connection)	In-order, exactly-once	Lowest	Recommended for HeliosDB
UC (Unreliable Connection)	Best-effort	Very low	Not suitable (data loss risk)
UD (Unreliable Datagram)	Best-effort multicast	Low	Cache invalidation (optional)

Recommendation: Use RC (Reliable Connection) for all HeliosDB operations

Guarantees in-order delivery
Automatic retransmission on packet loss
Performance overhead: Negligible (<1%)

7.4 Congestion Control

TCP Congestion Control:

Automatic (built into protocol)
Adapts to network conditions
Can cause latency spikes under congestion

RDMA Congestion Control:

Hardware-based (PFC - Priority Flow Control)
Lossless Ethernet with backpressure
Requires proper switch configuration

HeliosDB Configuration:

[network.rdma]
# Enable ECN (Explicit Congestion Notification) for RoCEv2
enable_ecn = true

# QoS settings
priority = 3  # High priority for database traffic

# Buffer sizes
send_buffer_mb = 256
recv_buffer_mb = 256

# Connection pools
max_connections_per_node = 16

8. Monitoring and Performance Validation

8.1 Key Metrics

RDMA Performance:

rdma.post_send_latency_us: Time to post SEND/WRITE
rdma.completion_latency_us: Time until completion event
rdma.throughput_gbps: Achieved bandwidth per connection
rdma.cpu_percent: CPU overhead for RDMA operations

Comparison Metrics:

network.tcp_throughput_gbps: Baseline TCP performance
network.rdma_speedup: Actual RDMA speedup vs TCP
cpu.network_overhead_percent: CPU spent on networking

8.2 Benchmarking Protocol

Micro-benchmarks:

# RDMA latency test
ib_send_lat -d mlx5_0 -F --report_gbits

# RDMA bandwidth test
ib_send_bw -d mlx5_0 -D 10 --report_gbits

# Compare vs TCP
netperf -H storage_node -t TCP_STREAM

Application-level benchmarks:

-- Measure predicate pushdown latency
EXPLAIN ANALYZE
SELECT count(*)
FROM large_table
WHERE category = 'electronics' AND price < 100;

-- Compare TCP vs RDMA:
-- TCP: 150ms
-- RDMA: 20ms (expected)

9. Migration Path

9.1 Hybrid Deployment (TCP + RDMA)

Phase 1: TCP-only (Months 1-6)

Deploy with 10 Gbps Ethernet
Validate functionality
Gather performance baselines

Phase 2: Hybrid (Months 7-9)

Add RDMA NICs to critical nodes
Run TCP and RDMA in parallel
Gradual migration of traffic
Validate performance gains

Phase 3: RDMA-native (Months 10+)

All inter-node traffic on RDMA
TCP only for external client connections
Optimize for RDMA-specific features (one-sided WRITE)

9.2 Fallback Mechanism

// HeliosDB network layer with automatic fallback
pub enum TransportMode {
    RDMA,
    TCP,
}

pub struct NetworkConnection {
    mode: TransportMode,
    rdma_qp: Option<RdmaQueuePair>,
    tcp_socket: Option<TcpStream>,
}

impl NetworkConnection {
    pub async fn send(&self, data: &[u8]) -> Result<()> {
        match self.mode {
            TransportMode::RDMA => {
                if let Some(qp) = &self.rdma_qp {
                    qp.post_send(data).await
                } else {
                    // Fallback to TCP if RDMA unavailable
                    self.tcp_socket.as_ref().unwrap().write(data).await
                }
            }
            TransportMode::TCP => {
                self.tcp_socket.as_ref().unwrap().write(data).await
            }
        }
    }
}

Benefit: Graceful degradation if RDMA fails

10. Conclusion

Key Findings:

Latency Improvements:
- Small messages (RPC): 3x faster (26 μs → 8 μs)
- Medium messages: 3x faster (27 μs → 9 μs)
- Large transfers: 7x faster (667 μs → 91 μs)
- Synchronous replication: 1.5x faster (1.54 ms → 1.02 ms)
Throughput Improvements:
- Single stream: 10x higher (1.2 GB/sec → 12.2 GB/sec)
- Multi-shard queries: 10x faster data gathering
- Write throughput: 50% higher (65K → 98K writes/sec)
CPU Efficiency:
- 95x reduction in CPU cycles per byte (950K → 10K cycles/MB)
- 90% CPU savings for networking (30 cores → 0.3 cores)
- Enables 30-50% more throughput on same hardware
Cost-Benefit:
- 4x higher upfront cost ($24K → $96K for 30-node cluster)
- Break-even: Immediate (due to fewer nodes needed)
- Or 3-6 months (due to higher throughput → more revenue)
- Annual opex savings: $9K+ in power alone
Scalability:
- TCP: Network bottleneck at 8-10 nodes
- RDMA: Can scale to 80+ nodes before network limit
- 8-10x better scalability ceiling

Recommendation:

RDMA/RoCEv2 is critical for HeliosDB production deployments.

The performance benefits are transformative:

10x faster analytical queries (multi-shard aggregations)
50% higher write throughput (synchronous replication)
90% CPU savings (enables more concurrent queries)

The 4x hardware premium is justified by:

Immediate TCO benefits (fewer nodes needed)
Future scalability (100+ node clusters)
Competitive advantage (sub-50ms p99 for complex queries)

Minimum Recommended Configuration:

100 Gbps RoCEv2 for production
25 Gbps RoCEv2 for smaller deployments (<10 nodes)
TCP acceptable only for development/testing

Next Steps:

Implement RDMA connection management in heliosdb-network module
Add automatic TCP fallback for compatibility
Develop RDMA-specific optimizations (one-sided WRITE for replication)
Create performance benchmarking suite (TCP vs RDMA comparison)