LSM-Tree RUM Conjecture Analysis for HeliosDB

Executive Summary

This analysis examines the Read-Update-Memory (RUM) conjecture trade-offs for HeliosDB’s LSM-tree storage engine, providing quantitative estimates and configuration recommendations for different workload patterns.

1. RUM Conjecture Overview

The RUM conjecture states that a storage system cannot simultaneously optimize for:

Read Amplification (R): Number of disk reads per point query
Update Amplification (U): Number of times data is rewritten during its lifetime
Memory Amplification (M): Memory overhead per stored data item

HeliosDB’s LSM-tree prioritizes write throughput (low update latency) at the cost of higher read and write amplification.

2. Compaction Strategy Analysis

2.1 Size-Tiered Compaction Strategy (STCS)

Characteristics:

Merges SSTables of similar size
Optimized for write-heavy workloads
Higher read amplification

Quantitative Estimates:

Metric	Value	Explanation
Write Amplification	5-10x	Each write may be rewritten 5-10 times through compaction
Read Amplification	10-20 SSTables	Point query may scan 10-20 SSTables
Space Amplification	50-90%	50-90% additional space for compaction overhead
Memory per SSTable	~10MB	Bloom filter + index metadata
Compaction CPU Cost	Medium	Periodic large merges

Performance Profile:

Write Throughput:     ████████████████████ (100K-500K writes/sec per node)
Point Read Latency:   ████████░░░░░░░░░░░ (2-10ms, varies with SSTable count)
Range Scan Latency:   ██████████████░░░░░ (70% of optimal)
Space Efficiency:     ████████░░░░░░░░░░░ (50-60% efficiency)

Best For:

Write-heavy OLTP workloads (>80% writes)
Time-series data ingestion
Log aggregation systems
Initial data loading phases

2.2 Leveled Compaction Strategy (LCS)

Characteristics:

Organizes data into levels of increasing size
Minimizes overlap between SSTables
Lower, more predictable read latency

Quantitative Estimates:

Metric	Value	Explanation
Write Amplification	10-30x	Higher due to aggressive compaction
Read Amplification	1-3 SSTables	Query typically scans 1 SSTable per level
Space Amplification	10-20%	Tighter space bounds due to less overlap
Memory per Level	~15MB	More index structures per level
Compaction CPU Cost	High	Continuous background compaction

Performance Profile:

Write Throughput:     ████████████░░░░░░░ (60K-200K writes/sec per node)
Point Read Latency:   ███████████████████ (0.5-2ms, very consistent)
Range Scan Latency:   ███████████████████ (95% of optimal)
Space Efficiency:     ███████████████████ (85-90% efficiency)

Best For:

Read-heavy OLTP workloads (>70% reads)
Transactional systems with point queries
Workloads sensitive to read latency variance
Space-constrained deployments

2.3 Hybrid Compaction Strategy (Recommended for HTAP)

Proposed Strategy:

Use STCS for hot/recent data (last 24-48 hours)
Automatically transition to LCS for warm data
Apply HCC compression to cold data (>30 days)

Quantitative Estimates:

Metric	Value	Explanation
Write Amplification	6-12x	Balanced approach
Read Amplification	3-8 SSTables	Hot data faster, cold acceptable
Space Amplification	30-40%	Mixed approach
Memory Overhead	~12MB/level	Moderate memory usage
Compaction CPU Cost	Medium-High	Transition overhead

Performance Profile:

Write Throughput:     █████████████████░░ (80K-350K writes/sec per node)
Point Read Latency:   █████████████████░░ (1-5ms, workload dependent)
Range Scan Latency:   █████████████████░░ (85% of optimal)
Space Efficiency:     █████████████░░░░░░ (70-75% efficiency)

Best For:

HTAP workloads with mixed read/write patterns
Systems with temporal access patterns
Multi-tenant environments with diverse workloads

3. Memory Amplification Analysis

3.1 Bloom Filter Memory Overhead

For 1 billion rows across 1000 SSTables (1M rows each):

Bloom Filter Size per SSTable = (rows × bits_per_row) / 8
                                = (1M × 10 bits) / 8
                                = 1.25 MB per SSTable

Total Bloom Filter Memory = 1000 × 1.25 MB = 1.25 GB

Configuration Trade-offs:

Bits per Row	False Positive Rate	Memory per 1M Rows	Recommendation
8 bits	2.0%	1.0 MB	Too high FP rate
10 bits	1.0%	1.25 MB	Recommended default
12 bits	0.6%	1.5 MB	Read-heavy workloads
16 bits	0.2%	2.0 MB	Latency-critical systems

3.2 Index Memory Overhead

Primary Index Structure:

Index Entry Size = key_size + offset (8 bytes) + size (4 bytes)
                 = 32 bytes (avg key) + 12 bytes = 44 bytes

For 1M rows with 1-in-100 sampling:
Index Memory = 10K entries × 44 bytes = 440 KB per SSTable

3.3 Memtable Memory Allocation

Recommended Configuration:

Workload Type	Memtable Size	Flush Frequency	Peak Memory
Write-Heavy	256 MB	2-5 minutes	768 MB (3× for compaction)
Balanced	128 MB	5-10 minutes	384 MB
Read-Heavy	64 MB	10-20 minutes	192 MB

Formula:

Total Memtable Memory = memtable_size × (1 + num_immutable_memtables)
Recommended: 128 MB × 3 = 384 MB per table

4. Read Amplification Mitigation Strategies

4.1 Partition Pruning Impact

Scenario: Query with partition filter on date-partitioned table

Without Partition Pruning:
- Total partitions: 365 (daily for 1 year)
- SSTables per partition: 10
- Total SSTables scanned: 3,650
- Estimated latency: 150-300ms

With Partition Pruning (1-week query):
- Partitions scanned: 7
- SSTables scanned: 70
- Estimated latency: 5-15ms
- Improvement: 20-60x faster

Recommendation: Always define partitioning on time-series or categorical dimensions.

4.2 Bloom Filter Effectiveness

Expected Performance:

Workload	Bloom Filter Hit Rate	Read Amplification Reduction
Point queries (existing keys)	0% (all pass through)	None
Point queries (non-existent keys)	99% (filtered)	99% reduction
Mixed (70% exist, 30% miss)	30% filtered	30% reduction

Key Insight: Bloom filters are critical for workloads with high miss rates.

4.3 Cache Hit Rate Analysis

L1 Cache (Memtable):

Size: 128 MB (default)
Coverage: Last ~1M rows (assuming 128 bytes/row)
Expected hit rate for recent writes: 80-95%

L2 Cache (Compute Node Block Cache):

Size: 8-32 GB (configurable)
Coverage: Hot working set
Expected hit rate: 60-80% (steady state)

Combined Cache Effectiveness:

Effective Read Amplification = Base × (1 - L1_hit_rate) × (1 - L2_hit_rate)

Example (STCS with 15 SSTables):
- Base read amp: 15
- L1 hit: 10%
- L2 hit: 70%
- Effective: 15 × 0.9 × 0.3 = 4.05 SSTables

Latency improvement: 70% reduction

5. Write Amplification Analysis

5.1 Compaction Write Amplification by Strategy

STCS Write Path:

Initial Write: 1 write to commit log + 1 write to memtable
Memtable Flush: 1 write to L0 SSTable
Compaction L0→L1: 4 writes (4-way merge)
Compaction L1→L2: No additional (size-tiered)
Total: ~6-8 writes per initial write

LCS Write Path:

Initial Write: 1 write to commit log + 1 write to memtable
Memtable Flush: 1 write to L0
L0→L1 Compaction: 1 write
L1→L2 Compaction: 10 writes (10x fanout)
L2→L3 Compaction: 10 writes
Total: ~22-30 writes per initial write

5.2 Network Write Amplification (Replication)

Synchronous Mirroring:

Application Write: 1 logical write
Physical Writes:
  - Primary commit log: 1 write
  - Primary memtable: 1 write
  - Mirror commit log (over network): 1 write
  - Mirror memtable: 1 write

Network Amplification: 1× (one network round-trip)
Storage Amplification: 2× (primary + mirror)

Latency Impact:

Single-node write latency: 0.5-1ms (commit log + memtable)
Mirrored write latency: 1.5-3ms (+ 1 RTT over RDMA)

RDMA RTT (same datacenter): 1-5 microseconds
TCP RTT (same datacenter): 0.5-2 milliseconds

Expected write latency with RDMA: 0.5ms + 5μs ≈ 0.505ms (negligible overhead)
Expected write latency with TCP: 0.5ms + 1ms ≈ 1.5ms (3x slowdown)

Recommendation: RDMA is critical for maintaining low write latency with synchronous replication.

6. Configuration Recommendations by Workload

6.1 Write-Heavy OLTP (>80% writes)

[storage.compaction]
strategy = "STCS"
sstable_size_mb = 160
min_merge_threshold = 4
max_merge_threshold = 32

[storage.memtable]
size_mb = 256
flush_threshold_mb = 200
num_memtables = 2

[storage.bloom_filter]
bits_per_row = 10

Expected Performance:

Write throughput: 300K-500K ops/sec per node
Point read latency (p99): 8-15ms
Space efficiency: 55-65%

6.2 Read-Heavy OLTP (>70% reads)

[storage.compaction]
strategy = "LCS"
level_fanout = 10
max_sstable_size_mb = 160

[storage.memtable]
size_mb = 128
flush_threshold_mb = 96
num_memtables = 1

[storage.bloom_filter]
bits_per_row = 12

[storage.cache]
block_cache_mb = 16384  # 16 GB

Expected Performance:

Write throughput: 100K-200K ops/sec per node
Point read latency (p99): 1-3ms
Space efficiency: 80-90%

6.3 HTAP (Mixed Workload)

[storage.compaction]
strategy = "Hybrid"
hot_data_strategy = "STCS"
cold_data_strategy = "LCS"
transition_age_hours = 48

[storage.memtable]
size_mb = 128
flush_threshold_mb = 96
num_memtables = 2

[storage.bloom_filter]
bits_per_row = 10

[storage.hcc]
enabled = true
cold_data_age_days = 30
compression_mode = "WAREHOUSE_OPTIMIZED"

Expected Performance:

Write throughput: 200K-350K ops/sec per node
Point read latency (p99): 2-6ms (hot), 5-15ms (cold)
Space efficiency: 70-75% (hot), 90%+ (cold with HCC)

7. Monitoring and Tuning Metrics

Key Performance Indicators

Write Path:

lsm.memtable.flush_rate: Target 10-50 flushes/hour
lsm.compaction.pending_bytes: Should be < 10GB
lsm.write_amplification: Monitor vs. expected values
replication.mirror_lag_ms: Target < 5ms

Read Path:

lsm.read_amplification: SSTables scanned per query
lsm.bloom_filter.false_positive_rate: Monitor vs. configured
cache.hit_rate.l1: Target > 10% (memtable)
cache.hit_rate.l2: Target > 60% (block cache)

Resource Usage:

memory.bloom_filters_mb: Should be < 5% of total memory
memory.memtable_mb: Should be predictable
disk.compaction_io_mb_per_sec: Shouldn’t saturate I/O
cpu.compaction_percent: Target < 30% average

Auto-Tuning Recommendations

Dynamic Compaction Throttling:

if write_throughput > 0.8 × max_throughput:
    increase compaction_threads by 1

if compaction_pending_bytes > 20GB:
    reduce memtable_size by 25%
    increase flush frequency

Adaptive Bloom Filter Sizing:

if false_positive_rate > 2 × configured_rate:
    increase bits_per_row by 2

if memory_pressure > 0.9:
    decrease bits_per_row to minimum (8)

8. Conclusion

Key Findings:

Write Performance Priority: STCS provides 2-3x better write throughput but with 5-10x worse worst-case read latency
Read Latency Priority: LCS provides consistent sub-2ms reads but reduces write throughput by 40-60%
HTAP Sweet Spot: Hybrid strategy with age-based transitions balances both workloads
RDMA Critical: Synchronous replication with RDMA adds only ~5μs, vs ~1ms with TCP
Memory Investment: Allocating 2-4GB per billion rows for Bloom filters reduces read amplification by 99% for misses

Recommended Starting Configuration:

Compaction: Hybrid (STCS hot, LCS cold)
Memtable: 128 MB
Bloom filters: 10 bits/row
Block cache: 16-32 GB per compute node
Monitor and tune based on actual workload patterns

Next Steps:

Implement workload-adaptive compaction strategy selection
Develop real-time compaction throttling based on resource utilization
Build cost model for automatic cache sizing