Skip to content

LSM-Tree RUM Conjecture Analysis for HeliosDB

LSM-Tree RUM Conjecture Analysis for HeliosDB

Executive Summary

This analysis examines the Read-Update-Memory (RUM) conjecture trade-offs for HeliosDB’s LSM-tree storage engine, providing quantitative estimates and configuration recommendations for different workload patterns.

1. RUM Conjecture Overview

The RUM conjecture states that a storage system cannot simultaneously optimize for:

  • Read Amplification (R): Number of disk reads per point query
  • Update Amplification (U): Number of times data is rewritten during its lifetime
  • Memory Amplification (M): Memory overhead per stored data item

HeliosDB’s LSM-tree prioritizes write throughput (low update latency) at the cost of higher read and write amplification.

2. Compaction Strategy Analysis

2.1 Size-Tiered Compaction Strategy (STCS)

Characteristics:

  • Merges SSTables of similar size
  • Optimized for write-heavy workloads
  • Higher read amplification

Quantitative Estimates:

MetricValueExplanation
Write Amplification5-10xEach write may be rewritten 5-10 times through compaction
Read Amplification10-20 SSTablesPoint query may scan 10-20 SSTables
Space Amplification50-90%50-90% additional space for compaction overhead
Memory per SSTable~10MBBloom filter + index metadata
Compaction CPU CostMediumPeriodic large merges

Performance Profile:

Write Throughput: ████████████████████ (100K-500K writes/sec per node)
Point Read Latency: ████████░░░░░░░░░░░ (2-10ms, varies with SSTable count)
Range Scan Latency: ██████████████░░░░░ (70% of optimal)
Space Efficiency: ████████░░░░░░░░░░░ (50-60% efficiency)

Best For:

  • Write-heavy OLTP workloads (>80% writes)
  • Time-series data ingestion
  • Log aggregation systems
  • Initial data loading phases

2.2 Leveled Compaction Strategy (LCS)

Characteristics:

  • Organizes data into levels of increasing size
  • Minimizes overlap between SSTables
  • Lower, more predictable read latency

Quantitative Estimates:

MetricValueExplanation
Write Amplification10-30xHigher due to aggressive compaction
Read Amplification1-3 SSTablesQuery typically scans 1 SSTable per level
Space Amplification10-20%Tighter space bounds due to less overlap
Memory per Level~15MBMore index structures per level
Compaction CPU CostHighContinuous background compaction

Performance Profile:

Write Throughput: ████████████░░░░░░░ (60K-200K writes/sec per node)
Point Read Latency: ███████████████████ (0.5-2ms, very consistent)
Range Scan Latency: ███████████████████ (95% of optimal)
Space Efficiency: ███████████████████ (85-90% efficiency)

Best For:

  • Read-heavy OLTP workloads (>70% reads)
  • Transactional systems with point queries
  • Workloads sensitive to read latency variance
  • Space-constrained deployments

Proposed Strategy:

  • Use STCS for hot/recent data (last 24-48 hours)
  • Automatically transition to LCS for warm data
  • Apply HCC compression to cold data (>30 days)

Quantitative Estimates:

MetricValueExplanation
Write Amplification6-12xBalanced approach
Read Amplification3-8 SSTablesHot data faster, cold acceptable
Space Amplification30-40%Mixed approach
Memory Overhead~12MB/levelModerate memory usage
Compaction CPU CostMedium-HighTransition overhead

Performance Profile:

Write Throughput: █████████████████░░ (80K-350K writes/sec per node)
Point Read Latency: █████████████████░░ (1-5ms, workload dependent)
Range Scan Latency: █████████████████░░ (85% of optimal)
Space Efficiency: █████████████░░░░░░ (70-75% efficiency)

Best For:

  • HTAP workloads with mixed read/write patterns
  • Systems with temporal access patterns
  • Multi-tenant environments with diverse workloads

3. Memory Amplification Analysis

3.1 Bloom Filter Memory Overhead

For 1 billion rows across 1000 SSTables (1M rows each):

Bloom Filter Size per SSTable = (rows × bits_per_row) / 8
= (1M × 10 bits) / 8
= 1.25 MB per SSTable
Total Bloom Filter Memory = 1000 × 1.25 MB = 1.25 GB

Configuration Trade-offs:

Bits per RowFalse Positive RateMemory per 1M RowsRecommendation
8 bits2.0%1.0 MBToo high FP rate
10 bits1.0%1.25 MBRecommended default
12 bits0.6%1.5 MBRead-heavy workloads
16 bits0.2%2.0 MBLatency-critical systems

3.2 Index Memory Overhead

Primary Index Structure:

Index Entry Size = key_size + offset (8 bytes) + size (4 bytes)
= 32 bytes (avg key) + 12 bytes = 44 bytes
For 1M rows with 1-in-100 sampling:
Index Memory = 10K entries × 44 bytes = 440 KB per SSTable

3.3 Memtable Memory Allocation

Recommended Configuration:

Workload TypeMemtable SizeFlush FrequencyPeak Memory
Write-Heavy256 MB2-5 minutes768 MB (3× for compaction)
Balanced128 MB5-10 minutes384 MB
Read-Heavy64 MB10-20 minutes192 MB

Formula:

Total Memtable Memory = memtable_size × (1 + num_immutable_memtables)
Recommended: 128 MB × 3 = 384 MB per table

4. Read Amplification Mitigation Strategies

4.1 Partition Pruning Impact

Scenario: Query with partition filter on date-partitioned table

Without Partition Pruning:
- Total partitions: 365 (daily for 1 year)
- SSTables per partition: 10
- Total SSTables scanned: 3,650
- Estimated latency: 150-300ms
With Partition Pruning (1-week query):
- Partitions scanned: 7
- SSTables scanned: 70
- Estimated latency: 5-15ms
- Improvement: 20-60x faster

Recommendation: Always define partitioning on time-series or categorical dimensions.

4.2 Bloom Filter Effectiveness

Expected Performance:

WorkloadBloom Filter Hit RateRead Amplification Reduction
Point queries (existing keys)0% (all pass through)None
Point queries (non-existent keys)99% (filtered)99% reduction
Mixed (70% exist, 30% miss)30% filtered30% reduction

Key Insight: Bloom filters are critical for workloads with high miss rates.

4.3 Cache Hit Rate Analysis

L1 Cache (Memtable):

  • Size: 128 MB (default)
  • Coverage: Last ~1M rows (assuming 128 bytes/row)
  • Expected hit rate for recent writes: 80-95%

L2 Cache (Compute Node Block Cache):

  • Size: 8-32 GB (configurable)
  • Coverage: Hot working set
  • Expected hit rate: 60-80% (steady state)

Combined Cache Effectiveness:

Effective Read Amplification = Base × (1 - L1_hit_rate) × (1 - L2_hit_rate)
Example (STCS with 15 SSTables):
- Base read amp: 15
- L1 hit: 10%
- L2 hit: 70%
- Effective: 15 × 0.9 × 0.3 = 4.05 SSTables
Latency improvement: 70% reduction

5. Write Amplification Analysis

5.1 Compaction Write Amplification by Strategy

STCS Write Path:

Initial Write: 1 write to commit log + 1 write to memtable
Memtable Flush: 1 write to L0 SSTable
Compaction L0→L1: 4 writes (4-way merge)
Compaction L1→L2: No additional (size-tiered)
Total: ~6-8 writes per initial write

LCS Write Path:

Initial Write: 1 write to commit log + 1 write to memtable
Memtable Flush: 1 write to L0
L0→L1 Compaction: 1 write
L1→L2 Compaction: 10 writes (10x fanout)
L2→L3 Compaction: 10 writes
Total: ~22-30 writes per initial write

5.2 Network Write Amplification (Replication)

Synchronous Mirroring:

Application Write: 1 logical write
Physical Writes:
- Primary commit log: 1 write
- Primary memtable: 1 write
- Mirror commit log (over network): 1 write
- Mirror memtable: 1 write
Network Amplification: 1× (one network round-trip)
Storage Amplification: 2× (primary + mirror)

Latency Impact:

Single-node write latency: 0.5-1ms (commit log + memtable)
Mirrored write latency: 1.5-3ms (+ 1 RTT over RDMA)
RDMA RTT (same datacenter): 1-5 microseconds
TCP RTT (same datacenter): 0.5-2 milliseconds
Expected write latency with RDMA: 0.5ms + 5μs ≈ 0.505ms (negligible overhead)
Expected write latency with TCP: 0.5ms + 1ms ≈ 1.5ms (3x slowdown)

Recommendation: RDMA is critical for maintaining low write latency with synchronous replication.

6. Configuration Recommendations by Workload

6.1 Write-Heavy OLTP (>80% writes)

[storage.compaction]
strategy = "STCS"
sstable_size_mb = 160
min_merge_threshold = 4
max_merge_threshold = 32
[storage.memtable]
size_mb = 256
flush_threshold_mb = 200
num_memtables = 2
[storage.bloom_filter]
bits_per_row = 10

Expected Performance:

  • Write throughput: 300K-500K ops/sec per node
  • Point read latency (p99): 8-15ms
  • Space efficiency: 55-65%

6.2 Read-Heavy OLTP (>70% reads)

[storage.compaction]
strategy = "LCS"
level_fanout = 10
max_sstable_size_mb = 160
[storage.memtable]
size_mb = 128
flush_threshold_mb = 96
num_memtables = 1
[storage.bloom_filter]
bits_per_row = 12
[storage.cache]
block_cache_mb = 16384 # 16 GB

Expected Performance:

  • Write throughput: 100K-200K ops/sec per node
  • Point read latency (p99): 1-3ms
  • Space efficiency: 80-90%

6.3 HTAP (Mixed Workload)

[storage.compaction]
strategy = "Hybrid"
hot_data_strategy = "STCS"
cold_data_strategy = "LCS"
transition_age_hours = 48
[storage.memtable]
size_mb = 128
flush_threshold_mb = 96
num_memtables = 2
[storage.bloom_filter]
bits_per_row = 10
[storage.hcc]
enabled = true
cold_data_age_days = 30
compression_mode = "WAREHOUSE_OPTIMIZED"

Expected Performance:

  • Write throughput: 200K-350K ops/sec per node
  • Point read latency (p99): 2-6ms (hot), 5-15ms (cold)
  • Space efficiency: 70-75% (hot), 90%+ (cold with HCC)

7. Monitoring and Tuning Metrics

Key Performance Indicators

Write Path:

  • lsm.memtable.flush_rate: Target 10-50 flushes/hour
  • lsm.compaction.pending_bytes: Should be < 10GB
  • lsm.write_amplification: Monitor vs. expected values
  • replication.mirror_lag_ms: Target < 5ms

Read Path:

  • lsm.read_amplification: SSTables scanned per query
  • lsm.bloom_filter.false_positive_rate: Monitor vs. configured
  • cache.hit_rate.l1: Target > 10% (memtable)
  • cache.hit_rate.l2: Target > 60% (block cache)

Resource Usage:

  • memory.bloom_filters_mb: Should be < 5% of total memory
  • memory.memtable_mb: Should be predictable
  • disk.compaction_io_mb_per_sec: Shouldn’t saturate I/O
  • cpu.compaction_percent: Target < 30% average

Auto-Tuning Recommendations

Dynamic Compaction Throttling:

if write_throughput > 0.8 × max_throughput:
increase compaction_threads by 1
if compaction_pending_bytes > 20GB:
reduce memtable_size by 25%
increase flush frequency

Adaptive Bloom Filter Sizing:

if false_positive_rate > 2 × configured_rate:
increase bits_per_row by 2
if memory_pressure > 0.9:
decrease bits_per_row to minimum (8)

8. Conclusion

Key Findings:

  1. Write Performance Priority: STCS provides 2-3x better write throughput but with 5-10x worse worst-case read latency
  2. Read Latency Priority: LCS provides consistent sub-2ms reads but reduces write throughput by 40-60%
  3. HTAP Sweet Spot: Hybrid strategy with age-based transitions balances both workloads
  4. RDMA Critical: Synchronous replication with RDMA adds only ~5μs, vs ~1ms with TCP
  5. Memory Investment: Allocating 2-4GB per billion rows for Bloom filters reduces read amplification by 99% for misses

Recommended Starting Configuration:

  • Compaction: Hybrid (STCS hot, LCS cold)
  • Memtable: 128 MB
  • Bloom filters: 10 bits/row
  • Block cache: 16-32 GB per compute node
  • Monitor and tune based on actual workload patterns

Next Steps:

  • Implement workload-adaptive compaction strategy selection
  • Develop real-time compaction throttling based on resource utilization
  • Build cost model for automatic cache sizing