LSM-Tree RUM Conjecture Analysis for HeliosDB
LSM-Tree RUM Conjecture Analysis for HeliosDB
Executive Summary
This analysis examines the Read-Update-Memory (RUM) conjecture trade-offs for HeliosDB’s LSM-tree storage engine, providing quantitative estimates and configuration recommendations for different workload patterns.
1. RUM Conjecture Overview
The RUM conjecture states that a storage system cannot simultaneously optimize for:
- Read Amplification (R): Number of disk reads per point query
- Update Amplification (U): Number of times data is rewritten during its lifetime
- Memory Amplification (M): Memory overhead per stored data item
HeliosDB’s LSM-tree prioritizes write throughput (low update latency) at the cost of higher read and write amplification.
2. Compaction Strategy Analysis
2.1 Size-Tiered Compaction Strategy (STCS)
Characteristics:
- Merges SSTables of similar size
- Optimized for write-heavy workloads
- Higher read amplification
Quantitative Estimates:
| Metric | Value | Explanation |
|---|---|---|
| Write Amplification | 5-10x | Each write may be rewritten 5-10 times through compaction |
| Read Amplification | 10-20 SSTables | Point query may scan 10-20 SSTables |
| Space Amplification | 50-90% | 50-90% additional space for compaction overhead |
| Memory per SSTable | ~10MB | Bloom filter + index metadata |
| Compaction CPU Cost | Medium | Periodic large merges |
Performance Profile:
Write Throughput: ████████████████████ (100K-500K writes/sec per node)Point Read Latency: ████████░░░░░░░░░░░ (2-10ms, varies with SSTable count)Range Scan Latency: ██████████████░░░░░ (70% of optimal)Space Efficiency: ████████░░░░░░░░░░░ (50-60% efficiency)Best For:
- Write-heavy OLTP workloads (>80% writes)
- Time-series data ingestion
- Log aggregation systems
- Initial data loading phases
2.2 Leveled Compaction Strategy (LCS)
Characteristics:
- Organizes data into levels of increasing size
- Minimizes overlap between SSTables
- Lower, more predictable read latency
Quantitative Estimates:
| Metric | Value | Explanation |
|---|---|---|
| Write Amplification | 10-30x | Higher due to aggressive compaction |
| Read Amplification | 1-3 SSTables | Query typically scans 1 SSTable per level |
| Space Amplification | 10-20% | Tighter space bounds due to less overlap |
| Memory per Level | ~15MB | More index structures per level |
| Compaction CPU Cost | High | Continuous background compaction |
Performance Profile:
Write Throughput: ████████████░░░░░░░ (60K-200K writes/sec per node)Point Read Latency: ███████████████████ (0.5-2ms, very consistent)Range Scan Latency: ███████████████████ (95% of optimal)Space Efficiency: ███████████████████ (85-90% efficiency)Best For:
- Read-heavy OLTP workloads (>70% reads)
- Transactional systems with point queries
- Workloads sensitive to read latency variance
- Space-constrained deployments
2.3 Hybrid Compaction Strategy (Recommended for HTAP)
Proposed Strategy:
- Use STCS for hot/recent data (last 24-48 hours)
- Automatically transition to LCS for warm data
- Apply HCC compression to cold data (>30 days)
Quantitative Estimates:
| Metric | Value | Explanation |
|---|---|---|
| Write Amplification | 6-12x | Balanced approach |
| Read Amplification | 3-8 SSTables | Hot data faster, cold acceptable |
| Space Amplification | 30-40% | Mixed approach |
| Memory Overhead | ~12MB/level | Moderate memory usage |
| Compaction CPU Cost | Medium-High | Transition overhead |
Performance Profile:
Write Throughput: █████████████████░░ (80K-350K writes/sec per node)Point Read Latency: █████████████████░░ (1-5ms, workload dependent)Range Scan Latency: █████████████████░░ (85% of optimal)Space Efficiency: █████████████░░░░░░ (70-75% efficiency)Best For:
- HTAP workloads with mixed read/write patterns
- Systems with temporal access patterns
- Multi-tenant environments with diverse workloads
3. Memory Amplification Analysis
3.1 Bloom Filter Memory Overhead
For 1 billion rows across 1000 SSTables (1M rows each):
Bloom Filter Size per SSTable = (rows × bits_per_row) / 8 = (1M × 10 bits) / 8 = 1.25 MB per SSTable
Total Bloom Filter Memory = 1000 × 1.25 MB = 1.25 GBConfiguration Trade-offs:
| Bits per Row | False Positive Rate | Memory per 1M Rows | Recommendation |
|---|---|---|---|
| 8 bits | 2.0% | 1.0 MB | Too high FP rate |
| 10 bits | 1.0% | 1.25 MB | Recommended default |
| 12 bits | 0.6% | 1.5 MB | Read-heavy workloads |
| 16 bits | 0.2% | 2.0 MB | Latency-critical systems |
3.2 Index Memory Overhead
Primary Index Structure:
Index Entry Size = key_size + offset (8 bytes) + size (4 bytes) = 32 bytes (avg key) + 12 bytes = 44 bytes
For 1M rows with 1-in-100 sampling:Index Memory = 10K entries × 44 bytes = 440 KB per SSTable3.3 Memtable Memory Allocation
Recommended Configuration:
| Workload Type | Memtable Size | Flush Frequency | Peak Memory |
|---|---|---|---|
| Write-Heavy | 256 MB | 2-5 minutes | 768 MB (3× for compaction) |
| Balanced | 128 MB | 5-10 minutes | 384 MB |
| Read-Heavy | 64 MB | 10-20 minutes | 192 MB |
Formula:
Total Memtable Memory = memtable_size × (1 + num_immutable_memtables)Recommended: 128 MB × 3 = 384 MB per table4. Read Amplification Mitigation Strategies
4.1 Partition Pruning Impact
Scenario: Query with partition filter on date-partitioned table
Without Partition Pruning:- Total partitions: 365 (daily for 1 year)- SSTables per partition: 10- Total SSTables scanned: 3,650- Estimated latency: 150-300ms
With Partition Pruning (1-week query):- Partitions scanned: 7- SSTables scanned: 70- Estimated latency: 5-15ms- Improvement: 20-60x fasterRecommendation: Always define partitioning on time-series or categorical dimensions.
4.2 Bloom Filter Effectiveness
Expected Performance:
| Workload | Bloom Filter Hit Rate | Read Amplification Reduction |
|---|---|---|
| Point queries (existing keys) | 0% (all pass through) | None |
| Point queries (non-existent keys) | 99% (filtered) | 99% reduction |
| Mixed (70% exist, 30% miss) | 30% filtered | 30% reduction |
Key Insight: Bloom filters are critical for workloads with high miss rates.
4.3 Cache Hit Rate Analysis
L1 Cache (Memtable):
- Size: 128 MB (default)
- Coverage: Last ~1M rows (assuming 128 bytes/row)
- Expected hit rate for recent writes: 80-95%
L2 Cache (Compute Node Block Cache):
- Size: 8-32 GB (configurable)
- Coverage: Hot working set
- Expected hit rate: 60-80% (steady state)
Combined Cache Effectiveness:
Effective Read Amplification = Base × (1 - L1_hit_rate) × (1 - L2_hit_rate)
Example (STCS with 15 SSTables):- Base read amp: 15- L1 hit: 10%- L2 hit: 70%- Effective: 15 × 0.9 × 0.3 = 4.05 SSTables
Latency improvement: 70% reduction5. Write Amplification Analysis
5.1 Compaction Write Amplification by Strategy
STCS Write Path:
Initial Write: 1 write to commit log + 1 write to memtableMemtable Flush: 1 write to L0 SSTableCompaction L0→L1: 4 writes (4-way merge)Compaction L1→L2: No additional (size-tiered)Total: ~6-8 writes per initial writeLCS Write Path:
Initial Write: 1 write to commit log + 1 write to memtableMemtable Flush: 1 write to L0L0→L1 Compaction: 1 writeL1→L2 Compaction: 10 writes (10x fanout)L2→L3 Compaction: 10 writesTotal: ~22-30 writes per initial write5.2 Network Write Amplification (Replication)
Synchronous Mirroring:
Application Write: 1 logical writePhysical Writes: - Primary commit log: 1 write - Primary memtable: 1 write - Mirror commit log (over network): 1 write - Mirror memtable: 1 write
Network Amplification: 1× (one network round-trip)Storage Amplification: 2× (primary + mirror)Latency Impact:
Single-node write latency: 0.5-1ms (commit log + memtable)Mirrored write latency: 1.5-3ms (+ 1 RTT over RDMA)
RDMA RTT (same datacenter): 1-5 microsecondsTCP RTT (same datacenter): 0.5-2 milliseconds
Expected write latency with RDMA: 0.5ms + 5μs ≈ 0.505ms (negligible overhead)Expected write latency with TCP: 0.5ms + 1ms ≈ 1.5ms (3x slowdown)Recommendation: RDMA is critical for maintaining low write latency with synchronous replication.
6. Configuration Recommendations by Workload
6.1 Write-Heavy OLTP (>80% writes)
[storage.compaction]strategy = "STCS"sstable_size_mb = 160min_merge_threshold = 4max_merge_threshold = 32
[storage.memtable]size_mb = 256flush_threshold_mb = 200num_memtables = 2
[storage.bloom_filter]bits_per_row = 10Expected Performance:
- Write throughput: 300K-500K ops/sec per node
- Point read latency (p99): 8-15ms
- Space efficiency: 55-65%
6.2 Read-Heavy OLTP (>70% reads)
[storage.compaction]strategy = "LCS"level_fanout = 10max_sstable_size_mb = 160
[storage.memtable]size_mb = 128flush_threshold_mb = 96num_memtables = 1
[storage.bloom_filter]bits_per_row = 12
[storage.cache]block_cache_mb = 16384 # 16 GBExpected Performance:
- Write throughput: 100K-200K ops/sec per node
- Point read latency (p99): 1-3ms
- Space efficiency: 80-90%
6.3 HTAP (Mixed Workload)
[storage.compaction]strategy = "Hybrid"hot_data_strategy = "STCS"cold_data_strategy = "LCS"transition_age_hours = 48
[storage.memtable]size_mb = 128flush_threshold_mb = 96num_memtables = 2
[storage.bloom_filter]bits_per_row = 10
[storage.hcc]enabled = truecold_data_age_days = 30compression_mode = "WAREHOUSE_OPTIMIZED"Expected Performance:
- Write throughput: 200K-350K ops/sec per node
- Point read latency (p99): 2-6ms (hot), 5-15ms (cold)
- Space efficiency: 70-75% (hot), 90%+ (cold with HCC)
7. Monitoring and Tuning Metrics
Key Performance Indicators
Write Path:
lsm.memtable.flush_rate: Target 10-50 flushes/hourlsm.compaction.pending_bytes: Should be < 10GBlsm.write_amplification: Monitor vs. expected valuesreplication.mirror_lag_ms: Target < 5ms
Read Path:
lsm.read_amplification: SSTables scanned per querylsm.bloom_filter.false_positive_rate: Monitor vs. configuredcache.hit_rate.l1: Target > 10% (memtable)cache.hit_rate.l2: Target > 60% (block cache)
Resource Usage:
memory.bloom_filters_mb: Should be < 5% of total memorymemory.memtable_mb: Should be predictabledisk.compaction_io_mb_per_sec: Shouldn’t saturate I/Ocpu.compaction_percent: Target < 30% average
Auto-Tuning Recommendations
Dynamic Compaction Throttling:
if write_throughput > 0.8 × max_throughput: increase compaction_threads by 1
if compaction_pending_bytes > 20GB: reduce memtable_size by 25% increase flush frequencyAdaptive Bloom Filter Sizing:
if false_positive_rate > 2 × configured_rate: increase bits_per_row by 2
if memory_pressure > 0.9: decrease bits_per_row to minimum (8)8. Conclusion
Key Findings:
- Write Performance Priority: STCS provides 2-3x better write throughput but with 5-10x worse worst-case read latency
- Read Latency Priority: LCS provides consistent sub-2ms reads but reduces write throughput by 40-60%
- HTAP Sweet Spot: Hybrid strategy with age-based transitions balances both workloads
- RDMA Critical: Synchronous replication with RDMA adds only ~5μs, vs ~1ms with TCP
- Memory Investment: Allocating 2-4GB per billion rows for Bloom filters reduces read amplification by 99% for misses
Recommended Starting Configuration:
- Compaction: Hybrid (STCS hot, LCS cold)
- Memtable: 128 MB
- Bloom filters: 10 bits/row
- Block cache: 16-32 GB per compute node
- Monitor and tune based on actual workload patterns
Next Steps:
- Implement workload-adaptive compaction strategy selection
- Develop real-time compaction throttling based on resource utilization
- Build cost model for automatic cache sizing