Group Commit WAL - Architecture Diagrams
Group Commit WAL - Architecture Diagrams
Document Version: 1.0 Date: 2025-11-10 Related: GROUP_COMMIT_WAL_ARCHITECTURE.md
1. System Architecture Overview
┌─────────────────────────────────────────────────────────────────────┐│ Application Layer ││ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││ │ Transaction │ │ MVCC Version │ │ Query Engine │ ││ │ Coordinator │ │ Store │ │ │ ││ └──────┬───────┘ └──────┬───────┘ └──────────────┘ ││ │ │ │└─────────┼─────────────────┼──────────────────────────────────────────┘ │ │ │ ┌────────────┘ │ │ ▼ ▼┌─────────────────────────────────────────────────────────────────────┐│ Group Commit WAL Layer ││ ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ GroupCommitWal │ ││ │ │ ││ │ append(entry) → LSN │ ││ │ wait_for_lsn(lsn) → Result<()> │ ││ │ │ ││ └──────────┬──────────────────────────────────────────────────┘ ││ │ ││ │ 1. Assign LSN (atomic) ││ │ 2. Enqueue to pending queue ││ │ 3. Return immediately ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ Lock-Free Pending Queue │ ││ │ │ ││ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ ││ │ │ LSN:1│→ │ LSN:2│→ │ LSN:3│→ │ LSN:4│ → ... │ ││ │ │Entry │ │Entry │ │Entry │ │Entry │ │ ││ │ └──────┘ └──────┘ └──────┘ └──────┘ │ ││ │ │ ││ └──────────┬──────────────────────────────────────────────────┘ ││ │ ││ │ Flush Thread polls queue ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ Flush Thread │ ││ │ │ ││ │ Loop: │ ││ │ 1. Collect batch (max 100 entries OR 10ms timeout) │ ││ │ 2. Write all entries to file │ ││ │ 3. Single fsync() for entire batch │ ││ │ 4. Update last_flushed_lsn │ ││ │ 5. Notify all waiters │ ││ │ │ ││ └──────────┬──────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ WAL File │ ││ │ │ ││ │ [Entry1][Entry2][Entry3]...[EntryN] │ ││ │ │ ││ │ Each entry: LSN | Type | TxnID | Data | Checksum │ ││ │ │ ││ └──────────────────────────────────────────────────────────────┘ ││ │└──────────────────────────────────────────────────────────────────────┘2. Write Path - Detailed Flow
Client Thread Pending Queue Flush Thread Disk │ │ │ │ │ 1. append(entry) │ │ │ ├─────────────────────────────> │ │ │ │ │ │ │ 2. Assign LSN (atomic) │ │ │ │ lsn = counter.fetch_add()│ │ │ │ │ │ │ │ 3. Create waiter │ │ │ │ (notify, result) │ │ │ │ │ │ │ │ 4. Enqueue pending entry │ │ │ ├─────────────────────────────> │ │ │ │ │ │ │ 5. Return LSN immediately │ │ │ │<───────────────────────────── │ │ │ │ │ │ │ 6. wait_for_lsn(lsn) [OPTIONAL] │ │ │ (blocks until flush) │ │ │ │ ┌────────────────────────────────────────────┐ │ │ │ │ Waiter sleeps │ │ │ │ └────────────────────────────────────────────┘ │ │ │ │ │ │ │ │ 7. Collect batch │ │ │ │ (every 10ms or 100 │ │ │ │ entries) │ │ │ │<─────────────────────────┤ │ │ │ │ │ │ │ 8. Write entries │ │ │ │ (buffered) │ │ │ │ ├────────────────────> │ │ │ │ │ │ │ 9. fsync() │ │ │ │ (single call │ │ │ │ for batch) │ │ │ │<───────────────────┤ │ │ │ │ │ │ 10. Update │ │ │ │ last_flushed_lsn │ │ │ │ │ │ │ 11. Notify waiters │ │ │ │<────────────────────────────┼──────────────────────────┤ │ │ result = Ok(()) │ │ │ │ │ │ │ │ 12. wait_for_lsn() returns │ │ │ │ │ │ │ ▼ ▼ ▼ ▼Key Points:
- Steps 1-5: Fast path (~1μs), no I/O
- Step 6: Optional wait for durability
- Steps 7-10: Batched flush (amortizes fsync cost)
- Steps 11-12: Waiter notification (all waiters in batch)
3. Batching Strategy Visualization
Time-Based Flush (10ms interval):─────────────────────────────────────────────────────────────> ↓ ↓ ↓ Flush 1 Flush 2 Flush 3 (5 entries) (8 entries) (3 entries)
Size-Based Flush (100 entries):Entry: 1 2 3 4 5 ... 98 99 100 ↓ Flush (100 entries)
Hybrid Approach (WHICHEVER COMES FIRST):Scenario 1 - Time wins:─────────────────> 10ms elapsed ↓ Flush (20 entries collected so far)
Scenario 2 - Size wins:Entry: 1 2 3 ... 100 ↓ Flush (100 entries, only 5ms elapsed)4. Durability Modes Comparison
┌─────────────────────────────────────────────────────────────────────┐│ SYNCHRONOUS MODE ││ ││ append(entry) ──> Write ──> fsync() ──> Return ││ ││ Latency: ~10ms per append ││ Throughput: ~100 commits/sec ││ Durability: Immediate ✓ │└──────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐│ GROUP COMMIT MODE (Default) ││ ││ append(entry) ──> Enqueue ──> Return (LSN) ││ │ ││ │ [Batch collect] ││ │ ││ └──> fsync() ──> Notify waiters ││ ││ Latency: ~15ms (avg wait 5ms + fsync 10ms) ││ Throughput: ~10,000 commits/sec ││ Durability: On wait_for_lsn() ✓ │└──────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐│ ASYNC MODE ││ ││ append(entry) ──> Enqueue ──> Return (LSN) ││ ││ [Client never waits] ││ ││ Latency: <1ms ││ Throughput: ~1M commits/sec (memory-bound) ││ Durability: None (data may be lost) ✗ │└──────────────────────────────────────────────────────────────────────┘5. Recovery Protocol State Machine
┌─────────────┐ │ START │ └──────┬──────┘ │ ▼ ┌─────────────┐ │ Open WAL │ │ File │ └──────┬──────┘ │ ▼ ┌─────────────┐ ┌────────<│ Read Entry │>────────┐ │ └──────┬──────┘ │ │ │ │ │ │ │ EOF? │ │ Success │ Error? │ │ │ │ ▼ │ │ ┌─────────────┐ │ │ │ Validate │ │ │ │ Checksum │ │ │ └──────┬──────┘ │ │ │ │ │ │ │ │ OK? │ Failed? │ │ │ │ │ ▼ │ │ ┌─────────────┐ │ │ │ Add to │ │ │ │ Recovered │ │ │ │ List │ │ │ └──────┬──────┘ │ │ │ │ │ └────────────────┘ │ │ ▼ ▼ ┌─────────────┐ ┌─────────────┐ │ Recovery │ │ Truncate │ │ Complete │ │ File at │ │ │ │ Last Valid │ │ Return │ │ Offset │ │ Entries │ │ │ └─────────────┘ └──────┬──────┘ │ ▼ ┌─────────────┐ │ Return │ │ Partial │ │ Entries │ └─────────────┘Recovery Guarantees:
- All entries with valid checksums are recovered
- Corruption is isolated to tail of file
- File is truncated to last valid boundary
- No partial entries visible
6. Performance Model
WITHOUT Group Commit:═══════════════════════════════════════════════════════════════════
Transaction 1: Write ──> fsync() ──────────> Return (10ms)
Transaction 2: Write ──> fsync() ──────────> Return (10ms)
Transaction 3: Write ──> fsync() ──────────> Return (10ms)
Total Time: 30ms for 3 transactionsThroughput: ~100 transactions/sec (1000ms / 10ms)
WITH Group Commit:═══════════════════════════════════════════════════════════════════
Transaction 1: Write ──> Enqueue ──> Return (1μs) │ │Transaction 2: Write ──> Enqueue ──> Return (1μs) │ │Transaction 3: Write ──> Enqueue ──> Return (1μs) │ │ ├──> Batch Collect (10ms) │ └──> Single fsync() ──> Notify All (10ms)
Total Time: 10ms for 3 transactions (batched)Throughput: ~10,000 transactions/sec (100 per batch * 100 batches/sec)
IMPROVEMENT: 100x throughput increase!7. Concurrency Model
┌───────────────────────────────────────────────────────────────────┐│ Multiple Client Threads ││ ││ Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 ││ │ │ │ │ │ ││ │ append() │ append() │ append() │ append() │ append()││ ▼ ▼ ▼ ▼ ▼ ││ ┌──────────────────────────────────────────────────────────────┐││ │ Lock-Free Queue (SegQueue) │││ │ │││ │ No locks required! │││ │ - Atomic LSN assignment │││ │ - Lock-free push operations │││ │ - No contention on append │││ │ │││ └──────────────────────────────┬───────────────────────────────┘││ │ ││ │ Single consumer ││ │ ││ ▼ ││ ┌───────────────┐ ││ │ Flush Thread │ ││ │ │ ││ │ - Pop batch │ ││ │ - Write │ ││ │ - fsync │ ││ │ - Notify │ ││ └───────────────┘ ││ │└────────────────────────────────────────────────────────────────────┘
Key Benefits:- No lock contention on write path- Linear scalability with number of client threads- Single I/O thread optimizes for batching8. Failure Scenarios
Scenario 1: Crash During Write──────────────────────────────────────────────────────────────────WAL File Before Crash:[Entry 1][Entry 2][Entry 3][Partial Entry 4
Recovery:- Read Entry 1: Valid checksum ✓- Read Entry 2: Valid checksum ✓- Read Entry 3: Valid checksum ✓- Read Entry 4: UnexpectedEof ✗
Result:- Truncate file after Entry 3- Recover entries 1-3- Entry 4 lost (not flushed)
Scenario 2: Crash During fsync──────────────────────────────────────────────────────────────────WAL File Before Crash:[Entry 1][Entry 2][Entry 3][Entry 4] ← fsync in progress ↑ checksum not updated
Recovery:- Read Entry 1: Valid checksum ✓- Read Entry 2: Valid checksum ✓- Read Entry 3: Valid checksum ✓- Read Entry 4: Invalid checksum ✗
Result:- Truncate file after Entry 3- Recover entries 1-3- Entry 4 rolled back (batch atomicity)
Scenario 3: Disk Corruption──────────────────────────────────────────────────────────────────WAL File:[Entry 1][Entry 2][Corrupted Data][Entry 4][Entry 5] ↑
Recovery:- Read Entry 1: Valid checksum ✓- Read Entry 2: Valid checksum ✓- Read Entry 3: Invalid checksum ✗- STOP (corruption detected)
Result:- Truncate file after Entry 2- Recover entries 1-2- Entries 3-5 lost (corruption boundary)9. Integration with Transaction Manager
┌─────────────────────────────────────────────────────────────────┐│ Transaction Coordinator ││ ││ begin_transaction() ││ └─> wal.append(WalEntry::Begin) ││ (Don't wait - optimization) ││ ││ execute_operations() ││ └─> wal.append(WalEntry::Data) ││ (Don't wait - accumulate changes) ││ ││ commit_transaction() ││ └─> lsn = wal.append(WalEntry::Commit) ││ wal.wait_for_lsn(lsn) ← CRITICAL: Wait for durability ││ release_locks() ││ │└──────────────────────────────────────────────────────────────────┘ │ │ ▼┌─────────────────────────────────────────────────────────────────┐│ MVCC Version Store ││ ││ create_version(key, value, txn_id) ││ └─> lsn = wal.append(WalEntry::Data) ││ (Don't wait - version not visible until commit) ││ ││ commit_version(txn_id, commit_lsn) ││ └─> wal.wait_for_lsn(commit_lsn) ││ mark_versions_visible(txn_id) ││ │└──────────────────────────────────────────────────────────────────┘
Key Insight:- Only COMMIT records need to wait for durability- BEGIN and DATA records can be fire-and-forget- This maximizes batching efficiency10. Monitoring Dashboard Layout
┌─────────────────────────────────────────────────────────────────────┐│ WAL Group Commit Dashboard │├─────────────────────────────────────────────────────────────────────┤│ ││ THROUGHPUT ││ ┌────────────────────────────────────────────────────────────────┐ ││ │ Commits/sec: 10,234 ████████████████████████████ (Target: 10K)│││ │ Fsyncs/sec: 102 ██ (90% reduction from 1000) │││ │ Avg Batch: 100 ████████████████████████████████ │││ └────────────────────────────────────────────────────────────────┘ ││ ││ LATENCY ││ ┌────────────────────────────────────────────────────────────────┐ ││ │ P50: 12ms ███████ │││ │ P90: 18ms ████████████ │││ │ P99: 22ms ██████████████ (Target: <20ms) ⚠ │││ │ P999: 45ms ██████████████████████████████ │││ └────────────────────────────────────────────────────────────────┘ ││ ││ QUEUE HEALTH ││ ┌────────────────────────────────────────────────────────────────┐ ││ │ Pending: 23 entries │││ │ Max Depth: 87 entries │││ │ Utilization: 26% ████ │││ └────────────────────────────────────────────────────────────────┘ ││ ││ ALERTS ││ ┌────────────────────────────────────────────────────────────────┐ ││ │ ⚠ P99 latency above target (22ms > 20ms) │││ │ ✓ All other metrics healthy │││ └────────────────────────────────────────────────────────────────┘ ││ │└──────────────────────────────────────────────────────────────────────┘11. Configuration Tuning Matrix
┌───────────────────────────────────────────────────────────────────┐│ Workload-Based Tuning │├───────────────────────────────────────────────────────────────────┤│ ││ OLTP (Low Latency) ││ ┌──────────────────────────────────────────────────────────┐ ││ │ flush_interval: 5ms │ batch_size: 50 │ │ ││ │ Expected: 50K/sec │ P99 latency: 8ms │ │ ││ └──────────────────────────────────────────────────────────┘ ││ ││ Mixed Workload (Balanced) ││ ┌──────────────────────────────────────────────────────────┐ ││ │ flush_interval: 10ms │ batch_size: 100 │ │ ││ │ Expected: 100K/sec │ P99 latency: 15ms│ │ ││ └──────────────────────────────────────────────────────────┘ ││ ││ Analytics (High Throughput) ││ ┌──────────────────────────────────────────────────────────┐ ││ │ flush_interval: 20ms │ batch_size: 500 │ │ ││ │ Expected: 250K/sec │ P99 latency: 30ms│ │ ││ └──────────────────────────────────────────────────────────┘ ││ ││ Batch Processing (Maximum Throughput) ││ ┌──────────────────────────────────────────────────────────┐ ││ │ flush_interval: 50ms │ batch_size: 1000 │ │ ││ │ Expected: 500K/sec │ P99 latency: 75ms│ │ ││ └──────────────────────────────────────────────────────────┘ ││ │└────────────────────────────────────────────────────────────────────┘Document Status: Complete Last Updated: 2025-11-10 Related Documents:
- GROUP_COMMIT_WAL_ARCHITECTURE.md (Full specification)
- GROUP_COMMIT_WAL_IMPLEMENTATION_ROADMAP.md (Implementation guide)
- GROUP_COMMIT_WAL_TEST_STRATEGY.md (Testing plan)