Index Durability in WAL - Design Document
Index Durability in WAL - Design Document
Date: November 5, 2025 Status: 🔴 CRITICAL - Data Integrity Risk Priority: P0 - Must fix before production Estimated Effort: 2 weeks
Executive Summary
HeliosDB’s global secondary indexes are NOT durable. Index updates are not logged in the Write-Ahead Log (WAL), causing:
- ❌ Index corruption on crash - Stale entries remain after recovery
- ❌ Table-index inconsistency - Tables recover, indexes don’t
- ❌ Data loss - Index must be rebuilt from scratch after crash
- ❌ Long recovery time - Full table scan required to rebuild indexes
Impact: Production database crash causes permanent index corruption until manual rebuild.
Problem Statement
Current Architecture
UPDATE users SET email='new@email.com' WHERE id=1;
┌─────────────────────────────────────────────────────────────┐│ 1. UPDATE row in table ││ └─→ WAL entry: [UPDATE, key=user:1, value=...] ││ └─→ Memtable write ││ └─→ DURABLE (survives crash) │└─────────────────────────────────────────────────────────────┘ ↓┌─────────────────────────────────────────────────────────────┐│ 2. Update index entry ││ └─→ ❌ NO WAL entry ││ └─→ In-memory B-Tree update ││ └─→ ❌ NOT DURABLE (lost on crash) │└─────────────────────────────────────────────────────────────┘
💥 CRASH HERE
┌─────────────────────────────────────────────────────────────┐│ 3. Recovery ││ ├─→ Table: Recovered from WAL ││ └─→ Index: ❌ Corrupted (points to old email) │└─────────────────────────────────────────────────────────────┘Example Failure Scenario
-- Initial stateINSERT INTO users (id, email) VALUES (1, 'alice@example.com');CREATE INDEX idx_email ON users(email);
-- DML operationUPDATE users SET email='bob@example.com' WHERE id=1;
-- State before crash:-- Table: id=1, email='bob@example.com' (written to WAL)-- Index: email='bob@example.com' → row_id=1 (NOT in WAL)-- Index: email='alice@example.com' → [tombstone] (NOT in WAL)
💥 CRASH
-- State after recovery:-- Table: id=1, email='bob@example.com' ( recovered)-- Index: email='alice@example.com' → row_id=1 (❌ stale!)-- Index: email='bob@example.com' → [MISSING]
-- Query failure:SELECT * FROM users WHERE email='bob@example.com';-- Returns: 0 rows (❌ WRONG! Should return id=1)
SELECT * FROM users WHERE email='alice@example.com';-- Returns: id=1 with email='bob@example.com' (❌ INCONSISTENT!)Root Cause Analysis
Where Indexes Are Stored
File: heliosdb-indexes/src/global_secondary.rs
Index Storage ❌ (In-Memory, Not Durable)
// Line 48-60pub struct GlobalIndexManager { /// Index definitions indexes: DashMap<String, Arc<IndexState>>, // ❌ In-memory only
/// Index storage (B-Trees) storage: Arc<dyn IndexStorage>, // ❌ Separate from LSM WAL}
// Index writes go to separate storage, NOT LSM WALimpl GlobalIndexManager { pub async fn insert_entry(&self, index_name: &str, entry: IndexEntry) -> Result<()> { let index = self.indexes.get(index_name)?; index.storage.insert(entry)?; // ❌ No WAL! }}Table Storage (Durable) vs Index Storage (Not Durable)
┌─────────────────────────────────────────────────────────────┐│ Table Storage (heliosdb-storage/src/lsm.rs) │├─────────────────────────────────────────────────────────────┤│ Write Path: ││ 1. Log to WAL (commitlog) ││ 2. Write to Memtable ││ 3. Flush to SSTable ││ 4. Crash recovery: Replay WAL ││ ││ Result: FULLY DURABLE │└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐│ Index Storage (heliosdb-indexes/src/btree_storage.rs) │├─────────────────────────────────────────────────────────────┤│ Write Path: ││ 1. Log to WAL ❌ MISSING ││ 2. Write to in-memory B-Tree ││ 3. Persist to disk (eventually) ⚠ Async ││ 4. Crash recovery: ??? ❌ NO REPLAY ││ ││ Result: NOT DURABLE ❌ │└─────────────────────────────────────────────────────────────┘Design Solution
Option 1: Shared WAL (Recommended) ⭐
Approach: Index operations write to the same WAL as table operations
┌─────────────────────────────────────────────────────────────┐│ SHARED WAL │├─────────────────────────────────────────────────────────────┤│ [Table Entry] table_id=1, key=user:1, value=... ││ [Index Entry] index_id=5, key=alice@..., value=row_id:1 ││ [Index Delete] index_id=5, key=alice@... [tombstone] ││ [Index Entry] index_id=5, key=bob@..., value=row_id:1 │└─────────────────────────────────────────────────────────────┘ ↓ Recovery replays BOTH ↓ ┌────────────────────────────────┐ │ Tables Indexes │ └────────────────────────────────┘Pros:
- Atomic durability (table + indexes together)
- Single recovery path
- Guaranteed consistency
- Minimal complexity
Cons:
- ⚠ WAL size increases (mitigated by compression)
- ⚠ Index recovery time on startup
Option 2: Separate Index WAL
Approach: Indexes have their own WAL, coordinated with table WAL
Pros:
- Index changes isolated from table WAL
- Can replay indexes independently
Cons:
- ❌ Complex 2PC coordination required
- ❌ Crash in between commits = inconsistency
- ❌ Recovery more complex (must replay both in order)
Verdict: ❌ Too complex, use Option 1
Architecture: Shared WAL (Option 1)
WAL Entry Format
#[derive(Serialize, Deserialize)]pub enum WalEntry { // Existing: Table operations TableWrite { table_id: u64, key: Vec<u8>, value: Vec<u8>, timestamp: u64 }, TableDelete { table_id: u64, key: Vec<u8>, timestamp: u64 },
// NEW: Index operations IndexWrite { index_id: u64, index_name: String, key: IndexKey, // Indexed column value value: RowLocation, // Pointer to table row timestamp: u64, }, IndexDelete { index_id: u64, index_name: String, key: IndexKey, timestamp: u64, },}Write Path Changes
impl LsmStorageEngine { // NEW: Log index operation to WAL pub async fn log_index_write( &self, index_id: u64, index_name: &str, key: &IndexKey, location: &RowLocation, ) -> Result<()> { let entry = WalEntry::IndexWrite { index_id, index_name: index_name.to_string(), key: key.clone(), value: location.clone(), timestamp: self.timestamp_counter.fetch_add(1, Ordering::SeqCst), };
// Write to WAL self.commitlog.append(&entry).await?; Ok(()) }}Index Manager Integration
pub struct GlobalIndexManager { indexes: DashMap<String, Arc<IndexState>>,
// NEW: Reference to storage engine for WAL access storage_engine: Option<Arc<LsmStorageEngine>>,}
impl GlobalIndexManager { pub async fn insert_entry(&self, index_name: &str, entry: IndexEntry) -> Result<()> { let index = self.indexes.get(index_name)?;
// NEW: Log to WAL first if let Some(storage) = &self.storage_engine { storage.log_index_write( index.definition.index_id, index_name, &entry.key, &entry.location, ).await?; }
// Then update in-memory index index.storage.insert(entry)?;
Ok(()) }}Recovery Path
impl Recovery { pub async fn replay_wal( &self, storage: &LsmStorageEngine, index_manager: &GlobalIndexManager, ) -> Result<()> { for entry in self.commitlog.read_all()? { match entry { WalEntry::TableWrite { table_id, key, value, timestamp } => { storage.replay_write(table_id, key, value, timestamp)?; }
// NEW: Replay index operations WalEntry::IndexWrite { index_id, index_name, key, value, timestamp } => { index_manager.replay_insert(&index_name, key, value, timestamp)?; }
WalEntry::IndexDelete { index_id, index_name, key, timestamp } => { index_manager.replay_delete(&index_name, key, timestamp)?; }
_ => { /* other entry types */ } } }
Ok(()) }}Implementation Plan
Phase 1: WAL Format Updates (3 days)
Tasks:
- Add IndexWrite/IndexDelete to WalEntry enum
- Update WAL serialization/deserialization
- Update WAL replay logic
- Add unit tests
Files:
heliosdb-storage/src/commitlog.rs(+50 LOC)heliosdb-storage/src/wal_entry.rs(+80 LOC)heliosdb-storage/tests/wal_index_test.rs(+150 LOC)
Phase 2: Storage Layer Integration (4 days)
Tasks:
- Add log_index_write() to LsmStorageEngine
- Add log_index_delete() to LsmStorageEngine
- Update recovery to replay index operations
- Add integration tests
Files:
heliosdb-storage/src/lsm.rs(+100 LOC)heliosdb-storage/src/recovery.rs(+80 LOC)heliosdb-storage/tests/recovery_index_test.rs(+200 LOC)
Phase 3: Index Manager Integration (5 days)
Tasks:
- Add storage_engine reference to GlobalIndexManager
- Call log_index_write() in insert_entry()
- Call log_index_delete() in remove_entry()
- Add replay_insert() and replay_delete() methods
- Update DML operations to use new API
Files:
heliosdb-indexes/src/global_secondary.rs(+120 LOC)heliosdb-protocols/src/postgres/index_maintenance.rs(+30 LOC)heliosdb-indexes/tests/index_durability_test.rs(+250 LOC)
Phase 4: End-to-End Testing (2 days)
Tasks:
- Crash recovery test with indexes
- Concurrent DML + crash test
- Performance benchmarks (WAL overhead)
- Stress test (1M index operations + crash)
Files:
heliosdb-storage/tests/crash_recovery_test.rs(+300 LOC)
Testing Strategy
Unit Test: WAL Format
#[test]fn test_wal_index_entry_serialization() { let entry = WalEntry::IndexWrite { index_id: 5, index_name: "idx_email".to_string(), key: IndexKey::String("alice@example.com".to_string()), value: RowLocation { shard_id: "shard_0".to_string(), row_key: 123 }, timestamp: 1000, };
let bytes = entry.serialize()?; let decoded = WalEntry::deserialize(&bytes)?;
assert_eq!(entry, decoded);}Integration Test: Crash Recovery
#[tokio::test]async fn test_index_survives_crash() { // Setup let storage = LsmStorageEngine::new("/tmp/test").await?; let index_mgr = GlobalIndexManager::new(Some(storage.clone())).await?; index_mgr.create_index("idx_email", "users", "email").await?;
// Insert row + index storage.write(key("user:1"), value("alice@example.com")).await?; index_mgr.insert_entry("idx_email", IndexEntry { key: IndexKey::String("alice@example.com".to_string()), location: RowLocation { shard_id: "shard_0".to_string(), row_key: 1 }, }).await?;
// Simulate crash (don't flush, just drop) drop(storage); drop(index_mgr);
// Recovery let storage = LsmStorageEngine::recover("/tmp/test").await?; let index_mgr = GlobalIndexManager::new(Some(storage.clone())).await?; storage.replay_wal(&index_mgr).await?;
// Verify index recovered let results = index_mgr.query_index("idx_email", &Predicate::Eq("alice@example.com")).await?; assert_eq!(results.len(), 1); assert_eq!(results[0].row_key, 1);}Performance Impact
WAL Overhead Analysis
Assumptions:
- 1 table write = 100 bytes WAL entry
- 1 index write = 120 bytes WAL entry
- Table with 3 indexes
Overhead:
Before (table only): 100 bytes/writeAfter (table + 3 indexes): 100 + (3 × 120) = 460 bytes/writeOverhead: 4.6x WAL sizeMitigation:
- WAL Compression: Use LZ4 (3-5x compression ratio)
- 460 bytes → ~100-150 bytes (similar to before)
- Batch Index Writes: Group multiple index updates
- Async WAL Flush: Don’t block on every write
Expected Impact: <10% write throughput reduction
Backward Compatibility
Migration Path
- Phase 1: Add index WAL entries alongside existing behavior
- Phase 2: Enable WAL logging for new indexes
- Phase 3: Rebuild existing indexes with WAL support
- Phase 4: Remove legacy non-durable index code
Data Migration
-- Mark indexes as needing rebuildUPDATE system.indexes SET needs_rebuild=true WHERE created_before='2025-11-05';
-- Background job rebuilds indexes with WAL supportREBUILD INDEX idx_email ON users;Risks and Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| WAL size explosion | Disk space issues | Enable compression (LZ4), aggressive compaction |
| Recovery time increase | Longer startup | Checkpoint index state, incremental replay |
| Write performance degradation | Slower DML | Batch index updates, async WAL flush |
| Distributed coordination | Complex across shards | Per-shard WAL, eventual consistency mode |
Success Criteria
- Index survives crash (integration test passes)
- Table-index consistency after recovery (no stale entries)
- WAL overhead < 50% with compression
- Write throughput degradation < 10%
- Recovery time < 2× baseline
Alternative: Lazy Index Rebuild
If WAL overhead is unacceptable, consider:
- Don’t log index updates to WAL
- On recovery: Mark indexes as “stale”
- Background job: Rebuild indexes from tables
- Trade-off: Recovery consistency for write performance
Verdict: ❌ Not recommended - consistency is critical
References
- PostgreSQL WAL: https://www.postgresql.org/docs/current/wal-intro.html
- MySQL InnoDB Redo Log: https://dev.mysql.com/doc/refman/8.0/en/innodb-redo-log.html
- RocksDB WAL: https://github.com/facebook/rocksdb/wiki/Write-Ahead-Log
- HeliosDB Storage Analysis:
STORAGE_ARCHITECTURE_DETAILED_ANALYSIS.md
Next Steps: Implement Phase 1 (WAL Format Updates)