Index Durability in WAL - Design Document

Date: November 5, 2025 Status: 🔴 CRITICAL - Data Integrity Risk Priority: P0 - Must fix before production Estimated Effort: 2 weeks

Executive Summary

HeliosDB’s global secondary indexes are NOT durable. Index updates are not logged in the Write-Ahead Log (WAL), causing:

❌ Index corruption on crash - Stale entries remain after recovery
❌ Table-index inconsistency - Tables recover, indexes don’t
❌ Data loss - Index must be rebuilt from scratch after crash
❌ Long recovery time - Full table scan required to rebuild indexes

Impact: Production database crash causes permanent index corruption until manual rebuild.

Problem Statement

Current Architecture

UPDATE users SET email='new@email.com' WHERE id=1;

┌─────────────────────────────────────────────────────────────┐
│ 1. UPDATE row in table                                      │
│    └─→ WAL entry: [UPDATE, key=user:1, value=...]         │
│    └─→ Memtable write                                       │
│    └─→  DURABLE (survives crash)                         │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ 2. Update index entry                                       │
│    └─→ ❌ NO WAL entry                                     │
│    └─→ In-memory B-Tree update                             │
│    └─→ ❌ NOT DURABLE (lost on crash)                      │
└─────────────────────────────────────────────────────────────┘

💥 CRASH HERE

┌─────────────────────────────────────────────────────────────┐
│ 3. Recovery                                                  │
│    ├─→ Table:  Recovered from WAL                        │
│    └─→ Index: ❌ Corrupted (points to old email)           │
└─────────────────────────────────────────────────────────────┘

Example Failure Scenario

-- Initial state
INSERT INTO users (id, email) VALUES (1, 'alice@example.com');
CREATE INDEX idx_email ON users(email);

-- DML operation
UPDATE users SET email='bob@example.com' WHERE id=1;

-- State before crash:
-- Table: id=1, email='bob@example.com' (written to WAL)
-- Index: email='bob@example.com' → row_id=1 (NOT in WAL)
-- Index: email='alice@example.com' → [tombstone] (NOT in WAL)

💥 CRASH

-- State after recovery:
-- Table: id=1, email='bob@example.com' ( recovered)
-- Index: email='alice@example.com' → row_id=1 (❌ stale!)
-- Index: email='bob@example.com' → [MISSING]

-- Query failure:
SELECT * FROM users WHERE email='bob@example.com';
-- Returns: 0 rows (❌ WRONG! Should return id=1)

SELECT * FROM users WHERE email='alice@example.com';
-- Returns: id=1 with email='bob@example.com' (❌ INCONSISTENT!)

Root Cause Analysis

Where Indexes Are Stored

File: heliosdb-indexes/src/global_secondary.rs

Index Storage ❌ (In-Memory, Not Durable)

// Line 48-60
pub struct GlobalIndexManager {
    /// Index definitions
    indexes: DashMap<String, Arc<IndexState>>,  // ❌ In-memory only

    /// Index storage (B-Trees)
    storage: Arc<dyn IndexStorage>,  // ❌ Separate from LSM WAL
}

// Index writes go to separate storage, NOT LSM WAL
impl GlobalIndexManager {
    pub async fn insert_entry(&self, index_name: &str, entry: IndexEntry) -> Result<()> {
        let index = self.indexes.get(index_name)?;
        index.storage.insert(entry)?;  // ❌ No WAL!
    }
}

Table Storage (Durable) vs Index Storage (Not Durable)

┌─────────────────────────────────────────────────────────────┐
│ Table Storage (heliosdb-storage/src/lsm.rs)                │
├─────────────────────────────────────────────────────────────┤
│ Write Path:                                                  │
│   1. Log to WAL (commitlog)                               │
│   2. Write to Memtable                                    │
│   3. Flush to SSTable                                     │
│   4. Crash recovery: Replay WAL                           │
│                                                              │
│ Result: FULLY DURABLE                                     │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ Index Storage (heliosdb-indexes/src/btree_storage.rs)      │
├─────────────────────────────────────────────────────────────┤
│ Write Path:                                                  │
│   1. Log to WAL                             ❌ MISSING      │
│   2. Write to in-memory B-Tree                            │
│   3. Persist to disk (eventually)           ⚠  Async      │
│   4. Crash recovery: ???                    ❌ NO REPLAY    │
│                                                              │
│ Result: NOT DURABLE                         ❌              │
└─────────────────────────────────────────────────────────────┘

Design Solution

Option 1: Shared WAL (Recommended) ⭐

Approach: Index operations write to the same WAL as table operations

┌─────────────────────────────────────────────────────────────┐
│                      SHARED WAL                              │
├─────────────────────────────────────────────────────────────┤
│ [Table Entry]  table_id=1, key=user:1, value=...           │
│ [Index Entry]  index_id=5, key=alice@..., value=row_id:1   │
│ [Index Delete] index_id=5, key=alice@... [tombstone]       │
│ [Index Entry]  index_id=5, key=bob@..., value=row_id:1     │
└─────────────────────────────────────────────────────────────┘
                          ↓
              Recovery replays BOTH
                          ↓
         ┌────────────────────────────────┐
         │ Tables     Indexes         │
         └────────────────────────────────┘

Pros:

Atomic durability (table + indexes together)
Single recovery path
Guaranteed consistency
Minimal complexity

Cons:

⚠ WAL size increases (mitigated by compression)
⚠ Index recovery time on startup

Option 2: Separate Index WAL

Approach: Indexes have their own WAL, coordinated with table WAL

Pros:

Index changes isolated from table WAL
Can replay indexes independently

Cons:

❌ Complex 2PC coordination required
❌ Crash in between commits = inconsistency
❌ Recovery more complex (must replay both in order)

Verdict: ❌ Too complex, use Option 1

Architecture: Shared WAL (Option 1)

WAL Entry Format

#[derive(Serialize, Deserialize)]
pub enum WalEntry {
    // Existing: Table operations
    TableWrite { table_id: u64, key: Vec<u8>, value: Vec<u8>, timestamp: u64 },
    TableDelete { table_id: u64, key: Vec<u8>, timestamp: u64 },

    // NEW: Index operations
    IndexWrite {
        index_id: u64,
        index_name: String,
        key: IndexKey,        // Indexed column value
        value: RowLocation,   // Pointer to table row
        timestamp: u64,
    },
    IndexDelete {
        index_id: u64,
        index_name: String,
        key: IndexKey,
        timestamp: u64,
    },
}

Write Path Changes

impl LsmStorageEngine {
    // NEW: Log index operation to WAL
    pub async fn log_index_write(
        &self,
        index_id: u64,
        index_name: &str,
        key: &IndexKey,
        location: &RowLocation,
    ) -> Result<()> {
        let entry = WalEntry::IndexWrite {
            index_id,
            index_name: index_name.to_string(),
            key: key.clone(),
            value: location.clone(),
            timestamp: self.timestamp_counter.fetch_add(1, Ordering::SeqCst),
        };

        // Write to WAL
        self.commitlog.append(&entry).await?;
        Ok(())
    }
}

Index Manager Integration

pub struct GlobalIndexManager {
    indexes: DashMap<String, Arc<IndexState>>,

    // NEW: Reference to storage engine for WAL access
    storage_engine: Option<Arc<LsmStorageEngine>>,
}

impl GlobalIndexManager {
    pub async fn insert_entry(&self, index_name: &str, entry: IndexEntry) -> Result<()> {
        let index = self.indexes.get(index_name)?;

        // NEW: Log to WAL first
        if let Some(storage) = &self.storage_engine {
            storage.log_index_write(
                index.definition.index_id,
                index_name,
                &entry.key,
                &entry.location,
            ).await?;
        }

        // Then update in-memory index
        index.storage.insert(entry)?;

        Ok(())
    }
}

Recovery Path

impl Recovery {
    pub async fn replay_wal(
        &self,
        storage: &LsmStorageEngine,
        index_manager: &GlobalIndexManager,
    ) -> Result<()> {
        for entry in self.commitlog.read_all()? {
            match entry {
                WalEntry::TableWrite { table_id, key, value, timestamp } => {
                    storage.replay_write(table_id, key, value, timestamp)?;
                }

                // NEW: Replay index operations
                WalEntry::IndexWrite { index_id, index_name, key, value, timestamp } => {
                    index_manager.replay_insert(&index_name, key, value, timestamp)?;
                }

                WalEntry::IndexDelete { index_id, index_name, key, timestamp } => {
                    index_manager.replay_delete(&index_name, key, timestamp)?;
                }

                _ => { /* other entry types */ }
            }
        }

        Ok(())
    }
}

Implementation Plan

Phase 1: WAL Format Updates (3 days)

Tasks:

Add IndexWrite/IndexDelete to WalEntry enum
Update WAL serialization/deserialization
Update WAL replay logic
Add unit tests

Files:

heliosdb-storage/src/commitlog.rs (+50 LOC)
heliosdb-storage/src/wal_entry.rs (+80 LOC)
heliosdb-storage/tests/wal_index_test.rs (+150 LOC)

Phase 2: Storage Layer Integration (4 days)

Tasks:

Add log_index_write() to LsmStorageEngine
Add log_index_delete() to LsmStorageEngine
Update recovery to replay index operations
Add integration tests

Files:

heliosdb-storage/src/lsm.rs (+100 LOC)
heliosdb-storage/src/recovery.rs (+80 LOC)
heliosdb-storage/tests/recovery_index_test.rs (+200 LOC)

Phase 3: Index Manager Integration (5 days)

Tasks:

Add storage_engine reference to GlobalIndexManager
Call log_index_write() in insert_entry()
Call log_index_delete() in remove_entry()
Add replay_insert() and replay_delete() methods
Update DML operations to use new API

Files:

heliosdb-indexes/src/global_secondary.rs (+120 LOC)
heliosdb-protocols/src/postgres/index_maintenance.rs (+30 LOC)
heliosdb-indexes/tests/index_durability_test.rs (+250 LOC)

Phase 4: End-to-End Testing (2 days)

Tasks:

Crash recovery test with indexes
Concurrent DML + crash test
Performance benchmarks (WAL overhead)
Stress test (1M index operations + crash)

Files:

heliosdb-storage/tests/crash_recovery_test.rs (+300 LOC)

Testing Strategy

Unit Test: WAL Format

#[test]
fn test_wal_index_entry_serialization() {
    let entry = WalEntry::IndexWrite {
        index_id: 5,
        index_name: "idx_email".to_string(),
        key: IndexKey::String("alice@example.com".to_string()),
        value: RowLocation { shard_id: "shard_0".to_string(), row_key: 123 },
        timestamp: 1000,
    };

    let bytes = entry.serialize()?;
    let decoded = WalEntry::deserialize(&bytes)?;

    assert_eq!(entry, decoded);
}

Integration Test: Crash Recovery

#[tokio::test]
async fn test_index_survives_crash() {
    // Setup
    let storage = LsmStorageEngine::new("/tmp/test").await?;
    let index_mgr = GlobalIndexManager::new(Some(storage.clone())).await?;
    index_mgr.create_index("idx_email", "users", "email").await?;

    // Insert row + index
    storage.write(key("user:1"), value("alice@example.com")).await?;
    index_mgr.insert_entry("idx_email", IndexEntry {
        key: IndexKey::String("alice@example.com".to_string()),
        location: RowLocation { shard_id: "shard_0".to_string(), row_key: 1 },
    }).await?;

    // Simulate crash (don't flush, just drop)
    drop(storage);
    drop(index_mgr);

    // Recovery
    let storage = LsmStorageEngine::recover("/tmp/test").await?;
    let index_mgr = GlobalIndexManager::new(Some(storage.clone())).await?;
    storage.replay_wal(&index_mgr).await?;

    // Verify index recovered
    let results = index_mgr.query_index("idx_email", &Predicate::Eq("alice@example.com")).await?;
    assert_eq!(results.len(), 1);
    assert_eq!(results[0].row_key, 1);
}

Performance Impact

WAL Overhead Analysis

Assumptions:

1 table write = 100 bytes WAL entry
1 index write = 120 bytes WAL entry
Table with 3 indexes

Overhead:

Before (table only):  100 bytes/write
After (table + 3 indexes): 100 + (3 × 120) = 460 bytes/write
Overhead: 4.6x WAL size

Mitigation:

WAL Compression: Use LZ4 (3-5x compression ratio)
- 460 bytes → ~100-150 bytes (similar to before)
Batch Index Writes: Group multiple index updates
Async WAL Flush: Don’t block on every write

Expected Impact: <10% write throughput reduction

Backward Compatibility

Migration Path

Phase 1: Add index WAL entries alongside existing behavior
Phase 2: Enable WAL logging for new indexes
Phase 3: Rebuild existing indexes with WAL support
Phase 4: Remove legacy non-durable index code

Data Migration

-- Mark indexes as needing rebuild
UPDATE system.indexes SET needs_rebuild=true WHERE created_before='2025-11-05';

-- Background job rebuilds indexes with WAL support
REBUILD INDEX idx_email ON users;

Risks and Mitigations

Risk	Impact	Mitigation
WAL size explosion	Disk space issues	Enable compression (LZ4), aggressive compaction
Recovery time increase	Longer startup	Checkpoint index state, incremental replay
Write performance degradation	Slower DML	Batch index updates, async WAL flush
Distributed coordination	Complex across shards	Per-shard WAL, eventual consistency mode

Success Criteria

Index survives crash (integration test passes)
Table-index consistency after recovery (no stale entries)
WAL overhead < 50% with compression
Write throughput degradation < 10%
Recovery time < 2× baseline

Alternative: Lazy Index Rebuild

If WAL overhead is unacceptable, consider:

Don’t log index updates to WAL
On recovery: Mark indexes as “stale”
Background job: Rebuild indexes from tables
Trade-off: Recovery consistency for write performance

Verdict: ❌ Not recommended - consistency is critical

References

PostgreSQL WAL: https://www.postgresql.org/docs/current/wal-intro.html
MySQL InnoDB Redo Log: https://dev.mysql.com/doc/refman/8.0/en/innodb-redo-log.html
RocksDB WAL: https://github.com/facebook/rocksdb/wiki/Write-Ahead-Log
HeliosDB Storage Analysis: STORAGE_ARCHITECTURE_DETAILED_ANALYSIS.md

Next Steps: Implement Phase 1 (WAL Format Updates)