Skip to content

Index Durability in WAL - Design Document

Index Durability in WAL - Design Document

Date: November 5, 2025 Status: 🔴 CRITICAL - Data Integrity Risk Priority: P0 - Must fix before production Estimated Effort: 2 weeks


Executive Summary

HeliosDB’s global secondary indexes are NOT durable. Index updates are not logged in the Write-Ahead Log (WAL), causing:

  • Index corruption on crash - Stale entries remain after recovery
  • Table-index inconsistency - Tables recover, indexes don’t
  • Data loss - Index must be rebuilt from scratch after crash
  • Long recovery time - Full table scan required to rebuild indexes

Impact: Production database crash causes permanent index corruption until manual rebuild.


Problem Statement

Current Architecture

UPDATE users SET email='new@email.com' WHERE id=1;
┌─────────────────────────────────────────────────────────────┐
│ 1. UPDATE row in table │
│ └─→ WAL entry: [UPDATE, key=user:1, value=...] │
│ └─→ Memtable write │
│ └─→ DURABLE (survives crash) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 2. Update index entry │
│ └─→ ❌ NO WAL entry │
│ └─→ In-memory B-Tree update │
│ └─→ ❌ NOT DURABLE (lost on crash) │
└─────────────────────────────────────────────────────────────┘
💥 CRASH HERE
┌─────────────────────────────────────────────────────────────┐
│ 3. Recovery │
│ ├─→ Table: Recovered from WAL │
│ └─→ Index: ❌ Corrupted (points to old email) │
└─────────────────────────────────────────────────────────────┘

Example Failure Scenario

-- Initial state
INSERT INTO users (id, email) VALUES (1, 'alice@example.com');
CREATE INDEX idx_email ON users(email);
-- DML operation
UPDATE users SET email='bob@example.com' WHERE id=1;
-- State before crash:
-- Table: id=1, email='bob@example.com' (written to WAL)
-- Index: email='bob@example.com' → row_id=1 (NOT in WAL)
-- Index: email='alice@example.com' → [tombstone] (NOT in WAL)
💥 CRASH
-- State after recovery:
-- Table: id=1, email='bob@example.com' ( recovered)
-- Index: email='alice@example.com' → row_id=1 (❌ stale!)
-- Index: email='bob@example.com' → [MISSING]
-- Query failure:
SELECT * FROM users WHERE email='bob@example.com';
-- Returns: 0 rows (❌ WRONG! Should return id=1)
SELECT * FROM users WHERE email='alice@example.com';
-- Returns: id=1 with email='bob@example.com' (❌ INCONSISTENT!)

Root Cause Analysis

Where Indexes Are Stored

File: heliosdb-indexes/src/global_secondary.rs

Index Storage ❌ (In-Memory, Not Durable)

// Line 48-60
pub struct GlobalIndexManager {
/// Index definitions
indexes: DashMap<String, Arc<IndexState>>, // ❌ In-memory only
/// Index storage (B-Trees)
storage: Arc<dyn IndexStorage>, // ❌ Separate from LSM WAL
}
// Index writes go to separate storage, NOT LSM WAL
impl GlobalIndexManager {
pub async fn insert_entry(&self, index_name: &str, entry: IndexEntry) -> Result<()> {
let index = self.indexes.get(index_name)?;
index.storage.insert(entry)?; // ❌ No WAL!
}
}

Table Storage (Durable) vs Index Storage (Not Durable)

┌─────────────────────────────────────────────────────────────┐
│ Table Storage (heliosdb-storage/src/lsm.rs) │
├─────────────────────────────────────────────────────────────┤
│ Write Path: │
│ 1. Log to WAL (commitlog) │
│ 2. Write to Memtable │
│ 3. Flush to SSTable │
│ 4. Crash recovery: Replay WAL │
│ │
│ Result: FULLY DURABLE │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Index Storage (heliosdb-indexes/src/btree_storage.rs) │
├─────────────────────────────────────────────────────────────┤
│ Write Path: │
│ 1. Log to WAL ❌ MISSING │
│ 2. Write to in-memory B-Tree │
│ 3. Persist to disk (eventually) ⚠ Async │
│ 4. Crash recovery: ??? ❌ NO REPLAY │
│ │
│ Result: NOT DURABLE ❌ │
└─────────────────────────────────────────────────────────────┘

Design Solution

Approach: Index operations write to the same WAL as table operations

┌─────────────────────────────────────────────────────────────┐
│ SHARED WAL │
├─────────────────────────────────────────────────────────────┤
│ [Table Entry] table_id=1, key=user:1, value=... │
│ [Index Entry] index_id=5, key=alice@..., value=row_id:1 │
│ [Index Delete] index_id=5, key=alice@... [tombstone] │
│ [Index Entry] index_id=5, key=bob@..., value=row_id:1 │
└─────────────────────────────────────────────────────────────┘
Recovery replays BOTH
┌────────────────────────────────┐
│ Tables Indexes │
└────────────────────────────────┘

Pros:

  • Atomic durability (table + indexes together)
  • Single recovery path
  • Guaranteed consistency
  • Minimal complexity

Cons:

  • ⚠ WAL size increases (mitigated by compression)
  • ⚠ Index recovery time on startup

Option 2: Separate Index WAL

Approach: Indexes have their own WAL, coordinated with table WAL

Pros:

  • Index changes isolated from table WAL
  • Can replay indexes independently

Cons:

  • ❌ Complex 2PC coordination required
  • ❌ Crash in between commits = inconsistency
  • ❌ Recovery more complex (must replay both in order)

Verdict: ❌ Too complex, use Option 1


Architecture: Shared WAL (Option 1)

WAL Entry Format

#[derive(Serialize, Deserialize)]
pub enum WalEntry {
// Existing: Table operations
TableWrite { table_id: u64, key: Vec<u8>, value: Vec<u8>, timestamp: u64 },
TableDelete { table_id: u64, key: Vec<u8>, timestamp: u64 },
// NEW: Index operations
IndexWrite {
index_id: u64,
index_name: String,
key: IndexKey, // Indexed column value
value: RowLocation, // Pointer to table row
timestamp: u64,
},
IndexDelete {
index_id: u64,
index_name: String,
key: IndexKey,
timestamp: u64,
},
}

Write Path Changes

heliosdb-storage/src/lsm.rs
impl LsmStorageEngine {
// NEW: Log index operation to WAL
pub async fn log_index_write(
&self,
index_id: u64,
index_name: &str,
key: &IndexKey,
location: &RowLocation,
) -> Result<()> {
let entry = WalEntry::IndexWrite {
index_id,
index_name: index_name.to_string(),
key: key.clone(),
value: location.clone(),
timestamp: self.timestamp_counter.fetch_add(1, Ordering::SeqCst),
};
// Write to WAL
self.commitlog.append(&entry).await?;
Ok(())
}
}

Index Manager Integration

heliosdb-indexes/src/global_secondary.rs
pub struct GlobalIndexManager {
indexes: DashMap<String, Arc<IndexState>>,
// NEW: Reference to storage engine for WAL access
storage_engine: Option<Arc<LsmStorageEngine>>,
}
impl GlobalIndexManager {
pub async fn insert_entry(&self, index_name: &str, entry: IndexEntry) -> Result<()> {
let index = self.indexes.get(index_name)?;
// NEW: Log to WAL first
if let Some(storage) = &self.storage_engine {
storage.log_index_write(
index.definition.index_id,
index_name,
&entry.key,
&entry.location,
).await?;
}
// Then update in-memory index
index.storage.insert(entry)?;
Ok(())
}
}

Recovery Path

heliosdb-storage/src/recovery.rs
impl Recovery {
pub async fn replay_wal(
&self,
storage: &LsmStorageEngine,
index_manager: &GlobalIndexManager,
) -> Result<()> {
for entry in self.commitlog.read_all()? {
match entry {
WalEntry::TableWrite { table_id, key, value, timestamp } => {
storage.replay_write(table_id, key, value, timestamp)?;
}
// NEW: Replay index operations
WalEntry::IndexWrite { index_id, index_name, key, value, timestamp } => {
index_manager.replay_insert(&index_name, key, value, timestamp)?;
}
WalEntry::IndexDelete { index_id, index_name, key, timestamp } => {
index_manager.replay_delete(&index_name, key, timestamp)?;
}
_ => { /* other entry types */ }
}
}
Ok(())
}
}

Implementation Plan

Phase 1: WAL Format Updates (3 days)

Tasks:

  1. Add IndexWrite/IndexDelete to WalEntry enum
  2. Update WAL serialization/deserialization
  3. Update WAL replay logic
  4. Add unit tests

Files:

  • heliosdb-storage/src/commitlog.rs (+50 LOC)
  • heliosdb-storage/src/wal_entry.rs (+80 LOC)
  • heliosdb-storage/tests/wal_index_test.rs (+150 LOC)

Phase 2: Storage Layer Integration (4 days)

Tasks:

  1. Add log_index_write() to LsmStorageEngine
  2. Add log_index_delete() to LsmStorageEngine
  3. Update recovery to replay index operations
  4. Add integration tests

Files:

  • heliosdb-storage/src/lsm.rs (+100 LOC)
  • heliosdb-storage/src/recovery.rs (+80 LOC)
  • heliosdb-storage/tests/recovery_index_test.rs (+200 LOC)

Phase 3: Index Manager Integration (5 days)

Tasks:

  1. Add storage_engine reference to GlobalIndexManager
  2. Call log_index_write() in insert_entry()
  3. Call log_index_delete() in remove_entry()
  4. Add replay_insert() and replay_delete() methods
  5. Update DML operations to use new API

Files:

  • heliosdb-indexes/src/global_secondary.rs (+120 LOC)
  • heliosdb-protocols/src/postgres/index_maintenance.rs (+30 LOC)
  • heliosdb-indexes/tests/index_durability_test.rs (+250 LOC)

Phase 4: End-to-End Testing (2 days)

Tasks:

  1. Crash recovery test with indexes
  2. Concurrent DML + crash test
  3. Performance benchmarks (WAL overhead)
  4. Stress test (1M index operations + crash)

Files:

  • heliosdb-storage/tests/crash_recovery_test.rs (+300 LOC)

Testing Strategy

Unit Test: WAL Format

#[test]
fn test_wal_index_entry_serialization() {
let entry = WalEntry::IndexWrite {
index_id: 5,
index_name: "idx_email".to_string(),
key: IndexKey::String("alice@example.com".to_string()),
value: RowLocation { shard_id: "shard_0".to_string(), row_key: 123 },
timestamp: 1000,
};
let bytes = entry.serialize()?;
let decoded = WalEntry::deserialize(&bytes)?;
assert_eq!(entry, decoded);
}

Integration Test: Crash Recovery

#[tokio::test]
async fn test_index_survives_crash() {
// Setup
let storage = LsmStorageEngine::new("/tmp/test").await?;
let index_mgr = GlobalIndexManager::new(Some(storage.clone())).await?;
index_mgr.create_index("idx_email", "users", "email").await?;
// Insert row + index
storage.write(key("user:1"), value("alice@example.com")).await?;
index_mgr.insert_entry("idx_email", IndexEntry {
key: IndexKey::String("alice@example.com".to_string()),
location: RowLocation { shard_id: "shard_0".to_string(), row_key: 1 },
}).await?;
// Simulate crash (don't flush, just drop)
drop(storage);
drop(index_mgr);
// Recovery
let storage = LsmStorageEngine::recover("/tmp/test").await?;
let index_mgr = GlobalIndexManager::new(Some(storage.clone())).await?;
storage.replay_wal(&index_mgr).await?;
// Verify index recovered
let results = index_mgr.query_index("idx_email", &Predicate::Eq("alice@example.com")).await?;
assert_eq!(results.len(), 1);
assert_eq!(results[0].row_key, 1);
}

Performance Impact

WAL Overhead Analysis

Assumptions:

  • 1 table write = 100 bytes WAL entry
  • 1 index write = 120 bytes WAL entry
  • Table with 3 indexes

Overhead:

Before (table only): 100 bytes/write
After (table + 3 indexes): 100 + (3 × 120) = 460 bytes/write
Overhead: 4.6x WAL size

Mitigation:

  1. WAL Compression: Use LZ4 (3-5x compression ratio)
    • 460 bytes → ~100-150 bytes (similar to before)
  2. Batch Index Writes: Group multiple index updates
  3. Async WAL Flush: Don’t block on every write

Expected Impact: <10% write throughput reduction


Backward Compatibility

Migration Path

  1. Phase 1: Add index WAL entries alongside existing behavior
  2. Phase 2: Enable WAL logging for new indexes
  3. Phase 3: Rebuild existing indexes with WAL support
  4. Phase 4: Remove legacy non-durable index code

Data Migration

-- Mark indexes as needing rebuild
UPDATE system.indexes SET needs_rebuild=true WHERE created_before='2025-11-05';
-- Background job rebuilds indexes with WAL support
REBUILD INDEX idx_email ON users;

Risks and Mitigations

RiskImpactMitigation
WAL size explosionDisk space issuesEnable compression (LZ4), aggressive compaction
Recovery time increaseLonger startupCheckpoint index state, incremental replay
Write performance degradationSlower DMLBatch index updates, async WAL flush
Distributed coordinationComplex across shardsPer-shard WAL, eventual consistency mode

Success Criteria

  1. Index survives crash (integration test passes)
  2. Table-index consistency after recovery (no stale entries)
  3. WAL overhead < 50% with compression
  4. Write throughput degradation < 10%
  5. Recovery time < 2× baseline

Alternative: Lazy Index Rebuild

If WAL overhead is unacceptable, consider:

  • Don’t log index updates to WAL
  • On recovery: Mark indexes as “stale”
  • Background job: Rebuild indexes from tables
  • Trade-off: Recovery consistency for write performance

Verdict: ❌ Not recommended - consistency is critical


References


Next Steps: Implement Phase 1 (WAL Format Updates)