Lock Poison Recovery - Developer Guide

Overview

This guide explains how to properly handle lock poisoning in HeliosDB to prevent cascading failures and ensure system resilience.

What is Lock Poisoning?

A lock becomes “poisoned” when a thread panics while holding the lock. The Rust standard library marks the lock as poisoned to indicate that shared state may be inconsistent.

The Problem

// ❌ BAD: This will panic if the lock is poisoned
let guard = self.cache.lock().unwrap();

When this code encounters a poisoned lock, it panics, potentially causing:

Cascading failures across threads
Cluster coordination breakdown
Data loss or inconsistency
Service outages

The Solution

Standard Pattern

// ✅ GOOD: Recover from poisoned locks
let guard = self.cache.lock().unwrap_or_else(|poisoned| {
    tracing::warn!("cache lock was poisoned, recovering");
    poisoned.into_inner()
});

For Different Lock Types

Mutex (write lock)

use std::sync::Mutex;

let data = Arc::new(Mutex::new(Vec::new()));

// Acquire lock with poison recovery
let mut guard = data.lock().unwrap_or_else(|poisoned| {
    tracing::warn!("data lock was poisoned, recovering");
    poisoned.into_inner()
});

guard.push(42);

RwLock (read lock)

use std::sync::RwLock;

let data = Arc::new(RwLock::new(HashMap::new()));

// Read lock with poison recovery
let guard = data.read().unwrap_or_else(|poisoned| {
    tracing::warn!("data read lock was poisoned, recovering");
    poisoned.into_inner()
});

let value = guard.get(&key);

RwLock (write lock)

// Write lock with poison recovery
let mut guard = data.write().unwrap_or_else(|poisoned| {
    tracing::warn!("data write lock was poisoned, recovering");
    poisoned.into_inner()
});

guard.insert(key, value);

When to Use Each Approach

1. Recovery (Preferred for most cases)

Use unwrap_or_else when:

The operation can safely continue
Data consistency can be verified
Logs/metrics capture the event

let guard = self.lock.write().unwrap_or_else(|poisoned| {
    tracing::warn!("Lock was poisoned, recovering");
    poisoned.into_inner()
});

2. Error Propagation (When caller needs to know)

Use map_err when:

Caller needs to handle poison differently
Operation should fail on poison
Part of a transaction that needs rollback

let guard = self.lock.write()
    .map_err(|_| MyError::LockPoisoned("lock_name"))?;

3. Clear Poison (Advanced)

For manual poison management:

use std::sync::PoisonError;

// Clear poison and get data
let mutex = Arc::new(Mutex::new(vec![1, 2, 3]));
let result = mutex.lock();

match result {
    Ok(guard) => {
        // Lock is not poisoned
        println!("Data: {:?}", *guard);
    }
    Err(poisoned) => {
        // Recover data and clear poison
        let guard = poisoned.into_inner();
        tracing::warn!("Recovered from poisoned lock");

        // Verify data consistency here
        assert!(guard.len() > 0);
    }
}

Best Practices

1. Always Log Poison Events

tracing::warn!(
    lock_name = "cache",
    operation = "get",
    "Lock was poisoned, recovering"
);

2. Provide Context

Include enough information to debug the issue:

let guard = self.entries_cache.write().unwrap_or_else(|poisoned| {
    tracing::warn!(
        lock = "entries_cache",
        method = "append",
        "Lock was poisoned during append operation, recovering"
    );
    poisoned.into_inner()
});

3. Verify Data After Recovery

let guard = self.data.lock().unwrap_or_else(|poisoned| {
    tracing::warn!("data lock was poisoned, recovering");
    let data = poisoned.into_inner();

    // Verify data consistency
    if data.is_empty() {
        tracing::error!("Recovered poisoned lock but data is empty!");
    }

    data
});

4. Consider Adding Metrics

let guard = self.lock.write().unwrap_or_else(|poisoned| {
    tracing::warn!("Lock was poisoned, recovering");

    // Increment poison counter metric
    metrics::counter!("lock_poison_events", 1,
        "lock_name" => "my_lock",
        "operation" => "write"
    );

    poisoned.into_inner()
});

Common Patterns in HeliosDB

Pattern 1: Cache Operations

// Read from cache with poison recovery
let cache = self.cache.read().unwrap_or_else(|poisoned| {
    tracing::warn!("cache lock was poisoned in get, recovering");
    poisoned.into_inner()
});

if let Some(entry) = cache.get(key) {
    return Ok(Some(entry.clone()));
}

Pattern 2: Metric Updates

// Update metrics with poison recovery
let guard = self.metrics.write().unwrap_or_else(|poisoned| {
    tracing::warn!("metrics lock was poisoned, recovering");
    poisoned.into_inner()
});

guard.increment_counter("requests");

Pattern 3: State Management

// Update state with poison recovery
let mut state = self.state.write().unwrap_or_else(|poisoned| {
    tracing::warn!("state lock was poisoned in update, recovering");
    poisoned.into_inner()
});

state.last_updated = Instant::now();
state.value = new_value;

Testing Lock Poison Recovery

Unit Test Example

#[test]
fn test_lock_poison_recovery() {
    use std::sync::{Arc, Mutex};
    use std::panic;

    let data = Arc::new(Mutex::new(vec![1, 2, 3]));
    let data_clone = Arc::clone(&data);

    // Deliberately poison the lock
    let _ = panic::catch_unwind(|| {
        let mut guard = data_clone.lock().unwrap();
        guard.push(4);
        panic!("Intentional panic to poison lock");
    });

    // Verify we can recover
    let guard = data.lock().unwrap_or_else(|poisoned| {
        poisoned.into_inner()
    });

    assert_eq!(*guard, vec![1, 2, 3, 4]);
}

Integration Test Example

#[tokio::test]
async fn test_concurrent_poison_recovery() {
    let storage = Arc::new(RaftStorage::new("/tmp/test").unwrap());

    // Spawn tasks that might panic
    let handles: Vec<_> = (0..10)
        .map(|i| {
            let storage = Arc::clone(&storage);
            tokio::spawn(async move {
                if i == 5 {
                    panic!("Test panic");
                }
                storage.append(&[create_entry(i)]).unwrap();
            })
        })
        .collect();

    // Wait for all tasks
    for handle in handles {
        let _ = handle.await;
    }

    // Verify storage still works after poison
    let state = storage.initial_state().unwrap();
    assert!(state.hard_state.term >= 0);
}

Automated Checking

Use the provided script to find and fix lock unwraps:

# Find all unsafe lock unwraps
grep -r "\.lock()\|\.write()\|\.read()" --include="*.rs" src/ \
    | grep "\.unwrap()"

# Apply automated fixes
python3 scripts/utilities/fix_lock_poison_batch.py src/

Monitoring in Production

1. Log Aggregation

Set up log queries to track poison events:

level:WARN AND message:"lock was poisoned"

2. Metrics Dashboard

Track poison events over time:

# Count of poison events per minute
rate(lock_poison_events_total[1m])

# Poison events by lock name
sum by (lock_name) (lock_poison_events_total)

3. Alerting

Alert on unusual poison rates:

alert: HighLockPoisonRate
expr: rate(lock_poison_events_total[5m]) > 0.1
for: 5m
annotations:
  summary: "High rate of lock poison events detected"

Troubleshooting

Issue: Frequent Poison Events

Symptoms: Logs show repeated poison warnings for the same lock

Diagnosis:

Check which operations are panicking
Review panic backtraces in logs
Identify root cause of panics

Resolution:

Fix the code causing panics
Add panic guards around risky operations
Consider using catch_unwind for fallible operations

Issue: Data Inconsistency After Recovery

Symptoms: Data appears corrupted after poison recovery

Diagnosis:

Check if partial writes occurred before panic
Review transaction boundaries
Verify invariants after recovery

Resolution:

Add validation after poison recovery
Use transactional updates
Consider using atomic operations instead of locks

Quick Reference

// ❌ NEVER DO THIS
let guard = lock.lock().unwrap();

// ✅ ALWAYS DO THIS
let guard = lock.lock().unwrap_or_else(|poisoned| {
    tracing::warn!("lock was poisoned, recovering");
    poisoned.into_inner()
});

Remember: Lock poisoning is a recoverable error. Always handle it gracefully to ensure system resilience!

Lock Poison Recovery - Developer Guide

Lock Poison Recovery - Developer Guide

Overview

What is Lock Poisoning?

The Problem

The Solution

Standard Pattern

For Different Lock Types

Mutex (write lock)

RwLock (read lock)

RwLock (write lock)

When to Use Each Approach

1. Recovery (Preferred for most cases)

2. Error Propagation (When caller needs to know)

3. Clear Poison (Advanced)

Best Practices

1. Always Log Poison Events

2. Provide Context

3. Verify Data After Recovery

4. Consider Adding Metrics

Common Patterns in HeliosDB

Pattern 1: Cache Operations

Pattern 2: Metric Updates

Pattern 3: State Management

Testing Lock Poison Recovery

Unit Test Example

Integration Test Example

Automated Checking

Monitoring in Production

1. Log Aggregation

2. Metrics Dashboard

3. Alerting

Troubleshooting

Issue: Frequent Poison Events

Issue: Data Inconsistency After Recovery

See Also

Quick Reference