Skip to content

Lock Poison Recovery - Developer Guide

Lock Poison Recovery - Developer Guide

Overview

This guide explains how to properly handle lock poisoning in HeliosDB to prevent cascading failures and ensure system resilience.

What is Lock Poisoning?

A lock becomes “poisoned” when a thread panics while holding the lock. The Rust standard library marks the lock as poisoned to indicate that shared state may be inconsistent.

The Problem

// ❌ BAD: This will panic if the lock is poisoned
let guard = self.cache.lock().unwrap();

When this code encounters a poisoned lock, it panics, potentially causing:

  • Cascading failures across threads
  • Cluster coordination breakdown
  • Data loss or inconsistency
  • Service outages

The Solution

Standard Pattern

// ✅ GOOD: Recover from poisoned locks
let guard = self.cache.lock().unwrap_or_else(|poisoned| {
tracing::warn!("cache lock was poisoned, recovering");
poisoned.into_inner()
});

For Different Lock Types

Mutex (write lock)

use std::sync::Mutex;
let data = Arc::new(Mutex::new(Vec::new()));
// Acquire lock with poison recovery
let mut guard = data.lock().unwrap_or_else(|poisoned| {
tracing::warn!("data lock was poisoned, recovering");
poisoned.into_inner()
});
guard.push(42);

RwLock (read lock)

use std::sync::RwLock;
let data = Arc::new(RwLock::new(HashMap::new()));
// Read lock with poison recovery
let guard = data.read().unwrap_or_else(|poisoned| {
tracing::warn!("data read lock was poisoned, recovering");
poisoned.into_inner()
});
let value = guard.get(&key);

RwLock (write lock)

// Write lock with poison recovery
let mut guard = data.write().unwrap_or_else(|poisoned| {
tracing::warn!("data write lock was poisoned, recovering");
poisoned.into_inner()
});
guard.insert(key, value);

When to Use Each Approach

1. Recovery (Preferred for most cases)

Use unwrap_or_else when:

  • The operation can safely continue
  • Data consistency can be verified
  • Logs/metrics capture the event
let guard = self.lock.write().unwrap_or_else(|poisoned| {
tracing::warn!("Lock was poisoned, recovering");
poisoned.into_inner()
});

2. Error Propagation (When caller needs to know)

Use map_err when:

  • Caller needs to handle poison differently
  • Operation should fail on poison
  • Part of a transaction that needs rollback
let guard = self.lock.write()
.map_err(|_| MyError::LockPoisoned("lock_name"))?;

3. Clear Poison (Advanced)

For manual poison management:

use std::sync::PoisonError;
// Clear poison and get data
let mutex = Arc::new(Mutex::new(vec![1, 2, 3]));
let result = mutex.lock();
match result {
Ok(guard) => {
// Lock is not poisoned
println!("Data: {:?}", *guard);
}
Err(poisoned) => {
// Recover data and clear poison
let guard = poisoned.into_inner();
tracing::warn!("Recovered from poisoned lock");
// Verify data consistency here
assert!(guard.len() > 0);
}
}

Best Practices

1. Always Log Poison Events

tracing::warn!(
lock_name = "cache",
operation = "get",
"Lock was poisoned, recovering"
);

2. Provide Context

Include enough information to debug the issue:

let guard = self.entries_cache.write().unwrap_or_else(|poisoned| {
tracing::warn!(
lock = "entries_cache",
method = "append",
"Lock was poisoned during append operation, recovering"
);
poisoned.into_inner()
});

3. Verify Data After Recovery

let guard = self.data.lock().unwrap_or_else(|poisoned| {
tracing::warn!("data lock was poisoned, recovering");
let data = poisoned.into_inner();
// Verify data consistency
if data.is_empty() {
tracing::error!("Recovered poisoned lock but data is empty!");
}
data
});

4. Consider Adding Metrics

let guard = self.lock.write().unwrap_or_else(|poisoned| {
tracing::warn!("Lock was poisoned, recovering");
// Increment poison counter metric
metrics::counter!("lock_poison_events", 1,
"lock_name" => "my_lock",
"operation" => "write"
);
poisoned.into_inner()
});

Common Patterns in HeliosDB

Pattern 1: Cache Operations

// Read from cache with poison recovery
let cache = self.cache.read().unwrap_or_else(|poisoned| {
tracing::warn!("cache lock was poisoned in get, recovering");
poisoned.into_inner()
});
if let Some(entry) = cache.get(key) {
return Ok(Some(entry.clone()));
}

Pattern 2: Metric Updates

// Update metrics with poison recovery
let guard = self.metrics.write().unwrap_or_else(|poisoned| {
tracing::warn!("metrics lock was poisoned, recovering");
poisoned.into_inner()
});
guard.increment_counter("requests");

Pattern 3: State Management

// Update state with poison recovery
let mut state = self.state.write().unwrap_or_else(|poisoned| {
tracing::warn!("state lock was poisoned in update, recovering");
poisoned.into_inner()
});
state.last_updated = Instant::now();
state.value = new_value;

Testing Lock Poison Recovery

Unit Test Example

#[test]
fn test_lock_poison_recovery() {
use std::sync::{Arc, Mutex};
use std::panic;
let data = Arc::new(Mutex::new(vec![1, 2, 3]));
let data_clone = Arc::clone(&data);
// Deliberately poison the lock
let _ = panic::catch_unwind(|| {
let mut guard = data_clone.lock().unwrap();
guard.push(4);
panic!("Intentional panic to poison lock");
});
// Verify we can recover
let guard = data.lock().unwrap_or_else(|poisoned| {
poisoned.into_inner()
});
assert_eq!(*guard, vec![1, 2, 3, 4]);
}

Integration Test Example

#[tokio::test]
async fn test_concurrent_poison_recovery() {
let storage = Arc::new(RaftStorage::new("/tmp/test").unwrap());
// Spawn tasks that might panic
let handles: Vec<_> = (0..10)
.map(|i| {
let storage = Arc::clone(&storage);
tokio::spawn(async move {
if i == 5 {
panic!("Test panic");
}
storage.append(&[create_entry(i)]).unwrap();
})
})
.collect();
// Wait for all tasks
for handle in handles {
let _ = handle.await;
}
// Verify storage still works after poison
let state = storage.initial_state().unwrap();
assert!(state.hard_state.term >= 0);
}

Automated Checking

Use the provided script to find and fix lock unwraps:

Terminal window
# Find all unsafe lock unwraps
grep -r "\.lock()\|\.write()\|\.read()" --include="*.rs" src/ \
| grep "\.unwrap()"
# Apply automated fixes
python3 scripts/utilities/fix_lock_poison_batch.py src/

Monitoring in Production

1. Log Aggregation

Set up log queries to track poison events:

level:WARN AND message:"lock was poisoned"

2. Metrics Dashboard

Track poison events over time:

# Count of poison events per minute
rate(lock_poison_events_total[1m])
# Poison events by lock name
sum by (lock_name) (lock_poison_events_total)

3. Alerting

Alert on unusual poison rates:

alert: HighLockPoisonRate
expr: rate(lock_poison_events_total[5m]) > 0.1
for: 5m
annotations:
summary: "High rate of lock poison events detected"

Troubleshooting

Issue: Frequent Poison Events

Symptoms: Logs show repeated poison warnings for the same lock

Diagnosis:

  1. Check which operations are panicking
  2. Review panic backtraces in logs
  3. Identify root cause of panics

Resolution:

  1. Fix the code causing panics
  2. Add panic guards around risky operations
  3. Consider using catch_unwind for fallible operations

Issue: Data Inconsistency After Recovery

Symptoms: Data appears corrupted after poison recovery

Diagnosis:

  1. Check if partial writes occurred before panic
  2. Review transaction boundaries
  3. Verify invariants after recovery

Resolution:

  1. Add validation after poison recovery
  2. Use transactional updates
  3. Consider using atomic operations instead of locks

See Also

Quick Reference

// ❌ NEVER DO THIS
let guard = lock.lock().unwrap();
// ✅ ALWAYS DO THIS
let guard = lock.lock().unwrap_or_else(|poisoned| {
tracing::warn!("lock was poisoned, recovering");
poisoned.into_inner()
});

Remember: Lock poisoning is a recoverable error. Always handle it gracefully to ensure system resilience!