Lock Poison Recovery - Developer Guide
Lock Poison Recovery - Developer Guide
Overview
This guide explains how to properly handle lock poisoning in HeliosDB to prevent cascading failures and ensure system resilience.
What is Lock Poisoning?
A lock becomes “poisoned” when a thread panics while holding the lock. The Rust standard library marks the lock as poisoned to indicate that shared state may be inconsistent.
The Problem
// ❌ BAD: This will panic if the lock is poisonedlet guard = self.cache.lock().unwrap();When this code encounters a poisoned lock, it panics, potentially causing:
- Cascading failures across threads
- Cluster coordination breakdown
- Data loss or inconsistency
- Service outages
The Solution
Standard Pattern
// ✅ GOOD: Recover from poisoned lockslet guard = self.cache.lock().unwrap_or_else(|poisoned| { tracing::warn!("cache lock was poisoned, recovering"); poisoned.into_inner()});For Different Lock Types
Mutex (write lock)
use std::sync::Mutex;
let data = Arc::new(Mutex::new(Vec::new()));
// Acquire lock with poison recoverylet mut guard = data.lock().unwrap_or_else(|poisoned| { tracing::warn!("data lock was poisoned, recovering"); poisoned.into_inner()});
guard.push(42);RwLock (read lock)
use std::sync::RwLock;
let data = Arc::new(RwLock::new(HashMap::new()));
// Read lock with poison recoverylet guard = data.read().unwrap_or_else(|poisoned| { tracing::warn!("data read lock was poisoned, recovering"); poisoned.into_inner()});
let value = guard.get(&key);RwLock (write lock)
// Write lock with poison recoverylet mut guard = data.write().unwrap_or_else(|poisoned| { tracing::warn!("data write lock was poisoned, recovering"); poisoned.into_inner()});
guard.insert(key, value);When to Use Each Approach
1. Recovery (Preferred for most cases)
Use unwrap_or_else when:
- The operation can safely continue
- Data consistency can be verified
- Logs/metrics capture the event
let guard = self.lock.write().unwrap_or_else(|poisoned| { tracing::warn!("Lock was poisoned, recovering"); poisoned.into_inner()});2. Error Propagation (When caller needs to know)
Use map_err when:
- Caller needs to handle poison differently
- Operation should fail on poison
- Part of a transaction that needs rollback
let guard = self.lock.write() .map_err(|_| MyError::LockPoisoned("lock_name"))?;3. Clear Poison (Advanced)
For manual poison management:
use std::sync::PoisonError;
// Clear poison and get datalet mutex = Arc::new(Mutex::new(vec![1, 2, 3]));let result = mutex.lock();
match result { Ok(guard) => { // Lock is not poisoned println!("Data: {:?}", *guard); } Err(poisoned) => { // Recover data and clear poison let guard = poisoned.into_inner(); tracing::warn!("Recovered from poisoned lock");
// Verify data consistency here assert!(guard.len() > 0); }}Best Practices
1. Always Log Poison Events
tracing::warn!( lock_name = "cache", operation = "get", "Lock was poisoned, recovering");2. Provide Context
Include enough information to debug the issue:
let guard = self.entries_cache.write().unwrap_or_else(|poisoned| { tracing::warn!( lock = "entries_cache", method = "append", "Lock was poisoned during append operation, recovering" ); poisoned.into_inner()});3. Verify Data After Recovery
let guard = self.data.lock().unwrap_or_else(|poisoned| { tracing::warn!("data lock was poisoned, recovering"); let data = poisoned.into_inner();
// Verify data consistency if data.is_empty() { tracing::error!("Recovered poisoned lock but data is empty!"); }
data});4. Consider Adding Metrics
let guard = self.lock.write().unwrap_or_else(|poisoned| { tracing::warn!("Lock was poisoned, recovering");
// Increment poison counter metric metrics::counter!("lock_poison_events", 1, "lock_name" => "my_lock", "operation" => "write" );
poisoned.into_inner()});Common Patterns in HeliosDB
Pattern 1: Cache Operations
// Read from cache with poison recoverylet cache = self.cache.read().unwrap_or_else(|poisoned| { tracing::warn!("cache lock was poisoned in get, recovering"); poisoned.into_inner()});
if let Some(entry) = cache.get(key) { return Ok(Some(entry.clone()));}Pattern 2: Metric Updates
// Update metrics with poison recoverylet guard = self.metrics.write().unwrap_or_else(|poisoned| { tracing::warn!("metrics lock was poisoned, recovering"); poisoned.into_inner()});
guard.increment_counter("requests");Pattern 3: State Management
// Update state with poison recoverylet mut state = self.state.write().unwrap_or_else(|poisoned| { tracing::warn!("state lock was poisoned in update, recovering"); poisoned.into_inner()});
state.last_updated = Instant::now();state.value = new_value;Testing Lock Poison Recovery
Unit Test Example
#[test]fn test_lock_poison_recovery() { use std::sync::{Arc, Mutex}; use std::panic;
let data = Arc::new(Mutex::new(vec![1, 2, 3])); let data_clone = Arc::clone(&data);
// Deliberately poison the lock let _ = panic::catch_unwind(|| { let mut guard = data_clone.lock().unwrap(); guard.push(4); panic!("Intentional panic to poison lock"); });
// Verify we can recover let guard = data.lock().unwrap_or_else(|poisoned| { poisoned.into_inner() });
assert_eq!(*guard, vec![1, 2, 3, 4]);}Integration Test Example
#[tokio::test]async fn test_concurrent_poison_recovery() { let storage = Arc::new(RaftStorage::new("/tmp/test").unwrap());
// Spawn tasks that might panic let handles: Vec<_> = (0..10) .map(|i| { let storage = Arc::clone(&storage); tokio::spawn(async move { if i == 5 { panic!("Test panic"); } storage.append(&[create_entry(i)]).unwrap(); }) }) .collect();
// Wait for all tasks for handle in handles { let _ = handle.await; }
// Verify storage still works after poison let state = storage.initial_state().unwrap(); assert!(state.hard_state.term >= 0);}Automated Checking
Use the provided script to find and fix lock unwraps:
# Find all unsafe lock unwrapsgrep -r "\.lock()\|\.write()\|\.read()" --include="*.rs" src/ \ | grep "\.unwrap()"
# Apply automated fixespython3 scripts/utilities/fix_lock_poison_batch.py src/Monitoring in Production
1. Log Aggregation
Set up log queries to track poison events:
level:WARN AND message:"lock was poisoned"2. Metrics Dashboard
Track poison events over time:
# Count of poison events per minuterate(lock_poison_events_total[1m])
# Poison events by lock namesum by (lock_name) (lock_poison_events_total)3. Alerting
Alert on unusual poison rates:
alert: HighLockPoisonRateexpr: rate(lock_poison_events_total[5m]) > 0.1for: 5mannotations: summary: "High rate of lock poison events detected"Troubleshooting
Issue: Frequent Poison Events
Symptoms: Logs show repeated poison warnings for the same lock
Diagnosis:
- Check which operations are panicking
- Review panic backtraces in logs
- Identify root cause of panics
Resolution:
- Fix the code causing panics
- Add panic guards around risky operations
- Consider using
catch_unwindfor fallible operations
Issue: Data Inconsistency After Recovery
Symptoms: Data appears corrupted after poison recovery
Diagnosis:
- Check if partial writes occurred before panic
- Review transaction boundaries
- Verify invariants after recovery
Resolution:
- Add validation after poison recovery
- Use transactional updates
- Consider using atomic operations instead of locks
See Also
- Rust Book: Shared-State Concurrency
- std::sync::PoisonError Documentation
LOCK_POISON_RECOVERY_IMPLEMENTATION_REPORT.md- Full implementation detailsscripts/utilities/fix_lock_poison_batch.py- Automated fixing tool
Quick Reference
// ❌ NEVER DO THISlet guard = lock.lock().unwrap();
// ✅ ALWAYS DO THISlet guard = lock.lock().unwrap_or_else(|poisoned| { tracing::warn!("lock was poisoned, recovering"); poisoned.into_inner()});Remember: Lock poisoning is a recoverable error. Always handle it gracefully to ensure system resilience!