HeliosDB Deadlock Prevention Operations Guide
HeliosDB Deadlock Prevention Operations Guide
Version: 7.0 Last Updated: December 30, 2025 Status: Production Ready Target Audience: Database Administrators, Application Developers, Operations Engineers
Table of Contents
- Understanding Deadlocks
- Deadlock Detection and Resolution
- Identifying Deadlocks
- Common Deadlock Patterns
- Prevention Strategies
- Operational Procedures
- Real-World Scenarios
- Troubleshooting
- Performance Considerations
- Integration with Application Frameworks
Understanding Deadlocks
What Is a Deadlock?
A deadlock occurs when two or more transactions are waiting for each other to release locks on resources, creating a circular dependency. None of the transactions can proceed, and they remain indefinitely blocked unless intervention occurs.
Deadlock Conditions (All Four Must Be Present)
- Mutual Exclusion: Resources cannot be shared; only one transaction can hold a lock at a time
- Hold and Wait: Transactions can hold locks while waiting for additional locks
- No Preemption: Locks cannot be forcibly removed from transactions
- Circular Wait: A cycle of transactions exists where each waits for resources held by the next
How Deadlocks Manifest in HeliosDB
HeliosDB uses Multi-Version Concurrency Control (MVCC) with snapshot isolation and row-level locking to minimize deadlock probability, but they can still occur in these scenarios:
- Concurrent DML Operations: When multiple transactions modify overlapping rows
- Foreign Key Constraints: When cascading updates lock parent/child tables in different orders
- Explicit Locks: When transactions explicitly request locks on multiple resources
- Index Maintenance: When DML operations update both data and index structures
- Distributed Transactions: When transactions span multiple shards or nodes
Example Deadlock Scenario
Transaction T1: 1. Lock Row A in Table users 2. Attempt to lock Row B in Table orders (waiting...)
Transaction T2: 1. Lock Row B in Table orders 2. Attempt to lock Row A in Table users (waiting...)
Result: Both transactions deadlocked, neither can proceedDeadlock Detection and Resolution
Automatic Detection Mechanisms
HeliosDB implements multiple deadlock detection strategies:
1. Wait-For Graph Monitoring
HeliosDB maintains an internal wait-for graph tracking which transactions are waiting on which other transactions:
Configuration in heliosdb.toml:[transactions]# Enable deadlock detectionenable_deadlock_detection = true
# Detection interval (milliseconds)deadlock_check_interval = 100
# Maximum wait time before assuming deadlock (milliseconds)deadlock_timeout = 5000
# Enable wait-for graph logging (verbose)log_wait_for_graph = false2. Timeout-Based Detection
When a transaction exceeds the configured lock acquisition timeout, it’s assumed to be deadlocked:
[transactions]# Lock acquisition timeout (seconds)lock_timeout = 30
# Transaction timeout (seconds)transaction_timeout = 300
# Enable automatic rollback on timeoutauto_rollback_on_timeout = true3. Cycle Detection
The detection system periodically searches for cycles in the wait-for graph:
// Cycle Detection Algorithm (Conceptual)// Implemented in HeliosDB's transaction managerfn detect_deadlock_cycles(wait_for_graph: &WaitForGraph) -> Vec<DeadlockCycle> { let mut cycles = Vec::new();
for node in wait_for_graph.nodes() { if let Some(cycle) = find_cycle_from_node(node, wait_for_graph) { cycles.push(cycle); } }
cycles}Automatic Resolution Mechanisms
1. Victim Selection Strategy
When a deadlock is detected, HeliosDB automatically selects a victim transaction to rollback:
[transactions]# Deadlock victim selection strategy# Options: "youngest", "fewest_locks", "least_work", "random"deadlock_victim_strategy = "fewest_locks"
# Enable automatic victim rollbackauto_victim_rollback = trueSelection Criteria:
- youngest: Rollback the transaction with the most recent start time (minimal work lost)
- fewest_locks: Rollback the transaction holding the fewest locks (minimize impact)
- least_work: Rollback the transaction with fewest executed statements
- random: Randomly select a victim (not recommended for production)
2. Victim Notification and Rollback
// Application-level handlingmatch transaction_result { Ok(data) => { // Transaction succeeded println!("Transaction committed successfully"); } Err(TransactionError::Deadlock { victim_id, cycle_size }) => { // Application was the deadlock victim eprintln!("Transaction deadlocked (victim in {}-txn cycle)", cycle_size); eprintln!("Transaction will be automatically rolled back"); eprintln!("Implement retry logic in application");
// Retry logic thread::sleep(Duration::from_millis(10 * cycle_size as u64)); // Re-execute transaction } Err(e) => eprintln!("Transaction failed: {}", e),}Identifying Deadlocks
Log-Based Identification
Enable Deadlock Logging
[logging]# Log all deadlock eventsdeadlock_log_level = "info"
# Detailed deadlock diagnosticsdetailed_deadlock_logs = true
# Log wait-for graph when deadlock detectedlog_graph_on_deadlock = true
# Enable transaction history loggingtransaction_history_log = trueParsing Deadlock Log Entries
HeliosDB logs deadlock events in this format:
[2025-12-30T10:45:23.456Z] DEADLOCK DETECTED Deadlock ID: deadlock_20251230_104523_001 Cycle Size: 3 transactions Detection Time: 45ms
Transaction 1: ID: txn_2025_1001 Start Time: 2025-12-30T10:45:20.123Z Duration: 3.333s Locks Held: 2 rows (table:users, table:orders) Waiting For: 1 row (table:accounts) Statements: UPDATE users SET status=?, UPDATE orders SET total=?
Transaction 2: ID: txn_2025_1002 Start Time: 2025-12-30T10:45:21.456Z Duration: 2.000s Locks Held: 1 row (table:accounts) Waiting For: 1 row (table:users) Statements: SELECT * FROM accounts, UPDATE accounts SET balance=?
Transaction 3: ID: txn_2025_1003 Start Time: 2025-12-30T10:45:22.789Z Duration: 0.667s Locks Held: 1 row (table:orders) Waiting For: 1 row (table:accounts) Statements: INSERT INTO orders VALUES (...)
Victim Selected: txn_2025_1001 (fewest_locks strategy) Victim Rolled Back: TRUE Victim Retry Available: TRUEMonitoring Queries
1. Monitor Active Transactions and Locks
-- View all active transactions with lock informationSELECT txn_id, start_time, duration_ms, lock_count, statement_count, wait_for_txn_id, priority, statusFROM sys.transactionsWHERE status = 'ACTIVE'ORDER BY duration_ms DESC;2. Identify Blocked Transactions
-- Find transactions waiting on locks (potential deadlock victims)SELECT t1.txn_id as blocked_txn, t1.start_time, t1.lock_count as blocked_lock_count, t2.txn_id as blocking_txn, t2.lock_count as blocking_lock_count, COUNT(*) as wait_chain_depthFROM sys.transactions t1JOIN sys.transactions t2 ON t1.wait_for_txn_id = t2.txn_idLEFT JOIN sys.transactions t3 ON t2.wait_for_txn_id = t3.txn_idWHERE t1.wait_for_txn_id IS NOT NULLGROUP BY t1.txn_id, t1.start_time, t1.lock_count, t2.txn_id, t2.lock_countORDER BY wait_chain_depth DESC;3. Monitor Lock Wait Times
-- Track average lock wait times by tableSELECT table_name, COUNT(*) as total_lock_attempts, SUM(CASE WHEN wait_time_ms > 0 THEN 1 ELSE 0 END) as waited_attempts, AVG(wait_time_ms) as avg_wait_ms, MAX(wait_time_ms) as max_wait_ms, PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY wait_time_ms) as p95_wait_msFROM sys.lock_statisticsGROUP BY table_nameHAVING AVG(wait_time_ms) > 10ORDER BY avg_wait_ms DESC;4. Detect Deadlock Cycles
-- View recent deadlock cycles (last 24 hours)SELECT deadlock_id, detection_time, cycle_size, victim_txn_id, was_auto_resolved, ARRAY_AGG(txn_id) as involved_transactionsFROM sys.deadlock_historyWHERE detection_time > NOW() - INTERVAL '24 hours'GROUP BY deadlock_id, detection_time, cycle_size, victim_txn_id, was_auto_resolvedORDER BY detection_time DESC;5. Analyze Deadlock Patterns
-- Identify tables frequently involved in deadlocksSELECT ARRAY_AGG(DISTINCT table_name) as tables_in_deadlock, COUNT(*) as deadlock_frequency, AVG(cycle_size) as avg_cycle_size, DATE(detection_time) as deadlock_dateFROM sys.deadlock_history dhJOIN sys.deadlock_involved_tables dit ON dh.deadlock_id = dit.deadlock_idWHERE detection_time > NOW() - INTERVAL '7 days'GROUP BY DATE(detection_time)ORDER BY deadlock_date DESC, deadlock_frequency DESC;Metrics and Dashboards
Key metrics to monitor:
| Metric | Threshold | Alert Level |
|---|---|---|
| Deadlocks/Hour | > 5 | WARNING |
| Deadlocks/Hour | > 20 | CRITICAL |
| Avg Lock Wait Time | > 100ms | WARNING |
| Avg Lock Wait Time | > 500ms | CRITICAL |
| Transaction Rollbacks/Hour | > 10% of total | WARNING |
| Lock Timeout Rate | > 1% | WARNING |
Common Deadlock Patterns
Pattern 1: Classic Two-Transaction Deadlock (Most Common)
Scenario: Two transactions lock resources in opposite order
-- Transaction T1BEGIN;UPDATE users SET status = 'active' WHERE id = 100; -- Lock Row A-- (T2 locks Row B here)UPDATE orders SET total = 1000 WHERE user_id = 100; -- Waiting for Row BCOMMIT;
-- Transaction T2 (concurrent with T1)BEGIN;UPDATE orders SET total = 2000 WHERE user_id = 100; -- Lock Row B-- (T1 locks Row A here)UPDATE users SET status = 'pending' WHERE id = 100; -- Waiting for Row ACOMMIT;-- DEADLOCK: T1 waits for T2's lock, T2 waits for T1's lockPrevention: Establish consistent lock ordering (see Strategy 1 below)
Pattern 2: Foreign Key Constraint Deadlock
Scenario: Parent-child updates with cascading locks
-- Transaction T1BEGIN;UPDATE parent_table SET status = 'inactive' WHERE id = 50;-- Cascading lock: all children with parent_id = 50UPDATE child_table SET status = 'inactive' WHERE parent_id = 50;COMMIT;
-- Transaction T2 (concurrent)BEGIN;UPDATE child_table SET status = 'active' WHERE id = 5000;-- This child has parent_id = 50 (locked by T1)UPDATE parent_table SET status = 'active' WHERE id = 50;-- Waiting for parent lock held by T1COMMIT;-- DEADLOCK: T1 locks parent, T2 wants parent-- T2 locks child, T1 wants all childrenPrevention: Disable cascading locks for high-volume child tables
Pattern 3: Multi-Table Deadlock Cycle
Scenario: Three or more transactions in a cycle
-- Transaction T1BEGIN;UPDATE table_a SET value = 1 WHERE id = 1; -- Lock A-- Waiting for C...UPDATE table_c SET value = 1 WHERE id = 1;COMMIT;
-- Transaction T2BEGIN;UPDATE table_b SET value = 1 WHERE id = 1; -- Lock B-- Waiting for A...UPDATE table_a SET value = 1 WHERE id = 1;COMMIT;
-- Transaction T3BEGIN;UPDATE table_c SET value = 1 WHERE id = 1; -- Lock C-- Waiting for B...UPDATE table_b SET value = 1 WHERE id = 1;COMMIT;-- DEADLOCK CYCLE: T1->C, T2->A, T3->B (circular)Prevention: Always acquire locks in consistent global order across all transactions
Pattern 4: Index Maintenance Deadlock
Scenario: DML operations and index updates cause deadlocks
-- When index maintenance is enabled, DML operations acquire:-- 1. Lock on data row-- 2. Lock on index entry
-- Transaction T1BEGIN;INSERT INTO users (id, email) VALUES (100, 'user1@example.com');-- Locks: users row 100, index entry for email-- Waiting for index lock on orders...UPDATE orders SET user_id = 100 WHERE id = 1;COMMIT;
-- Transaction T2BEGIN;INSERT INTO orders (id, user_id) VALUES (1, 100);-- Locks: orders row 1, index entry-- Waiting for users lock...UPDATE users SET email = 'user2@example.com' WHERE id = 100;COMMIT;-- DEADLOCK: Data lock + Index lock combinationPrevention: Batch DML operations or separate data/index operations
Pattern 5: Lock Timeout Cascade
Scenario: One deadlock triggers cascading timeouts
-- Initial deadlock between T1 and T2-- System selects T1 as victim, rolls back-- T1 reconnects and retries-- But now T3, T4, T5 are also waiting for resources-- All get timeout errors-- Cascading effect: Multiple "deadlock-like" failuresPrevention: Implement exponential backoff in retry logic
Prevention Strategies
Strategy 1: Consistent Lock Ordering (MOST IMPORTANT)
Principle: Always acquire locks in the same order across all transactions
Implementation Example
-- Define canonical lock order globally-- Order: users -> orders -> accounts -> products -> inventory
-- Good: Transaction T1 (follows canonical order)BEGIN;UPDATE users SET status = 'active' WHERE id = 100; -- Lock 1stUPDATE orders SET total = 1000 WHERE user_id = 100; -- Lock 2ndUPDATE accounts SET balance = 500 WHERE user_id = 100; -- Lock 3rdCOMMIT;
-- Good: Transaction T2 (follows same canonical order)BEGIN;UPDATE users SET points = 100 WHERE id = 200; -- Lock 1stUPDATE orders SET status = 'shipped' WHERE user_id = 200; -- Lock 2ndUPDATE accounts SET balance = 600 WHERE user_id = 200; -- Lock 3rdCOMMIT;-- No deadlock: Both follow identical lock order
-- Bad: Transaction T3 (violates canonical order)BEGIN;UPDATE orders SET total = 2000 WHERE user_id = 150; -- Lock 2nd (wrong!)UPDATE users SET status = 'pending' WHERE id = 150; -- Lock 1st (wrong!)COMMIT;-- Potential deadlock with T1 or T2Application-Level Implementation
// Rust application using HeliosDBuse std::collections::BTreeMap;
struct Transaction { id: String, tables_to_lock: Vec<String>, // Sorted in canonical order}
impl Transaction { // CRITICAL: Define global lock order const LOCK_ORDER: &'static [&'static str] = &[ "users", "orders", "accounts", "products", "inventory" ];
fn acquire_locks_in_order(&mut self, connection: &mut DbConnection) { // Sort tables according to canonical order self.tables_to_lock.sort_by_key(|table| { Self::LOCK_ORDER.iter().position(|&t| t == table) .unwrap_or(usize::MAX) });
// Acquire locks in sorted order for table in &self.tables_to_lock { connection.lock_table(table)?; } }
fn execute(mut self, connection: &mut DbConnection) -> Result<()> { // Always acquire locks first, in canonical order self.acquire_locks_in_order(connection)?;
// Then perform DML operations connection.execute_statement(&self.get_sql())?;
Ok(()) }}Strategy 2: Transaction Design Patterns
Pattern A: Serialized Operations
Reduce concurrency by executing related operations sequentially:
-- Instead of concurrent transactions competing for same rows:-- Bad approach: Multiple concurrent transactions updating same userINSERT INTO audit_log VALUES (...);INSERT INTO audit_log VALUES (...);INSERT INTO audit_log VALUES (...);
-- Good approach: Batch operations in single transactionBEGIN;INSERT INTO audit_log VALUES (...);INSERT INTO audit_log VALUES (...);INSERT INTO audit_log VALUES (...);COMMIT;Pattern B: Minimal Transaction Scope
Keep transactions as short as possible:
-- Bad: Long transaction with network I/OBEGIN;UPDATE users SET balance = balance - 100 WHERE id = 1;-- Long delay: Application calls external API for verification-- During this time, row is locked!CALL external_api_verification();UPDATE audit_log SET last_verified = NOW() WHERE user_id = 1;COMMIT;
-- Good: Separate concerns-- Part 1: Minimal transaction for data changeBEGIN;UPDATE users SET balance = balance - 100 WHERE id = 1;COMMIT;
-- Part 2: External operations without lockCALL external_api_verification();
-- Part 3: Minimal transaction for auditBEGIN;UPDATE audit_log SET last_verified = NOW() WHERE user_id = 1;COMMIT;Pattern C: Read-Only Transactions
Use read-only transactions to avoid write locks:
-- Specify transaction as read-onlyBEGIN READ ONLY;SELECT u.*, o.* FROM users uJOIN orders o ON u.id = o.user_idWHERE u.id = 100;COMMIT;
-- Read-only transactions never block writes-- Lower isolation level can be usedSET TRANSACTION ISOLATION LEVEL READ COMMITTED;Strategy 3: Using Timeouts Effectively
Global Timeout Configuration
[transactions]# Transaction-level timeout (entire transaction must complete)transaction_timeout = 300 # 5 minutes
# Lock acquisition timeout (specific lock acquisition)lock_timeout = 30 # 30 seconds
# Idle transaction timeout (auto-cleanup)idle_transaction_timeout = 900 # 15 minutes
# Enable automatic rollbackauto_rollback_on_timeout = trueApplication-Level Timeout Handling
use std::time::Duration;use std::thread;
fn execute_with_timeout_handling( transaction: Transaction, max_retries: u32,) -> Result<TransactionResult> { let mut retry_count = 0; let mut backoff_ms = 10;
loop { match execute_transaction_with_timeout( transaction.clone(), Duration::from_secs(30), ) { Ok(result) => return Ok(result), Err(TransactionError::Timeout | TransactionError::Deadlock { .. }) => { retry_count += 1; if retry_count > max_retries { return Err(TransactionError::MaxRetriesExceeded); }
// Exponential backoff: 10ms, 20ms, 40ms, 80ms, etc. println!("Retry {} after {}ms", retry_count, backoff_ms); thread::sleep(Duration::from_millis(backoff_ms)); backoff_ms = (backoff_ms * 2).min(5000); // Cap at 5 seconds } Err(e) => return Err(e), } }}Strategy 4: Isolation Level Considerations
Isolation Levels and Deadlock Risk
| Isolation Level | Deadlock Risk | Performance | Use Case |
|---|---|---|---|
| READ UNCOMMITTED | Very Low | Highest | Non-critical reads |
| READ COMMITTED | Low | High | Most general use |
| REPEATABLE READ | Medium | Medium | Historical consistency |
| SERIALIZABLE | High | Lowest | Financial/critical data |
Selecting Appropriate Isolation Level
-- For read-heavy workloads: Use lower isolationSET TRANSACTION ISOLATION LEVEL READ COMMITTED;BEGIN;SELECT * FROM users WHERE status = 'active';-- Non-blocking, fastCOMMIT;
-- For update-heavy workloads: Use appropriate isolationSET TRANSACTION ISOLATION LEVEL REPEATABLE READ;BEGIN;-- Provides consistent view of data during transactionUPDATE accounts SET balance = balance + 100 WHERE id = 1;UPDATE audit_log SET last_update = NOW();COMMIT;
-- For critical operations: Enforce serializationSET TRANSACTION ISOLATION LEVEL SERIALIZABLE;BEGIN;-- Guaranteed no concurrent transactions on these tablesUPDATE financial_records SET reconciled = TRUE WHERE month = CURRENT_MONTH;COMMIT;Operational Procedures
Procedure 1: Enable Deadlock Detection and Logging
Step 1: Update Configuration
[transactions]# Core deadlock detectionenable_deadlock_detection = truedeadlock_check_interval = 100 # ms between checksdeadlock_timeout = 5000 # ms to assume deadlocklock_timeout = 30 # seconds for lock acquisitiontransaction_timeout = 300 # seconds for full transactionauto_rollback_on_timeout = true # auto-recover from timeoutsdeadlock_victim_strategy = "fewest_locks" # selection strategy
[logging]# Deadlock loggingdeadlock_log_level = "info"detailed_deadlock_logs = truelog_wait_for_graph = truetransaction_history_log = truelog_file = "/var/log/heliosdb/deadlock.log"Step 2: Restart HeliosDB Service
# Restart with new configurationsystemctl restart heliosdb
# Verify configuration appliedheliosdb-cli config show --section transactionsProcedure 2: Monitor for Deadlocks
Step 2a: Real-Time Monitoring
# Monitor deadlock events in real-timetail -f /var/log/heliosdb/deadlock.log | grep "DEADLOCK DETECTED"
# Monitor specific deadlock detailstail -f /var/log/heliosdb/deadlock.log | grep -A 20 "DEADLOCK DETECTED"Step 2b: Automated Alerting
#!/bin/bashDEADLOCK_LOG="/var/log/heliosdb/deadlock.log"ALERT_THRESHOLD=5 # Alert if 5+ deadlocks in 1 hourCHECK_INTERVAL=300 # Check every 5 minutesHOURS_TO_CHECK=1
while true; do # Count deadlocks in last hour DEADLOCK_COUNT=$(grep -c "DEADLOCK DETECTED" <(tail -c +0 "$DEADLOCK_LOG" | grep "$(date -d '-1 hour' '+%Y-%m-%dT%H')")
if [ "$DEADLOCK_COUNT" -gt "$ALERT_THRESHOLD" ]; then # Send alert to monitoring system curl -X POST https://monitoring.example.com/alerts \ -H "Content-Type: application/json" \ -d "{\"severity\": \"warning\", \"message\": \"$DEADLOCK_COUNT deadlocks detected in HeliosDB\"}" fi
sleep "$CHECK_INTERVAL"doneProcedure 3: Analyze Deadlock Root Causes
Step 3a: Query Deadlock History
-- Detailed deadlock analysis queryWITH deadlock_details AS ( SELECT dh.deadlock_id, dh.detection_time, dh.cycle_size, dh.victim_txn_id, array_agg(DISTINCT dit.table_name) as affected_tables, array_agg(DISTINCT dit.operation_type) as operations FROM sys.deadlock_history dh LEFT JOIN sys.deadlock_involved_tables dit ON dh.deadlock_id = dit.deadlock_id WHERE dh.detection_time > NOW() - INTERVAL '1 day' GROUP BY dh.deadlock_id, dh.detection_time, dh.cycle_size, dh.victim_txn_id)SELECT deadlock_id, detection_time, cycle_size, affected_tables, operations, victim_txn_id, (NOW() - detection_time)::interval as time_agoFROM deadlock_detailsORDER BY detection_time DESCLIMIT 20;Step 3b: Identify Problematic Transaction Patterns
-- Find transactions that frequently participate in deadlocksSELECT t.txn_id, COUNT(DISTINCT dh.deadlock_id) as deadlock_count, array_agg(DISTINCT SUBSTR(t.statement, 1, 50)) as typical_statements, COUNT(*) as total_executions, ROUND(100.0 * COUNT(DISTINCT dh.deadlock_id) / COUNT(*), 2) as deadlock_rate_pctFROM sys.transaction_history tLEFT JOIN sys.deadlock_history dh ON t.txn_id = ANY(dh.involved_txn_ids)WHERE t.execution_time > NOW() - INTERVAL '7 days'GROUP BY t.txn_idHAVING COUNT(DISTINCT dh.deadlock_id) > 0ORDER BY deadlock_count DESC;Procedure 4: Configure Automated Recovery
Step 4a: Enable Automatic Victim Selection and Rollback
[transactions]# Automatic recovery settingsauto_rollback_on_timeout = truedeadlock_victim_strategy = "fewest_locks"auto_victim_rollback = true
# Retry assistanceprovide_retry_guidance = truemax_victim_retry_attempts = 3victim_retry_backoff_base_ms = 10Step 4b: Application-Level Retry Logic
# Python application example using psycopg2 (PostgreSQL-compatible)import psycopg2import timeimport random
def execute_transaction_with_deadlock_recovery(connection, query, max_retries=3): """Execute transaction with automatic deadlock recovery"""
for attempt in range(max_retries): try: cursor = connection.cursor() cursor.execute("BEGIN;") cursor.execute(query) cursor.execute("COMMIT;") return cursor.fetchall()
except psycopg2.OperationalError as e: if "deadlock" in str(e).lower(): if attempt < max_retries - 1: # Exponential backoff with jitter backoff_ms = (10 * (2 ** attempt)) + random.randint(0, 100) wait_seconds = backoff_ms / 1000.0 print(f"Deadlock detected, retrying in {wait_seconds:.2f}s (attempt {attempt + 1})") time.sleep(wait_seconds) connection.rollback() else: raise else: raise
raise Exception("Failed to execute transaction after {} attempts".format(max_retries))
# Usagetry: result = execute_transaction_with_deadlock_recovery( conn, "UPDATE users SET status='active' WHERE id=1; " "UPDATE orders SET total=1000 WHERE user_id=1;" )except Exception as e: print(f"Transaction ultimately failed: {e}")Procedure 5: Manual Intervention
Step 5a: Identify Stuck Transactions
-- Find transactions stuck (waiting >30 seconds)SELECT txn_id, start_time, EXTRACT(EPOCH FROM (NOW() - start_time)) as duration_sec, lock_count, wait_for_txn_id, statusFROM sys.transactionsWHERE status = 'WAITING' AND (NOW() - start_time) > INTERVAL '30 seconds'ORDER BY duration_sec DESC;Step 5b: Force Rollback of Stuck Transaction
-- CAUTION: Use only when transaction is truly stuck-- This will lose any uncommitted workSELECT heliosdb.force_rollback_transaction('txn_2025_1001');
-- Verify rollbackSELECT status FROM sys.transactions WHERE txn_id = 'txn_2025_1001';-- Result should be: 'ROLLED_BACK'Step 5c: Investigate and Prevent Recurrence
# Step 1: Capture full details of failed transactionheliosdb-cli diagnostics export --format json \ --output deadlock_investigation_$(date +%s).json
# Step 2: Analyze transaction SQL and lock access patternsjq '.transactions[] | select(.txn_id=="txn_2025_1001")' deadlock_investigation_*.json
# Step 3: Review application code for the failing query# (Implement consistent lock ordering per Strategy 1)
# Step 4: Test fix in staging environment# (Run concurrency load test from benchmarks/)
# Step 5: Deploy fix to productionReal-World Scenarios
Scenario 1: E-Commerce Order Processing Deadlock
Context: High-volume e-commerce platform with order and inventory management
Problem:
10% of orders fail with deadlocks during peak traffic (Black Friday)Error: "Transaction deadlocked with 2 other transactions"Impact: 5,000+ failed orders/hour, customer complaintsRoot Cause Analysis:
Order Processing Flow (Multiple Services):1. Order Service: Lock order row, lock inventory row2. Inventory Service: Lock inventory row, lock order row3. Fulfillment Service: Lock order row, lock fulfillment row
Lock Ordering Issue:- Order Service locks: orders -> inventory (correct)- Inventory Service locks: inventory -> orders (WRONG! Reversed)- Fulfillment Service locks: orders -> fulfillment (correct)
Result: Order Service (lock A->B) deadlocks with Inventory Service (lock B->A)Solution Implemented:
-
Define Global Lock Order:
users -> orders -> inventory -> fulfillment -> shipments -> returns -
Update All Services:
// Inventory Service BEFORE (WRONG)fn process_inventory() {lock_inventory(); // Lock Block_orders(); // Lock A (WRONG ORDER!)update_inventory();}// Inventory Service AFTER (CORRECT)fn process_inventory() {lock_orders(); // Lock Alock_inventory(); // Lock B (CORRECT ORDER!)update_inventory();} -
Add Comprehensive Logging:
-- Monitor for any reversal of lock orderCREATE TRIGGER validate_lock_orderBEFORE UPDATE ON inventoryBEGINASSERT (transaction_has_lock_on(orders)),"Orders must be locked before inventory!";END;
Results:
- Deadlocks reduced from 5,000/hour to 0
- Peak order throughput increased 15%
- Customer satisfaction score improved
Scenario 2: Financial Transaction Processing in Multi-Region Setup
Context: Banking system with multi-region deployment (US-East, US-West, EU)
Problem:
Cross-region transfers occasionally deadlockError rate: 0.5% of cross-region transactionsInvolves account tables in different regionsDeadlock Pattern:
Transfer: Account A (US-East) -> Account B (EU)
Transaction T1 (US-East Processor): 1. Lock Account A in US-East [SUCCESS] 2. Initiate transfer, lock Account B in EU [WAITING...]
Transaction T2 (EU Processor): 1. Lock Account B in EU [SUCCESS] 2. Update Account A in US-East (reverse transfer) [WAITING...]
DEADLOCK: T1 waits for T2's EU lock, T2 waits for T1's US-East lockSolution Implemented:
-
Coordinate Lock Ordering Across Regions:
-- Global transaction coordinator ensures lock order-- Lock order: US-East accounts BEFORE EU accountsBEGIN;-- Step 1: Lock in primary region firstSELECT pg_advisory_xact_lock(hash('account_us_east_123'));UPDATE accounts_us_east SET balance = balance - 100 WHERE id = 123;-- Step 2: Lock in secondary region (always after primary)SELECT pg_advisory_xact_lock(hash('account_eu_456'));UPDATE accounts_eu SET balance = balance + 100 WHERE id = 456;COMMIT; -
Implement Distributed Transaction Coordinator:
// Coordinator ensures all transactions follow same lock orderstruct DistributedTransaction {regions: Vec<Region>, // Ordered: [US-East, EU]}impl DistributedTransaction {fn execute(&self) -> Result<()> {// Lock regions in order: always US-East FIRSTfor region in &self.regions {region.acquire_locks()?;}for region in &self.regions {region.execute_operations()?;}Ok(())}} -
Add Detection and Metrics:
-- Monitor cross-region deadlock rateSELECTsource_region,target_region,COUNT(*) as total_transfers,COUNT(CASE WHEN deadlock THEN 1 END) as deadlock_count,100.0 * COUNT(CASE WHEN deadlock THEN 1 END) / COUNT(*) as deadlock_rate_pctFROM cross_region_transfersWHERE transfer_time > NOW() - INTERVAL '1 hour'GROUP BY source_region, target_regionORDER BY deadlock_rate_pct DESC;
Results:
- Cross-region deadlock rate reduced to 0.01% (from 0.5%)
- Improved transaction reliability from 99.5% to 99.99%
- No impact on performance (slight <1% latency increase due to lock coordination)
Troubleshooting
Issue 1: Frequent Deadlocks (>10/hour)
Diagnostic Steps
# Step 1: Collect deadlock statisticspsql heliosdb -c "SELECT COUNT(*) as deadlock_count, DATE(detection_time) as date, EXTRACT(HOUR FROM detection_time) as hourFROM sys.deadlock_historyWHERE detection_time > NOW() - INTERVAL '3 days'GROUP BY DATE(detection_time), EXTRACT(HOUR FROM detection_time)ORDER BY deadlock_count DESC;"
# Step 2: Identify problematic tablespsql heliosdb -c "SELECT table_name, COUNT(*) as deadlock_involvement, array_agg(DISTINCT operation_type) as operationsFROM sys.deadlock_involved_tablesWHERE detection_time > NOW() - INTERVAL '1 day'GROUP BY table_nameORDER BY deadlock_involvement DESC;"
# Step 3: Check for lock ordering issuespsql heliosdb -c "SELECT txn_id, statement, SUBSTRING(statement FROM 1 FOR 100) as first_statement, lock_orderFROM sys.transaction_historyWHERE deadlock_involved = TRUEORDER BY txn_id;"Resolution
# Increase deadlock detection sensitivity[transactions]deadlock_check_interval = 50 # Reduce from 100ms to 50msdeadlock_timeout = 3000 # Reduce from 5000ms to 3000mslock_timeout = 20 # Reduce from 30s to 20s
# Adjust victim selectiondeadlock_victim_strategy = "fewest_locks" # More fairIssue 2: Transaction Timeouts Without Deadlock Messages
Diagnostic
-- Distinguish between deadlock and timeoutSELECT txn_id, error_type, error_message, wait_duration_ms, lock_count, CASE WHEN wait_duration_ms > 30000 THEN 'Likely Timeout' WHEN cycle_detected THEN 'Deadlock' ELSE 'Other' END as diagnosisFROM sys.transaction_errorsWHERE error_time > NOW() - INTERVAL '1 hour'ORDER BY error_time DESC;Resolution
[transactions]# Increase timeouts if timeouts are prematurelock_timeout = 60 # Increase from 30stransaction_timeout = 600 # Increase from 300s
# Or decrease if timeouts are intentional (circuit breaker)lock_timeout = 15transaction_timeout = 120Issue 3: High Lock Wait Times But No Deadlocks
Diagnostic
-- Identify bottleneck tables/rowsSELECT table_name, row_id, COUNT(*) as wait_count, AVG(wait_time_ms) as avg_wait_ms, MAX(wait_time_ms) as max_wait_ms, PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY wait_time_ms) as p99_waitFROM sys.lock_waitsWHERE wait_time_ms > 100GROUP BY table_name, row_idORDER BY wait_count DESCLIMIT 20;Resolution
-- 1. Consider partitioning hot tableCREATE TABLE users_new (LIKE users INCLUDING ALL) PARTITION BY RANGE (id);CREATE TABLE users_new_p1 PARTITION OF users_new FOR VALUES FROM (0) TO (1000000);CREATE TABLE users_new_p2 PARTITION OF users_new FOR VALUES FROM (1000000) TO (2000000);
-- 2. Or reduce lock scope-- Instead of: UPDATE users SET ... WHERE status = 'active'-- Use: UPDATE users SET ... WHERE id IN (SELECT id FROM users WHERE status = 'active' LIMIT 100)
-- 3. Use row-level locking hintsUPDATE users SET balance = balance + 100 WHERE id = 1 FOR UPDATE NOWAIT;Performance Considerations
Impact of Deadlock Detection
Deadlock detection mechanisms consume resources:
Detection Mechanism CPU Impact Memory Impact Latency Impact------------------------------------------------------------------------Wait-for graph tracking 1-2% +5MB <1msCycle detection 2-5% +10MB 1-5msLock timeout checking <1% <1MB <1msAutomatic logging 2-3% +20MB <5ms (log write)
Total in "always on" config: ~5-10% CPU, ~35MB memory, <10ms latencyOptimization Strategies
Strategy A: Disable Detailed Logging in High-Performance Paths
[logging]# Reduce logging overheaddetailed_deadlock_logs = false # Only log essential infolog_wait_for_graph = false # Skip detailed graph loggingtransaction_history_log = false # Only log summary stats
# Keep alert-level logging activedeadlock_log_level = "warning" # Only log actual deadlocks, not all waitsStrategy B: Tune Detection Intervals
[transactions]# Balance between responsiveness and overheaddeadlock_check_interval = 200 # Check every 200ms instead of 50ms# This reduces CPU by ~50% with slight (200ms) increase in detection time
# For high-throughput systems, can increase to 500ms# For financial systems requiring quick response, keep at 50-100msStrategy C: Use Sampling for Monitoring
-- Sample transaction history instead of logging allCREATE TABLE sys.transaction_history_sampled ASSELECT * FROM sys.transaction_historyWHERE random() < 0.01 -- Log 1% of transactions AND deadlock_involved = TRUE; -- Always log deadlock-involvedPerformance Metrics by Configuration
Configuration Deadlock Detection Latency CPU Overhead False Positive Rate-------------------------------------------------------------------------------------------Aggressive (50ms, detailed) 50-100ms 8-10% <0.1%Balanced (200ms, normal) 200-300ms 3-5% <0.1%Conservative (500ms, minimal) 500-1000ms 1-2% 1-2%Disabled None (manual only) <0.1% N/AIntegration with Application Frameworks
Java/Spring Framework Integration
// Spring Data JPA configuration with deadlock handling@Configurationpublic class TransactionConfig {
@Bean public PlatformTransactionManager transactionManager(EntityManagerFactory emf) { JpaTransactionManager tm = new JpaTransactionManager(emf);
// Configure deadlock retry tm.setDefaultTimeout(30); // 30 second timeout
return tm; }
@Bean public TransactionTemplate transactionTemplate(PlatformTransactionManager tm) { TransactionTemplate template = new TransactionTemplate(tm); template.setIsolationLevel(TransactionDefinition.ISOLATION_READ_COMMITTED);
return template; }}
// Usage with AOP retry@Servicepublic class OrderService {
@Transactional @Retryable( value = {DeadlockException.class}, maxAttempts = 3, backoff = @Backoff(delay = 100, multiplier = 2) ) public void processOrder(Order order) { // Implementation uses consistent lock ordering userRepository.save(order.getUser()); // Lock 1 orderRepository.save(order); // Lock 2 inventoryRepository.update(order.getItems()); // Lock 3 }
@Recover public void recover(DeadlockException ex, Order order) { logger.error("Order processing failed after retries: {}", order.getId()); // Fallback logic }}Python/SQLAlchemy Integration
from sqlalchemy import create_engine, eventfrom sqlalchemy.orm import sessionmakerfrom sqlalchemy.exc import OperationalErrorfrom contextlib import contextmanagerimport timeimport random
class HeliosDBSessionFactory: """SQLAlchemy session factory with deadlock handling"""
def __init__(self, connection_string, max_retries=3): self.engine = create_engine(connection_string) self.Session = sessionmaker(bind=self.engine) self.max_retries = max_retries
# Register deadlock event listener @event.listens_for(self.engine, "handle_error") def deadlock_listener(exception_context): if "deadlock" in str(exception_context.original_exception).lower(): exception_context.invalidate_pool_on_disconnect = True return True
@contextmanager def session_with_deadlock_handling(self): """Context manager for transactions with auto-retry""" for attempt in range(self.max_retries): session = self.Session() try: yield session session.commit() break except OperationalError as e: session.rollback() if "deadlock" in str(e).lower() and attempt < self.max_retries - 1: backoff = (100 * (2 ** attempt)) + random.randint(0, 50) time.sleep(backoff / 1000.0) else: raise finally: session.close()
# Usagedb = HeliosDBSessionFactory("postgresql://user:pass@localhost/heliosdb")
def process_transfer(from_user_id, to_user_id, amount): with db.session_with_deadlock_handling() as session: # Ensure consistent lock order: from_user before to_user from_user = session.query(User).filter_by(id=min(from_user_id, to_user_id)).with_for_update().first() to_user = session.query(User).filter_by(id=max(from_user_id, to_user_id)).with_for_update().first()
from_user.balance -= amount to_user.balance += amountNode.js/Sequelize Integration
// Sequelize models with deadlock handlingconst Sequelize = require('sequelize');const sequelize = new Sequelize('heliosdb://user:pass@localhost/heliosdb');
// Custom transaction hookclass DeadlockRetryTransaction { constructor(maxRetries = 3) { this.maxRetries = maxRetries; }
async execute(asyncFn) { for (let attempt = 0; attempt < this.maxRetries; attempt++) { try { return await sequelize.transaction( { isolationLevel: Sequelize.Transaction.ISOLATION_LEVELS.READ_COMMITTED }, asyncFn ); } catch (error) { if (error.message.includes('deadlock') && attempt < this.maxRetries - 1) { const backoff = (100 * Math.pow(2, attempt)) + Math.random() * 50; await new Promise(resolve => setTimeout(resolve, backoff)); } else { throw error; } } } }}
// Usageconst retryHandler = new DeadlockRetryTransaction(3);
async function processOrder(orderId) { return retryHandler.execute(async (t) => { // Ensure consistent lock ordering const user = await User.findByPk(orderId, { lock: true, transaction: t }); const order = await Order.create({ userId: user.id }, { transaction: t }); const inventory = await Inventory.update( { quantity: sequelize.where(sequelize.col('quantity'), Sequelize.Op.sub, 1) }, { where: { id: order.itemId }, transaction: t } ); return order; });}Best Practices Summary
Critical Best Practices
-
Always Implement Consistent Lock Ordering (non-negotiable)
- Define canonical table order globally
- Enforce in all transaction paths
- Document lock order in code comments
-
Keep Transactions Short and Focused
- Lock resources immediately before use
- Release locks immediately after use
- Move non-database operations outside transactions
-
Implement Application-Level Retry Logic
- Use exponential backoff for retries
- Limit retry attempts (usually 3-5)
- Log and alert on persistent failures
-
Monitor and Alert on Deadlock Indicators
- Track deadlock rate (alert on >5/hour)
- Monitor lock wait times (alert on >100ms average)
- Monitor transaction timeout rate
-
Use Appropriate Isolation Levels
- Start with READ COMMITTED for most use cases
- Only use SERIALIZABLE when strictly necessary
- Document isolation level choice
Configuration Summary
[transactions]# Detectionenable_deadlock_detection = truedeadlock_check_interval = 100deadlock_timeout = 5000lock_timeout = 30transaction_timeout = 300
# Recoveryauto_rollback_on_timeout = truedeadlock_victim_strategy = "fewest_locks"auto_victim_rollback = true
[logging]deadlock_log_level = "info"detailed_deadlock_logs = truelog_wait_for_graph = truetransaction_history_log = trueConclusion
Deadlock prevention in HeliosDB relies on:
- Consistent lock ordering - The single most important factor
- Transaction design - Keeping transactions short and focused
- Monitoring - Early detection of deadlock conditions
- Application resilience - Implementing retry logic at application level
- Appropriate configuration - Tuning timeouts and detection parameters
By following the strategies and procedures outlined in this guide, you can eliminate the vast majority of deadlocks and ensure reliable, high-throughput transaction processing in HeliosDB.
Additional Resources
- HeliosDB Transaction Documentation:
docs/features/transactions/ - MVCC Implementation:
docs/features/mvcc/ - Lock Management:
docs/architecture/locking.md - Monitoring and Diagnostics:
docs/guides/user/AUTONOMOUS_DATABASE_OPERATIONS_GUIDE.md - Performance Tuning:
docs/guides/user/CONFIGURATION_GUIDE.md
Questions or Issues? Contact the HeliosDB support team or consult the project’s GitHub repository.