HeliosDB Deadlock Prevention Operations Guide

Version: 7.0 Last Updated: December 30, 2025 Status: Production Ready Target Audience: Database Administrators, Application Developers, Operations Engineers

Understanding Deadlocks
Deadlock Detection and Resolution
Identifying Deadlocks
Common Deadlock Patterns
Prevention Strategies
Operational Procedures
Real-World Scenarios
Troubleshooting
Performance Considerations
Integration with Application Frameworks

Understanding Deadlocks

What Is a Deadlock?

A deadlock occurs when two or more transactions are waiting for each other to release locks on resources, creating a circular dependency. None of the transactions can proceed, and they remain indefinitely blocked unless intervention occurs.

Deadlock Conditions (All Four Must Be Present)

Mutual Exclusion: Resources cannot be shared; only one transaction can hold a lock at a time
Hold and Wait: Transactions can hold locks while waiting for additional locks
No Preemption: Locks cannot be forcibly removed from transactions
Circular Wait: A cycle of transactions exists where each waits for resources held by the next

How Deadlocks Manifest in HeliosDB

HeliosDB uses Multi-Version Concurrency Control (MVCC) with snapshot isolation and row-level locking to minimize deadlock probability, but they can still occur in these scenarios:

Concurrent DML Operations: When multiple transactions modify overlapping rows
Foreign Key Constraints: When cascading updates lock parent/child tables in different orders
Explicit Locks: When transactions explicitly request locks on multiple resources
Index Maintenance: When DML operations update both data and index structures
Distributed Transactions: When transactions span multiple shards or nodes

Example Deadlock Scenario

Transaction T1:
  1. Lock Row A in Table users
  2. Attempt to lock Row B in Table orders (waiting...)

Transaction T2:
  1. Lock Row B in Table orders
  2. Attempt to lock Row A in Table users (waiting...)

Result: Both transactions deadlocked, neither can proceed

Deadlock Detection and Resolution

Automatic Detection Mechanisms

HeliosDB implements multiple deadlock detection strategies:

1. Wait-For Graph Monitoring

HeliosDB maintains an internal wait-for graph tracking which transactions are waiting on which other transactions:

Configuration in heliosdb.toml:
[transactions]
# Enable deadlock detection
enable_deadlock_detection = true

# Detection interval (milliseconds)
deadlock_check_interval = 100

# Maximum wait time before assuming deadlock (milliseconds)
deadlock_timeout = 5000

# Enable wait-for graph logging (verbose)
log_wait_for_graph = false

2. Timeout-Based Detection

When a transaction exceeds the configured lock acquisition timeout, it’s assumed to be deadlocked:

[transactions]
# Lock acquisition timeout (seconds)
lock_timeout = 30

# Transaction timeout (seconds)
transaction_timeout = 300

# Enable automatic rollback on timeout
auto_rollback_on_timeout = true

3. Cycle Detection

The detection system periodically searches for cycles in the wait-for graph:

// Cycle Detection Algorithm (Conceptual)
// Implemented in HeliosDB's transaction manager
fn detect_deadlock_cycles(wait_for_graph: &WaitForGraph) -> Vec<DeadlockCycle> {
    let mut cycles = Vec::new();

    for node in wait_for_graph.nodes() {
        if let Some(cycle) = find_cycle_from_node(node, wait_for_graph) {
            cycles.push(cycle);
        }
    }

    cycles
}

Automatic Resolution Mechanisms

1. Victim Selection Strategy

When a deadlock is detected, HeliosDB automatically selects a victim transaction to rollback:

[transactions]
# Deadlock victim selection strategy
# Options: "youngest", "fewest_locks", "least_work", "random"
deadlock_victim_strategy = "fewest_locks"

# Enable automatic victim rollback
auto_victim_rollback = true

Selection Criteria:

youngest: Rollback the transaction with the most recent start time (minimal work lost)
fewest_locks: Rollback the transaction holding the fewest locks (minimize impact)
least_work: Rollback the transaction with fewest executed statements
random: Randomly select a victim (not recommended for production)

2. Victim Notification and Rollback

// Application-level handling
match transaction_result {
    Ok(data) => {
        // Transaction succeeded
        println!("Transaction committed successfully");
    }
    Err(TransactionError::Deadlock { victim_id, cycle_size }) => {
        // Application was the deadlock victim
        eprintln!("Transaction deadlocked (victim in {}-txn cycle)", cycle_size);
        eprintln!("Transaction will be automatically rolled back");
        eprintln!("Implement retry logic in application");

        // Retry logic
        thread::sleep(Duration::from_millis(10 * cycle_size as u64));
        // Re-execute transaction
    }
    Err(e) => eprintln!("Transaction failed: {}", e),
}

Identifying Deadlocks

Log-Based Identification

Enable Deadlock Logging

[logging]
# Log all deadlock events
deadlock_log_level = "info"

# Detailed deadlock diagnostics
detailed_deadlock_logs = true

# Log wait-for graph when deadlock detected
log_graph_on_deadlock = true

# Enable transaction history logging
transaction_history_log = true

Parsing Deadlock Log Entries

HeliosDB logs deadlock events in this format:

[2025-12-30T10:45:23.456Z] DEADLOCK DETECTED
  Deadlock ID: deadlock_20251230_104523_001
  Cycle Size: 3 transactions
  Detection Time: 45ms

  Transaction 1:
    ID: txn_2025_1001
    Start Time: 2025-12-30T10:45:20.123Z
    Duration: 3.333s
    Locks Held: 2 rows (table:users, table:orders)
    Waiting For: 1 row (table:accounts)
    Statements: UPDATE users SET status=?, UPDATE orders SET total=?

  Transaction 2:
    ID: txn_2025_1002
    Start Time: 2025-12-30T10:45:21.456Z
    Duration: 2.000s
    Locks Held: 1 row (table:accounts)
    Waiting For: 1 row (table:users)
    Statements: SELECT * FROM accounts, UPDATE accounts SET balance=?

  Transaction 3:
    ID: txn_2025_1003
    Start Time: 2025-12-30T10:45:22.789Z
    Duration: 0.667s
    Locks Held: 1 row (table:orders)
    Waiting For: 1 row (table:accounts)
    Statements: INSERT INTO orders VALUES (...)

  Victim Selected: txn_2025_1001 (fewest_locks strategy)
  Victim Rolled Back: TRUE
  Victim Retry Available: TRUE

Monitoring Queries

1. Monitor Active Transactions and Locks

-- View all active transactions with lock information
SELECT
    txn_id,
    start_time,
    duration_ms,
    lock_count,
    statement_count,
    wait_for_txn_id,
    priority,
    status
FROM sys.transactions
WHERE status = 'ACTIVE'
ORDER BY duration_ms DESC;

2. Identify Blocked Transactions

-- Find transactions waiting on locks (potential deadlock victims)
SELECT
    t1.txn_id as blocked_txn,
    t1.start_time,
    t1.lock_count as blocked_lock_count,
    t2.txn_id as blocking_txn,
    t2.lock_count as blocking_lock_count,
    COUNT(*) as wait_chain_depth
FROM sys.transactions t1
JOIN sys.transactions t2 ON t1.wait_for_txn_id = t2.txn_id
LEFT JOIN sys.transactions t3 ON t2.wait_for_txn_id = t3.txn_id
WHERE t1.wait_for_txn_id IS NOT NULL
GROUP BY t1.txn_id, t1.start_time, t1.lock_count,
         t2.txn_id, t2.lock_count
ORDER BY wait_chain_depth DESC;

3. Monitor Lock Wait Times

-- Track average lock wait times by table
SELECT
    table_name,
    COUNT(*) as total_lock_attempts,
    SUM(CASE WHEN wait_time_ms > 0 THEN 1 ELSE 0 END) as waited_attempts,
    AVG(wait_time_ms) as avg_wait_ms,
    MAX(wait_time_ms) as max_wait_ms,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY wait_time_ms) as p95_wait_ms
FROM sys.lock_statistics
GROUP BY table_name
HAVING AVG(wait_time_ms) > 10
ORDER BY avg_wait_ms DESC;

4. Detect Deadlock Cycles

-- View recent deadlock cycles (last 24 hours)
SELECT
    deadlock_id,
    detection_time,
    cycle_size,
    victim_txn_id,
    was_auto_resolved,
    ARRAY_AGG(txn_id) as involved_transactions
FROM sys.deadlock_history
WHERE detection_time > NOW() - INTERVAL '24 hours'
GROUP BY deadlock_id, detection_time, cycle_size, victim_txn_id, was_auto_resolved
ORDER BY detection_time DESC;

5. Analyze Deadlock Patterns

-- Identify tables frequently involved in deadlocks
SELECT
    ARRAY_AGG(DISTINCT table_name) as tables_in_deadlock,
    COUNT(*) as deadlock_frequency,
    AVG(cycle_size) as avg_cycle_size,
    DATE(detection_time) as deadlock_date
FROM sys.deadlock_history dh
JOIN sys.deadlock_involved_tables dit ON dh.deadlock_id = dit.deadlock_id
WHERE detection_time > NOW() - INTERVAL '7 days'
GROUP BY DATE(detection_time)
ORDER BY deadlock_date DESC, deadlock_frequency DESC;

Metrics and Dashboards

Key metrics to monitor:

Metric	Threshold	Alert Level
Deadlocks/Hour	> 5	WARNING
Deadlocks/Hour	> 20	CRITICAL
Avg Lock Wait Time	> 100ms	WARNING
Avg Lock Wait Time	> 500ms	CRITICAL
Transaction Rollbacks/Hour	> 10% of total	WARNING
Lock Timeout Rate	> 1%	WARNING

Common Deadlock Patterns

Pattern 1: Classic Two-Transaction Deadlock (Most Common)

Scenario: Two transactions lock resources in opposite order

-- Transaction T1
BEGIN;
UPDATE users SET status = 'active' WHERE id = 100;  -- Lock Row A
-- (T2 locks Row B here)
UPDATE orders SET total = 1000 WHERE user_id = 100;  -- Waiting for Row B
COMMIT;

-- Transaction T2 (concurrent with T1)
BEGIN;
UPDATE orders SET total = 2000 WHERE user_id = 100;  -- Lock Row B
-- (T1 locks Row A here)
UPDATE users SET status = 'pending' WHERE id = 100;  -- Waiting for Row A
COMMIT;
-- DEADLOCK: T1 waits for T2's lock, T2 waits for T1's lock

Prevention: Establish consistent lock ordering (see Strategy 1 below)

Pattern 2: Foreign Key Constraint Deadlock

Scenario: Parent-child updates with cascading locks

-- Transaction T1
BEGIN;
UPDATE parent_table SET status = 'inactive' WHERE id = 50;
-- Cascading lock: all children with parent_id = 50
UPDATE child_table SET status = 'inactive' WHERE parent_id = 50;
COMMIT;

-- Transaction T2 (concurrent)
BEGIN;
UPDATE child_table SET status = 'active' WHERE id = 5000;
-- This child has parent_id = 50 (locked by T1)
UPDATE parent_table SET status = 'active' WHERE id = 50;
-- Waiting for parent lock held by T1
COMMIT;
-- DEADLOCK: T1 locks parent, T2 wants parent
--          T2 locks child, T1 wants all children

Prevention: Disable cascading locks for high-volume child tables

Pattern 3: Multi-Table Deadlock Cycle

Scenario: Three or more transactions in a cycle

-- Transaction T1
BEGIN;
UPDATE table_a SET value = 1 WHERE id = 1;  -- Lock A
-- Waiting for C...
UPDATE table_c SET value = 1 WHERE id = 1;
COMMIT;

-- Transaction T2
BEGIN;
UPDATE table_b SET value = 1 WHERE id = 1;  -- Lock B
-- Waiting for A...
UPDATE table_a SET value = 1 WHERE id = 1;
COMMIT;

-- Transaction T3
BEGIN;
UPDATE table_c SET value = 1 WHERE id = 1;  -- Lock C
-- Waiting for B...
UPDATE table_b SET value = 1 WHERE id = 1;
COMMIT;
-- DEADLOCK CYCLE: T1->C, T2->A, T3->B (circular)

Prevention: Always acquire locks in consistent global order across all transactions

Pattern 4: Index Maintenance Deadlock

Scenario: DML operations and index updates cause deadlocks

-- When index maintenance is enabled, DML operations acquire:
-- 1. Lock on data row
-- 2. Lock on index entry

-- Transaction T1
BEGIN;
INSERT INTO users (id, email) VALUES (100, 'user1@example.com');
-- Locks: users row 100, index entry for email
-- Waiting for index lock on orders...
UPDATE orders SET user_id = 100 WHERE id = 1;
COMMIT;

-- Transaction T2
BEGIN;
INSERT INTO orders (id, user_id) VALUES (1, 100);
-- Locks: orders row 1, index entry
-- Waiting for users lock...
UPDATE users SET email = 'user2@example.com' WHERE id = 100;
COMMIT;
-- DEADLOCK: Data lock + Index lock combination

Prevention: Batch DML operations or separate data/index operations

Pattern 5: Lock Timeout Cascade

Scenario: One deadlock triggers cascading timeouts

-- Initial deadlock between T1 and T2
-- System selects T1 as victim, rolls back
-- T1 reconnects and retries
-- But now T3, T4, T5 are also waiting for resources
-- All get timeout errors
-- Cascading effect: Multiple "deadlock-like" failures

Prevention: Implement exponential backoff in retry logic

Prevention Strategies

Strategy 1: Consistent Lock Ordering (MOST IMPORTANT)

Principle: Always acquire locks in the same order across all transactions

Implementation Example

-- Define canonical lock order globally
-- Order: users -> orders -> accounts -> products -> inventory

-- Good: Transaction T1 (follows canonical order)
BEGIN;
UPDATE users SET status = 'active' WHERE id = 100;      -- Lock 1st
UPDATE orders SET total = 1000 WHERE user_id = 100;     -- Lock 2nd
UPDATE accounts SET balance = 500 WHERE user_id = 100;  -- Lock 3rd
COMMIT;

-- Good: Transaction T2 (follows same canonical order)
BEGIN;
UPDATE users SET points = 100 WHERE id = 200;           -- Lock 1st
UPDATE orders SET status = 'shipped' WHERE user_id = 200;  -- Lock 2nd
UPDATE accounts SET balance = 600 WHERE user_id = 200;  -- Lock 3rd
COMMIT;
-- No deadlock: Both follow identical lock order

-- Bad: Transaction T3 (violates canonical order)
BEGIN;
UPDATE orders SET total = 2000 WHERE user_id = 150;     -- Lock 2nd (wrong!)
UPDATE users SET status = 'pending' WHERE id = 150;     -- Lock 1st (wrong!)
COMMIT;
-- Potential deadlock with T1 or T2

Application-Level Implementation

// Rust application using HeliosDB
use std::collections::BTreeMap;

struct Transaction {
    id: String,
    tables_to_lock: Vec<String>,  // Sorted in canonical order
}

impl Transaction {
    // CRITICAL: Define global lock order
    const LOCK_ORDER: &'static [&'static str] = &[
        "users", "orders", "accounts", "products", "inventory"
    ];

    fn acquire_locks_in_order(&mut self, connection: &mut DbConnection) {
        // Sort tables according to canonical order
        self.tables_to_lock.sort_by_key(|table| {
            Self::LOCK_ORDER.iter().position(|&t| t == table)
                .unwrap_or(usize::MAX)
        });

        // Acquire locks in sorted order
        for table in &self.tables_to_lock {
            connection.lock_table(table)?;
        }
    }

    fn execute(mut self, connection: &mut DbConnection) -> Result<()> {
        // Always acquire locks first, in canonical order
        self.acquire_locks_in_order(connection)?;

        // Then perform DML operations
        connection.execute_statement(&self.get_sql())?;

        Ok(())
    }
}

Strategy 2: Transaction Design Patterns

Pattern A: Serialized Operations

Reduce concurrency by executing related operations sequentially:

-- Instead of concurrent transactions competing for same rows:
-- Bad approach: Multiple concurrent transactions updating same user
INSERT INTO audit_log VALUES (...);
INSERT INTO audit_log VALUES (...);
INSERT INTO audit_log VALUES (...);

-- Good approach: Batch operations in single transaction
BEGIN;
INSERT INTO audit_log VALUES (...);
INSERT INTO audit_log VALUES (...);
INSERT INTO audit_log VALUES (...);
COMMIT;

Pattern B: Minimal Transaction Scope

Keep transactions as short as possible:

-- Bad: Long transaction with network I/O
BEGIN;
UPDATE users SET balance = balance - 100 WHERE id = 1;
-- Long delay: Application calls external API for verification
-- During this time, row is locked!
CALL external_api_verification();
UPDATE audit_log SET last_verified = NOW() WHERE user_id = 1;
COMMIT;

-- Good: Separate concerns
-- Part 1: Minimal transaction for data change
BEGIN;
UPDATE users SET balance = balance - 100 WHERE id = 1;
COMMIT;

-- Part 2: External operations without lock
CALL external_api_verification();

-- Part 3: Minimal transaction for audit
BEGIN;
UPDATE audit_log SET last_verified = NOW() WHERE user_id = 1;
COMMIT;

Pattern C: Read-Only Transactions

Use read-only transactions to avoid write locks:

-- Specify transaction as read-only
BEGIN READ ONLY;
SELECT u.*, o.* FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.id = 100;
COMMIT;

-- Read-only transactions never block writes
-- Lower isolation level can be used
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;

Strategy 3: Using Timeouts Effectively

Global Timeout Configuration

[transactions]
# Transaction-level timeout (entire transaction must complete)
transaction_timeout = 300  # 5 minutes

# Lock acquisition timeout (specific lock acquisition)
lock_timeout = 30  # 30 seconds

# Idle transaction timeout (auto-cleanup)
idle_transaction_timeout = 900  # 15 minutes

# Enable automatic rollback
auto_rollback_on_timeout = true

Application-Level Timeout Handling

use std::time::Duration;
use std::thread;

fn execute_with_timeout_handling(
    transaction: Transaction,
    max_retries: u32,
) -> Result<TransactionResult> {
    let mut retry_count = 0;
    let mut backoff_ms = 10;

    loop {
        match execute_transaction_with_timeout(
            transaction.clone(),
            Duration::from_secs(30),
        ) {
            Ok(result) => return Ok(result),
            Err(TransactionError::Timeout | TransactionError::Deadlock { .. }) => {
                retry_count += 1;
                if retry_count > max_retries {
                    return Err(TransactionError::MaxRetriesExceeded);
                }

                // Exponential backoff: 10ms, 20ms, 40ms, 80ms, etc.
                println!("Retry {} after {}ms", retry_count, backoff_ms);
                thread::sleep(Duration::from_millis(backoff_ms));
                backoff_ms = (backoff_ms * 2).min(5000);  // Cap at 5 seconds
            }
            Err(e) => return Err(e),
        }
    }
}

Strategy 4: Isolation Level Considerations

Isolation Levels and Deadlock Risk

Isolation Level	Deadlock Risk	Performance	Use Case
READ UNCOMMITTED	Very Low	Highest	Non-critical reads
READ COMMITTED	Low	High	Most general use
REPEATABLE READ	Medium	Medium	Historical consistency
SERIALIZABLE	High	Lowest	Financial/critical data

Selecting Appropriate Isolation Level

-- For read-heavy workloads: Use lower isolation
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
BEGIN;
SELECT * FROM users WHERE status = 'active';
-- Non-blocking, fast
COMMIT;

-- For update-heavy workloads: Use appropriate isolation
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN;
-- Provides consistent view of data during transaction
UPDATE accounts SET balance = balance + 100 WHERE id = 1;
UPDATE audit_log SET last_update = NOW();
COMMIT;

-- For critical operations: Enforce serialization
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN;
-- Guaranteed no concurrent transactions on these tables
UPDATE financial_records SET reconciled = TRUE WHERE month = CURRENT_MONTH;
COMMIT;

Operational Procedures

Procedure 1: Enable Deadlock Detection and Logging

Step 1: Update Configuration

[transactions]
# Core deadlock detection
enable_deadlock_detection = true
deadlock_check_interval = 100          # ms between checks
deadlock_timeout = 5000                # ms to assume deadlock
lock_timeout = 30                      # seconds for lock acquisition
transaction_timeout = 300              # seconds for full transaction
auto_rollback_on_timeout = true        # auto-recover from timeouts
deadlock_victim_strategy = "fewest_locks"  # selection strategy

[logging]
# Deadlock logging
deadlock_log_level = "info"
detailed_deadlock_logs = true
log_wait_for_graph = true
transaction_history_log = true
log_file = "/var/log/heliosdb/deadlock.log"

Step 2: Restart HeliosDB Service

# Restart with new configuration
systemctl restart heliosdb

# Verify configuration applied
heliosdb-cli config show --section transactions

Procedure 2: Monitor for Deadlocks

Step 2a: Real-Time Monitoring

# Monitor deadlock events in real-time
tail -f /var/log/heliosdb/deadlock.log | grep "DEADLOCK DETECTED"

# Monitor specific deadlock details
tail -f /var/log/heliosdb/deadlock.log | grep -A 20 "DEADLOCK DETECTED"

Step 2b: Automated Alerting

#!/bin/bash
DEADLOCK_LOG="/var/log/heliosdb/deadlock.log"
ALERT_THRESHOLD=5          # Alert if 5+ deadlocks in 1 hour
CHECK_INTERVAL=300         # Check every 5 minutes
HOURS_TO_CHECK=1

while true; do
    # Count deadlocks in last hour
    DEADLOCK_COUNT=$(grep -c "DEADLOCK DETECTED" <(tail -c +0 "$DEADLOCK_LOG" | grep "$(date -d '-1 hour' '+%Y-%m-%dT%H')")

    if [ "$DEADLOCK_COUNT" -gt "$ALERT_THRESHOLD" ]; then
        # Send alert to monitoring system
        curl -X POST https://monitoring.example.com/alerts \
            -H "Content-Type: application/json" \
            -d "{\"severity\": \"warning\", \"message\": \"$DEADLOCK_COUNT deadlocks detected in HeliosDB\"}"
    fi

    sleep "$CHECK_INTERVAL"
done

Procedure 3: Analyze Deadlock Root Causes

Step 3a: Query Deadlock History

-- Detailed deadlock analysis query
WITH deadlock_details AS (
    SELECT
        dh.deadlock_id,
        dh.detection_time,
        dh.cycle_size,
        dh.victim_txn_id,
        array_agg(DISTINCT dit.table_name) as affected_tables,
        array_agg(DISTINCT dit.operation_type) as operations
    FROM sys.deadlock_history dh
    LEFT JOIN sys.deadlock_involved_tables dit
        ON dh.deadlock_id = dit.deadlock_id
    WHERE dh.detection_time > NOW() - INTERVAL '1 day'
    GROUP BY dh.deadlock_id, dh.detection_time, dh.cycle_size, dh.victim_txn_id
)
SELECT
    deadlock_id,
    detection_time,
    cycle_size,
    affected_tables,
    operations,
    victim_txn_id,
    (NOW() - detection_time)::interval as time_ago
FROM deadlock_details
ORDER BY detection_time DESC
LIMIT 20;

Step 3b: Identify Problematic Transaction Patterns

-- Find transactions that frequently participate in deadlocks
SELECT
    t.txn_id,
    COUNT(DISTINCT dh.deadlock_id) as deadlock_count,
    array_agg(DISTINCT SUBSTR(t.statement, 1, 50)) as typical_statements,
    COUNT(*) as total_executions,
    ROUND(100.0 * COUNT(DISTINCT dh.deadlock_id) / COUNT(*), 2) as deadlock_rate_pct
FROM sys.transaction_history t
LEFT JOIN sys.deadlock_history dh ON t.txn_id = ANY(dh.involved_txn_ids)
WHERE t.execution_time > NOW() - INTERVAL '7 days'
GROUP BY t.txn_id
HAVING COUNT(DISTINCT dh.deadlock_id) > 0
ORDER BY deadlock_count DESC;

Procedure 4: Configure Automated Recovery

Step 4a: Enable Automatic Victim Selection and Rollback

[transactions]
# Automatic recovery settings
auto_rollback_on_timeout = true
deadlock_victim_strategy = "fewest_locks"
auto_victim_rollback = true

# Retry assistance
provide_retry_guidance = true
max_victim_retry_attempts = 3
victim_retry_backoff_base_ms = 10

Step 4b: Application-Level Retry Logic

# Python application example using psycopg2 (PostgreSQL-compatible)
import psycopg2
import time
import random

def execute_transaction_with_deadlock_recovery(connection, query, max_retries=3):
    """Execute transaction with automatic deadlock recovery"""

    for attempt in range(max_retries):
        try:
            cursor = connection.cursor()
            cursor.execute("BEGIN;")
            cursor.execute(query)
            cursor.execute("COMMIT;")
            return cursor.fetchall()

        except psycopg2.OperationalError as e:
            if "deadlock" in str(e).lower():
                if attempt < max_retries - 1:
                    # Exponential backoff with jitter
                    backoff_ms = (10 * (2 ** attempt)) + random.randint(0, 100)
                    wait_seconds = backoff_ms / 1000.0
                    print(f"Deadlock detected, retrying in {wait_seconds:.2f}s (attempt {attempt + 1})")
                    time.sleep(wait_seconds)
                    connection.rollback()
                else:
                    raise
            else:
                raise

    raise Exception("Failed to execute transaction after {} attempts".format(max_retries))

# Usage
try:
    result = execute_transaction_with_deadlock_recovery(
        conn,
        "UPDATE users SET status='active' WHERE id=1; "
        "UPDATE orders SET total=1000 WHERE user_id=1;"
    )
except Exception as e:
    print(f"Transaction ultimately failed: {e}")

Procedure 5: Manual Intervention

Step 5a: Identify Stuck Transactions

-- Find transactions stuck (waiting >30 seconds)
SELECT
    txn_id,
    start_time,
    EXTRACT(EPOCH FROM (NOW() - start_time)) as duration_sec,
    lock_count,
    wait_for_txn_id,
    status
FROM sys.transactions
WHERE status = 'WAITING'
    AND (NOW() - start_time) > INTERVAL '30 seconds'
ORDER BY duration_sec DESC;

Step 5b: Force Rollback of Stuck Transaction

-- CAUTION: Use only when transaction is truly stuck
-- This will lose any uncommitted work
SELECT heliosdb.force_rollback_transaction('txn_2025_1001');

-- Verify rollback
SELECT status FROM sys.transactions WHERE txn_id = 'txn_2025_1001';
-- Result should be: 'ROLLED_BACK'

Step 5c: Investigate and Prevent Recurrence

# Step 1: Capture full details of failed transaction
heliosdb-cli diagnostics export --format json \
    --output deadlock_investigation_$(date +%s).json

# Step 2: Analyze transaction SQL and lock access patterns
jq '.transactions[] | select(.txn_id=="txn_2025_1001")' deadlock_investigation_*.json

# Step 3: Review application code for the failing query
# (Implement consistent lock ordering per Strategy 1)

# Step 4: Test fix in staging environment
# (Run concurrency load test from benchmarks/)

# Step 5: Deploy fix to production

Real-World Scenarios

Scenario 1: E-Commerce Order Processing Deadlock

Context: High-volume e-commerce platform with order and inventory management

Problem:

10% of orders fail with deadlocks during peak traffic (Black Friday)
Error: "Transaction deadlocked with 2 other transactions"
Impact: 5,000+ failed orders/hour, customer complaints

Root Cause Analysis:

Order Processing Flow (Multiple Services):
1. Order Service: Lock order row, lock inventory row
2. Inventory Service: Lock inventory row, lock order row
3. Fulfillment Service: Lock order row, lock fulfillment row

Lock Ordering Issue:
- Order Service locks: orders -> inventory (correct)
- Inventory Service locks: inventory -> orders (WRONG! Reversed)
- Fulfillment Service locks: orders -> fulfillment (correct)

Result: Order Service (lock A->B) deadlocks with Inventory Service (lock B->A)

Solution Implemented:

Define Global Lock Order:

users -> orders -> inventory -> fulfillment -> shipments -> returns

Update All Services:

// Inventory Service BEFORE (WRONG)
fn process_inventory() {
    lock_inventory();        // Lock B
    lock_orders();           // Lock A (WRONG ORDER!)
    update_inventory();
}

// Inventory Service AFTER (CORRECT)
fn process_inventory() {
    lock_orders();           // Lock A
    lock_inventory();        // Lock B (CORRECT ORDER!)
    update_inventory();
}

Add Comprehensive Logging:

-- Monitor for any reversal of lock order
CREATE TRIGGER validate_lock_order
BEFORE UPDATE ON inventory
BEGIN
    ASSERT (transaction_has_lock_on(orders)),
    "Orders must be locked before inventory!";
END;

Results:

Deadlocks reduced from 5,000/hour to 0
Peak order throughput increased 15%
Customer satisfaction score improved

Scenario 2: Financial Transaction Processing in Multi-Region Setup

Context: Banking system with multi-region deployment (US-East, US-West, EU)

Problem:

Cross-region transfers occasionally deadlock
Error rate: 0.5% of cross-region transactions
Involves account tables in different regions

Deadlock Pattern:

Transfer: Account A (US-East) -> Account B (EU)

Transaction T1 (US-East Processor):
  1. Lock Account A in US-East [SUCCESS]
  2. Initiate transfer, lock Account B in EU [WAITING...]

Transaction T2 (EU Processor):
  1. Lock Account B in EU [SUCCESS]
  2. Update Account A in US-East (reverse transfer) [WAITING...]

DEADLOCK: T1 waits for T2's EU lock, T2 waits for T1's US-East lock

Solution Implemented:

Coordinate Lock Ordering Across Regions:

-- Global transaction coordinator ensures lock order
-- Lock order: US-East accounts BEFORE EU accounts

BEGIN;
-- Step 1: Lock in primary region first
SELECT pg_advisory_xact_lock(hash('account_us_east_123'));
UPDATE accounts_us_east SET balance = balance - 100 WHERE id = 123;

-- Step 2: Lock in secondary region (always after primary)
SELECT pg_advisory_xact_lock(hash('account_eu_456'));
UPDATE accounts_eu SET balance = balance + 100 WHERE id = 456;

COMMIT;

Implement Distributed Transaction Coordinator:

// Coordinator ensures all transactions follow same lock order
struct DistributedTransaction {
    regions: Vec<Region>,  // Ordered: [US-East, EU]
}

impl DistributedTransaction {
    fn execute(&self) -> Result<()> {
        // Lock regions in order: always US-East FIRST
        for region in &self.regions {
            region.acquire_locks()?;
        }

        for region in &self.regions {
            region.execute_operations()?;
        }

        Ok(())
    }
}

Add Detection and Metrics:

-- Monitor cross-region deadlock rate
SELECT
    source_region,
    target_region,
    COUNT(*) as total_transfers,
    COUNT(CASE WHEN deadlock THEN 1 END) as deadlock_count,
    100.0 * COUNT(CASE WHEN deadlock THEN 1 END) / COUNT(*) as deadlock_rate_pct
FROM cross_region_transfers
WHERE transfer_time > NOW() - INTERVAL '1 hour'
GROUP BY source_region, target_region
ORDER BY deadlock_rate_pct DESC;

Results:

Cross-region deadlock rate reduced to 0.01% (from 0.5%)
Improved transaction reliability from 99.5% to 99.99%
No impact on performance (slight <1% latency increase due to lock coordination)

Troubleshooting

Issue 1: Frequent Deadlocks (>10/hour)

Diagnostic Steps

# Step 1: Collect deadlock statistics
psql heliosdb -c "
SELECT
    COUNT(*) as deadlock_count,
    DATE(detection_time) as date,
    EXTRACT(HOUR FROM detection_time) as hour
FROM sys.deadlock_history
WHERE detection_time > NOW() - INTERVAL '3 days'
GROUP BY DATE(detection_time), EXTRACT(HOUR FROM detection_time)
ORDER BY deadlock_count DESC;"

# Step 2: Identify problematic tables
psql heliosdb -c "
SELECT
    table_name,
    COUNT(*) as deadlock_involvement,
    array_agg(DISTINCT operation_type) as operations
FROM sys.deadlock_involved_tables
WHERE detection_time > NOW() - INTERVAL '1 day'
GROUP BY table_name
ORDER BY deadlock_involvement DESC;"

# Step 3: Check for lock ordering issues
psql heliosdb -c "
SELECT
    txn_id,
    statement,
    SUBSTRING(statement FROM 1 FOR 100) as first_statement,
    lock_order
FROM sys.transaction_history
WHERE deadlock_involved = TRUE
ORDER BY txn_id;"

Resolution

# Increase deadlock detection sensitivity
[transactions]
deadlock_check_interval = 50        # Reduce from 100ms to 50ms
deadlock_timeout = 3000              # Reduce from 5000ms to 3000ms
lock_timeout = 20                   # Reduce from 30s to 20s

# Adjust victim selection
deadlock_victim_strategy = "fewest_locks"  # More fair

Issue 2: Transaction Timeouts Without Deadlock Messages

Diagnostic

-- Distinguish between deadlock and timeout
SELECT
    txn_id,
    error_type,
    error_message,
    wait_duration_ms,
    lock_count,
    CASE
        WHEN wait_duration_ms > 30000 THEN 'Likely Timeout'
        WHEN cycle_detected THEN 'Deadlock'
        ELSE 'Other'
    END as diagnosis
FROM sys.transaction_errors
WHERE error_time > NOW() - INTERVAL '1 hour'
ORDER BY error_time DESC;

Resolution

[transactions]
# Increase timeouts if timeouts are premature
lock_timeout = 60           # Increase from 30s
transaction_timeout = 600   # Increase from 300s

# Or decrease if timeouts are intentional (circuit breaker)
lock_timeout = 15
transaction_timeout = 120

Issue 3: High Lock Wait Times But No Deadlocks

Diagnostic

-- Identify bottleneck tables/rows
SELECT
    table_name,
    row_id,
    COUNT(*) as wait_count,
    AVG(wait_time_ms) as avg_wait_ms,
    MAX(wait_time_ms) as max_wait_ms,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY wait_time_ms) as p99_wait
FROM sys.lock_waits
WHERE wait_time_ms > 100
GROUP BY table_name, row_id
ORDER BY wait_count DESC
LIMIT 20;

Resolution

-- 1. Consider partitioning hot table
CREATE TABLE users_new (LIKE users INCLUDING ALL) PARTITION BY RANGE (id);
CREATE TABLE users_new_p1 PARTITION OF users_new FOR VALUES FROM (0) TO (1000000);
CREATE TABLE users_new_p2 PARTITION OF users_new FOR VALUES FROM (1000000) TO (2000000);

-- 2. Or reduce lock scope
-- Instead of: UPDATE users SET ... WHERE status = 'active'
-- Use: UPDATE users SET ... WHERE id IN (SELECT id FROM users WHERE status = 'active' LIMIT 100)

-- 3. Use row-level locking hints
UPDATE users SET balance = balance + 100 WHERE id = 1 FOR UPDATE NOWAIT;

Performance Considerations

Impact of Deadlock Detection

Deadlock detection mechanisms consume resources:

Detection Mechanism         CPU Impact   Memory Impact   Latency Impact
------------------------------------------------------------------------
Wait-for graph tracking     1-2%         +5MB            <1ms
Cycle detection             2-5%         +10MB           1-5ms
Lock timeout checking       <1%          <1MB            <1ms
Automatic logging           2-3%         +20MB           <5ms (log write)

Total in "always on" config: ~5-10% CPU, ~35MB memory, <10ms latency

Optimization Strategies

Strategy A: Disable Detailed Logging in High-Performance Paths

[logging]
# Reduce logging overhead
detailed_deadlock_logs = false      # Only log essential info
log_wait_for_graph = false          # Skip detailed graph logging
transaction_history_log = false     # Only log summary stats

# Keep alert-level logging active
deadlock_log_level = "warning"      # Only log actual deadlocks, not all waits

Strategy B: Tune Detection Intervals

[transactions]
# Balance between responsiveness and overhead
deadlock_check_interval = 200       # Check every 200ms instead of 50ms
# This reduces CPU by ~50% with slight (200ms) increase in detection time

# For high-throughput systems, can increase to 500ms
# For financial systems requiring quick response, keep at 50-100ms

Strategy C: Use Sampling for Monitoring

-- Sample transaction history instead of logging all
CREATE TABLE sys.transaction_history_sampled AS
SELECT * FROM sys.transaction_history
WHERE random() < 0.01  -- Log 1% of transactions
  AND deadlock_involved = TRUE;    -- Always log deadlock-involved

Performance Metrics by Configuration

Configuration                  Deadlock Detection Latency   CPU Overhead   False Positive Rate
-------------------------------------------------------------------------------------------
Aggressive (50ms, detailed)   50-100ms                     8-10%          <0.1%
Balanced (200ms, normal)      200-300ms                    3-5%           <0.1%
Conservative (500ms, minimal) 500-1000ms                   1-2%           1-2%
Disabled                      None (manual only)           <0.1%          N/A

Integration with Application Frameworks

Java/Spring Framework Integration

// Spring Data JPA configuration with deadlock handling
@Configuration
public class TransactionConfig {

    @Bean
    public PlatformTransactionManager transactionManager(EntityManagerFactory emf) {
        JpaTransactionManager tm = new JpaTransactionManager(emf);

        // Configure deadlock retry
        tm.setDefaultTimeout(30);  // 30 second timeout

        return tm;
    }

    @Bean
    public TransactionTemplate transactionTemplate(PlatformTransactionManager tm) {
        TransactionTemplate template = new TransactionTemplate(tm);
        template.setIsolationLevel(TransactionDefinition.ISOLATION_READ_COMMITTED);

        return template;
    }
}

// Usage with AOP retry
@Service
public class OrderService {

    @Transactional
    @Retryable(
        value = {DeadlockException.class},
        maxAttempts = 3,
        backoff = @Backoff(delay = 100, multiplier = 2)
    )
    public void processOrder(Order order) {
        // Implementation uses consistent lock ordering
        userRepository.save(order.getUser());      // Lock 1
        orderRepository.save(order);               // Lock 2
        inventoryRepository.update(order.getItems()); // Lock 3
    }

    @Recover
    public void recover(DeadlockException ex, Order order) {
        logger.error("Order processing failed after retries: {}", order.getId());
        // Fallback logic
    }
}

Python/SQLAlchemy Integration

from sqlalchemy import create_engine, event
from sqlalchemy.orm import sessionmaker
from sqlalchemy.exc import OperationalError
from contextlib import contextmanager
import time
import random

class HeliosDBSessionFactory:
    """SQLAlchemy session factory with deadlock handling"""

    def __init__(self, connection_string, max_retries=3):
        self.engine = create_engine(connection_string)
        self.Session = sessionmaker(bind=self.engine)
        self.max_retries = max_retries

        # Register deadlock event listener
        @event.listens_for(self.engine, "handle_error")
        def deadlock_listener(exception_context):
            if "deadlock" in str(exception_context.original_exception).lower():
                exception_context.invalidate_pool_on_disconnect = True
                return True

    @contextmanager
    def session_with_deadlock_handling(self):
        """Context manager for transactions with auto-retry"""
        for attempt in range(self.max_retries):
            session = self.Session()
            try:
                yield session
                session.commit()
                break
            except OperationalError as e:
                session.rollback()
                if "deadlock" in str(e).lower() and attempt < self.max_retries - 1:
                    backoff = (100 * (2 ** attempt)) + random.randint(0, 50)
                    time.sleep(backoff / 1000.0)
                else:
                    raise
            finally:
                session.close()

# Usage
db = HeliosDBSessionFactory("postgresql://user:pass@localhost/heliosdb")

def process_transfer(from_user_id, to_user_id, amount):
    with db.session_with_deadlock_handling() as session:
        # Ensure consistent lock order: from_user before to_user
        from_user = session.query(User).filter_by(id=min(from_user_id, to_user_id)).with_for_update().first()
        to_user = session.query(User).filter_by(id=max(from_user_id, to_user_id)).with_for_update().first()

        from_user.balance -= amount
        to_user.balance += amount

Node.js/Sequelize Integration

// Sequelize models with deadlock handling
const Sequelize = require('sequelize');
const sequelize = new Sequelize('heliosdb://user:pass@localhost/heliosdb');

// Custom transaction hook
class DeadlockRetryTransaction {
    constructor(maxRetries = 3) {
        this.maxRetries = maxRetries;
    }

    async execute(asyncFn) {
        for (let attempt = 0; attempt < this.maxRetries; attempt++) {
            try {
                return await sequelize.transaction(
                    { isolationLevel: Sequelize.Transaction.ISOLATION_LEVELS.READ_COMMITTED },
                    asyncFn
                );
            } catch (error) {
                if (error.message.includes('deadlock') && attempt < this.maxRetries - 1) {
                    const backoff = (100 * Math.pow(2, attempt)) + Math.random() * 50;
                    await new Promise(resolve => setTimeout(resolve, backoff));
                } else {
                    throw error;
                }
            }
        }
    }
}

// Usage
const retryHandler = new DeadlockRetryTransaction(3);

async function processOrder(orderId) {
    return retryHandler.execute(async (t) => {
        // Ensure consistent lock ordering
        const user = await User.findByPk(orderId, { lock: true, transaction: t });
        const order = await Order.create({ userId: user.id }, { transaction: t });
        const inventory = await Inventory.update(
            { quantity: sequelize.where(sequelize.col('quantity'), Sequelize.Op.sub, 1) },
            { where: { id: order.itemId }, transaction: t }
        );
        return order;
    });
}

Best Practices Summary

Critical Best Practices

Always Implement Consistent Lock Ordering (non-negotiable)
- Define canonical table order globally
- Enforce in all transaction paths
- Document lock order in code comments
Keep Transactions Short and Focused
- Lock resources immediately before use
- Release locks immediately after use
- Move non-database operations outside transactions
Implement Application-Level Retry Logic
- Use exponential backoff for retries
- Limit retry attempts (usually 3-5)
- Log and alert on persistent failures
Monitor and Alert on Deadlock Indicators
- Track deadlock rate (alert on >5/hour)
- Monitor lock wait times (alert on >100ms average)
- Monitor transaction timeout rate
Use Appropriate Isolation Levels
- Start with READ COMMITTED for most use cases
- Only use SERIALIZABLE when strictly necessary
- Document isolation level choice

Configuration Summary

[transactions]
# Detection
enable_deadlock_detection = true
deadlock_check_interval = 100
deadlock_timeout = 5000
lock_timeout = 30
transaction_timeout = 300

# Recovery
auto_rollback_on_timeout = true
deadlock_victim_strategy = "fewest_locks"
auto_victim_rollback = true

[logging]
deadlock_log_level = "info"
detailed_deadlock_logs = true
log_wait_for_graph = true
transaction_history_log = true

Conclusion

Deadlock prevention in HeliosDB relies on:

Consistent lock ordering - The single most important factor
Transaction design - Keeping transactions short and focused
Monitoring - Early detection of deadlock conditions
Application resilience - Implementing retry logic at application level
Appropriate configuration - Tuning timeouts and detection parameters

By following the strategies and procedures outlined in this guide, you can eliminate the vast majority of deadlocks and ensure reliable, high-throughput transaction processing in HeliosDB.

Additional Resources

HeliosDB Transaction Documentation: docs/features/transactions/
MVCC Implementation: docs/features/mvcc/
Lock Management: docs/architecture/locking.md
Monitoring and Diagnostics: docs/guides/user/AUTONOMOUS_DATABASE_OPERATIONS_GUIDE.md
Performance Tuning: docs/guides/user/CONFIGURATION_GUIDE.md

Questions or Issues? Contact the HeliosDB support team or consult the project’s GitHub repository.

HeliosDB Deadlock Prevention Operations Guide

HeliosDB Deadlock Prevention Operations Guide

Table of Contents

Understanding Deadlocks

What Is a Deadlock?

Deadlock Conditions (All Four Must Be Present)

How Deadlocks Manifest in HeliosDB

Example Deadlock Scenario

Deadlock Detection and Resolution

Automatic Detection Mechanisms

1. Wait-For Graph Monitoring

2. Timeout-Based Detection

3. Cycle Detection

Automatic Resolution Mechanisms

1. Victim Selection Strategy

2. Victim Notification and Rollback

Identifying Deadlocks

Log-Based Identification

Enable Deadlock Logging

Parsing Deadlock Log Entries

Monitoring Queries

1. Monitor Active Transactions and Locks

2. Identify Blocked Transactions

3. Monitor Lock Wait Times

4. Detect Deadlock Cycles

5. Analyze Deadlock Patterns

Metrics and Dashboards

Common Deadlock Patterns

Pattern 1: Classic Two-Transaction Deadlock (Most Common)

Pattern 2: Foreign Key Constraint Deadlock

Pattern 3: Multi-Table Deadlock Cycle

Pattern 4: Index Maintenance Deadlock

Pattern 5: Lock Timeout Cascade

Prevention Strategies

Strategy 1: Consistent Lock Ordering (MOST IMPORTANT)

Implementation Example

Application-Level Implementation

Strategy 2: Transaction Design Patterns

Pattern A: Serialized Operations

Pattern B: Minimal Transaction Scope

Pattern C: Read-Only Transactions

Strategy 3: Using Timeouts Effectively

Global Timeout Configuration

Application-Level Timeout Handling

Strategy 4: Isolation Level Considerations

Isolation Levels and Deadlock Risk

Selecting Appropriate Isolation Level

Operational Procedures

Procedure 1: Enable Deadlock Detection and Logging

Step 1: Update Configuration

Step 2: Restart HeliosDB Service

Procedure 2: Monitor for Deadlocks

Step 2a: Real-Time Monitoring

Step 2b: Automated Alerting

Procedure 3: Analyze Deadlock Root Causes

Step 3a: Query Deadlock History

Step 3b: Identify Problematic Transaction Patterns

Procedure 4: Configure Automated Recovery

Step 4a: Enable Automatic Victim Selection and Rollback

Step 4b: Application-Level Retry Logic

Procedure 5: Manual Intervention

Step 5a: Identify Stuck Transactions

Step 5b: Force Rollback of Stuck Transaction

Step 5c: Investigate and Prevent Recurrence

Real-World Scenarios

Scenario 1: E-Commerce Order Processing Deadlock

Scenario 2: Financial Transaction Processing in Multi-Region Setup

Troubleshooting

Issue 1: Frequent Deadlocks (>10/hour)

Diagnostic Steps

Resolution

Issue 2: Transaction Timeouts Without Deadlock Messages

Diagnostic

Resolution

Issue 3: High Lock Wait Times But No Deadlocks

Diagnostic

Resolution

Performance Considerations

Impact of Deadlock Detection

Optimization Strategies