Skip to content

2PC Testing Infrastructure - Quick Reference

2PC Testing Infrastructure - Quick Reference

Version: 1.0 Last Updated: November 28, 2025 Status: Week 8-19 Implementation Guide


Overview

Purpose: Test HeliosDB’s Two-Phase Commit (2PC) implementation for production readiness Scope: 50+ failure scenarios, 1000+ node scale testing Duration: 12 weeks (Weeks 8-19) Budget: $200K development, $0 execution (automated)


πŸ“‹ Quick Start (5 Minutes)

1. Run All Tests

Terminal window
cd heliosdb-storage/tests/distributed
cargo test --release

2. Run Specific Category

Terminal window
# Node crash scenarios
cargo test --release node_crashes_
# Network partition scenarios
cargo test --release network_partitions_
# Timing anomaly scenarios
cargo test --release timing_anomalies_

3. Run Single Scenario

Terminal window
cargo test --release scenario_coordinator_crash_after_prepare

4. Generate Report

Terminal window
cargo test --release -- --test-threads=1 > test-results.txt

πŸ— Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Test Infrastructure β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Test Runner β”‚β†’ β”‚ Coordinator β”‚β†’ β”‚ Validator β”‚ β”‚
β”‚ β”‚ (main.rs) β”‚ β”‚ Simulator β”‚ β”‚ (ACID checks) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ ↓ ↓ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Virtual β”‚ β”‚ Metrics β”‚ β”‚
β”‚ β”‚ Participants β”‚ β”‚ Collector β”‚ β”‚
β”‚ β”‚ (1000+) β”‚ β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ ↓ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Failure β”‚ β”‚
β”‚ β”‚ Injector β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Components:

  • Virtual Participants: Lightweight in-memory participants (100MB for 1000 nodes)
  • Failure Injector: Crash, network partition, delay injection
  • Validator: Atomicity, Consistency, Isolation, Durability checks
  • Metrics: Latency (p50/p99/p999), throughput, recovery time

50+ Test Scenarios

Category 1: Node Crashes (6 scenarios)

ScenarioDescriptionExpected Outcome
1.1 Single Participant CrashOne participant crashes during PREPARETransaction ABORTS cleanly
1.2 Coordinator Crash Before PrepareCoordinator crashes before sending PREPARENo participant affected
1.3 Coordinator Crash After PrepareCoordinator crashes after PREPARE, before COMMITRecovery completes COMMIT
1.4 Coordinator Crash During CommitCoordinator crashes mid-COMMITRecovery completes COMMIT
1.5 Multiple Participant Crashes40% of participants crashTransaction ABORTS (no quorum)
1.6 Cascading CrashesSequential crashes during phasesClean abort, no data loss

Category 2: Network Partitions (8 scenarios)

ScenarioDescriptionExpected Outcome
2.1 Minority Partition20% of nodes isolatedTransaction ABORTS
2.2 Split-BrainExactly 50/50 splitNo quorum, both abort
2.3 Asymmetric PartitionCoordinator reachable, participants isolatedDepends on implementation
2.4 Partition During PreparePartition mid-votingABORT, timeout protection
2.5 Partition During CommitPartition mid-COMMITRecovery completes COMMIT
2.6 Partition Heal During RecoveryNetwork heals during recoveryRecovery completes successfully
2.7 Multiple Overlapping PartitionsMultiple partitions simultaneouslyABORT, no partial commits
2.8 Flapping NetworkRepeated connect/disconnectEventually commits or aborts

Category 3: Timing Anomalies (10 scenarios)

ScenarioDescriptionExpected Outcome
3.1 Slow Participant Prepare (1s)One participant delays voteCommits if timeout > 1s
3.2 Slow Participant Commit (2s)One participant slow to commitCommits eventually
3.3 Timeout During PrepareCoordinator timeout before all votesTransaction ABORTS
3.4 Timeout During CommitParticipant doesn’t ACK commitRetries until success
3.5 Variable Latency (50-500ms)Random latency per participantCommits, latency = max(latencies)
3.6 Packet Loss 5%5% packet drop rateCommits with retries
3.7 Packet Loss 20%20% packet drop rateCommits or aborts within timeout
3.8 High Latency (200ms)Cross-region simulationCommits, high latency expected
3.9 Bandwidth Constraint (1 Mbps)Low bandwidthCommits, high latency
3.10 Clock Skew (10s)Participant clocks differNo impact (logical timestamps)

Category 4: Coordinator Failures (8 scenarios)

ScenarioDescriptionExpected Outcome
4.1 Coordinator FailoverPrimary fails, backup takes overBackup completes transactions
4.2 Restart with RecoveryCoordinator restarts, recovers from WALAll prepared txns completed
4.3 Out of MemoryOOM during transactionClean failure, no corruption
4.4 Thread ExhaustionThread pool saturatedQueuing, no failures
4.5 Disk FullNo disk space for WALRejects new txns gracefully
4.6 Multiple CrashesRepeated crashes during recoveryEventually recovers
4.7 Byzantine CoordinatorConflicting COMMIT/ABORT messagesDetection, alerting
4.8 Stuck in PrepareCoordinator hangsParticipant timeout protection

Category 5: Lock Timeouts (6 scenarios)

ScenarioDescriptionExpected Outcome
5.1 Single Key TimeoutLock timeout on one keyAbort after timeout
5.2 Multiple Key TimeoutTimeout on multiple keysClean abort, release all locks
5.3 Deadlock Detection (2 txns)Classic 2-way deadlockAbort youngest, other proceeds
5.4 Deadlock Resolution (3+ txns)3-way deadlock cycleAbort one, others proceed
5.5 Livelock PreventionRepeated abort-retry loopExponential backoff, eventual progress
5.6 Long-Running TransactionsTransaction runs 60sConfigurable timeout, no premature abort

Category 6: Combined Failures (6 scenarios)

ScenarioDescriptionExpected Outcome
6.1 Partition + CrashNetwork partition + node crashClean handling, no data loss
6.2 Slow Network + TimeoutHigh latency triggers timeoutCorrect timeout behavior
6.3 Cascading FailuresOne failure triggers othersSelf-stabilization
6.4 Recovery During New FailuresNew failures during recoveryRobust recovery
6.5 Multiple Failure ModesAll failure types simultaneouslyABORT, zero corruption
6.6 Worst-Case Chaos100 nodes, 100 txns, random failuresFINAL BOSS TEST

Success Criteria

Correctness (100% Required)

  • Atomicity: All-or-nothing commits (0 violations)
  • Consistency: Invariants maintained (e.g., balance unchanged)
  • Isolation: No dirty reads (0 violations)
  • Durability: All committed transactions survive crash
  • Data Integrity: Zero data corruption across all scenarios

Performance (<5% Degradation)

  • Prepare Latency (p99): <100ms (baseline: 8ms)
  • Commit Latency (p99): <100ms (baseline: 10ms)
  • Recovery Time (p99): <10s (all scenarios)
  • Throughput: >10,000 distributed TXN/sec

Scale

  • Participants: 1000+ nodes in single test
  • Concurrent Transactions: 100+ simultaneous
  • Test Duration: 50+ scenarios in <48 hours
  • Resource Usage: <100MB per 1000 virtual participants

Implementation Timeline

Week 8-11: Development ($200K)

Week 8: Foundation

  • Day 1-2: Module structure, build setup
  • Day 3-5: Virtual participant implementation
  • Day 6-7: Basic test harness

Week 9: Failure Injection

  • Day 1-3: Failure injector
  • Day 4-5: Network simulator
  • Day 6-7: Integration

Week 10: Validation

  • Day 1-3: Correctness validator
  • Day 4-5: Metrics collector
  • Day 6-7: WAL recovery testing

Week 11: Scenarios

  • Day 1-5: Implement all 50+ scenarios
  • Day 6-7: Documentation, CI/CD integration

Deliverables (End of Week 11):

  • 7 modules, 3,100 LOC
  • All 50+ scenarios defined
  • CI/CD integrated
  • Developer docs complete

Week 12-15: Basic Testing ($0 - automated)

Categories:

  • Node Crashes (6 scenarios)
  • Network Partitions (8 scenarios)
  • Timing Anomalies (10 scenarios)

Activities:

  • Automated CI/CD runs
  • Bug filing and fixes
  • Iterative improvements

Week 16-18: Advanced Testing ($0 - automated)

Categories:

  • Coordinator Failures (8 scenarios)
  • Lock Timeouts (6 scenarios)
  • Combined Failures (6 scenarios)

Activities:

  • Scale testing (100, 500, 1000 nodes)
  • Performance regression testing
  • Chaos engineering

Week 19: Final Validation ($0 - automated)

Activities:

  • Full suite execution (50+ scenarios Γ— 3 reps)
  • Aggregate results
  • Production certification report
  • Sign-off

Expected Results

After Week 11 (Development Complete)

Test framework ready
All scenarios implemented
CI/CD integrated
⏸ Waiting for test execution

After Week 15 (Basic Testing)

24/24 basic scenarios PASS
Node crashes: 100% PASS
Network partitions: 100% PASS
Timing anomalies: 100% PASS
⚠ 2-3 bugs found and fixed

After Week 18 (Advanced Testing)

44/50 scenarios PASS
Coordinator failures: 95% PASS
Lock timeouts: 100% PASS
Combined failures: 90% PASS
⚠ Scale testing reveals 1-2 edge cases

After Week 19 (Final Validation)

50/50 scenarios PASS (100%)
Zero data corruption
Performance <5% degradation
Recovery <10s (all scenarios)
PRODUCTION CERTIFIED

Common Commands

Run Tests

Terminal window
# All tests
cargo test --release
# Specific category
cargo test --release node_crashes_
cargo test --release network_partitions_
cargo test --release timing_anomalies_
cargo test --release coordinator_failures_
cargo test --release lock_timeouts_
cargo test --release combined_failures_
# Single scenario
cargo test --release scenario_coordinator_crash_after_prepare
# With verbose output
RUST_LOG=debug cargo test --release -- --nocapture
# Single-threaded (for deterministic failures)
cargo test --release -- --test-threads=1

CLI Usage

Terminal window
# Run all scenarios
./2pc-test run-all --participants 100 --output-dir results/
# Run category
./2pc-test run-category "Node Crashes" --participants 100
# Run single scenario
./2pc-test run-scenario "Coordinator Crash After Prepare" --verbose
# List scenarios
./2pc-test list
./2pc-test list --category "Network Partitions"
# Generate random scenario
./2pc-test random --participants 500 --transactions 50

Generate Reports

Terminal window
# Text report
cargo test --release > test-results.txt
# JSON report
cargo test --release --format json > test-results.json
# HTML coverage report
cargo tarpaulin --out Html --output-dir target/coverage

πŸ› Troubleshooting

Test Failures

Issue: Test fails with β€œAtomicity violation” Solution: Check coordinator WAL recovery logic, ensure COMMIT decision persisted

Issue: Timeout in network partition scenario Solution: Increase timeout in NetworkConditions config

Issue: β€œCannot allocate memory” during 1000-node test Solution: Reduce num_participants or increase VM memory

Performance Issues

Issue: Tests run very slowly Solution: Use --release flag, ensure no debug logging

Issue: High CPU usage Solution: Reduce concurrent transactions with --test-threads=1



πŸ“ž Support

Questions? Check the main architecture document for detailed specifications.

Found a bug? File an issue with:

  • Scenario name
  • Test output
  • Expected vs. actual result

Version: 1.0 Status: Ready for Week 8 Implementation Next Review: End of Week 11 (Development Complete)