2PC Testing Infrastructure - Quick Reference
2PC Testing Infrastructure - Quick Reference
Version: 1.0 Last Updated: November 28, 2025 Status: Week 8-19 Implementation Guide
Overview
Purpose: Test HeliosDBβs Two-Phase Commit (2PC) implementation for production readiness Scope: 50+ failure scenarios, 1000+ node scale testing Duration: 12 weeks (Weeks 8-19) Budget: $200K development, $0 execution (automated)
π Quick Start (5 Minutes)
1. Run All Tests
cd heliosdb-storage/tests/distributedcargo test --release2. Run Specific Category
# Node crash scenarioscargo test --release node_crashes_
# Network partition scenarioscargo test --release network_partitions_
# Timing anomaly scenarioscargo test --release timing_anomalies_3. Run Single Scenario
cargo test --release scenario_coordinator_crash_after_prepare4. Generate Report
cargo test --release -- --test-threads=1 > test-results.txtπ Architecture Overview
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Test Infrastructure ββ ββ βββββββββββββββββ ββββββββββββββββ ββββββββββββββββββ ββ β Test Runner ββ β Coordinator ββ β Validator β ββ β (main.rs) β β Simulator β β (ACID checks) β ββ βββββββββββββββββ ββββββββββββββββ ββββββββββββββββββ ββ β β ββ ββββββββββββββββββ ββββββββββββββββββ ββ β Virtual β β Metrics β ββ β Participants β β Collector β ββ β (1000+) β β β ββ ββββββββββββββββββ ββββββββββββββββββ ββ β ββ ββββββββββββββββββ ββ β Failure β ββ β Injector β ββ ββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββKey Components:
- Virtual Participants: Lightweight in-memory participants (100MB for 1000 nodes)
- Failure Injector: Crash, network partition, delay injection
- Validator: Atomicity, Consistency, Isolation, Durability checks
- Metrics: Latency (p50/p99/p999), throughput, recovery time
50+ Test Scenarios
Category 1: Node Crashes (6 scenarios)
| Scenario | Description | Expected Outcome |
|---|---|---|
| 1.1 Single Participant Crash | One participant crashes during PREPARE | Transaction ABORTS cleanly |
| 1.2 Coordinator Crash Before Prepare | Coordinator crashes before sending PREPARE | No participant affected |
| 1.3 Coordinator Crash After Prepare | Coordinator crashes after PREPARE, before COMMIT | Recovery completes COMMIT |
| 1.4 Coordinator Crash During Commit | Coordinator crashes mid-COMMIT | Recovery completes COMMIT |
| 1.5 Multiple Participant Crashes | 40% of participants crash | Transaction ABORTS (no quorum) |
| 1.6 Cascading Crashes | Sequential crashes during phases | Clean abort, no data loss |
Category 2: Network Partitions (8 scenarios)
| Scenario | Description | Expected Outcome |
|---|---|---|
| 2.1 Minority Partition | 20% of nodes isolated | Transaction ABORTS |
| 2.2 Split-Brain | Exactly 50/50 split | No quorum, both abort |
| 2.3 Asymmetric Partition | Coordinator reachable, participants isolated | Depends on implementation |
| 2.4 Partition During Prepare | Partition mid-voting | ABORT, timeout protection |
| 2.5 Partition During Commit | Partition mid-COMMIT | Recovery completes COMMIT |
| 2.6 Partition Heal During Recovery | Network heals during recovery | Recovery completes successfully |
| 2.7 Multiple Overlapping Partitions | Multiple partitions simultaneously | ABORT, no partial commits |
| 2.8 Flapping Network | Repeated connect/disconnect | Eventually commits or aborts |
Category 3: Timing Anomalies (10 scenarios)
| Scenario | Description | Expected Outcome |
|---|---|---|
| 3.1 Slow Participant Prepare (1s) | One participant delays vote | Commits if timeout > 1s |
| 3.2 Slow Participant Commit (2s) | One participant slow to commit | Commits eventually |
| 3.3 Timeout During Prepare | Coordinator timeout before all votes | Transaction ABORTS |
| 3.4 Timeout During Commit | Participant doesnβt ACK commit | Retries until success |
| 3.5 Variable Latency (50-500ms) | Random latency per participant | Commits, latency = max(latencies) |
| 3.6 Packet Loss 5% | 5% packet drop rate | Commits with retries |
| 3.7 Packet Loss 20% | 20% packet drop rate | Commits or aborts within timeout |
| 3.8 High Latency (200ms) | Cross-region simulation | Commits, high latency expected |
| 3.9 Bandwidth Constraint (1 Mbps) | Low bandwidth | Commits, high latency |
| 3.10 Clock Skew (10s) | Participant clocks differ | No impact (logical timestamps) |
Category 4: Coordinator Failures (8 scenarios)
| Scenario | Description | Expected Outcome |
|---|---|---|
| 4.1 Coordinator Failover | Primary fails, backup takes over | Backup completes transactions |
| 4.2 Restart with Recovery | Coordinator restarts, recovers from WAL | All prepared txns completed |
| 4.3 Out of Memory | OOM during transaction | Clean failure, no corruption |
| 4.4 Thread Exhaustion | Thread pool saturated | Queuing, no failures |
| 4.5 Disk Full | No disk space for WAL | Rejects new txns gracefully |
| 4.6 Multiple Crashes | Repeated crashes during recovery | Eventually recovers |
| 4.7 Byzantine Coordinator | Conflicting COMMIT/ABORT messages | Detection, alerting |
| 4.8 Stuck in Prepare | Coordinator hangs | Participant timeout protection |
Category 5: Lock Timeouts (6 scenarios)
| Scenario | Description | Expected Outcome |
|---|---|---|
| 5.1 Single Key Timeout | Lock timeout on one key | Abort after timeout |
| 5.2 Multiple Key Timeout | Timeout on multiple keys | Clean abort, release all locks |
| 5.3 Deadlock Detection (2 txns) | Classic 2-way deadlock | Abort youngest, other proceeds |
| 5.4 Deadlock Resolution (3+ txns) | 3-way deadlock cycle | Abort one, others proceed |
| 5.5 Livelock Prevention | Repeated abort-retry loop | Exponential backoff, eventual progress |
| 5.6 Long-Running Transactions | Transaction runs 60s | Configurable timeout, no premature abort |
Category 6: Combined Failures (6 scenarios)
| Scenario | Description | Expected Outcome |
|---|---|---|
| 6.1 Partition + Crash | Network partition + node crash | Clean handling, no data loss |
| 6.2 Slow Network + Timeout | High latency triggers timeout | Correct timeout behavior |
| 6.3 Cascading Failures | One failure triggers others | Self-stabilization |
| 6.4 Recovery During New Failures | New failures during recovery | Robust recovery |
| 6.5 Multiple Failure Modes | All failure types simultaneously | ABORT, zero corruption |
| 6.6 Worst-Case Chaos | 100 nodes, 100 txns, random failures | FINAL BOSS TEST |
Success Criteria
Correctness (100% Required)
- Atomicity: All-or-nothing commits (0 violations)
- Consistency: Invariants maintained (e.g., balance unchanged)
- Isolation: No dirty reads (0 violations)
- Durability: All committed transactions survive crash
- Data Integrity: Zero data corruption across all scenarios
Performance (<5% Degradation)
- Prepare Latency (p99): <100ms (baseline: 8ms)
- Commit Latency (p99): <100ms (baseline: 10ms)
- Recovery Time (p99): <10s (all scenarios)
- Throughput: >10,000 distributed TXN/sec
Scale
- Participants: 1000+ nodes in single test
- Concurrent Transactions: 100+ simultaneous
- Test Duration: 50+ scenarios in <48 hours
- Resource Usage: <100MB per 1000 virtual participants
Implementation Timeline
Week 8-11: Development ($200K)
Week 8: Foundation
- Day 1-2: Module structure, build setup
- Day 3-5: Virtual participant implementation
- Day 6-7: Basic test harness
Week 9: Failure Injection
- Day 1-3: Failure injector
- Day 4-5: Network simulator
- Day 6-7: Integration
Week 10: Validation
- Day 1-3: Correctness validator
- Day 4-5: Metrics collector
- Day 6-7: WAL recovery testing
Week 11: Scenarios
- Day 1-5: Implement all 50+ scenarios
- Day 6-7: Documentation, CI/CD integration
Deliverables (End of Week 11):
- 7 modules, 3,100 LOC
- All 50+ scenarios defined
- CI/CD integrated
- Developer docs complete
Week 12-15: Basic Testing ($0 - automated)
Categories:
- Node Crashes (6 scenarios)
- Network Partitions (8 scenarios)
- Timing Anomalies (10 scenarios)
Activities:
- Automated CI/CD runs
- Bug filing and fixes
- Iterative improvements
Week 16-18: Advanced Testing ($0 - automated)
Categories:
- Coordinator Failures (8 scenarios)
- Lock Timeouts (6 scenarios)
- Combined Failures (6 scenarios)
Activities:
- Scale testing (100, 500, 1000 nodes)
- Performance regression testing
- Chaos engineering
Week 19: Final Validation ($0 - automated)
Activities:
- Full suite execution (50+ scenarios Γ 3 reps)
- Aggregate results
- Production certification report
- Sign-off
Expected Results
After Week 11 (Development Complete)
Test framework ready All scenarios implemented CI/CD integratedβΈ Waiting for test executionAfter Week 15 (Basic Testing)
24/24 basic scenarios PASS Node crashes: 100% PASS Network partitions: 100% PASS Timing anomalies: 100% PASSβ 2-3 bugs found and fixedAfter Week 18 (Advanced Testing)
44/50 scenarios PASS Coordinator failures: 95% PASS Lock timeouts: 100% PASS Combined failures: 90% PASSβ Scale testing reveals 1-2 edge casesAfter Week 19 (Final Validation)
50/50 scenarios PASS (100%) Zero data corruption Performance <5% degradation Recovery <10s (all scenarios) PRODUCTION CERTIFIEDCommon Commands
Run Tests
# All testscargo test --release
# Specific categorycargo test --release node_crashes_cargo test --release network_partitions_cargo test --release timing_anomalies_cargo test --release coordinator_failures_cargo test --release lock_timeouts_cargo test --release combined_failures_
# Single scenariocargo test --release scenario_coordinator_crash_after_prepare
# With verbose outputRUST_LOG=debug cargo test --release -- --nocapture
# Single-threaded (for deterministic failures)cargo test --release -- --test-threads=1CLI Usage
# Run all scenarios./2pc-test run-all --participants 100 --output-dir results/
# Run category./2pc-test run-category "Node Crashes" --participants 100
# Run single scenario./2pc-test run-scenario "Coordinator Crash After Prepare" --verbose
# List scenarios./2pc-test list./2pc-test list --category "Network Partitions"
# Generate random scenario./2pc-test random --participants 500 --transactions 50Generate Reports
# Text reportcargo test --release > test-results.txt
# JSON reportcargo test --release --format json > test-results.json
# HTML coverage reportcargo tarpaulin --out Html --output-dir target/coverageπ Troubleshooting
Test Failures
Issue: Test fails with βAtomicity violationβ Solution: Check coordinator WAL recovery logic, ensure COMMIT decision persisted
Issue: Timeout in network partition scenario
Solution: Increase timeout in NetworkConditions config
Issue: βCannot allocate memoryβ during 1000-node test
Solution: Reduce num_participants or increase VM memory
Performance Issues
Issue: Tests run very slowly
Solution: Use --release flag, ensure no debug logging
Issue: High CPU usage
Solution: Reduce concurrent transactions with --test-threads=1
π Related Documentation
- Main Architecture: 2PC_TESTING_INFRASTRUCTURE_ARCHITECTURE.md
- Rust Templates: 2PC_TESTING_RUST_TEMPLATES.md
- 2PC User Guide: 21_distributed_transactions_2pc.md
- Phase 1 Roadmap: PHASE1_FOUNDATION_8WEEK_ROADMAP.md
π Support
Questions? Check the main architecture document for detailed specifications.
Found a bug? File an issue with:
- Scenario name
- Test output
- Expected vs. actual result
Version: 1.0 Status: Ready for Week 8 Implementation Next Review: End of Week 11 (Development Complete)