Phase 2 Reliability Testing - Quick Reference Guide
Phase 2 Reliability Testing - Quick Reference Guide
Last Updated: November 9, 2025
Quick Command Reference
Run All Critical Tests
cd /home/claude/HeliosDBcargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignoredRun Individual Tests
Backup/Restore Tests
# TC-BR-001: Incremental Backup Chain (100+ backups)cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_incremental_backup_chain_validation --nocapture
# TC-BR-002: Point-In-Time Recovery (sub-second accuracy)cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_point_in_time_recovery_accuracy --nocapture
# TC-BR-003: Cross-Region Replication (3 regions)cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_cross_region_backup_replication --nocapture
# TC-BR-004: Corruption Detection (100% accuracy)cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_backup_corruption_detection --nocapture
# TC-BR-005: Concurrent Backups (10 simultaneous)cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_concurrent_backup_operations --nocaptureFailover Tests
# TC-AF-001: Leader Election (network partitions)cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_leader_election_under_partition --nocapture
# TC-AF-002: Failover Timing (RTO <60s)cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_failover_orchestration_timing --nocapture
# TC-AF-003: Cascading Failures (7-node cluster)cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_cascading_failure_handling --nocapture
# TC-AF-004: Split-Brain Prevention (all partitions)cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_split_brain_prevention --nocapture
# TC-AF-005: Failback & Reintegration (<5 min)cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_failback_and_reintegration --nocaptureTest Coverage Summary
Implemented Tests (10/1000+)
| ID | Test Name | Category | Status |
|---|---|---|---|
| TC-BR-001 | Incremental Backup Chain | Backup | Implemented |
| TC-BR-002 | Point-In-Time Recovery | Backup | Implemented |
| TC-BR-003 | Cross-Region Replication | Backup | Implemented |
| TC-BR-004 | Corruption Detection | Integrity | Implemented |
| TC-BR-005 | Concurrent Backups | Performance | Implemented |
| TC-AF-001 | Leader Election | Failover | Implemented |
| TC-AF-002 | Failover Timing | Failover | Implemented |
| TC-AF-003 | Cascading Failures | Chaos | Implemented |
| TC-AF-004 | Split-Brain Prevention | Chaos | Implemented |
| TC-AF-005 | Failback & Reintegration | Failover | Implemented |
Planned Tests (990+)
| Category | Unit | Integration | Chaos | Perf | Total |
|---|---|---|---|---|---|
| Backup/Restore | 200 | 150 | 50 | 30 | 430 |
| Schema Migration | 150 | 100 | 40 | 25 | 315 |
| Failover | 100 | 80 | 60 | 30 | 270 |
| Data Integrity | 120 | 90 | 50 | 35 | 295 |
| TOTAL | 570 | 420 | 200 | 120 | 1310 |
Performance SLAs
Backup/Restore
| Metric | SLA | Test | Status |
|---|---|---|---|
| Full backup speed | >5 GB/sec | TC-BR-005 | |
| Incremental speed | >10 GB/sec | TC-BR-001 | |
| Incremental overhead | <5% | TC-BR-001 | |
| PITR RTO | <60s | TC-BR-002 | |
| PITR RPO | 0s | TC-BR-002 | |
| Cross-region lag | <5 min | TC-BR-003 | |
| Verification speed | >10 GB/sec | TC-BR-004 | |
| Repair speed | >1 GB/sec | TC-BR-004 |
Failover
| Metric | SLA | Test | Status |
|---|---|---|---|
| Failure detection | <10s | TC-AF-002 | |
| Leader election | <5s | TC-AF-001 | |
| State transfer | <30s | TC-AF-002 | |
| Client redirect | <10s | TC-AF-002 | |
| Total RTO | <60s | TC-AF-002 | |
| RPO | 0s | TC-AF-002 | |
| Cascade recovery | <10s/node | TC-AF-003 | |
| Split-brain detect | <5s | TC-AF-004 |
Test Categories
P0 (Critical) - Block Production Release
- All backup/restore core functionality
- All failover core functionality
- All data integrity core functionality
- Zero data loss guarantees
- RTO/RPO SLA compliance
Tests: TC-BR-001, TC-BR-002, TC-AF-001, TC-AF-002, TC-AF-003, TC-AF-004
P1 (High) - Block Beta Release
- Performance benchmarks
- Chaos engineering scenarios
- Edge cases and corner cases
- Resource efficiency
Tests: TC-BR-003, TC-BR-004, TC-BR-005, TC-AF-005
P2 (Medium) - Nice to Have
- Extended soak tests
- Extreme scale tests
- Documentation tests
- UI/UX tests
Tests: (Planned in full test suite)
Interpreting Test Results
Success Indicators
=== TC-BR-001: PASSED ===✓ All WAL positions monotonically increase✓ All 100 backups validated✓ PASS: Meets SLA requirementFailure Indicators
✗ FAIL: Exceeds SLA requirementassertion failed: recovery_duration < 60sthread 'test_name' panicked at ...Performance Metrics
Step 6: Performance SLA validation... - Average backup time: 85ms - SLA requirement: < 1s ✓ PASS: Meets SLA requirementTroubleshooting
Test Hangs or Timeouts
Issue: Test runs indefinitely without completing
Solutions:
# Check for deadlocks in logscargo test ... --nocapture 2>&1 | grep -i deadlock
# Run with timeouttimeout 300 cargo test ...
# Check for resource exhaustionhtop # Look for 100% CPU or memory usageTest Failures
Issue: Test fails with assertion error
Solutions:
- Check error message for specific failure
- Review test logs with
--nocapture - Verify test data setup
- Check for race conditions (run multiple times)
- Validate environment (disk space, permissions)
Compilation Errors
Issue: Test code doesn’t compile
Solutions:
# Check dependenciescargo check --package heliosdb-ha-dr
# Update dependenciescargo update
# Clean and rebuildcargo clean && cargo build --package heliosdb-ha-drPerformance SLA Violations
Issue: Test passes but exceeds performance SLA
Solutions:
- Check system load (other processes)
- Run on dedicated test hardware
- Warm up caches (run test twice)
- Profile performance bottlenecks
- Compare to baseline metrics
CI/CD Integration
GitHub Actions Example
name: Phase 2 Reliability Tests
on: push: branches: [ main, develop ] pull_request: branches: [ main ]
jobs: critical-tests: runs-on: ubuntu-latest timeout-minutes: 30
steps: - uses: actions/checkout@v3
- name: Setup Rust uses: actions-rust-lang/setup-rust-toolchain@v1
- name: Run Critical Tests run: | cargo test --package heliosdb-ha-dr \ --test phase2_reliability_tests \ --ignored \ -- --test-threads=4 --nocapture
- name: Check Performance SLAs run: | # Parse test output for SLA violations # Fail build if any SLA exceeded
- name: Upload Test Results uses: actions/upload-artifact@v3 with: name: test-results path: target/test-results/Quality Gates
# Required checks before mergerequired-checks: - All critical (P0) tests passing - No performance regressions >5% - Code coverage >95% - No critical security issues
# Required checks before releaserelease-checks: - All critical + high (P0, P1) tests passing - All performance SLAs met - Chaos tests passing >95% - Load tests completed - Soak tests (72h) completedTest Data & Fixtures
Backup Test Data
- Full backup size: ~10KB (simulated)
- Incremental backup size: ~2KB (simulated)
- WAL segment size: ~512B (simulated)
- Scale factor: 1/1000 of production
Failover Test Data
- Cluster sizes: 3, 5, 7 nodes
- Workload TPS: 5,000-10,000 (simulated)
- Network latency: 1-100ms (simulated)
- Failure injection: Controlled chaos
Performance Baselines
- Backup throughput: >5 GB/sec
- Failover time: <60s RTO, 0s RPO
- Recovery time: <60s for 24h WAL
- Corruption detection: 100% accuracy
Metrics Collection
Test Execution Metrics
// Automatically collected during test runs- Test duration (per test)- Resource usage (CPU, memory, disk I/O)- Network throughput- Operation counts- Error ratesPerformance Metrics
// Collected via PerformanceMetrics utilitylet mut metrics = PerformanceMetrics::new("backup_operation");
for sample in samples { let start = Instant::now(); // ... operation ... metrics.record_sample(start.elapsed());}
metrics.report(); // Prints p50, p95, p99, avg, min, maxChaos Metrics
// Failure injection tracking- Failures injected: 10- Failures handled: 10- Recovery time: avg 5s, max 12s- Data loss: 0 bytesMaintenance
Weekly Tasks
- Run full critical test suite (10 tests)
- Review performance trends
- Update baselines if needed
- Check for flaky tests
Monthly Tasks
- Run extended test suite (100+ tests)
- Chaos engineering exercises
- Load/soak testing
- Performance profiling
Quarterly Tasks
- Review and update SLAs
- Add new test cases
- Refactor test infrastructure
- Update documentation
Before Release
- Run ALL tests (1000+ when complete)
- Jepsen-style correctness tests
- Multi-region deployment tests
- Customer acceptance testing
Support & Resources
Documentation
- Test Plan:
/home/claude/HeliosDB/docs/PHASE2_RELIABILITY_TEST_PLAN.md - Summary:
/home/claude/HeliosDB/docs/TESTER_DELIVERABLE_SUMMARY.md - Quick Ref:
/home/claude/HeliosDB/docs/PHASE2_TESTING_QUICK_REFERENCE.md(this file)
Test Code
- Implementation:
/home/claude/HeliosDB/heliosdb-ha-dr/tests/phase2_reliability_tests.rs - Utilities: Test utilities and helpers included in test file
External Resources
- Jepsen Testing: https://jepsen.io/
- Chaos Engineering: https://principlesofchaos.org/
- Rust Testing Guide: https://doc.rust-lang.org/book/ch11-00-testing.html
- TigerBeetle Testing: https://tigerbeetle.com/blog/three-hour-transaction-test/
FAQ
Q: Why are tests marked #[ignore]?
A: These are long-running integration tests that should not run on every cargo test. Run explicitly with --ignored flag.
Q: How long do the tests take?
A:
- Individual test: 1-5 seconds
- All critical tests (10): ~30 seconds
- Full suite (1000+): ~2-4 hours (when complete)
Q: Can I run tests in parallel?
A: Yes, but some tests may conflict due to resource usage. Use --test-threads=4 to limit parallelism.
Q: What if a test fails?
A:
- Re-run to check if flaky
- Check logs with
--nocapture - Verify environment (disk space, etc.)
- Review recent code changes
- File a bug report with full output
Q: How do I add new tests?
A:
- Add test function to
phase2_reliability_tests.rs - Follow naming convention:
test_<feature>_<scenario> - Add
#[tokio::test]and#[ignore]attributes - Use test utilities from
test_utilsmodule - Document test case in test plan
- Update this quick reference
Q: How do I measure performance?
A: Use PerformanceMetrics utility:
let mut metrics = PerformanceMetrics::new("operation_name");for _ in 0..100 { let start = Instant::now(); // ... your operation ... metrics.record_sample(start.elapsed());}metrics.report(); // Prints statisticsQ: How do I inject failures?
A: Use chaos injection utilities (to be implemented):
let chaos = ChaosInjector::new();chaos.inject(FailureMode::NodeCrash { node_id });chaos.inject(FailureMode::NetworkPartition { groups });Version History
| Version | Date | Changes |
|---|---|---|
| 1.0 | 2025-11-09 | Initial release with 10 critical tests |
| - | - | Future: Add schema migration tests |
| - | - | Future: Add data integrity tests |
| - | - | Future: Add Jepsen framework |
For questions or issues, refer to main test plan documentation.