Skip to content

Phase 2 Reliability Testing - Quick Reference Guide

Phase 2 Reliability Testing - Quick Reference Guide

Last Updated: November 9, 2025


Quick Command Reference

Run All Critical Tests

Terminal window
cd /home/claude/HeliosDB
cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored

Run Individual Tests

Backup/Restore Tests

Terminal window
# TC-BR-001: Incremental Backup Chain (100+ backups)
cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_incremental_backup_chain_validation --nocapture
# TC-BR-002: Point-In-Time Recovery (sub-second accuracy)
cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_point_in_time_recovery_accuracy --nocapture
# TC-BR-003: Cross-Region Replication (3 regions)
cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_cross_region_backup_replication --nocapture
# TC-BR-004: Corruption Detection (100% accuracy)
cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_backup_corruption_detection --nocapture
# TC-BR-005: Concurrent Backups (10 simultaneous)
cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_concurrent_backup_operations --nocapture

Failover Tests

Terminal window
# TC-AF-001: Leader Election (network partitions)
cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_leader_election_under_partition --nocapture
# TC-AF-002: Failover Timing (RTO <60s)
cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_failover_orchestration_timing --nocapture
# TC-AF-003: Cascading Failures (7-node cluster)
cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_cascading_failure_handling --nocapture
# TC-AF-004: Split-Brain Prevention (all partitions)
cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_split_brain_prevention --nocapture
# TC-AF-005: Failback & Reintegration (<5 min)
cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_failback_and_reintegration --nocapture

Test Coverage Summary

Implemented Tests (10/1000+)

IDTest NameCategoryStatus
TC-BR-001Incremental Backup ChainBackupImplemented
TC-BR-002Point-In-Time RecoveryBackupImplemented
TC-BR-003Cross-Region ReplicationBackupImplemented
TC-BR-004Corruption DetectionIntegrityImplemented
TC-BR-005Concurrent BackupsPerformanceImplemented
TC-AF-001Leader ElectionFailoverImplemented
TC-AF-002Failover TimingFailoverImplemented
TC-AF-003Cascading FailuresChaosImplemented
TC-AF-004Split-Brain PreventionChaosImplemented
TC-AF-005Failback & ReintegrationFailoverImplemented

Planned Tests (990+)

CategoryUnitIntegrationChaosPerfTotal
Backup/Restore2001505030430
Schema Migration1501004025315
Failover100806030270
Data Integrity120905035295
TOTAL5704202001201310

Performance SLAs

Backup/Restore

MetricSLATestStatus
Full backup speed>5 GB/secTC-BR-005
Incremental speed>10 GB/secTC-BR-001
Incremental overhead<5%TC-BR-001
PITR RTO<60sTC-BR-002
PITR RPO0sTC-BR-002
Cross-region lag<5 minTC-BR-003
Verification speed>10 GB/secTC-BR-004
Repair speed>1 GB/secTC-BR-004

Failover

MetricSLATestStatus
Failure detection<10sTC-AF-002
Leader election<5sTC-AF-001
State transfer<30sTC-AF-002
Client redirect<10sTC-AF-002
Total RTO<60sTC-AF-002
RPO0sTC-AF-002
Cascade recovery<10s/nodeTC-AF-003
Split-brain detect<5sTC-AF-004

Test Categories

P0 (Critical) - Block Production Release

  • All backup/restore core functionality
  • All failover core functionality
  • All data integrity core functionality
  • Zero data loss guarantees
  • RTO/RPO SLA compliance

Tests: TC-BR-001, TC-BR-002, TC-AF-001, TC-AF-002, TC-AF-003, TC-AF-004

P1 (High) - Block Beta Release

  • Performance benchmarks
  • Chaos engineering scenarios
  • Edge cases and corner cases
  • Resource efficiency

Tests: TC-BR-003, TC-BR-004, TC-BR-005, TC-AF-005

P2 (Medium) - Nice to Have

  • Extended soak tests
  • Extreme scale tests
  • Documentation tests
  • UI/UX tests

Tests: (Planned in full test suite)


Interpreting Test Results

Success Indicators

=== TC-BR-001: PASSED ===
✓ All WAL positions monotonically increase
✓ All 100 backups validated
✓ PASS: Meets SLA requirement

Failure Indicators

✗ FAIL: Exceeds SLA requirement
assertion failed: recovery_duration < 60s
thread 'test_name' panicked at ...

Performance Metrics

Step 6: Performance SLA validation...
- Average backup time: 85ms
- SLA requirement: < 1s
✓ PASS: Meets SLA requirement

Troubleshooting

Test Hangs or Timeouts

Issue: Test runs indefinitely without completing

Solutions:

Terminal window
# Check for deadlocks in logs
cargo test ... --nocapture 2>&1 | grep -i deadlock
# Run with timeout
timeout 300 cargo test ...
# Check for resource exhaustion
htop # Look for 100% CPU or memory usage

Test Failures

Issue: Test fails with assertion error

Solutions:

  1. Check error message for specific failure
  2. Review test logs with --nocapture
  3. Verify test data setup
  4. Check for race conditions (run multiple times)
  5. Validate environment (disk space, permissions)

Compilation Errors

Issue: Test code doesn’t compile

Solutions:

Terminal window
# Check dependencies
cargo check --package heliosdb-ha-dr
# Update dependencies
cargo update
# Clean and rebuild
cargo clean && cargo build --package heliosdb-ha-dr

Performance SLA Violations

Issue: Test passes but exceeds performance SLA

Solutions:

  1. Check system load (other processes)
  2. Run on dedicated test hardware
  3. Warm up caches (run test twice)
  4. Profile performance bottlenecks
  5. Compare to baseline metrics

CI/CD Integration

GitHub Actions Example

name: Phase 2 Reliability Tests
on:
push:
branches: [ main, develop ]
pull_request:
branches: [ main ]
jobs:
critical-tests:
runs-on: ubuntu-latest
timeout-minutes: 30
steps:
- uses: actions/checkout@v3
- name: Setup Rust
uses: actions-rust-lang/setup-rust-toolchain@v1
- name: Run Critical Tests
run: |
cargo test --package heliosdb-ha-dr \
--test phase2_reliability_tests \
--ignored \
-- --test-threads=4 --nocapture
- name: Check Performance SLAs
run: |
# Parse test output for SLA violations
# Fail build if any SLA exceeded
- name: Upload Test Results
uses: actions/upload-artifact@v3
with:
name: test-results
path: target/test-results/

Quality Gates

# Required checks before merge
required-checks:
- All critical (P0) tests passing
- No performance regressions >5%
- Code coverage >95%
- No critical security issues
# Required checks before release
release-checks:
- All critical + high (P0, P1) tests passing
- All performance SLAs met
- Chaos tests passing >95%
- Load tests completed
- Soak tests (72h) completed

Test Data & Fixtures

Backup Test Data

  • Full backup size: ~10KB (simulated)
  • Incremental backup size: ~2KB (simulated)
  • WAL segment size: ~512B (simulated)
  • Scale factor: 1/1000 of production

Failover Test Data

  • Cluster sizes: 3, 5, 7 nodes
  • Workload TPS: 5,000-10,000 (simulated)
  • Network latency: 1-100ms (simulated)
  • Failure injection: Controlled chaos

Performance Baselines

  • Backup throughput: >5 GB/sec
  • Failover time: <60s RTO, 0s RPO
  • Recovery time: <60s for 24h WAL
  • Corruption detection: 100% accuracy

Metrics Collection

Test Execution Metrics

// Automatically collected during test runs
- Test duration (per test)
- Resource usage (CPU, memory, disk I/O)
- Network throughput
- Operation counts
- Error rates

Performance Metrics

// Collected via PerformanceMetrics utility
let mut metrics = PerformanceMetrics::new("backup_operation");
for sample in samples {
let start = Instant::now();
// ... operation ...
metrics.record_sample(start.elapsed());
}
metrics.report(); // Prints p50, p95, p99, avg, min, max

Chaos Metrics

// Failure injection tracking
- Failures injected: 10
- Failures handled: 10
- Recovery time: avg 5s, max 12s
- Data loss: 0 bytes

Maintenance

Weekly Tasks

  • Run full critical test suite (10 tests)
  • Review performance trends
  • Update baselines if needed
  • Check for flaky tests

Monthly Tasks

  • Run extended test suite (100+ tests)
  • Chaos engineering exercises
  • Load/soak testing
  • Performance profiling

Quarterly Tasks

  • Review and update SLAs
  • Add new test cases
  • Refactor test infrastructure
  • Update documentation

Before Release

  • Run ALL tests (1000+ when complete)
  • Jepsen-style correctness tests
  • Multi-region deployment tests
  • Customer acceptance testing

Support & Resources

Documentation

  • Test Plan: /home/claude/HeliosDB/docs/PHASE2_RELIABILITY_TEST_PLAN.md
  • Summary: /home/claude/HeliosDB/docs/TESTER_DELIVERABLE_SUMMARY.md
  • Quick Ref: /home/claude/HeliosDB/docs/PHASE2_TESTING_QUICK_REFERENCE.md (this file)

Test Code

  • Implementation: /home/claude/HeliosDB/heliosdb-ha-dr/tests/phase2_reliability_tests.rs
  • Utilities: Test utilities and helpers included in test file

External Resources


FAQ

Q: Why are tests marked #[ignore]?

A: These are long-running integration tests that should not run on every cargo test. Run explicitly with --ignored flag.

Q: How long do the tests take?

A:

  • Individual test: 1-5 seconds
  • All critical tests (10): ~30 seconds
  • Full suite (1000+): ~2-4 hours (when complete)

Q: Can I run tests in parallel?

A: Yes, but some tests may conflict due to resource usage. Use --test-threads=4 to limit parallelism.

Q: What if a test fails?

A:

  1. Re-run to check if flaky
  2. Check logs with --nocapture
  3. Verify environment (disk space, etc.)
  4. Review recent code changes
  5. File a bug report with full output

Q: How do I add new tests?

A:

  1. Add test function to phase2_reliability_tests.rs
  2. Follow naming convention: test_<feature>_<scenario>
  3. Add #[tokio::test] and #[ignore] attributes
  4. Use test utilities from test_utils module
  5. Document test case in test plan
  6. Update this quick reference

Q: How do I measure performance?

A: Use PerformanceMetrics utility:

let mut metrics = PerformanceMetrics::new("operation_name");
for _ in 0..100 {
let start = Instant::now();
// ... your operation ...
metrics.record_sample(start.elapsed());
}
metrics.report(); // Prints statistics

Q: How do I inject failures?

A: Use chaos injection utilities (to be implemented):

let chaos = ChaosInjector::new();
chaos.inject(FailureMode::NodeCrash { node_id });
chaos.inject(FailureMode::NetworkPartition { groups });

Version History

VersionDateChanges
1.02025-11-09Initial release with 10 critical tests
--Future: Add schema migration tests
--Future: Add data integrity tests
--Future: Add Jepsen framework

For questions or issues, refer to main test plan documentation.