Phase 2 Reliability Testing - Quick Reference Guide

Last Updated: November 9, 2025

Quick Command Reference

Run All Critical Tests

cd /home/claude/HeliosDB
cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored

Run Individual Tests

Backup/Restore Tests

# TC-BR-001: Incremental Backup Chain (100+ backups)
cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_incremental_backup_chain_validation --nocapture

# TC-BR-002: Point-In-Time Recovery (sub-second accuracy)
cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_point_in_time_recovery_accuracy --nocapture

# TC-BR-003: Cross-Region Replication (3 regions)
cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_cross_region_backup_replication --nocapture

# TC-BR-004: Corruption Detection (100% accuracy)
cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_backup_corruption_detection --nocapture

# TC-BR-005: Concurrent Backups (10 simultaneous)
cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_concurrent_backup_operations --nocapture

Failover Tests

# TC-AF-001: Leader Election (network partitions)
cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_leader_election_under_partition --nocapture

# TC-AF-002: Failover Timing (RTO <60s)
cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_failover_orchestration_timing --nocapture

# TC-AF-003: Cascading Failures (7-node cluster)
cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_cascading_failure_handling --nocapture

# TC-AF-004: Split-Brain Prevention (all partitions)
cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_split_brain_prevention --nocapture

# TC-AF-005: Failback & Reintegration (<5 min)
cargo test --package heliosdb-ha-dr --test phase2_reliability_tests --ignored -- test_failback_and_reintegration --nocapture

Test Coverage Summary

Implemented Tests (10/1000+)

ID	Test Name	Category	Status
TC-BR-001	Incremental Backup Chain	Backup	Implemented
TC-BR-002	Point-In-Time Recovery	Backup	Implemented
TC-BR-003	Cross-Region Replication	Backup	Implemented
TC-BR-004	Corruption Detection	Integrity	Implemented
TC-BR-005	Concurrent Backups	Performance	Implemented
TC-AF-001	Leader Election	Failover	Implemented
TC-AF-002	Failover Timing	Failover	Implemented
TC-AF-003	Cascading Failures	Chaos	Implemented
TC-AF-004	Split-Brain Prevention	Chaos	Implemented
TC-AF-005	Failback & Reintegration	Failover	Implemented

Planned Tests (990+)

Category	Unit	Integration	Chaos	Perf	Total
Backup/Restore	200	150	50	30	430
Schema Migration	150	100	40	25	315
Failover	100	80	60	30	270
Data Integrity	120	90	50	35	295
TOTAL	570	420	200	120	1310

Performance SLAs

Backup/Restore

Metric	SLA	Test
Full backup speed	>5 GB/sec	TC-BR-005
Incremental speed	>10 GB/sec	TC-BR-001
Incremental overhead	<5%	TC-BR-001
PITR RTO	<60s	TC-BR-002
PITR RPO	0s	TC-BR-002
Cross-region lag	<5 min	TC-BR-003
Verification speed	>10 GB/sec	TC-BR-004
Repair speed	>1 GB/sec	TC-BR-004

Failover

Metric	SLA	Test
Failure detection	<10s	TC-AF-002
Leader election	<5s	TC-AF-001
State transfer	<30s	TC-AF-002
Client redirect	<10s	TC-AF-002
Total RTO	<60s	TC-AF-002
RPO	0s	TC-AF-002
Cascade recovery	<10s/node	TC-AF-003
Split-brain detect	<5s	TC-AF-004

Test Categories

P0 (Critical) - Block Production Release

All backup/restore core functionality
All failover core functionality
All data integrity core functionality
Zero data loss guarantees
RTO/RPO SLA compliance

Tests: TC-BR-001, TC-BR-002, TC-AF-001, TC-AF-002, TC-AF-003, TC-AF-004

P1 (High) - Block Beta Release

Performance benchmarks
Chaos engineering scenarios
Edge cases and corner cases
Resource efficiency

Tests: TC-BR-003, TC-BR-004, TC-BR-005, TC-AF-005

P2 (Medium) - Nice to Have

Extended soak tests
Extreme scale tests
Documentation tests
UI/UX tests

Tests: (Planned in full test suite)

Interpreting Test Results

Success Indicators

=== TC-BR-001: PASSED ===
✓ All WAL positions monotonically increase
✓ All 100 backups validated
✓ PASS: Meets SLA requirement

Failure Indicators

✗ FAIL: Exceeds SLA requirement
assertion failed: recovery_duration < 60s
thread 'test_name' panicked at ...

Performance Metrics

Step 6: Performance SLA validation...
  - Average backup time: 85ms
  - SLA requirement: < 1s
  ✓ PASS: Meets SLA requirement

Troubleshooting

Test Hangs or Timeouts

Issue: Test runs indefinitely without completing

Solutions:

# Check for deadlocks in logs
cargo test ... --nocapture 2>&1 | grep -i deadlock

# Run with timeout
timeout 300 cargo test ...

# Check for resource exhaustion
htop  # Look for 100% CPU or memory usage

Test Failures

Issue: Test fails with assertion error

Solutions:

Check error message for specific failure
Review test logs with --nocapture
Verify test data setup
Check for race conditions (run multiple times)
Validate environment (disk space, permissions)

Compilation Errors

Issue: Test code doesn’t compile

Solutions:

# Check dependencies
cargo check --package heliosdb-ha-dr

# Update dependencies
cargo update

# Clean and rebuild
cargo clean && cargo build --package heliosdb-ha-dr

Performance SLA Violations

Issue: Test passes but exceeds performance SLA

Solutions:

Check system load (other processes)
Run on dedicated test hardware
Warm up caches (run test twice)
Profile performance bottlenecks
Compare to baseline metrics

CI/CD Integration

GitHub Actions Example

name: Phase 2 Reliability Tests

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main ]

jobs:
  critical-tests:
    runs-on: ubuntu-latest
    timeout-minutes: 30

    steps:
      - uses: actions/checkout@v3

      - name: Setup Rust
        uses: actions-rust-lang/setup-rust-toolchain@v1

      - name: Run Critical Tests
        run: |
          cargo test --package heliosdb-ha-dr \
            --test phase2_reliability_tests \
            --ignored \
            -- --test-threads=4 --nocapture

      - name: Check Performance SLAs
        run: |
          # Parse test output for SLA violations
          # Fail build if any SLA exceeded

      - name: Upload Test Results
        uses: actions/upload-artifact@v3
        with:
          name: test-results
          path: target/test-results/

Quality Gates

# Required checks before merge
required-checks:
  - All critical (P0) tests passing
  - No performance regressions >5%
  - Code coverage >95%
  - No critical security issues

# Required checks before release
release-checks:
  - All critical + high (P0, P1) tests passing
  - All performance SLAs met
  - Chaos tests passing >95%
  - Load tests completed
  - Soak tests (72h) completed

Test Data & Fixtures

Backup Test Data

Full backup size: ~10KB (simulated)
Incremental backup size: ~2KB (simulated)
WAL segment size: ~512B (simulated)
Scale factor: 1/1000 of production

Failover Test Data

Cluster sizes: 3, 5, 7 nodes
Workload TPS: 5,000-10,000 (simulated)
Network latency: 1-100ms (simulated)
Failure injection: Controlled chaos

Performance Baselines

Backup throughput: >5 GB/sec
Failover time: <60s RTO, 0s RPO
Recovery time: <60s for 24h WAL
Corruption detection: 100% accuracy

Metrics Collection

Test Execution Metrics

// Automatically collected during test runs
- Test duration (per test)
- Resource usage (CPU, memory, disk I/O)
- Network throughput
- Operation counts
- Error rates

Performance Metrics

// Collected via PerformanceMetrics utility
let mut metrics = PerformanceMetrics::new("backup_operation");

for sample in samples {
    let start = Instant::now();
    // ... operation ...
    metrics.record_sample(start.elapsed());
}

metrics.report(); // Prints p50, p95, p99, avg, min, max

Chaos Metrics

// Failure injection tracking
- Failures injected: 10
- Failures handled: 10
- Recovery time: avg 5s, max 12s
- Data loss: 0 bytes

Maintenance

Weekly Tasks

Run full critical test suite (10 tests)
Review performance trends
Update baselines if needed
Check for flaky tests

Monthly Tasks

Run extended test suite (100+ tests)
Chaos engineering exercises
Load/soak testing
Performance profiling

Quarterly Tasks

Review and update SLAs
Add new test cases
Refactor test infrastructure
Update documentation

Before Release

Run ALL tests (1000+ when complete)
Jepsen-style correctness tests
Multi-region deployment tests
Customer acceptance testing

Support & Resources

Documentation

Test Plan: /home/claude/HeliosDB/docs/PHASE2_RELIABILITY_TEST_PLAN.md
Summary: /home/claude/HeliosDB/docs/TESTER_DELIVERABLE_SUMMARY.md
Quick Ref: /home/claude/HeliosDB/docs/PHASE2_TESTING_QUICK_REFERENCE.md (this file)

Test Code

Implementation: /home/claude/HeliosDB/heliosdb-ha-dr/tests/phase2_reliability_tests.rs
Utilities: Test utilities and helpers included in test file

External Resources

Jepsen Testing: https://jepsen.io/
Chaos Engineering: https://principlesofchaos.org/
Rust Testing Guide: https://doc.rust-lang.org/book/ch11-00-testing.html
TigerBeetle Testing: https://tigerbeetle.com/blog/three-hour-transaction-test/

FAQ

Q: Why are tests marked `#[ignore]`?

A: These are long-running integration tests that should not run on every cargo test. Run explicitly with --ignored flag.

Q: How long do the tests take?

Individual test: 1-5 seconds
All critical tests (10): ~30 seconds
Full suite (1000+): ~2-4 hours (when complete)

Q: Can I run tests in parallel?

A: Yes, but some tests may conflict due to resource usage. Use --test-threads=4 to limit parallelism.

Q: What if a test fails?

Re-run to check if flaky
Check logs with --nocapture
Verify environment (disk space, etc.)
Review recent code changes
File a bug report with full output

Q: How do I add new tests?

Add test function to phase2_reliability_tests.rs
Follow naming convention: test_<feature>_<scenario>
Add #[tokio::test] and #[ignore] attributes
Use test utilities from test_utils module
Document test case in test plan
Update this quick reference

Q: How do I measure performance?

A: Use PerformanceMetrics utility:

let mut metrics = PerformanceMetrics::new("operation_name");
for _ in 0..100 {
    let start = Instant::now();
    // ... your operation ...
    metrics.record_sample(start.elapsed());
}
metrics.report(); // Prints statistics

Q: How do I inject failures?

A: Use chaos injection utilities (to be implemented):

let chaos = ChaosInjector::new();
chaos.inject(FailureMode::NodeCrash { node_id });
chaos.inject(FailureMode::NetworkPartition { groups });

Version History

Version	Date	Changes
1.0	2025-11-09	Initial release with 10 critical tests
-	-	Future: Add schema migration tests
-	-	Future: Add data integrity tests
-	-	Future: Add Jepsen framework

For questions or issues, refer to main test plan documentation.

Phase 2 Reliability Testing - Quick Reference Guide

Phase 2 Reliability Testing - Quick Reference Guide

Quick Command Reference

Run All Critical Tests

Run Individual Tests

Backup/Restore Tests

Failover Tests

Test Coverage Summary

Implemented Tests (10/1000+)

Planned Tests (990+)

Performance SLAs

Backup/Restore

Failover

Test Categories

P0 (Critical) - Block Production Release

P1 (High) - Block Beta Release

P2 (Medium) - Nice to Have

Interpreting Test Results

Success Indicators

Failure Indicators

Performance Metrics

Troubleshooting

Test Hangs or Timeouts

Test Failures

Compilation Errors

Performance SLA Violations

CI/CD Integration

GitHub Actions Example

Quality Gates

Test Data & Fixtures

Backup Test Data

Failover Test Data

Performance Baselines

Metrics Collection

Test Execution Metrics

Performance Metrics

Chaos Metrics

Maintenance

Weekly Tasks

Monthly Tasks

Quarterly Tasks

Before Release

Support & Resources

Documentation

Test Code

External Resources

FAQ

Q: Why are tests marked #[ignore]?

Q: How long do the tests take?

Q: Can I run tests in parallel?

Q: What if a test fails?

Q: How do I add new tests?

Q: How do I measure performance?

Q: How do I inject failures?

Version History

Q: Why are tests marked `#[ignore]`?