Skip to content

Backup/Restore Architecture - Executive Summary

Backup/Restore Architecture - Executive Summary

Agent: Analyst (Agent 3) Date: 2025-11-09 Status: Architecture Complete - Ready for Implementation


Overview

This document summarizes the advanced backup and restore architecture designed to meet the following requirements:

  1. Incremental Backups with <5% storage overhead
  2. Point-in-Time Recovery (PITR) in <15 minutes for 100GB databases
  3. Cross-Region Backup Replication with async multi-region support
  4. Automated Backup Verification with continuous testing

Current State Analysis

Existing Infrastructure

HeliosDB has three backup implementations with complementary strengths:

ComponentLocationStrengthsLimitations
Basic Backupbackup.rsFull/incremental, compressionFile-based, local only
Enhanced Backupbackup_v2.rsWAL-based, cloud support, LSN trackingPlaceholder implementations
HA/DR Backupha-dr/backup.rsContinuous backup, event-drivenSimulated operations
PITR Modulepitr.rsComplete workflow, cloud-nativeSimulated WAL replay

WAL Infrastructure

Strong foundation with:

  • WAL Manager: LSN-based tracking, checksum verification, buffered writes
  • CommitLog: Crash recovery, checkpointing, index operation support

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│ Backup Orchestrator │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Scheduler │ │ Coordinator │ │ Monitor │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└────────────┬────────────────┬────────────────┬──────────────────┘
│ │ │
┌─────────┴────────┐ ┌────┴──────┐ ┌────┴──────┐
│ │ │ │ │ │
v v v v v v
┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Incremental │ │ PITR │ │ Cross-Region│ │ Verification│
│ Backup │ │ Manager │ │ Replication │ │ Engine │
└──────┬──────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │ │
v v v v
┌────────────────────────────────────────────────────────────────┐
│ Storage Layer (WAL + LSM) │
└────────────────────────────────────────────────────────────────┘

Feature 1: Advanced Incremental Backups

Key Components

  1. Block Change Tracker: Bitmap-based tracking of 4KB blocks
  2. WAL Delta Extractor: Extracts changes from LSN range
  3. Content Deduplicator: SHA-256 based deduplication
  4. Parallel Block Processor: Multi-threaded compression

Performance Targets

  • Storage Overhead: <5% of full backup size
  • Delta Extraction: >1 GB/sec
  • Deduplication Ratio: >40%
  • Compression Ratio: 3:1 (ZSTD level 3)

Example: Incremental Backup Flow

1. WAL records written → Block Change Tracker marks changed blocks
2. On backup trigger → Delta Extractor reads LSN range
3. Changed blocks extracted → Deduplicator removes redundant content
4. Delta compressed → Encrypted → Uploaded to cloud
5. Metadata stored → Verification scheduled

Feature 2: Point-in-Time Recovery (PITR)

Recovery Workflow

Target Time: 2025-11-09 10:30:00 UTC
Phase 1: Locate Checkpoint (10s)
→ Find checkpoint before 10:30:00
→ Checkpoint at 10:15:00 (LSN 1,234,567)
Phase 2: Restore Base Backup (3 min)
→ Download full backup from 10:00:00
→ Extract to restore path
Phase 3: Download WAL Segments (1 min)
→ Segments 1234567-1256789 (parallel prefetch)
Phase 4: Replay WAL (8 min)
→ Parallel replay by table partition
→ 125,000 entries/sec throughput
Phase 5: Validate Consistency (1 min)
→ Check table catalog
→ Verify foreign keys
→ Validate indexes
Total: <15 minutes for 100GB database

Parallel Replay Optimization

  • Table Partitioning: Independent tables replayed in parallel
  • Prefetching: Next segments downloaded while replaying current
  • Batch Commits: Group entries before flushing to disk
  • Direct I/O: Aligned writes for SSD optimization

Feature 3: Cross-Region Backup Replication

Replication Architecture

Primary Region (us-east-1)
├─ Chunk 1 (10 MB) ──────────→ Secondary Region 1 (eu-west-1)
├─ Chunk 2 (10 MB) ──────────→ Secondary Region 2 (ap-southeast-1)
└─ Chunk 3 (10 MB) ──────────→ Secondary Region 3 (us-west-2)
└─ Bandwidth Manager (rate limiting)
└─ Integrity Verifier (checksums)

Replication Modes

  1. Synchronous: Wait for all regions (highest durability)
  2. Asynchronous: Fire-and-forget (lowest latency)
  3. Quorum: Wait for N regions (balanced)

Bandwidth Optimization

  • Token Bucket Rate Limiting: Prevent saturation
  • Chunk-based Transfer: 10 MB chunks with compression
  • Resumable Transfers: Continue from failure point
  • Sample-based Verification: 10-sample checksum for speed

Feature 4: Backup Verification & Testing

Verification Types

  1. Checksum Only: Quick integrity check (30 sec)
  2. Metadata Restore: Verify structure (2 min)
  3. Full Restore: Complete restore in sandbox (10 min)
  4. Functional Test: Restore + execute queries (15 min)

Continuous Validation

Validation Loop (hourly):
1. Select backups based on distribution strategy
- Oldest first
- Random sampling
- Priority-based (full backups prioritized)
2. Run verification tests concurrently
- Max 5 concurrent validations
- Isolated sandbox per test
3. Report results
- Alert on failures
- Store verification history
- Update backup health score

Consistency Checks

  • Table catalog completeness
  • Foreign key integrity
  • Index consistency
  • Row count validation
  • Custom SQL queries

Performance Benchmarks

Target Metrics

MetricTargetStatus
Incremental storage overhead<5%Design complete
PITR recovery time (100GB)<15 minArchitecture ready
Cross-region replication lag<5 minDesign complete
Backup verification coverage100%/weekFramework designed
WAL replay rate>100k entries/secParallel replay
Deduplication ratio>40%SHA-256 dedup

100GB PITR Recovery Breakdown

Target: 15 minutes total
├─ Checkpoint Location: 10 seconds
├─ Base Backup Download: 3 minutes (350 Mbps sustained)
├─ Base Backup Restore: 2 minutes (SSD I/O)
├─ WAL Segment Download: 1 minute (parallel prefetch)
├─ WAL Replay: 8 minutes (125,000 entries/sec)
└─ Consistency Validation: 1 minute
Total: 15 minutes ✓

Key Data Structures

Backup Metadata

pub struct BackupMetadata {
pub backup_id: String,
pub backup_type: BackupType, // Full, Incremental, Differential
pub lsn_range: LsnRange,
pub size_info: SizeInfo, // Original, compressed, encrypted
pub storage_info: StorageInfo, // Primary + replicas
pub verification_info: VerificationInfo,
pub replication_info: ReplicationInfo,
}

Recovery Target

pub enum RecoveryTargetType {
Time(DateTime<Utc>), // Specific timestamp
Lsn(Lsn), // Specific LSN
Transaction(u64), // Specific transaction
Latest, // Latest available
}

Integration Points

Modified Files

New Files (4 files):

  • heliosdb-storage/src/backup_advanced.rs - Advanced backup engine
  • heliosdb-storage/src/pitr_advanced.rs - Enhanced PITR
  • heliosdb-storage/src/replication_advanced.rs - Cross-region
  • heliosdb-storage/src/verification_engine.rs - Verification system

Modified Files (3 files):

  • heliosdb-storage/src/lib.rs - Export new modules
  • heliosdb-storage/src/wal.rs - Add block tracking hooks
  • heliosdb-storage/src/commitlog.rs - Add LSN tracking API

Dependencies

[dependencies]
bit-vec = "0.6" # Block bitmap tracking
sha2 = "0.10" # Content deduplication
rayon = "1.7" # Parallel processing
tokio = "1.35" # Async runtime
aws-sdk-s3 = "1.10" # Cloud storage

Risk Analysis

Top 5 Risks & Mitigations

  1. WAL Replay Data Loss

    • Risk: Medium probability, Critical impact
    • Mitigation: Double checksums (CRC32 + SHA256), replay validation
  2. Cross-Region Transfer Failure

    • Risk: Medium probability, High impact
    • Mitigation: Resumable transfers, exponential backoff retry
  3. Backup Corruption

    • Risk: Low probability, Critical impact
    • Mitigation: Continuous verification, redundant checksums
  4. PITR Performance Degradation

    • Risk: Medium probability, High impact
    • Mitigation: Parallel replay, prefetching, SSD optimization
  5. Incremental Chain Breaks

    • Risk: Medium probability, High impact
    • Mitigation: Automatic chain validation, fallback to full backup

Implementation Roadmap

12-Week Plan

Weeks 1-2: Incremental Backup Engine

  • Block change tracking
  • WAL delta extraction
  • Content deduplication
  • Parallel processing

Weeks 3-4: PITR Implementation

  • Recovery workflow
  • Parallel WAL replay
  • Checkpoint management
  • Consistency validation

Weeks 5-6: Cross-Region Replication

  • Multi-cloud support
  • Bandwidth management
  • Integrity verification
  • Resumable transfers

Weeks 7-8: Backup Verification

  • Verification engine
  • Sandbox testing
  • Consistency checks
  • Continuous validation

Weeks 9-10: Integration & Testing

  • End-to-end integration
  • Performance testing
  • Stress testing
  • Chaos engineering

Weeks 11-12: Production Hardening

  • Monitoring & observability
  • Documentation
  • Runbooks
  • Release

Success Criteria

Acceptance Tests

Incremental Backup Test

  • Create full backup (10 GB)
  • Simulate 500 MB of changes
  • Create incremental backup
  • Verify: incremental size <500 MB (5% overhead)

PITR Recovery Test

  • Populate database to 100 GB
  • Record timestamp T1
  • Continue writes for 1 hour
  • Restore to T1
  • Verify: recovery time <15 minutes

Cross-Region Replication Test

  • Create backup in us-east-1
  • Replicate to 3 regions
  • Verify: all checksums match
  • Verify: replication lag <5 minutes

Verification Test

  • Create 100 backups
  • Run continuous validation for 7 days
  • Verify: >95% of backups tested
  • Verify: no false positives

API Example

use heliosdb_storage::backup_advanced::*;
// Create incremental backup
let orchestrator = BackupOrchestrator::new(config).await?;
let backup = orchestrator.create_incremental_backup(
"base_backup_123",
IncrementalBackupConfig {
target_delta_size: 100_000_000, // 100 MB
compression_level: 3,
encryption_enabled: true,
parallel_workers: 8,
..Default::default()
}
).await?;
println!("Backup created: {}", backup.backup_id);
println!("Compression: {:.1}%",
(1.0 - backup.size_info.compression_ratio) * 100.0);
// Point-in-time recovery
let result = orchestrator.restore_to_time(
Utc::now() - chrono::Duration::hours(2),
Path::new("/var/lib/heliosdb/restore"),
).await?;
println!("Recovery completed in {:.2}s",
result.duration().as_secs_f64());

Conclusion

This architecture provides:

  1. Efficient Storage: <5% incremental overhead with deduplication
  2. Fast Recovery: <15 minute PITR for 100GB databases
  3. High Durability: Multi-region replication with 11 nines
  4. Automated Validation: Continuous testing ensures backup integrity

The design builds on HeliosDB’s existing WAL infrastructure while adding advanced capabilities for enterprise-grade backup and disaster recovery.

Status: Ready for Implementation Next Steps: Coder Agent Implementation → Testing → Production Deployment


For detailed specifications, see: docs/architecture/BACKUP_RESTORE_ARCHITECTURE_PHASE2.md