Disaster Recovery Plan

Overview

This document defines the disaster recovery (DR) procedures for HeliosDB Nano, ensuring rapid recovery from catastrophic failures while minimizing data loss.

Recovery Objectives

Metric	Target	Description
RTO	< 5 minutes	Time to restore service
RPO	< 1 minute	Maximum acceptable data loss
MTTR	< 15 minutes	Mean time to recovery

Disaster Scenarios

Tier 1: Component Failure

Scenario	RTO	Recovery Method
Single disk failure	0	RAID rebuild
Node failure	< 1 min	Automatic failover
Network partition	< 30 sec	Re-routing

Tier 2: Service Disruption

Scenario	RTO	RPO	Recovery Method
Database corruption	< 5 min	< 1 min	Branch restore
Ransomware	< 30 min	< 1 hour	Clean rebuild + backup
Accidental deletion	< 2 min	0	Time-travel recovery

Tier 3: Site Failure

Scenario	RTO	RPO	Recovery Method
Data center outage	< 15 min	< 5 min	Cross-region failover
Regional disaster	< 1 hour	< 30 min	DR site activation
Complete data loss	< 4 hours	< 24 hours	Backup restoration

HeliosDB-Specific Recovery Features

Branch-Based Recovery

HeliosDB’s unique branching model enables instant recovery:

-- Create recovery point (automatic or manual)
CREATE BRANCH recovery_point_20260124;

-- Restore to recovery point
CHECKOUT BRANCH 'recovery_point_20260124';

-- Or use time-travel for precise recovery
SELECT * FROM orders AS OF '2026-01-24T12:00:00Z';

-- Create new branch from point-in-time
CREATE BRANCH recovered_data AS OF '2026-01-24T12:00:00Z';

Time-Travel Recovery

-- View data at specific point in time
SELECT * FROM users AS OF '2026-01-24T11:59:00Z';

-- Compare before/after corruption
SELECT
    before.id, before.balance AS old_balance, after.balance AS new_balance
FROM users AS OF '2026-01-24T11:59:00Z' before
JOIN users after ON before.id = after.id
WHERE before.balance != after.balance;

-- Restore specific rows
INSERT INTO users
SELECT * FROM users AS OF '2026-01-24T11:59:00Z'
WHERE id IN (SELECT id FROM corrupted_ids);

WAL-Based Recovery

# Continuous WAL archiving
heliosdb-wal archive \
    --source /var/lib/heliosdb/wal \
    --destination s3://backup-bucket/wal

# Point-in-time recovery using WAL
heliosdb-restore \
    --base-backup /backups/base_20260123 \
    --wal-archive s3://backup-bucket/wal \
    --target-time "2026-01-24T12:00:00Z"

Backup Strategy

Backup Types

Type	Frequency	Retention	Method
Continuous WAL	Real-time	7 days	Streaming to S3
Incremental	Hourly	7 days	Branch snapshot
Full backup	Daily	30 days	Complete export
Archive	Monthly	1 year	Compressed archive

Backup Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     Backup Architecture                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Primary ──WAL──▶ WAL Archive (S3) ──▶ DR Region              │
│     │                                                          │
│     ├──hourly──▶ Incremental Backup ──▶ Hot Storage           │
│     │                                                          │
│     ├──daily───▶ Full Backup ────────▶ Warm Storage           │
│     │                                                          │
│     └──monthly─▶ Archive ────────────▶ Cold Storage (Glacier) │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Backup Verification

# Automated backup verification (daily)
heliosdb-backup verify \
    --backup /backups/latest \
    --checksum \
    --test-restore

# Monthly restore drill
heliosdb-restore \
    --backup /backups/monthly_20260101 \
    --target /tmp/restore_test \
    --verify-queries /tests/verification_queries.sql

Recovery Procedures

Procedure 1: Single Node Recovery

Trigger: Node unresponsive, health check failing

Steps:

Verify node is truly failed (not network issue)
Automatic failover to replica (< 30 seconds)
Promote replica to primary
Spin up new replica from backup
Verify replication is synchronized

# Manual failover if automatic fails
heliosdb-ha failover --force --promote replica-2

# Verify new primary
heliosdb-cli status --cluster

# Rebuild failed node
heliosdb-node rebuild \
    --from-backup latest \
    --join-cluster production

Procedure 2: Database Corruption Recovery

Trigger: Data integrity check failure, application errors

Steps:

Stop writes to prevent further damage
Identify corruption scope
Use time-travel to find clean point
Restore from clean point
Replay valid transactions

-- Step 1: Stop writes
ALTER SYSTEM SET default_transaction_read_only = on;

-- Step 2: Identify corruption
SELECT heliosdb_check_integrity('public.orders');

-- Step 3: Find clean point
SELECT timestamp
FROM heliosdb_branch_history
WHERE integrity_check = 'passed'
ORDER BY timestamp DESC LIMIT 1;

-- Step 4: Restore (creates new branch)
CREATE BRANCH recovery FROM 'main' AS OF '2026-01-24T11:00:00Z';
CHECKOUT BRANCH 'recovery';

-- Step 5: Merge clean branch back
-- (After verification)
MERGE BRANCH 'recovery' INTO 'main';

Procedure 3: Complete Site Failover

Trigger: Primary data center unavailable

Steps:

Confirm primary site failure
Activate DR site
Update DNS/load balancer
Verify DR site functionality
Notify stakeholders

# Step 1: Verify primary failure
heliosdb-monitor check-site primary --timeout 60

# Step 2: Activate DR
heliosdb-dr activate --site dr-west --confirm

# Step 3: Update routing
heliosdb-dns update --record db.example.com --target dr-west.example.com

# Step 4: Verify
heliosdb-cli --host dr-west.example.com status

# Step 5: Send notifications
heliosdb-notify send --template dr-activation --recipients ops-team

Procedure 4: Ransomware Recovery

Trigger: Ransomware detection, encrypted files

Steps:

Isolate affected systems immediately
Preserve evidence for investigation
Identify clean backup (before infection)
Rebuild from clean backup
Restore data, verify integrity
Strengthen security controls

# Step 1: Network isolation
iptables -A INPUT -j DROP
iptables -A OUTPUT -j DROP

# Step 2: Evidence preservation
heliosdb-forensics capture --full-system --output /secure/evidence

# Step 3: Identify clean backup
heliosdb-backup list --before "2026-01-20" --verify-clean

# Step 4: Clean rebuild
# (On isolated network)
heliosdb-restore --backup /secure/clean_backup_20260119 --new-server

# Step 5: Verify and reconnect
# (After security review)

DR Site Configuration

Active-Passive Setup

# Primary site configuration
[replication]
mode = "streaming"
primary = true
archive_command = "heliosdb-wal archive %p s3://backup/wal/%f"

[replication.standby]
host = "dr-west.internal"
port = 5432
sync_mode = "async"  # or "sync" for zero RPO

Active-Active Setup (Multi-Region)

# Multi-region configuration
[cluster]
name = "production"
mode = "multi-primary"

[[cluster.nodes]]
name = "us-east"
host = "db-east.example.com"
region = "us-east-1"
priority = 1

[[cluster.nodes]]
name = "us-west"
host = "db-west.example.com"
region = "us-west-2"
priority = 2

[cluster.conflict_resolution]
strategy = "last-write-wins"
vector_clock = true

Testing Schedule

Test Type	Frequency	Duration	Scope
Backup verification	Daily	Automated	All backups
Component failover	Weekly	15 min	Individual nodes
Site failover	Monthly	2 hours	Full DR drill
Full DR simulation	Quarterly	4 hours	Complete scenario

Test Checklist

Monitoring & Alerting

DR Health Checks

# Monitoring configuration
alerts:
  - name: replication_lag
    condition: lag > 60s
    severity: warning

  - name: replication_lag_critical
    condition: lag > 300s
    severity: critical

  - name: backup_age
    condition: last_backup > 24h
    severity: critical

  - name: dr_site_health
    condition: dr_site_unreachable
    severity: critical

Contacts

Escalation Path

Level	Contact	When
L1	On-call engineer	Initial response
L2	Database team lead	Escalation > 15 min
L3	VP Engineering	Site-wide failure
L4	Executive team	Customer impact > 1 hour

Emergency Contacts

Primary On-Call: [PagerDuty rotation]
Database Team: db-team@heliosdb.io
Emergency Line: +1-xxx-xxx-xxxx