Skip to content

Disaster Recovery Plan

Disaster Recovery Plan

Overview

This document defines the disaster recovery (DR) procedures for HeliosDB Nano, ensuring rapid recovery from catastrophic failures while minimizing data loss.

Recovery Objectives

MetricTargetDescription
RTO< 5 minutesTime to restore service
RPO< 1 minuteMaximum acceptable data loss
MTTR< 15 minutesMean time to recovery

Disaster Scenarios

Tier 1: Component Failure

ScenarioRTORPORecovery Method
Single disk failure00RAID rebuild
Node failure< 1 min0Automatic failover
Network partition< 30 sec0Re-routing

Tier 2: Service Disruption

ScenarioRTORPORecovery Method
Database corruption< 5 min< 1 minBranch restore
Ransomware< 30 min< 1 hourClean rebuild + backup
Accidental deletion< 2 min0Time-travel recovery

Tier 3: Site Failure

ScenarioRTORPORecovery Method
Data center outage< 15 min< 5 minCross-region failover
Regional disaster< 1 hour< 30 minDR site activation
Complete data loss< 4 hours< 24 hoursBackup restoration

HeliosDB-Specific Recovery Features

Branch-Based Recovery

HeliosDB’s unique branching model enables instant recovery:

-- Create recovery point (automatic or manual)
CREATE BRANCH recovery_point_20260124;
-- Restore to recovery point
CHECKOUT BRANCH 'recovery_point_20260124';
-- Or use time-travel for precise recovery
SELECT * FROM orders AS OF '2026-01-24T12:00:00Z';
-- Create new branch from point-in-time
CREATE BRANCH recovered_data AS OF '2026-01-24T12:00:00Z';

Time-Travel Recovery

-- View data at specific point in time
SELECT * FROM users AS OF '2026-01-24T11:59:00Z';
-- Compare before/after corruption
SELECT
before.id, before.balance AS old_balance, after.balance AS new_balance
FROM users AS OF '2026-01-24T11:59:00Z' before
JOIN users after ON before.id = after.id
WHERE before.balance != after.balance;
-- Restore specific rows
INSERT INTO users
SELECT * FROM users AS OF '2026-01-24T11:59:00Z'
WHERE id IN (SELECT id FROM corrupted_ids);

WAL-Based Recovery

Terminal window
# Continuous WAL archiving
heliosdb-wal archive \
--source /var/lib/heliosdb/wal \
--destination s3://backup-bucket/wal
# Point-in-time recovery using WAL
heliosdb-restore \
--base-backup /backups/base_20260123 \
--wal-archive s3://backup-bucket/wal \
--target-time "2026-01-24T12:00:00Z"

Backup Strategy

Backup Types

TypeFrequencyRetentionMethod
Continuous WALReal-time7 daysStreaming to S3
IncrementalHourly7 daysBranch snapshot
Full backupDaily30 daysComplete export
ArchiveMonthly1 yearCompressed archive

Backup Architecture

┌─────────────────────────────────────────────────────────────────┐
│ Backup Architecture │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Primary ──WAL──▶ WAL Archive (S3) ──▶ DR Region │
│ │ │
│ ├──hourly──▶ Incremental Backup ──▶ Hot Storage │
│ │ │
│ ├──daily───▶ Full Backup ────────▶ Warm Storage │
│ │ │
│ └──monthly─▶ Archive ────────────▶ Cold Storage (Glacier) │
│ │
└─────────────────────────────────────────────────────────────────┘

Backup Verification

Terminal window
# Automated backup verification (daily)
heliosdb-backup verify \
--backup /backups/latest \
--checksum \
--test-restore
# Monthly restore drill
heliosdb-restore \
--backup /backups/monthly_20260101 \
--target /tmp/restore_test \
--verify-queries /tests/verification_queries.sql

Recovery Procedures

Procedure 1: Single Node Recovery

Trigger: Node unresponsive, health check failing

Steps:

  1. Verify node is truly failed (not network issue)
  2. Automatic failover to replica (< 30 seconds)
  3. Promote replica to primary
  4. Spin up new replica from backup
  5. Verify replication is synchronized
Terminal window
# Manual failover if automatic fails
heliosdb-ha failover --force --promote replica-2
# Verify new primary
heliosdb-cli status --cluster
# Rebuild failed node
heliosdb-node rebuild \
--from-backup latest \
--join-cluster production

Procedure 2: Database Corruption Recovery

Trigger: Data integrity check failure, application errors

Steps:

  1. Stop writes to prevent further damage
  2. Identify corruption scope
  3. Use time-travel to find clean point
  4. Restore from clean point
  5. Replay valid transactions
-- Step 1: Stop writes
ALTER SYSTEM SET default_transaction_read_only = on;
-- Step 2: Identify corruption
SELECT heliosdb_check_integrity('public.orders');
-- Step 3: Find clean point
SELECT timestamp
FROM heliosdb_branch_history
WHERE integrity_check = 'passed'
ORDER BY timestamp DESC LIMIT 1;
-- Step 4: Restore (creates new branch)
CREATE BRANCH recovery FROM 'main' AS OF '2026-01-24T11:00:00Z';
CHECKOUT BRANCH 'recovery';
-- Step 5: Merge clean branch back
-- (After verification)
MERGE BRANCH 'recovery' INTO 'main';

Procedure 3: Complete Site Failover

Trigger: Primary data center unavailable

Steps:

  1. Confirm primary site failure
  2. Activate DR site
  3. Update DNS/load balancer
  4. Verify DR site functionality
  5. Notify stakeholders
Terminal window
# Step 1: Verify primary failure
heliosdb-monitor check-site primary --timeout 60
# Step 2: Activate DR
heliosdb-dr activate --site dr-west --confirm
# Step 3: Update routing
heliosdb-dns update --record db.example.com --target dr-west.example.com
# Step 4: Verify
heliosdb-cli --host dr-west.example.com status
# Step 5: Send notifications
heliosdb-notify send --template dr-activation --recipients ops-team

Procedure 4: Ransomware Recovery

Trigger: Ransomware detection, encrypted files

Steps:

  1. Isolate affected systems immediately
  2. Preserve evidence for investigation
  3. Identify clean backup (before infection)
  4. Rebuild from clean backup
  5. Restore data, verify integrity
  6. Strengthen security controls
Terminal window
# Step 1: Network isolation
iptables -A INPUT -j DROP
iptables -A OUTPUT -j DROP
# Step 2: Evidence preservation
heliosdb-forensics capture --full-system --output /secure/evidence
# Step 3: Identify clean backup
heliosdb-backup list --before "2026-01-20" --verify-clean
# Step 4: Clean rebuild
# (On isolated network)
heliosdb-restore --backup /secure/clean_backup_20260119 --new-server
# Step 5: Verify and reconnect
# (After security review)

DR Site Configuration

Active-Passive Setup

# Primary site configuration
[replication]
mode = "streaming"
primary = true
archive_command = "heliosdb-wal archive %p s3://backup/wal/%f"
[replication.standby]
host = "dr-west.internal"
port = 5432
sync_mode = "async" # or "sync" for zero RPO

Active-Active Setup (Multi-Region)

# Multi-region configuration
[cluster]
name = "production"
mode = "multi-primary"
[[cluster.nodes]]
name = "us-east"
host = "db-east.example.com"
region = "us-east-1"
priority = 1
[[cluster.nodes]]
name = "us-west"
host = "db-west.example.com"
region = "us-west-2"
priority = 2
[cluster.conflict_resolution]
strategy = "last-write-wins"
vector_clock = true

Testing Schedule

Test TypeFrequencyDurationScope
Backup verificationDailyAutomatedAll backups
Component failoverWeekly15 minIndividual nodes
Site failoverMonthly2 hoursFull DR drill
Full DR simulationQuarterly4 hoursComplete scenario

Test Checklist

  • Backup integrity verified
  • Recovery scripts executed successfully
  • RTO/RPO targets met
  • All applications reconnected
  • Data integrity confirmed
  • Performance acceptable
  • Documentation updated

Monitoring & Alerting

DR Health Checks

# Monitoring configuration
alerts:
- name: replication_lag
condition: lag > 60s
severity: warning
- name: replication_lag_critical
condition: lag > 300s
severity: critical
- name: backup_age
condition: last_backup > 24h
severity: critical
- name: dr_site_health
condition: dr_site_unreachable
severity: critical

Contacts

Escalation Path

LevelContactWhen
L1On-call engineerInitial response
L2Database team leadEscalation > 15 min
L3VP EngineeringSite-wide failure
L4Executive teamCustomer impact > 1 hour

Emergency Contacts

  • Primary On-Call: [PagerDuty rotation]
  • Database Team: db-team@heliosdb.io
  • Emergency Line: +1-xxx-xxx-xxxx