Safekeeper WAL Service User Guide

Overview

Safekeeper provides dedicated, fault-tolerant WAL storage separate from compute nodes, enabling instant compute recovery and better resource utilization.

Benefits

Instant compute recovery
WAL durability independent of compute
Reduced compute storage requirements
3+ WAL replicas for safety

Architecture

┌─────────────┐     WAL      ┌──────────────┐
│ Compute Node├──────────────►│  Safekeeper 1│
└─────────────┘              └──────────────┘
                                     │
                              ┌──────┴───────┐
                              ▼              ▼
                      ┌──────────────┐┌──────────────┐
                      │ Safekeeper 2││ Safekeeper 3 │
                      └──────────────┘└──────────────┘

Configuration

safekeeper:
  enabled: true
  replicas:
    - host: safekeeper-1:5433
      priority: 1
    - host: safekeeper-2:5433
      priority: 2
    - host: safekeeper-3:5433
      priority: 3

  wal_keep_segments: 100
  max_wal_size_gb: 10
  sync_timeout_ms: 1000

SQL Examples

-- Check safekeeper status
SELECT
    name,
    state,  -- active, syncing, recovering
    wal_position,
    lag_mb
FROM heliosdb.safekeeper_status;

-- Force WAL sync
SELECT heliosdb.sync_wal();

Use Cases

Instant Compute Recovery

1. Compute node fails
2. New compute node starts
3. Connects to safekeeper cluster
4. Replays WAL from safekeepers
5. Ready in <10 seconds

Compute Scale-to-Zero

-- Compute can scale to zero safely
-- WAL preserved in safekeepers
-- Resume from exact position

Best Practices

Run 3+ safekeeper replicas
Monitor WAL lag
Use fast storage for safekeepers
Set appropriate retention

For more: /docs/architecture/safekeeper.md