Skip to content

Safekeeper WAL Service User Guide

Safekeeper WAL Service User Guide

Overview

Safekeeper provides dedicated, fault-tolerant WAL storage separate from compute nodes, enabling instant compute recovery and better resource utilization.

Benefits

  • Instant compute recovery
  • WAL durability independent of compute
  • Reduced compute storage requirements
  • 3+ WAL replicas for safety

Architecture

┌─────────────┐ WAL ┌──────────────┐
│ Compute Node├──────────────►│ Safekeeper 1│
└─────────────┘ └──────────────┘
┌──────┴───────┐
▼ ▼
┌──────────────┐┌──────────────┐
│ Safekeeper 2││ Safekeeper 3 │
└──────────────┘└──────────────┘

Configuration

safekeeper:
enabled: true
replicas:
- host: safekeeper-1:5433
priority: 1
- host: safekeeper-2:5433
priority: 2
- host: safekeeper-3:5433
priority: 3
wal_keep_segments: 100
max_wal_size_gb: 10
sync_timeout_ms: 1000

SQL Examples

-- Check safekeeper status
SELECT
name,
state, -- active, syncing, recovering
wal_position,
lag_mb
FROM heliosdb.safekeeper_status;
-- Force WAL sync
SELECT heliosdb.sync_wal();

Use Cases

Instant Compute Recovery

1. Compute node fails
2. New compute node starts
3. Connects to safekeeper cluster
4. Replays WAL from safekeepers
5. Ready in <10 seconds

Compute Scale-to-Zero

-- Compute can scale to zero safely
-- WAL preserved in safekeepers
-- Resume from exact position

Best Practices

  1. Run 3+ safekeeper replicas
  2. Monitor WAL lag
  3. Use fast storage for safekeepers
  4. Set appropriate retention

For more: /docs/architecture/safekeeper.md