Safekeeper WAL Service User Guide
Safekeeper WAL Service User Guide
Overview
Safekeeper provides dedicated, fault-tolerant WAL storage separate from compute nodes, enabling instant compute recovery and better resource utilization.
Benefits
- Instant compute recovery
- WAL durability independent of compute
- Reduced compute storage requirements
- 3+ WAL replicas for safety
Architecture
┌─────────────┐ WAL ┌──────────────┐│ Compute Node├──────────────►│ Safekeeper 1│└─────────────┘ └──────────────┘ │ ┌──────┴───────┐ ▼ ▼ ┌──────────────┐┌──────────────┐ │ Safekeeper 2││ Safekeeper 3 │ └──────────────┘└──────────────┘Configuration
safekeeper: enabled: true replicas: - host: safekeeper-1:5433 priority: 1 - host: safekeeper-2:5433 priority: 2 - host: safekeeper-3:5433 priority: 3
wal_keep_segments: 100 max_wal_size_gb: 10 sync_timeout_ms: 1000SQL Examples
-- Check safekeeper statusSELECT name, state, -- active, syncing, recovering wal_position, lag_mbFROM heliosdb.safekeeper_status;
-- Force WAL syncSELECT heliosdb.sync_wal();Use Cases
Instant Compute Recovery
1. Compute node fails2. New compute node starts3. Connects to safekeeper cluster4. Replays WAL from safekeepers5. Ready in <10 secondsCompute Scale-to-Zero
-- Compute can scale to zero safely-- WAL preserved in safekeepers-- Resume from exact positionBest Practices
- Run 3+ safekeeper replicas
- Monitor WAL lag
- Use fast storage for safekeepers
- Set appropriate retention
For more: /docs/architecture/safekeeper.md