Scale-to-Zero Serverless Compute User Guide
Scale-to-Zero Serverless Compute User Guide
Overview
HeliosDB’s Scale-to-Zero feature automatically suspends idle compute nodes and resumes them in under 300ms when queries arrive, dramatically reducing costs for intermittent workloads while maintaining instant availability.
Benefits
- Pay only for active compute time (potential 50-90% cost savings)
- Sub-300ms resume time (imperceptible to users)
- Automatic idle detection and suspension
- No cold-start penalties
- Seamless integration with existing applications
Key Features
- Suspend time: <5 seconds
- Resume time: <300ms (average ~170ms)
- State persistence: <10MB snapshot
- Resume success rate: >99.9%
- Accurate per-second billing
Prerequisites
System Requirements
- HeliosDB v3.2 or later with autoscale package
- At least 2GB RAM for state persistence
- S3-compatible storage for snapshots (optional)
Required Configuration
Minimum configuration in heliosdb.conf:
autoscale: enabled: true scale_to_zero: enabled: true idle_timeout_seconds: 300 # 5 minutes min_cu: 0.0 # Allow scale to zeroStep-by-Step Configuration
1. Enable Scale-to-Zero
Edit /etc/heliosdb/heliosdb.conf:
compute: lifecycle: suspend_timeout_seconds: 5 resume_timeout_ms: 300 resume_target_ms: 250 checkpoint_transactions: true persist_buffer_cache: true max_cache_persist_mb: 10
autoscale: enabled: true min_cu: 0.0 # IMPORTANT: Must be 0.0 for scale-to-zero scale_to_zero: enabled: true idle_timeout_seconds: 300 idle_check_interval_seconds: 30
state_persistence: enabled: true snapshot_path: /var/lib/heliosdb/snapshots max_snapshot_size_mb: 10 compression: zstd2. Configure Activity Monitoring
activity_monitor: idle_timeout_seconds: 300 activity_threshold_seconds: 1 activity_window_size: 1000 sample_interval_seconds: 10
# Customize what counts as "activity" track_readonly_queries: true track_system_queries: false3. Set Up Billing Tracking
billing: enabled: true cu_config: cpu_cores_per_cu: 1.0 memory_mb_per_cu: 2048 price_per_cu_hour: 0.10
export: format: prometheus endpoint: http://prometheus:9090/metrics4. Restart and Verify
# Restart HeliosDBsudo systemctl restart heliosdb
# Check statusheliosdb-cli status
# Verify scale-to-zero is activeheliosdb-cli autoscale statusExpected output:
Autoscale Status: ENABLEDScale-to-Zero: ENABLEDCurrent State: ACTIVECurrent CUs: 2.0Idle Timeout: 300sTime Since Last Activity: 45sSQL Examples with Explanations
Monitoring Node State
-- Check current compute node stateSELECT node_id, state, -- Active, Suspended, Resuming current_cu, last_activity, idle_duration_secondsFROM heliosdb.compute_nodes;Example Output:
node_id | state | current_cu | last_activity | idle_duration_seconds---------+--------+------------+---------------+---------------------- node-1 | Active | 2.0 | 10s ago | 0Manual Suspend/Resume
-- Manually suspend a node (useful for maintenance)SELECT heliosdb.suspend_node('node-1');
-- Check suspend resultSELECT * FROM heliosdb.last_suspend_result();Output:
{ "node_id": "node-1", "duration_ms": 820, "snapshot_size_mb": 8.2, "phase_breakdown": { "checkpoint_transactions_ms": 102, "flush_buffer_cache_ms": 498, "persist_snapshot_ms": 195, "release_resources_ms": 25 }, "success": true}-- Manually resume a nodeSELECT heliosdb.resume_node('node-1');
-- Check resume resultSELECT * FROM heliosdb.last_resume_result();Output:
{ "node_id": "node-1", "total_duration_ms": 168, "phase_breakdown": { "allocate_resources_ms": 38, "restore_buffer_cache_ms": 58, "restore_session_state_ms": 29, "mark_ready_ms": 1, "overhead_ms": 42 }, "cache_hit_rate": 0.85, "success": true}Viewing Activity History
-- View recent query activitySELECT timestamp, query_type, duration_ms, rows_affectedFROM heliosdb.query_activityWHERE timestamp > now() - interval '1 hour'ORDER BY timestamp DESCLIMIT 20;Checking Billing Information
-- View CU-hour consumptionSELECT node_id, session_start, session_end, avg_cus, duration_hours, cu_hours, cost_usdFROM heliosdb.billing_sessionsWHERE session_start > now() - interval '24 hours'ORDER BY session_start DESC;Example Output:
node_id | session_start | session_end | avg_cus | duration_hours | cu_hours | cost_usd---------+---------------+-------------+---------+----------------+----------+---------- node-1 | 10:00:00 | 10:15:00 | 2.0 | 0.25 | 0.50 | $0.05 node-1 | 14:30:00 | 15:00:00 | 4.0 | 0.50 | 2.00 | $0.20Common Use Cases
Use Case 1: Development/Staging Environment
Scenario: Dev database used only during business hours (9 AM - 5 PM)
# heliosdb.confautoscale: scale_to_zero: enabled: true idle_timeout_seconds: 600 # 10 minutes idle time # Automatically suspends after hoursCost Savings:
- Active time: 8 hours/day = ~33% of day
- Idle time scaled to zero: 16 hours/day = ~67% savings
- Monthly savings: ~$200 on a 2 CU instance
Use Case 2: Batch Processing Workload
Scenario: Database receives batches every 4 hours
-- Application workflowBEGIN;-- Batch arrives, database auto-resumes (<300ms)SELECT heliosdb.current_node_state(); -- Returns: Resuming → Active
INSERT INTO events SELECT * FROM batch_staging;-- Process 1M records
COMMIT;-- After completion, idle timer starts-- After 5 minutes idle → auto-suspendCost Savings:
- Active time: ~15 min/4 hours = 6.25% of time
- Savings: ~94% compared to always-on
Use Case 3: Low-Traffic Web Application
Scenario: Internal dashboard with sporadic usage
autoscale: scale_to_zero: enabled: true idle_timeout_seconds: 180 # 3 minutes # Quick suspend for quick savingsUser Experience:
User opens dashboard ↓Query arrives at suspended node ↓Auto-resume in ~170ms ↓Dashboard loads (feels instant to user)Use Case 4: Testing/CI Pipeline
Scenario: Database for automated tests
#!/bin/bash# Tests trigger resume automaticallynpm run test:database
# No manual shutdown needed# Database auto-suspends after 5 min idleCost Savings:
- Test runs: ~10 min/day
- Always-on cost: 24 hours × $0.10/CU-hour = $2.40/day
- Scale-to-zero cost: 10 min × $0.10/CU-hour = $0.017/day
- Savings: ~99%
Troubleshooting
Issue: Resume Taking Too Long
Symptom: Resume time >500ms, users notice latency
Diagnosis:
-- Check recent resume timesSELECT timestamp, total_duration_ms, phase_breakdownFROM heliosdb.resume_historyORDER BY timestamp DESCLIMIT 10;Solution 1: Reduce Snapshot Size
# heliosdb.confstate_persistence: max_cache_persist_mb: 5 # Reduce from 10MB compress_sessions: trueSolution 2: Increase Resume Budget
compute: lifecycle: resume_timeout_ms: 500 # Allow more timeSolution 3: Keep Warm
autoscale: scale_to_zero: idle_timeout_seconds: 900 # 15 min (longer wait before suspend)Issue: Frequent Suspend/Resume Cycles
Symptom: Node oscillating between suspended/active
Diagnosis:
-- Count suspend/resume eventsSELECT DATE_TRUNC('hour', timestamp) as hour, COUNT(*) as cycle_countFROM heliosdb.lifecycle_eventsWHERE event_type IN ('suspend', 'resume')GROUP BY hourORDER BY hour DESC;Solution: Increase Idle Timeout
autoscale: scale_to_zero: idle_timeout_seconds: 600 # Increase from 300 # Adds hysteresis to prevent flappingIssue: Snapshot Size Too Large
Symptom:
ERROR: Snapshot size 12.3 MB exceeds limit of 10 MBDiagnosis:
SELECT node_id, snapshot_size_mb, session_count, buffer_cache_size_mbFROM heliosdb.node_snapshotsORDER BY snapshot_size_mb DESC;Solution 1: Reduce Buffer Cache
compute: lifecycle: max_cache_persist_mb: 5 # Reduce persisted cacheSolution 2: Clean Up Sessions
-- Close idle sessions before suspendSELECT pg_terminate_backend(pid)FROM pg_stat_activityWHERE state = 'idle'AND state_change < now() - interval '5 minutes';Issue: Resume Failures
Symptom: Resume success rate <99%
Diagnosis:
SELECT error_message, COUNT(*) as occurrence_countFROM heliosdb.resume_errorsWHERE timestamp > now() - interval '24 hours'GROUP BY error_messageORDER BY occurrence_count DESC;Common Errors and Solutions:
- Resource Allocation Failed
-- Check available resourcesSELECT * FROM heliosdb.resource_availability;
-- Solution: Increase resource pool- Snapshot Corruption
-- Check snapshot integritySELECT heliosdb.verify_snapshot('node-1');
-- Solution: Force fresh snapshotSELECT heliosdb.recreate_snapshot('node-1');- Timeout Exceeded
# Increase resume timeoutcompute: lifecycle: resume_timeout_ms: 500Performance Tuning Tips
1. Optimize Snapshot Creation
state_persistence: # Use fast compression compression: lz4 # Faster than zstd
# Reduce snapshot frequency during suspend incremental_snapshots: true
# Skip non-critical state persist_temp_tables: false persist_prepared_statements_cache: true # Keep important state2. Tune Idle Detection
activity_monitor: # Ignore system queries for idle detection track_system_queries: false track_readonly_queries: true
# Adjust sensitivity activity_threshold_seconds: 5 # Lower = more sensitive3. Pre-warm Cache on Resume
compute: lifecycle: # Load hot pages first prioritize_hot_pages: true hot_page_threshold: 100 # Most accessed pages
# Parallel cache loading parallel_cache_restore: true cache_restore_threads: 44. Connection Pooling
# Application codeimport psycopg2.pool
# Use connection pooling to reduce resume overheadpool = psycopg2.pool.SimpleConnectionPool( minconn=1, maxconn=10, host="heliosdb.example.com", database="mydb", # Connection stays open, no repeated resumes)5. Batch Queries
# Instead of multiple queries (triggers multiple resumes):for user in users: db.query(f"SELECT * FROM orders WHERE user_id = {user.id}")
# Batch into single query:user_ids = [user.id for user in users]db.query(f"SELECT * FROM orders WHERE user_id = ANY(ARRAY{user_ids})")Best Practices
1. Set Appropriate Idle Timeouts
# Development: Quick suspendautoscale: scale_to_zero: idle_timeout_seconds: 180 # 3 minutes
# Staging: Moderateautoscale: scale_to_zero: idle_timeout_seconds: 600 # 10 minutes
# Production: Conservative (if using scale-to-zero)autoscale: scale_to_zero: idle_timeout_seconds: 1800 # 30 minutes2. Monitor Resume Latency
-- Create alert for slow resumesCREATE OR REPLACE FUNCTION check_resume_performance()RETURNS void AS $$DECLARE avg_resume_ms float;BEGIN SELECT AVG(total_duration_ms) INTO avg_resume_ms FROM heliosdb.resume_history WHERE timestamp > now() - interval '1 hour';
IF avg_resume_ms > 300 THEN RAISE WARNING 'Average resume time %.0f ms exceeds 300ms target', avg_resume_ms; END IF;END;$$ LANGUAGE plpgsql;
-- Schedule checkSELECT cron.schedule('check-resume-perf', '*/15 * * * *', 'SELECT check_resume_performance()');3. Cost Monitoring
-- Daily cost reportCREATE VIEW daily_cost_summary ASSELECT DATE(session_start) as date, SUM(cu_hours) as total_cu_hours, SUM(cost_usd) as total_cost_usd, COUNT(*) as session_count, SUM(CASE WHEN avg_cus = 0 THEN duration_hours ELSE 0 END) as idle_hours_savedFROM heliosdb.billing_sessionsGROUP BY DATE(session_start)ORDER BY date DESC;
-- Check savingsSELECT date, total_cost_usd, idle_hours_saved * 2.0 * 0.10 as savings_usd, -- Assume 2 CU baseline ROUND(100.0 * idle_hours_saved / 24.0, 1) as idle_percentFROM daily_cost_summaryWHERE date = CURRENT_DATE;4. Application-Level Handling
# Python application exampleimport psycopg2import time
def query_with_retry(conn, query, max_retries=2): """Handle potential resume delays""" for attempt in range(max_retries): try: cursor = conn.cursor() cursor.execute(query) return cursor.fetchall() except psycopg2.OperationalError as e: if "resuming" in str(e).lower() and attempt < max_retries - 1: time.sleep(0.5) # Wait for resume continue raise
# Usageresults = query_with_retry(conn, "SELECT * FROM users")5. Testing Scale-to-Zero
#!/bin/bashecho "Testing scale-to-zero functionality..."
# Check initial stateecho "1. Initial state:"psql -c "SELECT node_id, state, current_cu FROM heliosdb.compute_nodes;"
# Wait for idle timeout + bufferecho "2. Waiting for auto-suspend (5 minutes)..."sleep 360
# Verify suspendedecho "3. Verifying suspended state:"psql -c "SELECT node_id, state FROM heliosdb.compute_nodes;"
# Send query to trigger resumeecho "4. Triggering resume with query..."time psql -c "SELECT COUNT(*) FROM users;"
# Check resume timeecho "5. Resume metrics:"psql -c "SELECT total_duration_ms, phase_breakdown FROM heliosdb.last_resume_result();"Integration with Monitoring
Prometheus Metrics
scrape_configs: - job_name: 'heliosdb_autoscale' static_configs: - targets: ['heliosdb:9090'] metrics_path: '/metrics'Key Metrics:
heliosdb_autoscale_suspend_total: Total suspendsheliosdb_autoscale_resume_total: Total resumesheliosdb_autoscale_resume_duration_seconds: Resume latency histogramheliosdb_autoscale_active_nodes: Currently active nodesheliosdb_autoscale_suspended_nodes: Currently suspended nodesheliosdb_autoscale_cu_hours_total: Total CU-hours consumed
Grafana Dashboard
{ "dashboard": { "title": "HeliosDB Scale-to-Zero", "panels": [ { "title": "Node States", "targets": [{ "expr": "heliosdb_autoscale_active_nodes + heliosdb_autoscale_suspended_nodes" }] }, { "title": "Resume Latency (P95)", "targets": [{ "expr": "histogram_quantile(0.95, heliosdb_autoscale_resume_duration_seconds_bucket)" }] }, { "title": "Cost Savings", "targets": [{ "expr": "rate(heliosdb_autoscale_cu_hours_total[1h]) * 0.10" }] } ] }}Advanced Topics
Custom Idle Detection Logic
-- Create custom activity functionCREATE OR REPLACE FUNCTION is_truly_idle()RETURNS boolean AS $$BEGIN -- Consider node idle only if: -- 1. No active connections -- 2. No queries in last 5 minutes -- 3. No scheduled jobs RETURN ( (SELECT COUNT(*) FROM pg_stat_activity WHERE state = 'active') = 0 AND (SELECT COUNT(*) FROM heliosdb.query_activity WHERE timestamp > now() - interval '5 minutes') = 0 AND (SELECT COUNT(*) FROM heliosdb.scheduled_jobs WHERE next_run < now() + interval '10 minutes') = 0 );END;$$ LANGUAGE plpgsql;
-- Configure to use custom logicALTER SYSTEM SET heliosdb.custom_idle_check = 'is_truly_idle()';Multi-Region Scale-to-Zero
# Primary region configautoscale: scale_to_zero: enabled: true coordinate_with_replicas: true # Don't suspend if replicas are active suspend_only_if_all_idle: trueConclusion
Scale-to-Zero in HeliosDB provides significant cost savings for variable workloads while maintaining sub-300ms resume times. By following the configuration and best practices in this guide, you can optimize costs without compromising user experience.
Key Takeaways:
- Resume time <300ms is imperceptible to users
- Potential 50-90% cost savings for intermittent workloads
- Requires minimal configuration
- Works transparently with existing applications
For more information:
- API Reference:
/docs/api/autoscale.md - Monitoring Guide:
/docs/operations/monitoring-autoscale.md - Cost Optimization:
/docs/guides/cost-optimization.md