Skip to content

Scale-to-Zero Serverless Compute User Guide

Scale-to-Zero Serverless Compute User Guide

Overview

HeliosDB’s Scale-to-Zero feature automatically suspends idle compute nodes and resumes them in under 300ms when queries arrive, dramatically reducing costs for intermittent workloads while maintaining instant availability.

Benefits

  • Pay only for active compute time (potential 50-90% cost savings)
  • Sub-300ms resume time (imperceptible to users)
  • Automatic idle detection and suspension
  • No cold-start penalties
  • Seamless integration with existing applications

Key Features

  • Suspend time: <5 seconds
  • Resume time: <300ms (average ~170ms)
  • State persistence: <10MB snapshot
  • Resume success rate: >99.9%
  • Accurate per-second billing

Prerequisites

System Requirements

  • HeliosDB v3.2 or later with autoscale package
  • At least 2GB RAM for state persistence
  • S3-compatible storage for snapshots (optional)

Required Configuration

Minimum configuration in heliosdb.conf:

autoscale:
enabled: true
scale_to_zero:
enabled: true
idle_timeout_seconds: 300 # 5 minutes
min_cu: 0.0 # Allow scale to zero

Step-by-Step Configuration

1. Enable Scale-to-Zero

Edit /etc/heliosdb/heliosdb.conf:

compute:
lifecycle:
suspend_timeout_seconds: 5
resume_timeout_ms: 300
resume_target_ms: 250
checkpoint_transactions: true
persist_buffer_cache: true
max_cache_persist_mb: 10
autoscale:
enabled: true
min_cu: 0.0 # IMPORTANT: Must be 0.0 for scale-to-zero
scale_to_zero:
enabled: true
idle_timeout_seconds: 300
idle_check_interval_seconds: 30
state_persistence:
enabled: true
snapshot_path: /var/lib/heliosdb/snapshots
max_snapshot_size_mb: 10
compression: zstd

2. Configure Activity Monitoring

activity_monitor:
idle_timeout_seconds: 300
activity_threshold_seconds: 1
activity_window_size: 1000
sample_interval_seconds: 10
# Customize what counts as "activity"
track_readonly_queries: true
track_system_queries: false

3. Set Up Billing Tracking

billing:
enabled: true
cu_config:
cpu_cores_per_cu: 1.0
memory_mb_per_cu: 2048
price_per_cu_hour: 0.10
export:
format: prometheus
endpoint: http://prometheus:9090/metrics

4. Restart and Verify

Terminal window
# Restart HeliosDB
sudo systemctl restart heliosdb
# Check status
heliosdb-cli status
# Verify scale-to-zero is active
heliosdb-cli autoscale status

Expected output:

Autoscale Status: ENABLED
Scale-to-Zero: ENABLED
Current State: ACTIVE
Current CUs: 2.0
Idle Timeout: 300s
Time Since Last Activity: 45s

SQL Examples with Explanations

Monitoring Node State

-- Check current compute node state
SELECT
node_id,
state, -- Active, Suspended, Resuming
current_cu,
last_activity,
idle_duration_seconds
FROM heliosdb.compute_nodes;

Example Output:

node_id | state | current_cu | last_activity | idle_duration_seconds
---------+--------+------------+---------------+----------------------
node-1 | Active | 2.0 | 10s ago | 0

Manual Suspend/Resume

-- Manually suspend a node (useful for maintenance)
SELECT heliosdb.suspend_node('node-1');
-- Check suspend result
SELECT * FROM heliosdb.last_suspend_result();

Output:

{
"node_id": "node-1",
"duration_ms": 820,
"snapshot_size_mb": 8.2,
"phase_breakdown": {
"checkpoint_transactions_ms": 102,
"flush_buffer_cache_ms": 498,
"persist_snapshot_ms": 195,
"release_resources_ms": 25
},
"success": true
}
-- Manually resume a node
SELECT heliosdb.resume_node('node-1');
-- Check resume result
SELECT * FROM heliosdb.last_resume_result();

Output:

{
"node_id": "node-1",
"total_duration_ms": 168,
"phase_breakdown": {
"allocate_resources_ms": 38,
"restore_buffer_cache_ms": 58,
"restore_session_state_ms": 29,
"mark_ready_ms": 1,
"overhead_ms": 42
},
"cache_hit_rate": 0.85,
"success": true
}

Viewing Activity History

-- View recent query activity
SELECT
timestamp,
query_type,
duration_ms,
rows_affected
FROM heliosdb.query_activity
WHERE timestamp > now() - interval '1 hour'
ORDER BY timestamp DESC
LIMIT 20;

Checking Billing Information

-- View CU-hour consumption
SELECT
node_id,
session_start,
session_end,
avg_cus,
duration_hours,
cu_hours,
cost_usd
FROM heliosdb.billing_sessions
WHERE session_start > now() - interval '24 hours'
ORDER BY session_start DESC;

Example Output:

node_id | session_start | session_end | avg_cus | duration_hours | cu_hours | cost_usd
---------+---------------+-------------+---------+----------------+----------+----------
node-1 | 10:00:00 | 10:15:00 | 2.0 | 0.25 | 0.50 | $0.05
node-1 | 14:30:00 | 15:00:00 | 4.0 | 0.50 | 2.00 | $0.20

Common Use Cases

Use Case 1: Development/Staging Environment

Scenario: Dev database used only during business hours (9 AM - 5 PM)

# heliosdb.conf
autoscale:
scale_to_zero:
enabled: true
idle_timeout_seconds: 600 # 10 minutes idle time
# Automatically suspends after hours

Cost Savings:

  • Active time: 8 hours/day = ~33% of day
  • Idle time scaled to zero: 16 hours/day = ~67% savings
  • Monthly savings: ~$200 on a 2 CU instance

Use Case 2: Batch Processing Workload

Scenario: Database receives batches every 4 hours

-- Application workflow
BEGIN;
-- Batch arrives, database auto-resumes (<300ms)
SELECT heliosdb.current_node_state(); -- Returns: Resuming → Active
INSERT INTO events SELECT * FROM batch_staging;
-- Process 1M records
COMMIT;
-- After completion, idle timer starts
-- After 5 minutes idle → auto-suspend

Cost Savings:

  • Active time: ~15 min/4 hours = 6.25% of time
  • Savings: ~94% compared to always-on

Use Case 3: Low-Traffic Web Application

Scenario: Internal dashboard with sporadic usage

autoscale:
scale_to_zero:
enabled: true
idle_timeout_seconds: 180 # 3 minutes
# Quick suspend for quick savings

User Experience:

User opens dashboard
Query arrives at suspended node
Auto-resume in ~170ms
Dashboard loads (feels instant to user)

Use Case 4: Testing/CI Pipeline

Scenario: Database for automated tests

ci-test.sh
#!/bin/bash
# Tests trigger resume automatically
npm run test:database
# No manual shutdown needed
# Database auto-suspends after 5 min idle

Cost Savings:

  • Test runs: ~10 min/day
  • Always-on cost: 24 hours × $0.10/CU-hour = $2.40/day
  • Scale-to-zero cost: 10 min × $0.10/CU-hour = $0.017/day
  • Savings: ~99%

Troubleshooting

Issue: Resume Taking Too Long

Symptom: Resume time >500ms, users notice latency

Diagnosis:

-- Check recent resume times
SELECT
timestamp,
total_duration_ms,
phase_breakdown
FROM heliosdb.resume_history
ORDER BY timestamp DESC
LIMIT 10;

Solution 1: Reduce Snapshot Size

# heliosdb.conf
state_persistence:
max_cache_persist_mb: 5 # Reduce from 10MB
compress_sessions: true

Solution 2: Increase Resume Budget

compute:
lifecycle:
resume_timeout_ms: 500 # Allow more time

Solution 3: Keep Warm

autoscale:
scale_to_zero:
idle_timeout_seconds: 900 # 15 min (longer wait before suspend)

Issue: Frequent Suspend/Resume Cycles

Symptom: Node oscillating between suspended/active

Diagnosis:

-- Count suspend/resume events
SELECT
DATE_TRUNC('hour', timestamp) as hour,
COUNT(*) as cycle_count
FROM heliosdb.lifecycle_events
WHERE event_type IN ('suspend', 'resume')
GROUP BY hour
ORDER BY hour DESC;

Solution: Increase Idle Timeout

autoscale:
scale_to_zero:
idle_timeout_seconds: 600 # Increase from 300
# Adds hysteresis to prevent flapping

Issue: Snapshot Size Too Large

Symptom:

ERROR: Snapshot size 12.3 MB exceeds limit of 10 MB

Diagnosis:

SELECT
node_id,
snapshot_size_mb,
session_count,
buffer_cache_size_mb
FROM heliosdb.node_snapshots
ORDER BY snapshot_size_mb DESC;

Solution 1: Reduce Buffer Cache

compute:
lifecycle:
max_cache_persist_mb: 5 # Reduce persisted cache

Solution 2: Clean Up Sessions

-- Close idle sessions before suspend
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND state_change < now() - interval '5 minutes';

Issue: Resume Failures

Symptom: Resume success rate <99%

Diagnosis:

SELECT
error_message,
COUNT(*) as occurrence_count
FROM heliosdb.resume_errors
WHERE timestamp > now() - interval '24 hours'
GROUP BY error_message
ORDER BY occurrence_count DESC;

Common Errors and Solutions:

  1. Resource Allocation Failed
-- Check available resources
SELECT * FROM heliosdb.resource_availability;
-- Solution: Increase resource pool
  1. Snapshot Corruption
-- Check snapshot integrity
SELECT heliosdb.verify_snapshot('node-1');
-- Solution: Force fresh snapshot
SELECT heliosdb.recreate_snapshot('node-1');
  1. Timeout Exceeded
# Increase resume timeout
compute:
lifecycle:
resume_timeout_ms: 500

Performance Tuning Tips

1. Optimize Snapshot Creation

state_persistence:
# Use fast compression
compression: lz4 # Faster than zstd
# Reduce snapshot frequency during suspend
incremental_snapshots: true
# Skip non-critical state
persist_temp_tables: false
persist_prepared_statements_cache: true # Keep important state

2. Tune Idle Detection

activity_monitor:
# Ignore system queries for idle detection
track_system_queries: false
track_readonly_queries: true
# Adjust sensitivity
activity_threshold_seconds: 5 # Lower = more sensitive

3. Pre-warm Cache on Resume

compute:
lifecycle:
# Load hot pages first
prioritize_hot_pages: true
hot_page_threshold: 100 # Most accessed pages
# Parallel cache loading
parallel_cache_restore: true
cache_restore_threads: 4

4. Connection Pooling

# Application code
import psycopg2.pool
# Use connection pooling to reduce resume overhead
pool = psycopg2.pool.SimpleConnectionPool(
minconn=1,
maxconn=10,
host="heliosdb.example.com",
database="mydb",
# Connection stays open, no repeated resumes
)

5. Batch Queries

# Instead of multiple queries (triggers multiple resumes):
for user in users:
db.query(f"SELECT * FROM orders WHERE user_id = {user.id}")
# Batch into single query:
user_ids = [user.id for user in users]
db.query(f"SELECT * FROM orders WHERE user_id = ANY(ARRAY{user_ids})")

Best Practices

1. Set Appropriate Idle Timeouts

# Development: Quick suspend
autoscale:
scale_to_zero:
idle_timeout_seconds: 180 # 3 minutes
# Staging: Moderate
autoscale:
scale_to_zero:
idle_timeout_seconds: 600 # 10 minutes
# Production: Conservative (if using scale-to-zero)
autoscale:
scale_to_zero:
idle_timeout_seconds: 1800 # 30 minutes

2. Monitor Resume Latency

-- Create alert for slow resumes
CREATE OR REPLACE FUNCTION check_resume_performance()
RETURNS void AS $$
DECLARE
avg_resume_ms float;
BEGIN
SELECT AVG(total_duration_ms) INTO avg_resume_ms
FROM heliosdb.resume_history
WHERE timestamp > now() - interval '1 hour';
IF avg_resume_ms > 300 THEN
RAISE WARNING 'Average resume time %.0f ms exceeds 300ms target',
avg_resume_ms;
END IF;
END;
$$ LANGUAGE plpgsql;
-- Schedule check
SELECT cron.schedule('check-resume-perf', '*/15 * * * *',
'SELECT check_resume_performance()');

3. Cost Monitoring

-- Daily cost report
CREATE VIEW daily_cost_summary AS
SELECT
DATE(session_start) as date,
SUM(cu_hours) as total_cu_hours,
SUM(cost_usd) as total_cost_usd,
COUNT(*) as session_count,
SUM(CASE WHEN avg_cus = 0 THEN duration_hours ELSE 0 END) as idle_hours_saved
FROM heliosdb.billing_sessions
GROUP BY DATE(session_start)
ORDER BY date DESC;
-- Check savings
SELECT
date,
total_cost_usd,
idle_hours_saved * 2.0 * 0.10 as savings_usd, -- Assume 2 CU baseline
ROUND(100.0 * idle_hours_saved / 24.0, 1) as idle_percent
FROM daily_cost_summary
WHERE date = CURRENT_DATE;

4. Application-Level Handling

# Python application example
import psycopg2
import time
def query_with_retry(conn, query, max_retries=2):
"""Handle potential resume delays"""
for attempt in range(max_retries):
try:
cursor = conn.cursor()
cursor.execute(query)
return cursor.fetchall()
except psycopg2.OperationalError as e:
if "resuming" in str(e).lower() and attempt < max_retries - 1:
time.sleep(0.5) # Wait for resume
continue
raise
# Usage
results = query_with_retry(conn, "SELECT * FROM users")

5. Testing Scale-to-Zero

test-scale-to-zero.sh
#!/bin/bash
echo "Testing scale-to-zero functionality..."
# Check initial state
echo "1. Initial state:"
psql -c "SELECT node_id, state, current_cu FROM heliosdb.compute_nodes;"
# Wait for idle timeout + buffer
echo "2. Waiting for auto-suspend (5 minutes)..."
sleep 360
# Verify suspended
echo "3. Verifying suspended state:"
psql -c "SELECT node_id, state FROM heliosdb.compute_nodes;"
# Send query to trigger resume
echo "4. Triggering resume with query..."
time psql -c "SELECT COUNT(*) FROM users;"
# Check resume time
echo "5. Resume metrics:"
psql -c "SELECT total_duration_ms, phase_breakdown FROM heliosdb.last_resume_result();"

Integration with Monitoring

Prometheus Metrics

prometheus.yml
scrape_configs:
- job_name: 'heliosdb_autoscale'
static_configs:
- targets: ['heliosdb:9090']
metrics_path: '/metrics'

Key Metrics:

  • heliosdb_autoscale_suspend_total: Total suspends
  • heliosdb_autoscale_resume_total: Total resumes
  • heliosdb_autoscale_resume_duration_seconds: Resume latency histogram
  • heliosdb_autoscale_active_nodes: Currently active nodes
  • heliosdb_autoscale_suspended_nodes: Currently suspended nodes
  • heliosdb_autoscale_cu_hours_total: Total CU-hours consumed

Grafana Dashboard

{
"dashboard": {
"title": "HeliosDB Scale-to-Zero",
"panels": [
{
"title": "Node States",
"targets": [{
"expr": "heliosdb_autoscale_active_nodes + heliosdb_autoscale_suspended_nodes"
}]
},
{
"title": "Resume Latency (P95)",
"targets": [{
"expr": "histogram_quantile(0.95, heliosdb_autoscale_resume_duration_seconds_bucket)"
}]
},
{
"title": "Cost Savings",
"targets": [{
"expr": "rate(heliosdb_autoscale_cu_hours_total[1h]) * 0.10"
}]
}
]
}
}

Advanced Topics

Custom Idle Detection Logic

-- Create custom activity function
CREATE OR REPLACE FUNCTION is_truly_idle()
RETURNS boolean AS $$
BEGIN
-- Consider node idle only if:
-- 1. No active connections
-- 2. No queries in last 5 minutes
-- 3. No scheduled jobs
RETURN (
(SELECT COUNT(*) FROM pg_stat_activity WHERE state = 'active') = 0
AND
(SELECT COUNT(*) FROM heliosdb.query_activity
WHERE timestamp > now() - interval '5 minutes') = 0
AND
(SELECT COUNT(*) FROM heliosdb.scheduled_jobs
WHERE next_run < now() + interval '10 minutes') = 0
);
END;
$$ LANGUAGE plpgsql;
-- Configure to use custom logic
ALTER SYSTEM SET heliosdb.custom_idle_check = 'is_truly_idle()';

Multi-Region Scale-to-Zero

# Primary region config
autoscale:
scale_to_zero:
enabled: true
coordinate_with_replicas: true
# Don't suspend if replicas are active
suspend_only_if_all_idle: true

Conclusion

Scale-to-Zero in HeliosDB provides significant cost savings for variable workloads while maintaining sub-300ms resume times. By following the configuration and best practices in this guide, you can optimize costs without compromising user experience.

Key Takeaways:

  • Resume time <300ms is imperceptible to users
  • Potential 50-90% cost savings for intermittent workloads
  • Requires minimal configuration
  • Works transparently with existing applications

For more information:

  • API Reference: /docs/api/autoscale.md
  • Monitoring Guide: /docs/operations/monitoring-autoscale.md
  • Cost Optimization: /docs/guides/cost-optimization.md