Scale-to-Zero Serverless Compute User Guide

Overview

HeliosDB’s Scale-to-Zero feature automatically suspends idle compute nodes and resumes them in under 300ms when queries arrive, dramatically reducing costs for intermittent workloads while maintaining instant availability.

Benefits

Pay only for active compute time (potential 50-90% cost savings)
Sub-300ms resume time (imperceptible to users)
Automatic idle detection and suspension
No cold-start penalties
Seamless integration with existing applications

Key Features

Suspend time: <5 seconds
Resume time: <300ms (average ~170ms)
State persistence: <10MB snapshot
Resume success rate: >99.9%
Accurate per-second billing

Prerequisites

System Requirements

HeliosDB v3.2 or later with autoscale package
At least 2GB RAM for state persistence
S3-compatible storage for snapshots (optional)

Required Configuration

Minimum configuration in heliosdb.conf:

autoscale:
  enabled: true
  scale_to_zero:
    enabled: true
    idle_timeout_seconds: 300  # 5 minutes
    min_cu: 0.0  # Allow scale to zero

Step-by-Step Configuration

1. Enable Scale-to-Zero

Edit /etc/heliosdb/heliosdb.conf:

compute:
  lifecycle:
    suspend_timeout_seconds: 5
    resume_timeout_ms: 300
    resume_target_ms: 250
    checkpoint_transactions: true
    persist_buffer_cache: true
    max_cache_persist_mb: 10

autoscale:
  enabled: true
  min_cu: 0.0  # IMPORTANT: Must be 0.0 for scale-to-zero
  scale_to_zero:
    enabled: true
    idle_timeout_seconds: 300
    idle_check_interval_seconds: 30

state_persistence:
  enabled: true
  snapshot_path: /var/lib/heliosdb/snapshots
  max_snapshot_size_mb: 10
  compression: zstd

2. Configure Activity Monitoring

activity_monitor:
  idle_timeout_seconds: 300
  activity_threshold_seconds: 1
  activity_window_size: 1000
  sample_interval_seconds: 10

  # Customize what counts as "activity"
  track_readonly_queries: true
  track_system_queries: false

3. Set Up Billing Tracking

billing:
  enabled: true
  cu_config:
    cpu_cores_per_cu: 1.0
    memory_mb_per_cu: 2048
    price_per_cu_hour: 0.10

  export:
    format: prometheus
    endpoint: http://prometheus:9090/metrics

4. Restart and Verify

# Restart HeliosDB
sudo systemctl restart heliosdb

# Check status
heliosdb-cli status

# Verify scale-to-zero is active
heliosdb-cli autoscale status

Expected output:

Autoscale Status: ENABLED
Scale-to-Zero: ENABLED
Current State: ACTIVE
Current CUs: 2.0
Idle Timeout: 300s
Time Since Last Activity: 45s

SQL Examples with Explanations

Monitoring Node State

-- Check current compute node state
SELECT
    node_id,
    state,  -- Active, Suspended, Resuming
    current_cu,
    last_activity,
    idle_duration_seconds
FROM heliosdb.compute_nodes;

Example Output:

 node_id | state  | current_cu | last_activity | idle_duration_seconds
---------+--------+------------+---------------+----------------------
 node-1  | Active | 2.0        | 10s ago       | 0

Manual Suspend/Resume

-- Manually suspend a node (useful for maintenance)
SELECT heliosdb.suspend_node('node-1');

-- Check suspend result
SELECT * FROM heliosdb.last_suspend_result();

Output:

{
  "node_id": "node-1",
  "duration_ms": 820,
  "snapshot_size_mb": 8.2,
  "phase_breakdown": {
    "checkpoint_transactions_ms": 102,
    "flush_buffer_cache_ms": 498,
    "persist_snapshot_ms": 195,
    "release_resources_ms": 25
  },
  "success": true
}

-- Manually resume a node
SELECT heliosdb.resume_node('node-1');

-- Check resume result
SELECT * FROM heliosdb.last_resume_result();

Output:

{
  "node_id": "node-1",
  "total_duration_ms": 168,
  "phase_breakdown": {
    "allocate_resources_ms": 38,
    "restore_buffer_cache_ms": 58,
    "restore_session_state_ms": 29,
    "mark_ready_ms": 1,
    "overhead_ms": 42
  },
  "cache_hit_rate": 0.85,
  "success": true
}

Viewing Activity History

-- View recent query activity
SELECT
    timestamp,
    query_type,
    duration_ms,
    rows_affected
FROM heliosdb.query_activity
WHERE timestamp > now() - interval '1 hour'
ORDER BY timestamp DESC
LIMIT 20;

Checking Billing Information

-- View CU-hour consumption
SELECT
    node_id,
    session_start,
    session_end,
    avg_cus,
    duration_hours,
    cu_hours,
    cost_usd
FROM heliosdb.billing_sessions
WHERE session_start > now() - interval '24 hours'
ORDER BY session_start DESC;

Example Output:

 node_id | session_start | session_end | avg_cus | duration_hours | cu_hours | cost_usd
---------+---------------+-------------+---------+----------------+----------+----------
 node-1  | 10:00:00      | 10:15:00    | 2.0     | 0.25           | 0.50     | $0.05
 node-1  | 14:30:00      | 15:00:00    | 4.0     | 0.50           | 2.00     | $0.20

Common Use Cases

Use Case 1: Development/Staging Environment

Scenario: Dev database used only during business hours (9 AM - 5 PM)

# heliosdb.conf
autoscale:
  scale_to_zero:
    enabled: true
    idle_timeout_seconds: 600  # 10 minutes idle time
    # Automatically suspends after hours

Cost Savings:

Active time: 8 hours/day = ~33% of day
Idle time scaled to zero: 16 hours/day = ~67% savings
Monthly savings: ~$200 on a 2 CU instance

Use Case 2: Batch Processing Workload

Scenario: Database receives batches every 4 hours

-- Application workflow
BEGIN;
-- Batch arrives, database auto-resumes (<300ms)
SELECT heliosdb.current_node_state(); -- Returns: Resuming → Active

INSERT INTO events SELECT * FROM batch_staging;
-- Process 1M records

COMMIT;
-- After completion, idle timer starts
-- After 5 minutes idle → auto-suspend

Cost Savings:

Active time: ~15 min/4 hours = 6.25% of time
Savings: ~94% compared to always-on

Use Case 3: Low-Traffic Web Application

Scenario: Internal dashboard with sporadic usage

autoscale:
  scale_to_zero:
    enabled: true
    idle_timeout_seconds: 180  # 3 minutes
    # Quick suspend for quick savings

User Experience:

User opens dashboard
  ↓
Query arrives at suspended node
  ↓
Auto-resume in ~170ms
  ↓
Dashboard loads (feels instant to user)

Use Case 4: Testing/CI Pipeline

Scenario: Database for automated tests

#!/bin/bash
# Tests trigger resume automatically
npm run test:database

# No manual shutdown needed
# Database auto-suspends after 5 min idle

Cost Savings:

Test runs: ~10 min/day
Always-on cost: 24 hours × $0.10/CU-hour = $2.40/day
Scale-to-zero cost: 10 min × $0.10/CU-hour = $0.017/day
Savings: ~99%

Troubleshooting

Issue: Resume Taking Too Long

Symptom: Resume time >500ms, users notice latency

Diagnosis:

-- Check recent resume times
SELECT
    timestamp,
    total_duration_ms,
    phase_breakdown
FROM heliosdb.resume_history
ORDER BY timestamp DESC
LIMIT 10;

Solution 1: Reduce Snapshot Size

# heliosdb.conf
state_persistence:
  max_cache_persist_mb: 5  # Reduce from 10MB
  compress_sessions: true

Solution 2: Increase Resume Budget

compute:
  lifecycle:
    resume_timeout_ms: 500  # Allow more time

Solution 3: Keep Warm

autoscale:
  scale_to_zero:
    idle_timeout_seconds: 900  # 15 min (longer wait before suspend)

Issue: Frequent Suspend/Resume Cycles

Symptom: Node oscillating between suspended/active

Diagnosis:

-- Count suspend/resume events
SELECT
    DATE_TRUNC('hour', timestamp) as hour,
    COUNT(*) as cycle_count
FROM heliosdb.lifecycle_events
WHERE event_type IN ('suspend', 'resume')
GROUP BY hour
ORDER BY hour DESC;

Solution: Increase Idle Timeout

autoscale:
  scale_to_zero:
    idle_timeout_seconds: 600  # Increase from 300
    # Adds hysteresis to prevent flapping

Issue: Snapshot Size Too Large

Symptom:

ERROR: Snapshot size 12.3 MB exceeds limit of 10 MB

Diagnosis:

SELECT
    node_id,
    snapshot_size_mb,
    session_count,
    buffer_cache_size_mb
FROM heliosdb.node_snapshots
ORDER BY snapshot_size_mb DESC;

Solution 1: Reduce Buffer Cache

compute:
  lifecycle:
    max_cache_persist_mb: 5  # Reduce persisted cache

Solution 2: Clean Up Sessions

-- Close idle sessions before suspend
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND state_change < now() - interval '5 minutes';

Issue: Resume Failures

Symptom: Resume success rate <99%

Diagnosis:

SELECT
    error_message,
    COUNT(*) as occurrence_count
FROM heliosdb.resume_errors
WHERE timestamp > now() - interval '24 hours'
GROUP BY error_message
ORDER BY occurrence_count DESC;

Common Errors and Solutions:

Resource Allocation Failed

-- Check available resources
SELECT * FROM heliosdb.resource_availability;

-- Solution: Increase resource pool

Snapshot Corruption

-- Check snapshot integrity
SELECT heliosdb.verify_snapshot('node-1');

-- Solution: Force fresh snapshot
SELECT heliosdb.recreate_snapshot('node-1');

Timeout Exceeded

# Increase resume timeout
compute:
  lifecycle:
    resume_timeout_ms: 500

Performance Tuning Tips

1. Optimize Snapshot Creation

state_persistence:
  # Use fast compression
  compression: lz4  # Faster than zstd

  # Reduce snapshot frequency during suspend
  incremental_snapshots: true

  # Skip non-critical state
  persist_temp_tables: false
  persist_prepared_statements_cache: true  # Keep important state

2. Tune Idle Detection

activity_monitor:
  # Ignore system queries for idle detection
  track_system_queries: false
  track_readonly_queries: true

  # Adjust sensitivity
  activity_threshold_seconds: 5  # Lower = more sensitive

3. Pre-warm Cache on Resume

compute:
  lifecycle:
    # Load hot pages first
    prioritize_hot_pages: true
    hot_page_threshold: 100  # Most accessed pages

    # Parallel cache loading
    parallel_cache_restore: true
    cache_restore_threads: 4

4. Connection Pooling

# Application code
import psycopg2.pool

# Use connection pooling to reduce resume overhead
pool = psycopg2.pool.SimpleConnectionPool(
    minconn=1,
    maxconn=10,
    host="heliosdb.example.com",
    database="mydb",
    # Connection stays open, no repeated resumes
)

5. Batch Queries

# Instead of multiple queries (triggers multiple resumes):
for user in users:
    db.query(f"SELECT * FROM orders WHERE user_id = {user.id}")

# Batch into single query:
user_ids = [user.id for user in users]
db.query(f"SELECT * FROM orders WHERE user_id = ANY(ARRAY{user_ids})")

Best Practices

1. Set Appropriate Idle Timeouts

# Development: Quick suspend
autoscale:
  scale_to_zero:
    idle_timeout_seconds: 180  # 3 minutes

# Staging: Moderate
autoscale:
  scale_to_zero:
    idle_timeout_seconds: 600  # 10 minutes

# Production: Conservative (if using scale-to-zero)
autoscale:
  scale_to_zero:
    idle_timeout_seconds: 1800  # 30 minutes

2. Monitor Resume Latency

-- Create alert for slow resumes
CREATE OR REPLACE FUNCTION check_resume_performance()
RETURNS void AS $$
DECLARE
    avg_resume_ms float;
BEGIN
    SELECT AVG(total_duration_ms) INTO avg_resume_ms
    FROM heliosdb.resume_history
    WHERE timestamp > now() - interval '1 hour';

    IF avg_resume_ms > 300 THEN
        RAISE WARNING 'Average resume time %.0f ms exceeds 300ms target',
            avg_resume_ms;
    END IF;
END;
$$ LANGUAGE plpgsql;

-- Schedule check
SELECT cron.schedule('check-resume-perf', '*/15 * * * *',
    'SELECT check_resume_performance()');

3. Cost Monitoring

-- Daily cost report
CREATE VIEW daily_cost_summary AS
SELECT
    DATE(session_start) as date,
    SUM(cu_hours) as total_cu_hours,
    SUM(cost_usd) as total_cost_usd,
    COUNT(*) as session_count,
    SUM(CASE WHEN avg_cus = 0 THEN duration_hours ELSE 0 END) as idle_hours_saved
FROM heliosdb.billing_sessions
GROUP BY DATE(session_start)
ORDER BY date DESC;

-- Check savings
SELECT
    date,
    total_cost_usd,
    idle_hours_saved * 2.0 * 0.10 as savings_usd,  -- Assume 2 CU baseline
    ROUND(100.0 * idle_hours_saved / 24.0, 1) as idle_percent
FROM daily_cost_summary
WHERE date = CURRENT_DATE;

4. Application-Level Handling

# Python application example
import psycopg2
import time

def query_with_retry(conn, query, max_retries=2):
    """Handle potential resume delays"""
    for attempt in range(max_retries):
        try:
            cursor = conn.cursor()
            cursor.execute(query)
            return cursor.fetchall()
        except psycopg2.OperationalError as e:
            if "resuming" in str(e).lower() and attempt < max_retries - 1:
                time.sleep(0.5)  # Wait for resume
                continue
            raise

# Usage
results = query_with_retry(conn, "SELECT * FROM users")

5. Testing Scale-to-Zero

#!/bin/bash
echo "Testing scale-to-zero functionality..."

# Check initial state
echo "1. Initial state:"
psql -c "SELECT node_id, state, current_cu FROM heliosdb.compute_nodes;"

# Wait for idle timeout + buffer
echo "2. Waiting for auto-suspend (5 minutes)..."
sleep 360

# Verify suspended
echo "3. Verifying suspended state:"
psql -c "SELECT node_id, state FROM heliosdb.compute_nodes;"

# Send query to trigger resume
echo "4. Triggering resume with query..."
time psql -c "SELECT COUNT(*) FROM users;"

# Check resume time
echo "5. Resume metrics:"
psql -c "SELECT total_duration_ms, phase_breakdown FROM heliosdb.last_resume_result();"

Integration with Monitoring

Prometheus Metrics

scrape_configs:
  - job_name: 'heliosdb_autoscale'
    static_configs:
      - targets: ['heliosdb:9090']
    metrics_path: '/metrics'

Key Metrics:

heliosdb_autoscale_suspend_total: Total suspends
heliosdb_autoscale_resume_total: Total resumes
heliosdb_autoscale_resume_duration_seconds: Resume latency histogram
heliosdb_autoscale_active_nodes: Currently active nodes
heliosdb_autoscale_suspended_nodes: Currently suspended nodes
heliosdb_autoscale_cu_hours_total: Total CU-hours consumed

Grafana Dashboard

{
  "dashboard": {
    "title": "HeliosDB Scale-to-Zero",
    "panels": [
      {
        "title": "Node States",
        "targets": [{
          "expr": "heliosdb_autoscale_active_nodes + heliosdb_autoscale_suspended_nodes"
        }]
      },
      {
        "title": "Resume Latency (P95)",
        "targets": [{
          "expr": "histogram_quantile(0.95, heliosdb_autoscale_resume_duration_seconds_bucket)"
        }]
      },
      {
        "title": "Cost Savings",
        "targets": [{
          "expr": "rate(heliosdb_autoscale_cu_hours_total[1h]) * 0.10"
        }]
      }
    ]
  }
}

Advanced Topics

Custom Idle Detection Logic

-- Create custom activity function
CREATE OR REPLACE FUNCTION is_truly_idle()
RETURNS boolean AS $$
BEGIN
    -- Consider node idle only if:
    -- 1. No active connections
    -- 2. No queries in last 5 minutes
    -- 3. No scheduled jobs
    RETURN (
        (SELECT COUNT(*) FROM pg_stat_activity WHERE state = 'active') = 0
        AND
        (SELECT COUNT(*) FROM heliosdb.query_activity
         WHERE timestamp > now() - interval '5 minutes') = 0
        AND
        (SELECT COUNT(*) FROM heliosdb.scheduled_jobs
         WHERE next_run < now() + interval '10 minutes') = 0
    );
END;
$$ LANGUAGE plpgsql;

-- Configure to use custom logic
ALTER SYSTEM SET heliosdb.custom_idle_check = 'is_truly_idle()';

Multi-Region Scale-to-Zero

# Primary region config
autoscale:
  scale_to_zero:
    enabled: true
    coordinate_with_replicas: true
    # Don't suspend if replicas are active
    suspend_only_if_all_idle: true

Conclusion

Scale-to-Zero in HeliosDB provides significant cost savings for variable workloads while maintaining sub-300ms resume times. By following the configuration and best practices in this guide, you can optimize costs without compromising user experience.

Key Takeaways:

Resume time <300ms is imperceptible to users
Potential 50-90% cost savings for intermittent workloads
Requires minimal configuration
Works transparently with existing applications

For more information:

API Reference: /docs/api/autoscale.md
Monitoring Guide: /docs/operations/monitoring-autoscale.md
Cost Optimization: /docs/guides/cost-optimization.md