Dynamic Autoscaling User Guide
Dynamic Autoscaling User Guide
Overview
HeliosDB’s autoscaling framework automatically adjusts compute resources (CUs) from 0 to maximum based on workload demand, achieving sub-10-second scale-up and cost optimization through intelligent resource management.
Benefits
- Dynamic 0-to-max CU scaling based on real-time metrics
- Sub-10s scale-up, sub-60s scale-down
- 28%+ cost savings through idle time optimization
- No oscillation with damping mechanism
- Vertical and horizontal scaling support
Key Features
- CPU, memory, and queue depth monitoring
- Configurable scaling policies
- Automatic scale-to-zero for idle workloads
- Real-time metrics and Prometheus integration
- Per-second billing accuracy
Prerequisites
- HeliosDB v3.2+ with autoscale module
- 4GB+ RAM for resource manager
- Monitoring tools (Prometheus recommended)
Step-by-Step Configuration
1. Enable Autoscaling
/etc/heliosdb/autoscaling.yaml:
autoscaling: enabled: true min_cu: 0.0 # Scale to zero max_cu: 16.0 # Maximum per node target_cpu: 70 # Target CPU utilization target_queue: 10 # Target queue depth
# Thresholds scale_up_threshold: 80 # Scale up at 80% CPU scale_down_threshold: 30 # Scale down at 30% CPU
# Timing cooldown_period: 60 # Seconds between scaling scale_step_percent: 50.0 # Scale by 50% each time
# Idle handling scale_to_zero_idle_seconds: 300 # 5 min idle timeout
# Damping (prevent oscillation) damping_factor: 0.82. Configure Monitoring
activity_monitor: idle_timeout_seconds: 300 activity_threshold_seconds: 1 activity_window_size: 1000 sample_interval_seconds: 10 track_readonly_queries: true track_system_queries: false3. Resource Limits
resources: total_cpu_cores: 64 total_memory_mb: 262144 # 256 GB min_cpu_cores: 1 min_memory_mb: 2048 # 2 GB max_cpu_cores: 16 max_memory_mb: 65536 # 64 GBSQL Examples
Monitoring Autoscaling
-- Check current autoscaling statusSELECT node_id, current_cu, target_cpu_percent, actual_cpu_percent, queue_depth, last_scaling_action, last_scaling_timeFROM heliosdb.autoscale_status;Viewing Scaling History
-- Recent scaling eventsSELECT timestamp, event_type, -- ScaleUp, ScaleDown, NoChange from_cu, to_cu, reason, confidence, duration_msFROM heliosdb.autoscaling_eventsWHERE timestamp > now() - interval '24 hours'ORDER BY timestamp DESC;Manual Scaling
-- Temporarily disable autoscalingSELECT heliosdb.disable_autoscaling('node-1');
-- Manually set CU levelSELECT heliosdb.set_cu('node-1', 4.0);
-- Re-enable autoscalingSELECT heliosdb.enable_autoscaling('node-1');Performance Metrics
-- Autoscaling effectivenessSELECT DATE_TRUNC('hour', timestamp) as hour, AVG(cpu_utilization) as avg_cpu, AVG(current_cu) as avg_cu, COUNT(CASE WHEN event_type = 'ScaleUp' THEN 1 END) as scale_ups, COUNT(CASE WHEN event_type = 'ScaleDown' THEN 1 END) as scale_downsFROM heliosdb.autoscale_metricsWHERE timestamp > now() - interval '24 hours'GROUP BY hourORDER BY hour;Common Use Cases
Use Case 1: E-commerce Website
Traffic Pattern: High during business hours, low overnight
autoscaling: min_cu: 0.5 # Minimum 0.5 CU (never zero for web) max_cu: 8.0 target_cpu: 65 scale_up_threshold: 75 scale_down_threshold: 35 cooldown_period: 120 # 2 min (more conservative) scale_to_zero_idle_seconds: 3600 # 1 hourExpected Behavior:
- 8 AM - 5 PM: 2-6 CU (busy hours)
- 5 PM - 11 PM: 1-3 CU (evening traffic)
- 11 PM - 8 AM: 0.5 CU (maintenance only)
Use Case 2: Batch Processing
Pattern: Periodic batch jobs every 4 hours
autoscaling: min_cu: 0.0 # Scale to zero between batches max_cu: 16.0 # Full power during batch scale_up_threshold: 70 # Quick scale-up scale_down_threshold: 20 # Aggressive scale-down cooldown_period: 30 # Fast response scale_to_zero_idle_seconds: 180 # 3 minUse Case 3: Development Environment
Pattern: Used only during work hours
autoscaling: min_cu: 0.0 max_cu: 4.0 # Lower max for dev scale_to_zero_idle_seconds: 600 # 10 min scale_step_percent: 100.0 # Double/halve each timeTroubleshooting
Issue: Frequent Scaling (Oscillation)
Symptom: Node scales up/down repeatedly
Diagnosis:
SELECT DATE_TRUNC('minute', timestamp) as minute, COUNT(*) as scaling_eventsFROM heliosdb.autoscaling_eventsWHERE timestamp > now() - interval '1 hour'GROUP BY minuteHAVING COUNT(*) > 2; -- More than 2 events/minute is problematicSolution 1: Increase Cooldown
autoscaling: cooldown_period: 180 # Increase from 60Solution 2: Increase Damping
autoscaling: damping_factor: 0.9 # Increase from 0.8Solution 3: Widen Threshold Gap
autoscaling: scale_up_threshold: 85 # Increase from 80 scale_down_threshold: 25 # Decrease from 30 # Wider gap prevents oscillationIssue: Slow Scale-Up
Symptom: Users experience latency during traffic spikes
Solution: More Aggressive Scale-Up
autoscaling: scale_up_threshold: 70 # Lower threshold scale_step_percent: 100.0 # Double instead of 50% increase cooldown_period: 30 # Faster responseIssue: High Costs
Symptom: Autoscaling not reducing costs as expected
Diagnosis:
-- Analyze CU utilizationSELECT DATE(timestamp) as date, AVG(current_cu) as avg_cu, MIN(current_cu) as min_cu, MAX(current_cu) as max_cu, AVG(cpu_utilization) as avg_cpuFROM heliosdb.autoscale_metricsWHERE timestamp > now() - interval '7 days'GROUP BY date;Solution: More Aggressive Scale-Down
autoscaling: scale_down_threshold: 40 # Increase from 30 scale_to_zero_idle_seconds: 180 # Reduce from 300Performance Tuning
1. Optimize for Workload Type
OLTP (Transactional):
autoscaling: target_cpu: 60 # Lower target for better responsiveness target_queue: 5 # Low queue tolerance scale_up_threshold: 70OLAP (Analytical):
autoscaling: target_cpu: 80 # Higher utilization OK target_queue: 20 # Can tolerate queuing scale_down_threshold: 40 # Keep resources longerMixed Workload:
autoscaling: target_cpu: 70 target_queue: 10 scale_up_threshold: 80 scale_down_threshold: 302. Time-Based Policies
-- Create scheduled scalingCREATE FUNCTION business_hours_policy()RETURNS TRIGGER AS $$BEGIN IF EXTRACT(HOUR FROM now()) BETWEEN 8 AND 17 THEN -- Business hours: maintain minimum 2 CU ALTER SYSTEM SET heliosdb.autoscale_min_cu = 2.0; ELSE -- Off-hours: allow scale to zero ALTER SYSTEM SET heliosdb.autoscale_min_cu = 0.0; END IF; RETURN NULL;END;$$ LANGUAGE plpgsql;
-- Schedule hourly checkSELECT cron.schedule('autoscale-policy', '0 * * * *', 'SELECT business_hours_policy()');3. Predictive Scaling
-- Analyze historical patternsCREATE MATERIALIZED VIEW scaling_patterns ASSELECT EXTRACT(DOW FROM timestamp) as day_of_week, EXTRACT(HOUR FROM timestamp) as hour_of_day, AVG(current_cu) as typical_cu, AVG(cpu_utilization) as typical_cpuFROM heliosdb.autoscale_metricsWHERE timestamp > now() - interval '30 days'GROUP BY day_of_week, hour_of_day;
-- Pre-scale based on patternSELECT heliosdb.set_target_cu( (SELECT typical_cu FROM scaling_patterns WHERE day_of_week = EXTRACT(DOW FROM now()) AND hour_of_day = EXTRACT(HOUR FROM now() + interval '10 minutes')));Best Practices
1. Start Conservative
# Initial configurationautoscaling: min_cu: 1.0 # Don't start with 0 max_cu: 8.0 # Don't start with max cooldown_period: 120 # Longer cooldown scale_to_zero_idle_seconds: 900 # 15 minAfter monitoring for a week, gradually optimize:
# Optimized after observationautoscaling: min_cu: 0.0 # Now confident to scale to zero max_cu: 16.0 # Increase max after testing cooldown_period: 60 scale_to_zero_idle_seconds: 3002. Monitor Before Optimizing
-- Weekly performance reviewSELECT 'Scale-ups' as metric, COUNT(*) as count, AVG(duration_ms) as avg_duration_msFROM heliosdb.autoscaling_eventsWHERE event_type = 'ScaleUp' AND timestamp > now() - interval '7 days'UNION ALLSELECT 'Scale-downs', COUNT(*), AVG(duration_ms)FROM heliosdb.autoscaling_eventsWHERE event_type = 'ScaleDown' AND timestamp > now() - interval '7 days';3. Alert on Anomalies
-- Create alert for excessive scalingCREATE FUNCTION check_scaling_frequency()RETURNS void AS $$DECLARE events_per_hour integer;BEGIN SELECT COUNT(*) INTO events_per_hour FROM heliosdb.autoscaling_events WHERE timestamp > now() - interval '1 hour';
IF events_per_hour > 10 THEN RAISE WARNING 'Excessive scaling: % events in last hour', events_per_hour; END IF;END;$$ LANGUAGE plpgsql;
SELECT cron.schedule('check-scaling', '*/15 * * * *', 'SELECT check_scaling_frequency()');4. Cost Tracking
-- Daily cost analysisCREATE VIEW daily_autoscale_costs ASSELECT DATE(timestamp) as date, AVG(current_cu) as avg_cu, SUM(current_cu * EXTRACT(EPOCH FROM LEAD(timestamp, 1, now()) OVER (ORDER BY timestamp) - timestamp ) / 3600.0) as cu_hours, SUM(current_cu * EXTRACT(EPOCH FROM LEAD(timestamp, 1, now()) OVER (ORDER BY timestamp) - timestamp ) / 3600.0) * 0.10 as cost_usdFROM heliosdb.autoscale_metricsGROUP BY dateORDER BY date DESC;Advanced Topics
Horizontal Scaling
Enable multi-node scaling:
autoscaling: horizontal: enabled: true min_nodes: 1 max_nodes: 10 node_max_cu: 16.0 scale_out_cpu_threshold: 85.0 scale_in_cpu_threshold: 20.0 drain_timeout_seconds: 120Custom Metrics
-- Register custom scaling metricCREATE FUNCTION custom_scaling_metric()RETURNS float AS $$BEGIN -- Custom logic: e.g., connection count weighted by query complexity RETURN ( SELECT COUNT(*) * AVG(query_complexity_score) FROM pg_stat_activity WHERE state = 'active' );END;$$ LANGUAGE plpgsql;
-- Use in scaling policyALTER SYSTEM SET heliosdb.autoscale_custom_metric = 'custom_scaling_metric()';Integration with Load Balancer
# haproxy.cfgbackend heliosdb_servers balance leastconn option httpchk GET /autoscale/status
# Dynamic server addition based on autoscaling server-template node 1-10 heliosdb-{1..10}:5432 checkMonitoring
Prometheus Metrics
scrape_configs: - job_name: 'heliosdb_autoscale' static_configs: - targets: ['heliosdb:9090'] metrics_path: '/metrics/autoscale'Key metrics:
heliosdb_autoscale_current_cuheliosdb_autoscale_target_cpu_percentheliosdb_autoscale_actual_cpu_percentheliosdb_autoscale_queue_depthheliosdb_autoscale_scaling_operations_totalheliosdb_autoscale_scaling_duration_seconds
Grafana Dashboard
Create dashboard with:
- Current CU level (gauge)
- CPU/Memory utilization (time series)
- Queue depth (time series)
- Scaling events (annotations)
- Cost trend (time series)
Conclusion
HeliosDB’s autoscaling provides intelligent resource management with significant cost savings. Configure policies based on your workload, monitor performance, and iterate for optimal results.
Key Takeaways:
- Start conservative, optimize over time
- Monitor scaling patterns for a week before tuning
- Use damping to prevent oscillation
- Set appropriate cooldown periods
- Track costs and adjust policies
For more information:
- Architecture:
/docs/architecture/autoscaling.md - API Reference:
/docs/api/autoscale.md - Cost Optimization:
/docs/guides/cost-optimization.md