Graceful Degradation System
Graceful Degradation System
Overview
The Graceful Degradation system provides intelligent load shedding for HeliosDB under overload conditions. Instead of crashing or rejecting all queries when resources are constrained, the system gracefully degrades performance by automatically adjusting operational parameters.
Architecture
Degradation Modes
The system operates in four distinct modes, automatically transitioning based on system metrics:
-
Normal Mode - Full capacity operation
- All query priorities accepted
- 100% parallelism
- Standard timeouts (1x)
- Full resource allocation (100%)
-
Cautious Mode - Slightly restricted
- All query priorities still accepted
- 80% parallelism (20% reduction)
- Extended timeouts (1.5x)
- 80% resource allocation
-
Restricted Mode - Significantly limited
- Only Normal+ priority queries accepted
- 50% parallelism (50% reduction)
- Extended timeouts (2x)
- 50% resource allocation
-
Survival Mode - Minimal operations only
- Only Critical priority queries accepted
- 20% parallelism (80% reduction)
- Maximum timeouts (3x)
- 30% resource allocation
Mode Transition Thresholds
Transitions occur based on three key metrics:
| Metric | Cautious | Restricted | Survival |
|---|---|---|---|
| CPU Utilization | 80% | 90% | 95% |
| Memory Utilization | 85% | 92% | 97% |
| Queue Depth | 100 | 500 | 1000 |
Transition Rule: The system enters the mode suggested by the worst (highest) metric.
Hysteresis
To prevent mode thrashing (rapid transitions between modes), the system implements hysteresis:
- Hysteresis Factor: 0.85 (configurable)
- Effect: Metrics must improve to 85% of the threshold before downgrading
Example:
- Enter Cautious mode at CPU = 80%
- Can only exit Cautious when CPU < 80% × 0.85 = 68%
Minimum Mode Duration
Prevents rapid mode changes:
- Default: 30 seconds
- System must stay in a mode for at least this duration before transitioning
Integration Points
1. Admission Control
The degradation manager integrates with admission control:
// In AdvancedWorkloadManager::submit_query()if self.degradation_manager.should_reject_query(query.priority) { let mode = self.degradation_manager.get_mode(); return Err(format!("System in {} mode, requires {:?}+ priority", mode, mode.min_priority()));}2. Query Scheduler
The scheduler adjusts parallelism based on degradation mode:
let parallelism_factor = degradation_manager.get_parallelism_factor();let adjusted_parallelism = base_parallelism * parallelism_factor;3. Resource Manager
Resource allocation is scaled based on the current mode:
let resource_factor = degradation_manager.get_resource_allocation_factor();let adjusted_allocation = requested_resources * resource_factor;4. Query Execution
Query timeouts are adjusted based on system load:
let timeout_multiplier = degradation_manager.get_timeout_multiplier();let adjusted_timeout = base_timeout * timeout_multiplier;Configuration
Basic Configuration
use heliosdb_workload::degradation::{DegradationConfig, DegradationThresholds};
let config = DegradationConfig { enabled: true, thresholds: DegradationThresholds { // CPU thresholds cpu_cautious: 0.80, cpu_restricted: 0.90, cpu_survival: 0.95,
// Memory thresholds memory_cautious: 0.85, memory_restricted: 0.92, memory_survival: 0.97,
// Queue depth thresholds queue_cautious: 100, queue_restricted: 500, queue_survival: 1000,
// Hysteresis and timing hysteresis_factor: 0.85, min_mode_duration_secs: 30, }, monitor_interval_secs: 10, enable_auto_recovery: true,};
let manager = DegradationManager::new(config);Advanced Configuration
Conservative Thresholds
For systems requiring maximum stability:
DegradationThresholds { cpu_cautious: 0.70, // More conservative cpu_restricted: 0.85, cpu_survival: 0.93, memory_cautious: 0.75, memory_restricted: 0.88, memory_survival: 0.95, hysteresis_factor: 0.90, // Wider hysteresis band min_mode_duration_secs: 60, // Longer stability period}Aggressive Thresholds
For systems that can tolerate higher load:
DegradationThresholds { cpu_cautious: 0.85, // Less conservative cpu_restricted: 0.93, cpu_survival: 0.97, memory_cautious: 0.90, memory_restricted: 0.95, memory_survival: 0.98, hysteresis_factor: 0.80, // Narrower hysteresis band min_mode_duration_secs: 15, // Faster recovery}Usage Examples
Updating System Load
The degradation manager should be updated with current system metrics periodically:
use heliosdb_workload::admission_control::SystemLoad;
let load = SystemLoad { cpu_utilization: get_cpu_utilization(), memory_utilization: get_memory_utilization(), queue_depth: get_queue_depth(), running_queries: get_running_query_count(), last_update: Instant::now(),};
// Update both admission controller and degradation managerworkload_manager.update_system_load(load)?;Checking Query Admission
use heliosdb_workload::scheduler::QueryPriority;
// Check if query should be admittedlet should_reject = degradation_manager.should_reject_query(query.priority);
if should_reject { return Err("Query rejected due to system overload");}Adjusting Query Execution
// Get current adjustmentslet parallelism_factor = degradation_manager.get_parallelism_factor();let timeout_multiplier = degradation_manager.get_timeout_multiplier();let resource_factor = degradation_manager.get_resource_allocation_factor();
// Apply adjustmentslet worker_threads = (base_threads as f64 * parallelism_factor) as usize;let query_timeout = base_timeout * timeout_multiplier;let memory_limit = base_memory_limit * resource_factor;Monitoring Status
// Get comprehensive status reportlet report = degradation_manager.get_status_report();
println!("Current mode: {:?}", report.current_mode);println!("Parallelism: {:.0}%", report.parallelism_factor * 100.0);println!("Time in mode: {}s", report.time_in_mode);println!("Total transitions: {}", report.total_transitions);println!("Queries rejected: {}", report.queries_rejected);println!("Queries accepted (degraded): {}", report.queries_accepted_degraded);
// Get metricslet metrics = degradation_manager.get_metrics();let (normal_pct, cautious_pct, restricted_pct, survival_pct) = metrics.mode_percentages();
println!("Time distribution:");println!(" Normal: {:.1}%", normal_pct);println!(" Cautious: {:.1}%", cautious_pct);println!(" Restricted: {:.1}%", restricted_pct);println!(" Survival: {:.1}%", survival_pct);Best Practices
1. Set Appropriate Thresholds
- Conservative: Set lower thresholds for critical production systems
- Aggressive: Set higher thresholds for development or test environments
- Calibrate: Monitor system behavior and adjust based on actual load patterns
2. Monitor Mode Transitions
- Log all mode transitions for analysis
- Alert operators when entering Restricted or Survival modes
- Track time spent in each mode
3. Tune Hysteresis
- Too low: Risk of mode thrashing
- Too high: Slow recovery from degraded modes
- Recommended: 0.80 - 0.90 range
4. Query Priority Strategy
- Critical: Health checks, monitoring, essential operations
- High: User-facing queries, important batch jobs
- Normal: Standard workload
- Background: Analytics, reports, maintenance
5. Recovery Monitoring
- Track auto-recovery success rate
- Monitor time to recovery
- Ensure auto-recovery is enabled in production
Metrics and Observability
Key Metrics to Monitor
-
Current Degradation Mode
- Gauge: Current mode (0=Normal, 1=Cautious, 2=Restricted, 3=Survival)
-
Mode Transition Rate
- Counter: Total mode transitions
- Rate: Transitions per hour
-
Time in Mode
- Histogram: Duration in each mode
- Percentage: Time distribution across modes
-
Query Rejection Rate
- Counter: Queries rejected due to degradation
- Rate: Rejections per second
-
Recovery Time
- Histogram: Time spent in degraded modes before recovery
Example Prometheus Metrics
// Register metricslazy_static! { static ref DEGRADATION_MODE: IntGauge = register_int_gauge!( "heliosdb_degradation_mode", "Current degradation mode (0=Normal, 1=Cautious, 2=Restricted, 3=Survival)" ).unwrap();
static ref DEGRADATION_TRANSITIONS: IntCounter = register_int_counter!( "heliosdb_degradation_transitions_total", "Total number of degradation mode transitions" ).unwrap();
static ref DEGRADATION_REJECTIONS: IntCounter = register_int_counter!( "heliosdb_degradation_rejections_total", "Total queries rejected due to degradation" ).unwrap();}
// Update metricsDEGRADATION_MODE.set(manager.get_mode() as i64);let metrics = manager.get_metrics();DEGRADATION_TRANSITIONS.inc_by(metrics.total_transitions);DEGRADATION_REJECTIONS.inc_by(metrics.queries_rejected_degradation);Testing
Unit Tests
#[test]fn test_mode_transition() { let manager = DegradationManager::new(DegradationConfig::default());
// High load let load = create_load(0.85, 0.5, 50); manager.update_load(load).unwrap(); assert_eq!(manager.get_mode(), DegradationMode::Cautious);}Integration Tests
#[tokio::test]async fn test_query_rejection_under_load() { let workload_manager = AdvancedWorkloadManager::new();
// Simulate high load let load = SystemLoad { cpu_utilization: 0.96, memory_utilization: 0.98, queue_depth: 1200, running_queries: 100, last_update: Instant::now(), }; workload_manager.update_system_load(load).unwrap();
// Background query should be rejected let query = create_query(QueryPriority::Background); let result = workload_manager.submit_query(query).await; assert!(result.is_err());
// Critical query should be accepted let query = create_query(QueryPriority::Critical); let result = workload_manager.submit_query(query).await; assert!(result.is_ok());}Troubleshooting
System Stuck in Degraded Mode
Symptoms: System remains in Cautious/Restricted mode despite low load
Possible Causes:
- Hysteresis factor too high
- Minimum mode duration too long
- Auto-recovery disabled
- Load metrics not updating
Solutions:
- Reduce
hysteresis_factor(e.g., from 0.90 to 0.85) - Reduce
min_mode_duration_secs - Enable
enable_auto_recovery - Verify system load updates are happening
Frequent Mode Thrashing
Symptoms: Rapid transitions between modes
Possible Causes:
- Hysteresis factor too low
- Minimum mode duration too short
- Thresholds too close together
- Bursty workload patterns
Solutions:
- Increase
hysteresis_factor(e.g., from 0.80 to 0.90) - Increase
min_mode_duration_secs - Spread threshold values further apart
- Consider workload smoothing
Queries Rejected Too Aggressively
Symptoms: Many queries rejected, but system not actually overloaded
Possible Causes:
- Thresholds set too conservatively
- Priority levels not aligned with workload
Solutions:
- Raise degradation thresholds
- Review and adjust query priorities
- Consider using more gradual degradation levels
Performance Impact
Overhead
- CPU: < 0.1% (periodic metric checks)
- Memory: ~1KB per manager instance
- Latency: < 10μs per query admission check
Benefits
- Availability: Prevents system crashes under overload
- Fairness: Ensures critical queries are processed
- Recovery: Automatic recovery when load decreases
- Observability: Comprehensive metrics for analysis
Future Enhancements
- Machine Learning: Predict load spikes and pre-emptively adjust
- Per-Tenant Degradation: Different thresholds per tenant
- Custom Degradation Policies: User-defined degradation strategies
- Graceful Query Cancellation: Cancel low-priority queries to free resources
- Load Shedding Strategies: Different strategies for different workload patterns