Skip to content

Graceful Degradation System

Graceful Degradation System

Overview

The Graceful Degradation system provides intelligent load shedding for HeliosDB under overload conditions. Instead of crashing or rejecting all queries when resources are constrained, the system gracefully degrades performance by automatically adjusting operational parameters.

Architecture

Degradation Modes

The system operates in four distinct modes, automatically transitioning based on system metrics:

  1. Normal Mode - Full capacity operation

    • All query priorities accepted
    • 100% parallelism
    • Standard timeouts (1x)
    • Full resource allocation (100%)
  2. Cautious Mode - Slightly restricted

    • All query priorities still accepted
    • 80% parallelism (20% reduction)
    • Extended timeouts (1.5x)
    • 80% resource allocation
  3. Restricted Mode - Significantly limited

    • Only Normal+ priority queries accepted
    • 50% parallelism (50% reduction)
    • Extended timeouts (2x)
    • 50% resource allocation
  4. Survival Mode - Minimal operations only

    • Only Critical priority queries accepted
    • 20% parallelism (80% reduction)
    • Maximum timeouts (3x)
    • 30% resource allocation

Mode Transition Thresholds

Transitions occur based on three key metrics:

MetricCautiousRestrictedSurvival
CPU Utilization80%90%95%
Memory Utilization85%92%97%
Queue Depth1005001000

Transition Rule: The system enters the mode suggested by the worst (highest) metric.

Hysteresis

To prevent mode thrashing (rapid transitions between modes), the system implements hysteresis:

  • Hysteresis Factor: 0.85 (configurable)
  • Effect: Metrics must improve to 85% of the threshold before downgrading

Example:

  • Enter Cautious mode at CPU = 80%
  • Can only exit Cautious when CPU < 80% × 0.85 = 68%

Minimum Mode Duration

Prevents rapid mode changes:

  • Default: 30 seconds
  • System must stay in a mode for at least this duration before transitioning

Integration Points

1. Admission Control

The degradation manager integrates with admission control:

// In AdvancedWorkloadManager::submit_query()
if self.degradation_manager.should_reject_query(query.priority) {
let mode = self.degradation_manager.get_mode();
return Err(format!("System in {} mode, requires {:?}+ priority",
mode, mode.min_priority()));
}

2. Query Scheduler

The scheduler adjusts parallelism based on degradation mode:

let parallelism_factor = degradation_manager.get_parallelism_factor();
let adjusted_parallelism = base_parallelism * parallelism_factor;

3. Resource Manager

Resource allocation is scaled based on the current mode:

let resource_factor = degradation_manager.get_resource_allocation_factor();
let adjusted_allocation = requested_resources * resource_factor;

4. Query Execution

Query timeouts are adjusted based on system load:

let timeout_multiplier = degradation_manager.get_timeout_multiplier();
let adjusted_timeout = base_timeout * timeout_multiplier;

Configuration

Basic Configuration

use heliosdb_workload::degradation::{DegradationConfig, DegradationThresholds};
let config = DegradationConfig {
enabled: true,
thresholds: DegradationThresholds {
// CPU thresholds
cpu_cautious: 0.80,
cpu_restricted: 0.90,
cpu_survival: 0.95,
// Memory thresholds
memory_cautious: 0.85,
memory_restricted: 0.92,
memory_survival: 0.97,
// Queue depth thresholds
queue_cautious: 100,
queue_restricted: 500,
queue_survival: 1000,
// Hysteresis and timing
hysteresis_factor: 0.85,
min_mode_duration_secs: 30,
},
monitor_interval_secs: 10,
enable_auto_recovery: true,
};
let manager = DegradationManager::new(config);

Advanced Configuration

Conservative Thresholds

For systems requiring maximum stability:

DegradationThresholds {
cpu_cautious: 0.70, // More conservative
cpu_restricted: 0.85,
cpu_survival: 0.93,
memory_cautious: 0.75,
memory_restricted: 0.88,
memory_survival: 0.95,
hysteresis_factor: 0.90, // Wider hysteresis band
min_mode_duration_secs: 60, // Longer stability period
}

Aggressive Thresholds

For systems that can tolerate higher load:

DegradationThresholds {
cpu_cautious: 0.85, // Less conservative
cpu_restricted: 0.93,
cpu_survival: 0.97,
memory_cautious: 0.90,
memory_restricted: 0.95,
memory_survival: 0.98,
hysteresis_factor: 0.80, // Narrower hysteresis band
min_mode_duration_secs: 15, // Faster recovery
}

Usage Examples

Updating System Load

The degradation manager should be updated with current system metrics periodically:

use heliosdb_workload::admission_control::SystemLoad;
let load = SystemLoad {
cpu_utilization: get_cpu_utilization(),
memory_utilization: get_memory_utilization(),
queue_depth: get_queue_depth(),
running_queries: get_running_query_count(),
last_update: Instant::now(),
};
// Update both admission controller and degradation manager
workload_manager.update_system_load(load)?;

Checking Query Admission

use heliosdb_workload::scheduler::QueryPriority;
// Check if query should be admitted
let should_reject = degradation_manager.should_reject_query(query.priority);
if should_reject {
return Err("Query rejected due to system overload");
}

Adjusting Query Execution

// Get current adjustments
let parallelism_factor = degradation_manager.get_parallelism_factor();
let timeout_multiplier = degradation_manager.get_timeout_multiplier();
let resource_factor = degradation_manager.get_resource_allocation_factor();
// Apply adjustments
let worker_threads = (base_threads as f64 * parallelism_factor) as usize;
let query_timeout = base_timeout * timeout_multiplier;
let memory_limit = base_memory_limit * resource_factor;

Monitoring Status

// Get comprehensive status report
let report = degradation_manager.get_status_report();
println!("Current mode: {:?}", report.current_mode);
println!("Parallelism: {:.0}%", report.parallelism_factor * 100.0);
println!("Time in mode: {}s", report.time_in_mode);
println!("Total transitions: {}", report.total_transitions);
println!("Queries rejected: {}", report.queries_rejected);
println!("Queries accepted (degraded): {}", report.queries_accepted_degraded);
// Get metrics
let metrics = degradation_manager.get_metrics();
let (normal_pct, cautious_pct, restricted_pct, survival_pct) =
metrics.mode_percentages();
println!("Time distribution:");
println!(" Normal: {:.1}%", normal_pct);
println!(" Cautious: {:.1}%", cautious_pct);
println!(" Restricted: {:.1}%", restricted_pct);
println!(" Survival: {:.1}%", survival_pct);

Best Practices

1. Set Appropriate Thresholds

  • Conservative: Set lower thresholds for critical production systems
  • Aggressive: Set higher thresholds for development or test environments
  • Calibrate: Monitor system behavior and adjust based on actual load patterns

2. Monitor Mode Transitions

  • Log all mode transitions for analysis
  • Alert operators when entering Restricted or Survival modes
  • Track time spent in each mode

3. Tune Hysteresis

  • Too low: Risk of mode thrashing
  • Too high: Slow recovery from degraded modes
  • Recommended: 0.80 - 0.90 range

4. Query Priority Strategy

  • Critical: Health checks, monitoring, essential operations
  • High: User-facing queries, important batch jobs
  • Normal: Standard workload
  • Background: Analytics, reports, maintenance

5. Recovery Monitoring

  • Track auto-recovery success rate
  • Monitor time to recovery
  • Ensure auto-recovery is enabled in production

Metrics and Observability

Key Metrics to Monitor

  1. Current Degradation Mode

    • Gauge: Current mode (0=Normal, 1=Cautious, 2=Restricted, 3=Survival)
  2. Mode Transition Rate

    • Counter: Total mode transitions
    • Rate: Transitions per hour
  3. Time in Mode

    • Histogram: Duration in each mode
    • Percentage: Time distribution across modes
  4. Query Rejection Rate

    • Counter: Queries rejected due to degradation
    • Rate: Rejections per second
  5. Recovery Time

    • Histogram: Time spent in degraded modes before recovery

Example Prometheus Metrics

// Register metrics
lazy_static! {
static ref DEGRADATION_MODE: IntGauge = register_int_gauge!(
"heliosdb_degradation_mode",
"Current degradation mode (0=Normal, 1=Cautious, 2=Restricted, 3=Survival)"
).unwrap();
static ref DEGRADATION_TRANSITIONS: IntCounter = register_int_counter!(
"heliosdb_degradation_transitions_total",
"Total number of degradation mode transitions"
).unwrap();
static ref DEGRADATION_REJECTIONS: IntCounter = register_int_counter!(
"heliosdb_degradation_rejections_total",
"Total queries rejected due to degradation"
).unwrap();
}
// Update metrics
DEGRADATION_MODE.set(manager.get_mode() as i64);
let metrics = manager.get_metrics();
DEGRADATION_TRANSITIONS.inc_by(metrics.total_transitions);
DEGRADATION_REJECTIONS.inc_by(metrics.queries_rejected_degradation);

Testing

Unit Tests

#[test]
fn test_mode_transition() {
let manager = DegradationManager::new(DegradationConfig::default());
// High load
let load = create_load(0.85, 0.5, 50);
manager.update_load(load).unwrap();
assert_eq!(manager.get_mode(), DegradationMode::Cautious);
}

Integration Tests

#[tokio::test]
async fn test_query_rejection_under_load() {
let workload_manager = AdvancedWorkloadManager::new();
// Simulate high load
let load = SystemLoad {
cpu_utilization: 0.96,
memory_utilization: 0.98,
queue_depth: 1200,
running_queries: 100,
last_update: Instant::now(),
};
workload_manager.update_system_load(load).unwrap();
// Background query should be rejected
let query = create_query(QueryPriority::Background);
let result = workload_manager.submit_query(query).await;
assert!(result.is_err());
// Critical query should be accepted
let query = create_query(QueryPriority::Critical);
let result = workload_manager.submit_query(query).await;
assert!(result.is_ok());
}

Troubleshooting

System Stuck in Degraded Mode

Symptoms: System remains in Cautious/Restricted mode despite low load

Possible Causes:

  1. Hysteresis factor too high
  2. Minimum mode duration too long
  3. Auto-recovery disabled
  4. Load metrics not updating

Solutions:

  1. Reduce hysteresis_factor (e.g., from 0.90 to 0.85)
  2. Reduce min_mode_duration_secs
  3. Enable enable_auto_recovery
  4. Verify system load updates are happening

Frequent Mode Thrashing

Symptoms: Rapid transitions between modes

Possible Causes:

  1. Hysteresis factor too low
  2. Minimum mode duration too short
  3. Thresholds too close together
  4. Bursty workload patterns

Solutions:

  1. Increase hysteresis_factor (e.g., from 0.80 to 0.90)
  2. Increase min_mode_duration_secs
  3. Spread threshold values further apart
  4. Consider workload smoothing

Queries Rejected Too Aggressively

Symptoms: Many queries rejected, but system not actually overloaded

Possible Causes:

  1. Thresholds set too conservatively
  2. Priority levels not aligned with workload

Solutions:

  1. Raise degradation thresholds
  2. Review and adjust query priorities
  3. Consider using more gradual degradation levels

Performance Impact

Overhead

  • CPU: < 0.1% (periodic metric checks)
  • Memory: ~1KB per manager instance
  • Latency: < 10μs per query admission check

Benefits

  • Availability: Prevents system crashes under overload
  • Fairness: Ensures critical queries are processed
  • Recovery: Automatic recovery when load decreases
  • Observability: Comprehensive metrics for analysis

Future Enhancements

  1. Machine Learning: Predict load spikes and pre-emptively adjust
  2. Per-Tenant Degradation: Different thresholds per tenant
  3. Custom Degradation Policies: User-defined degradation strategies
  4. Graceful Query Cancellation: Cancel low-priority queries to free resources
  5. Load Shedding Strategies: Different strategies for different workload patterns

References