Resource Leak Prevention - Quick Reference Guide
Resource Leak Prevention - Quick Reference Guide
For: DevOps, SRE, Database Administrators Version: 1.0 Date: 2025-11-10
Quick Configuration
Essential Settings (Production)
# Minimal safe configuration for production[connection_pool]min_connections = 10max_connections = 500acquire_timeout = "30s"leak_detection_enabled = trueleak_detection_timeout = "10m"
[timeouts]default_query_timeout = "30s"transaction_timeout = "10m"
[resource_limits]max_client_connections = 10000max_memory_per_query_mb = 1024max_total_memory_mb = 32768memory_pressure_threshold = 0.85Common Issues & Solutions
Issue 1: Connection Leak Alert
Symptom: Alert “Connection leaked after 30 minutes”
Immediate Actions:
# 1. Check leaked connectionsheliosdb-cli metrics | grep "connections_leaked"
# 2. Review recent queries from userheliosdb-cli query "SELECT * FROM pg_stat_activity WHERE usename='<user>'"
# 3. Force reclaim if neededheliosdb-cli admin force-reclaim-connection <connection_id>Root Cause Analysis:
- Check application code for missing
connection.close()calls - Review stack traces in leak alerts
- Check for transaction rollback failures
Prevention:
- Use connection pooling properly
- Always use try-with-resources / defer / RAII patterns
- Enable leak detection in development
Issue 2: High Memory Pressure
Symptom: Alert “Memory pressure: HIGH (88%)”
Immediate Actions:
# 1. Check memory usageheliosdb-cli metrics | grep "memory_usage_mb"
# 2. Check top memory-consuming queriesheliosdb-cli query "SELECT query_id, memory_mb FROM active_queries ORDER BY memory_mb DESC LIMIT 10"
# 3. Force garbage collectionheliosdb-cli admin force-gc
# 4. If critical, enable degradationheliosdb-cli admin enable-degradation --type reduce-memory-limitsRoot Cause Analysis:
- Large result sets being buffered
- Memory-intensive aggregations
- Cache size too large
- Memory leak in application
Prevention:
- Set appropriate query memory limits
- Use streaming for large results
- Monitor cache hit rates
- Regular memory profiling
Issue 3: Connection Pool Exhausted
Symptom: Error “Connection pool exhausted, timeout after 30s”
Immediate Actions:
# 1. Check pool utilizationheliosdb-cli metrics | grep "pool_utilization"
# 2. Check active connectionsheliosdb-cli query "SELECT COUNT(*) FROM pg_stat_activity WHERE state='active'"
# 3. Identify long-running queriesheliosdb-cli query "SELECT pid, now() - query_start as duration, query FROM pg_stat_activity WHERE state='active' ORDER BY duration DESC"
# 4. Kill long-running queries if neededheliosdb-cli admin kill-query <query_id>
# 5. Temporarily increase pool sizeheliosdb-cli admin resize-pool --size 1000Root Cause Analysis:
- Too many concurrent connections
- Connections not being released
- Long-running queries holding connections
- Pool size too small for workload
Prevention:
- Right-size connection pool
- Implement connection timeouts
- Use connection pooling middleware
- Monitor pool utilization
Issue 4: Circuit Breaker Open
Symptom: Error “Circuit breaker is open for connection pool”
Immediate Actions:
# 1. Check circuit breaker stateheliosdb-cli metrics | grep "circuit_breaker_state"
# 2. Check recent failuresheliosdb-cli metrics | grep "circuit_breaker_failures"
# 3. Check backend healthheliosdb-cli health-check
# 4. Manually reset circuit breaker if safeheliosdb-cli admin reset-circuit-breaker --name connection_poolRoot Cause Analysis:
- Backend database failures
- Network connectivity issues
- Resource exhaustion on backend
- Configuration errors
Prevention:
- Monitor backend health
- Implement proper error handling
- Configure appropriate thresholds
- Have redundant backends
Issue 5: Query Timeout
Symptom: Error “Query timeout exceeded: 30000ms”
Immediate Actions:
# 1. Check if query is still runningheliosdb-cli query "SELECT * FROM pg_stat_activity WHERE query_id='<query_id>'"
# 2. Review query planheliosdb-cli explain "SELECT ..."
# 3. Check for locksheliosdb-cli query "SELECT * FROM pg_locks WHERE NOT granted"
# 4. Increase timeout for this query type if appropriateheliosdb-cli config set timeouts.long_query_timeout 600sRoot Cause Analysis:
- Inefficient query
- Missing indexes
- Lock contention
- Large data volume
- Timeout too aggressive
Prevention:
- Optimize queries
- Add appropriate indexes
- Use query hints for complex queries
- Set per-query-type timeouts
Monitoring Checklist
Daily Checks
- Connection pool utilization < 80%
- Memory pressure level: Normal or Elevated
- No connection leak alerts in last 24h
- Circuit breaker state: Closed
- Query timeout rate < 1%
Weekly Checks
- Review resource limit violations
- Check for degradation activations
- Analyze slow query log
- Review memory growth trends
- Validate backup/restore procedures
Monthly Checks
- Connection pool sizing review
- Timeout configuration review
- Resource limit adjustment
- Circuit breaker threshold tuning
- Leak detection effectiveness
Key Metrics
Connection Pool Metrics
connections_total # Total connections in poolconnections_active # Connections currently in useconnections_idle # Connections availableconnections_leaked # Connections that leakedpool_utilization # Active / Total (should be < 0.8)connection_lifetime_ms # Average connection lifetimeacquire_timeout_rate # % of acquire attempts that timeoutResource Limit Metrics
memory_usage_mb # Current memory usagememory_limit_mb # Configured memory limitmemory_pressure_level # Normal/Elevated/High/Criticalquery_memory_violations # Queries rejected for memoryconnection_limit_violations # Connections rejectedfile_descriptor_usage # Open file countTimeout Metrics
operation_timeouts # Total timeout eventstimeout_by_type # Breakdown by operation typeaverage_query_time_ms # Average query execution timep95_query_time_ms # 95th percentile query timep99_query_time_ms # 99th percentile query timeCircuit Breaker Metrics
circuit_breaker_state # Closed/Open/Half-Opencircuit_breaker_failures # Failure countcircuit_breaker_successes # Success count in half-opencircuit_open_events # Times circuit openedcircuit_close_events # Times circuit recoveredAlert Severity Levels
P1 (Critical - Immediate Action)
- Memory pressure: CRITICAL (> 95%)
- Massive leak detected (> 50 leaked connections)
- Circuit breaker open for > 5 minutes
- Connection pool exhausted for > 1 minute
- Database unavailable
Response Time: 15 minutes Escalation: Page on-call immediately
P2 (High - Urgent Action)
- Memory pressure: HIGH (> 85%)
- Connection leak detected (> 30 min)
- Query timeout rate > 5%
- Resource limit violations increasing
- Degradation activated
Response Time: 1 hour Escalation: Alert on-call during business hours
P3 (Medium - Standard Action)
- Memory pressure: ELEVATED (> 70%)
- Connection held warning (> 10 min)
- Pool utilization > 80%
- Slow query detected
- Resource usage trending up
Response Time: 4 hours Escalation: Create ticket for investigation
P4 (Low - Informational)
- Memory pressure: NORMAL
- Connection recycled (age limit)
- Configuration change applied
- Health check passed
Response Time: Best effort Escalation: Log for review
Emergency Procedures
Emergency Memory Recovery
# 1. Check current stateheliosdb-cli metrics | grep memory
# 2. Force garbage collectionheliosdb-cli admin force-gc
# 3. Clear cachesheliosdb-cli admin clear-cache --type queryheliosdb-cli admin clear-cache --type result
# 4. Close idle connectionsheliosdb-cli admin close-idle-connections --age 1m
# 5. Kill low-priority queriesheliosdb-cli admin kill-queries --priority low
# 6. Enable degradationheliosdb-cli admin enable-degradation --type reduce-memory-limits
# 7. If still critical, enable read-only modeheliosdb-cli admin enable-read-only-modeEmergency Connection Recovery
# 1. Check connection stateheliosdb-cli metrics | grep connections
# 2. Close idle connectionsheliosdb-cli admin close-idle-connections --age 30s
# 3. Force reclaim leaked connectionsheliosdb-cli admin force-reclaim-all-leaked
# 4. Kill long-running queriesheliosdb-cli query "SELECT pid FROM pg_stat_activity WHERE state='active' AND now() - query_start > interval '5 minutes'" | xargs -I {} heliosdb-cli admin kill-query {}
# 5. Reject new connections temporarilyheliosdb-cli admin enable-degradation --type reject-new-connections
# 6. Restart connection pool (last resort)heliosdb-cli admin restart-pool --gracefulEmergency Shutdown
# 1. Enable read-only modeheliosdb-cli admin enable-read-only-mode
# 2. Stop accepting new connectionsheliosdb-cli admin stop-accepting-connections
# 3. Wait for active queries to complete (30s timeout)heliosdb-cli admin wait-for-queries --timeout 30s
# 4. Kill remaining queriesheliosdb-cli admin kill-all-queries
# 5. Graceful shutdown (30s timeout)heliosdb-cli admin shutdown --graceful --timeout 30s
# 6. Force shutdown if neededheliosdb-cli admin shutdown --forceConfiguration Tuning Guide
Small Deployment (< 100 users)
[connection_pool]min_connections = 5max_connections = 50
[resource_limits]max_client_connections = 1000max_total_memory_mb = 4096Medium Deployment (100-1000 users)
[connection_pool]min_connections = 10max_connections = 200
[resource_limits]max_client_connections = 5000max_total_memory_mb = 16384Large Deployment (> 1000 users)
[connection_pool]min_connections = 20max_connections = 500
[resource_limits]max_client_connections = 10000max_total_memory_mb = 32768High-Performance (Low Latency)
[connection_pool]min_connections = 50max_connections = 1000acquire_timeout = "5s"
[timeouts]default_query_timeout = "10s"lock_timeout = "500ms"High-Throughput (Batch Processing)
[connection_pool]min_connections = 10max_connections = 100
[timeouts]default_query_timeout = "5m"transaction_timeout = "30m"
[resource_limits]max_memory_per_query_mb = 4096Troubleshooting Commands
# Connection pool statusheliosdb-cli pool status
# Active connectionsheliosdb-cli pool connections --active
# Leaked connectionsheliosdb-cli pool connections --leaked
# Resource usageheliosdb-cli resources usage
# Resource pressureheliosdb-cli resources pressure
# Circuit breaker statusheliosdb-cli circuit-breaker status --all
# Recent timeoutsheliosdb-cli timeouts recent --limit 10
# Query resource usageheliosdb-cli query-resources --top 10
# Health checkheliosdb-cli health-check --verbose
# Configuration dumpheliosdb-cli config dump --section resource_limitsBest Practices
Application Development
-
Always close connections
let conn = pool.acquire().await?;defer! { pool.release(conn).await; } -
Set query timeouts
query.with_timeout(Duration::from_secs(30)) -
Stream large results
let stream = query.execute_streaming().await?; -
Check resource availability
if !pool.can_acquire() {return Err("Pool exhausted");}
Operations
-
Monitor continuously
- Set up alerts for all P1/P2 conditions
- Dashboard with key metrics
- Regular log review
-
Test failure scenarios
- Connection leak injection
- Memory pressure simulation
- Circuit breaker testing
-
Document incidents
- Root cause analysis
- Remediation steps
- Prevention measures
-
Regular maintenance
- Weekly metric review
- Monthly configuration tuning
- Quarterly load testing
Support Contacts
- P1 Issues: page-oncall@example.com
- P2 Issues: sre-team@example.com
- P3/P4 Issues: support@example.com
- Documentation: https://heliosdb.dev/docs/resource-management
Last Updated: 2025-11-10 Next Review: 2025-12-10