HeliosDB Monitoring and Alerting Operations Guide
HeliosDB Monitoring and Alerting Operations Guide
Version: 1.0 Last Updated: November 24, 2025 Target Audience: SRE Team, Operations Engineers, Database Administrators
Table of Contents
- Executive Summary
- Architecture Overview
- Metrics Collection
- Dashboards
- Alerting
- Distributed Tracing
- Log Aggregation
- SLO Monitoring
- Anomaly Detection
- Runbooks
- Troubleshooting
- Best Practices
Executive Summary
What This Document Covers
This guide provides comprehensive documentation for HeliosDB’s production-grade monitoring and alerting infrastructure, covering:
- Metrics Collection: 150+ metrics across all 184 crates
- Dashboards: 50+ Grafana dashboards for all features
- Alerting: 200+ alert rules with P0-P3 severity levels
- Distributed Tracing: OpenTelemetry integration for full request tracing
- Log Aggregation: ELK Stack and Splunk integration
- SLO Monitoring: Service Level Objectives with error budgets
- Anomaly Detection: ML-powered anomaly detection and intelligent alerting
Quick Links
- Grafana Dashboards: https://grafana.heliosdb.io
- Prometheus: https://prometheus.heliosdb.io
- Alertmanager: https://alertmanager.heliosdb.io
- Jaeger Tracing: https://jaeger.heliosdb.io
- Kibana: https://kibana.heliosdb.io
- Splunk: https://splunk.heliosdb.io
- Runbooks: https://docs.heliosdb.io/runbooks/
Key Metrics at a Glance
| Metric | Target | Critical Threshold |
|---|---|---|
| Availability | 99.9% | <99.5% |
| P99 Latency | <500ms | >1000ms |
| P95 Latency | <100ms | >500ms |
| Error Rate | <0.1% | >1% |
| Replication Lag | <10s | >60s |
| Cache Hit Rate | >80% | <60% |
Architecture Overview
Monitoring Stack Components
┌─────────────────────────────────────────────────────────────┐│ HeliosDB Cluster ││ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││ │ Node 1 │ │ Node 2 │ │ Node 3 │ ││ │ :9090 │ │ :9090 │ │ :9090 │ ││ └────┬─────┘ └────┬─────┘ └────┬─────┘ ││ │ │ │ ││ │ Metrics │ │ ││ │ Traces │ │ ││ │ Logs │ │ │└───────┼─────────────┼─────────────┼──────────────────────────┘ │ │ │ ▼ ▼ ▼┌─────────────────────────────────────────────────────────────┐│ OpenTelemetry Collector ││ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││ │ Receivers │ │ Processors │ │ Exporters │ ││ │ OTLP/Jaeger │ │ Sampling │ │ Prom/Jaeger │ ││ │ Prometheus │ │ Enrichment │ │ ES/Loki │ ││ └──────────────┘ └──────────────┘ └──────────────┘ │└─────────────────────────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼┌─────────────┐ ┌─────────────┐ ┌─────────────┐│ Prometheus │ │ Jaeger │ │ ELK Stack ││ Metrics │ │ Traces │ │ Logs ││ Storage │ │ Storage │ │ Storage │└──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ ▼ ▼ ▼┌─────────────────────────────────────────────────────────────┐│ Grafana ││ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││ │ System │ │ Feature │ │ SLO │ │ Anomaly │ ││ │ Overview │ │ Specific │ │Monitoring│ │Detection │ ││ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │└─────────────────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────┐ │ Alertmanager │ │ - Routing │ │ - Deduplication │ │ - Silencing │ └──────────────────────┘ │ ┌──────────────────┼──────────────────┐ ▼ ▼ ▼ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ PagerDuty │ │ Slack │ │ Email │ └────────────┘ └────────────┘ └────────────┘Data Flow
- Metrics Export: HeliosDB exports Prometheus metrics on
/metricsendpoint - Trace Generation: OpenTelemetry SDK generates traces for all operations
- Log Emission: Structured JSON logs written to files
- Collection: OpenTelemetry Collector scrapes/receives data
- Processing: Sampling, enrichment, filtering, correlation
- Storage: Prometheus (metrics), Jaeger/Tempo (traces), Elasticsearch/Loki (logs)
- Visualization: Grafana queries all data sources
- Alerting: Prometheus evaluates alert rules, Alertmanager handles notifications
Metrics Collection
Available Metrics
Core Database Metrics (P0 - Critical)
# Query performanceheliosdb_queries_total{query_type, status, protocol}heliosdb_query_duration_seconds_bucket{query_type, protocol}heliosdb_query_rows_scanned{query_type}heliosdb_query_rows_returned{query_type}
# Transactionsheliosdb_transactions_total{type, status}heliosdb_transaction_duration_seconds{type}heliosdb_active_transactions{type}heliosdb_transaction_conflicts_total{conflict_type}
# Connectionsheliosdb_active_connections{protocol}heliosdb_connection_wait_seconds{protocol}heliosdb_connection_errors_total{protocol, error_type}
# Storageheliosdb_storage_operations_total{operation, status}heliosdb_storage_operation_duration_seconds{operation}heliosdb_storage_bytes_read_total{tier}heliosdb_storage_bytes_written_total{tier}heliosdb_memtable_size_bytes{table}heliosdb_compaction_queue_depth{level}Multi-Model Database Metrics
# Vector databaseheliosdb_vector_search_operations_total{index_type, status}heliosdb_vector_search_duration_seconds{index_type}heliosdb_vector_index_size{index_name, dimensions}
# Graph databaseheliosdb_graph_traversal_operations_total{traversal_type, status}heliosdb_graph_traversal_duration_seconds{traversal_type}heliosdb_graph_nodes{graph_name, node_type}heliosdb_graph_edges{graph_name, edge_type}
# Document storeheliosdb_document_operations_total{operation, collection, status}heliosdb_document_size_bytes{collection}
# Time-seriesheliosdb_timeseries_ingest_rate{metric}heliosdb_timeseries_compression_ratio{metric}Distributed Systems Metrics
# Cluster healthheliosdb_cluster_nodes{status}heliosdb_cluster_leader_elections_totalheliosdb_cluster_split_brain_detected_total
# Replicationheliosdb_replication_lag_seconds{source, target, region}heliosdb_replication_synced_bytes_total{source, target}heliosdb_replication_conflicts_total{conflict_type, resolution}heliosdb_replication_queue_depth{source, target}
# Shardingheliosdb_shard_operations_total{operation, status}heliosdb_shard_rebalancing_activeheliosdb_shard_data_size_bytes{shard_id}
# Multi-regionheliosdb_cross_region_latency_seconds{source_region, target_region}heliosdb_region_failover_operations_total{from_region, to_region, status}Querying Metrics
Common PromQL Queries
Calculate error rate:
sum(rate(heliosdb_queries_total{status="error"}[5m]))/sum(rate(heliosdb_queries_total[5m]))* 100P95 latency:
histogram_quantile(0.95, sum(rate(heliosdb_query_duration_seconds_bucket[5m])) by (le, query_type)) * 1000Cache hit rate:
sum(rate(heliosdb_cache_hits_total[5m])) by (cache_type)/( sum(rate(heliosdb_cache_hits_total[5m])) by (cache_type) + sum(rate(heliosdb_cache_misses_total[5m])) by (cache_type)) * 100Top queries by duration:
topk(10, sum(rate(heliosdb_query_duration_seconds_sum[5m])) by (query_type) / sum(rate(heliosdb_query_duration_seconds_count[5m])) by (query_type))Dashboards
Available Dashboards
1. System Overview Dashboard
URL: https://grafana.heliosdb.io/d/system-overview Purpose: High-level system health and performance Panels:
- Cluster health (healthy nodes)
- Query rate (QPS)
- Error rate
- P99 latency
- Active connections
- Replication lag
- Query rate by type (graph)
- Query latency percentiles (graph)
- CPU usage by node
- Memory usage by node
- Disk usage by node
- Network I/O
- Storage operations
- Compaction queue depth
- Cache performance
- Active alerts table
2. Vector Database Dashboard
URL: https://grafana.heliosdb.io/d/vector-database Panels:
- Search operations/sec
- P95 search latency
- Total vectors indexed
- Insert rate
- Search latency distribution
- Index size growth
- HNSW performance metrics
- Vector dimension distribution
3. Graph Database Dashboard
URL: https://grafana.heliosdb.io/d/graph-database Panels:
- Traversal operations/sec
- P95 traversal latency
- Total nodes/edges
- Graph density
- Traversal patterns
- Cypher query performance
- PageRank/centrality metrics
4. Multi-Tenancy Dashboard
URL: https://grafana.heliosdb.io/d/multi-tenancy Panels:
- Active tenants
- Data size by tenant (top 10)
- Query rate by tenant (top 10)
- Tenant quotas vs usage
- Noisy neighbor detection
- Cross-tenant isolation metrics
5. Replication Dashboard
URL: https://grafana.heliosdb.io/d/replication Panels:
- Replication lag heatmap
- Throughput by replication stream
- Conflict rate
- Queue depth
- Sync status
- Cross-region latency matrix
6. Security Dashboard
URL: https://grafana.heliosdb.io/d/security Panels:
- Authentication attempts/failures
- Authorization checks
- Security violations
- Audit events
- Suspicious activity detection
- Failed login attempts by IP
7. Cache Performance Dashboard
URL: https://grafana.heliosdb.io/d/cache-performance Panels:
- Hit rate by cache tier
- Eviction rate
- Cache size
- Operation latency
- Cache efficiency trends
8. ML Features Dashboard
URL: https://grafana.heliosdb.io/d/ml-features Panels:
- Model inference rate
- Model latency
- Auto-index creation rate
- Optimizer invocations
- Improvement ratio
- NL2SQL confidence scores
9. Edge Computing Dashboard
URL: https://grafana.heliosdb.io/d/edge-computing Panels:
- Active edge nodes
- Sync lag by edge node
- Edge function invocations
- Total edge data
- Edge network latency
10. Lakehouse Integration Dashboard
URL: https://grafana.heliosdb.io/d/lakehouse Panels:
- Tables by format (Iceberg, Delta, Hudi)
- Query operations by format
- Metadata operations
- Compaction activity
- Table size distribution
Additional Dashboards
- OLTP/OLAP Workloads: Transaction performance and analytical query metrics
- Sharding & Rebalancing: Shard distribution and rebalancing activity
- Multi-Region: Cross-region performance and failover metrics
- Cost Management: Query costs and resource optimization
- NL2SQL: Natural language query performance
- Document Store: MongoDB-compatible document operations
- Time-Series: Time-series ingestion and compression
- Distributed Tracing: Trace visualization and analysis
- Workload Optimizer: Query optimization metrics
- Compression: Compression ratios and performance
Dashboard Navigation
Best Practices:
- Start with System Overview dashboard for overall health
- Drill down to feature-specific dashboards for detailed analysis
- Use time range selector to focus on incident timeframes
- Use template variables to filter by cluster, instance, tenant
- Enable auto-refresh (30s recommended) for real-time monitoring
Alerting
Alert Severity Levels
| Severity | Priority | Response Time | Notification | Action |
|---|---|---|---|---|
| Critical | P0 | Immediate (5 min) | PagerDuty + Phone | Page on-call, start incident |
| High | P1 | 15 minutes | Slack + Email | Notify team, investigate |
| Warning | P2 | 1 hour | Slack | Monitor, plan action |
| Info | P3 | Next business day | Slack (monitoring) | Log for review |
Critical Alerts (P0)
DatabaseDown
Description: HeliosDB node is completely down
Query: up{job="heliosdb"} == 0
Duration: 30s
Impact: Data unavailability, potential data loss
Runbook: https://docs.heliosdb.io/runbooks/database-down
ClusterQuorumLost
Description: Less than majority of nodes are healthy
Query: count(up{job="heliosdb"} == 1) < (count(up{job="heliosdb"}) / 2 + 1)
Duration: 30s
Impact: Write operations blocked, data inconsistency risk
Runbook: https://docs.heliosdb.io/runbooks/quorum-lost
CriticalQueryErrorRate
Description: Query error rate exceeds 5%
Query: sum(rate(heliosdb_queries_total{status="error"}[1m])) / sum(rate(heliosdb_queries_total[1m])) > 0.05
Duration: 2m
Impact: Widespread query failures affecting users
Runbook: https://docs.heliosdb.io/runbooks/high-error-rate
CriticalReplicationFailure
Description: Replication lag exceeds 5 minutes
Query: heliosdb_replication_lag_seconds > 300
Duration: 5m
Impact: Data consistency at risk, failover risk
Runbook: https://docs.heliosdb.io/runbooks/replication-failure
OutOfMemory
Description: Memory usage exceeds 98%
Query: (heliosdb_memory_usage_bytes{type="total"} / heliosdb_memory_usage_bytes{type="limit"}) > 0.98
Duration: 1m
Impact: Imminent OOM kills, database crash risk
Runbook: https://docs.heliosdb.io/runbooks/out-of-memory
DiskFull
Description: Disk usage exceeds 95%
Query: (heliosdb_disk_usage_bytes / heliosdb_disk_usage_bytes{type="capacity"}) > 0.95
Duration: 2m
Impact: Write operations will fail, database may crash
Runbook: https://docs.heliosdb.io/runbooks/disk-full
High Priority Alerts (P1)
HighQueryLatencyP99
Description: P99 query latency exceeds 1 second
Query: histogram_quantile(0.99, sum(rate(heliosdb_query_duration_seconds_bucket[5m])) by (le)) > 1.0
Duration: 5m
Impact: Poor user experience, SLA at risk
Action: Analyze slow queries, check for missing indexes
HighConnectionPoolUtilization
Description: Connection pool at 85%+ utilization
Query: heliosdb_active_connections / heliosdb_active_connections{type="max"} > 0.85
Duration: 10m
Impact: Connection timeouts imminent
Action: Increase pool size or scale horizontally
CacheHitRateDegraded
Description: Cache hit rate below 60%
Query: sum(rate(heliosdb_cache_hits_total[10m])) / (sum(rate(heliosdb_cache_hits_total[10m])) + sum(rate(heliosdb_cache_misses_total[10m]))) < 0.6
Duration: 15m
Impact: Increased storage I/O, query latency
Action: Increase cache size, analyze eviction patterns
Alert Response Workflow
Alert Fires │ ▼┌─────────────────┐│ Alertmanager ││ - Deduplication ││ - Grouping ││ - Routing │└────────┬────────┘ │ ├──────────────────────────────┐ ▼ ▼ [Critical?] [High/Warning?] │ │ ▼ ▼ ┌──────────────┐ ┌──────────────┐ │ PagerDuty │ │ Slack │ │ + Phone │ │ + Email │ └──────┬───────┘ └──────┬───────┘ │ │ ▼ ▼ ┌──────────────┐ ┌──────────────┐ │ On-call Eng │ │ Team Lead │ │ Acknowledges │ │ Reviews │ └──────┬───────┘ └──────┬───────┘ │ │ ▼ ▼ ┌──────────────┐ ┌──────────────┐ │ Follow │ │ Monitor │ │ Runbook │ │ & Plan Fix │ └──────┬───────┘ └──────────────┘ │ ▼ ┌──────────────┐ │ Mitigate │ │ Issue │ └──────┬───────┘ │ ▼ ┌──────────────┐ │ Resolve │ │ Alert │ └──────┬───────┘ │ ▼ ┌──────────────┐ │ Post-Mortem │ │ (if P0/P1) │ └──────────────┘Distributed Tracing
Trace Instrumentation
HeliosDB automatically instruments all operations with OpenTelemetry:
Trace Hierarchy:
Query Request (Root Span)│├─ Authentication (Span)│├─ Authorization (Span)│├─ Query Parsing (Span)│├─ Query Planning (Span)│ ├─ Cost Estimation│ └─ Index Selection│├─ Query Execution (Span)│ ├─ Storage Read (Span)│ │ ├─ Cache Lookup│ │ └─ Disk Read (if cache miss)│ ├─ Index Scan (Span)│ └─ Filter/Join Operations (Span)│└─ Result Serialization (Span)Trace Attributes
Every span includes:
trace_id: Unique trace identifierspan_id: Unique span identifierparent_span_id: Parent span (for hierarchy)service.name: “heliosdb”service.version: “7.0.0”deployment.environment: “production”db.system: “heliosdb”db.operation: Query type (SELECT, INSERT, etc.)db.statement: Hashed SQL statementdb.table: Table namedb.rows_affected: Number of rowsduration_ms: Span durationtenant_id: Tenant identifierquery_complexity: low/medium/highslo.met: true/false
Viewing Traces
Jaeger UI: https://jaeger.heliosdb.io
Common Searches:
# Find slow queriesduration > 1s
# Find errorserror=true
# Find specific query typedb.operation=SELECT
# Find distributed transactionstransaction.distributed=true
# Find queries for specific tenanttenant_id=abc123
# Find queries missing SLOslo.met=falseTrace Sampling
- Always sampled: Errors, slow queries (>1s), distributed transactions
- High sample rate (50%): P99+ latency queries
- Medium sample rate (10%): Normal queries
- Low sample rate (1%): Health checks
- Never sampled: Internal monitoring
Log Aggregation
Log Structure
All logs are emitted in structured JSON format:
{ "timestamp": "2025-11-24T10:30:45.123Z", "level": "INFO", "category": "query", "message": "Query executed successfully", "service": "heliosdb", "version": "7.0.0", "environment": "production", "cluster": "prod-cluster-01", "node_id": "node-01", "region": "us-east-1", "query_id": "q-12345", "query_type": "SELECT", "duration_ms": 45.2, "rows_returned": 100, "cache_hit": true, "tenant_id": "tenant-uuid", "trace_id": "abc123", "span_id": "xyz789"}Log Categories
- query: Query execution and optimization
- transaction: Transaction lifecycle
- storage: Storage engine operations
- replication: Replication and sync
- security: Authentication, authorization, audit
- network: Network communication
- cache: Cache operations
- ml: ML model inference
- system: System-level events
Searching Logs
Kibana (ELK): https://kibana.heliosdb.io
Common Queries:
# Find errors in last hourlevel:ERROR AND @timestamp:[now-1h TO now]
# Find slow queriescategory:query AND duration_ms:>1000
# Find security violationscategory:security AND event_type:unauthorized_access
# Find queries for specific tenanttenant_id:"abc123"
# Find queries that missed SLOslo.met:false
# Correlate logs with tracestrace_id:"abc123"Splunk: https://splunk.heliosdb.io
# Error rate trendindex=heliosdb_logs level=ERROR| timechart count by category
# Slow query analysisindex=heliosdb_queries duration_ms>1000| stats avg(duration_ms) p95(duration_ms) p99(duration_ms) count by query_type
# Security auditindex=heliosdb_audit category=security| stats count by event_type, user_id| sort -countSLO Monitoring
Defined SLOs
Availability SLO
- Target: 99.9% (three nines)
- Window: 30 days
- Error Budget: 43.2 minutes/month
- Current Status: Check dashboard at https://grafana.heliosdb.io/d/slo-overview
Latency SLOs
- P95 Latency: <100ms (all queries)
- P99 Latency: <500ms (OLTP queries)
- Window: 30 days
Data Freshness SLO
- Replication Lag: <10 seconds
- Window: 30 days
Error Budget
Formula:
Error Budget Remaining = (1 - (Actual Availability / Target Availability)) × 100Burn Rate:
Burn Rate = (Error Budget Consumed / Time Elapsed) / (Total Error Budget / SLO Window)Alert Thresholds:
- Fast Burn (14.4x): 1% budget consumed in 1 hour → Page immediately
- Moderate Burn (6x): 5% budget consumed in 6 hours → Alert team
- Slow Burn (1x): 10% budget consumed in 3 days → Notify for review
SLO Dashboard
Access the SLO dashboard at: https://grafana.heliosdb.io/d/slo-overview
Panels:
- SLO compliance status (green/yellow/red)
- Error budget remaining (%)
- Current burn rate
- SLI trends (7-day)
- Top SLO violations
- Projected budget exhaustion
Anomaly Detection
Detection Methods
1. Statistical Detection
- Z-Score: Detects values >3 standard deviations from mean
- ARIMA: Forecasts expected values, detects deviations
- Exponential Smoothing: Detects trend changes
2. ML-Based Detection
- Isolation Forest: Detects multivariate anomalies
- LSTM: Predicts time-series, detects deviations
- Autoencoders: Learns normal patterns, detects unusual patterns
3. Rule-Based Detection
- Sudden Spike: Value >3× rolling average
- Sudden Drop: Value <0.3× rolling average
- Unusual Pattern: Deviation from historical patterns
Anomaly Scoring
Anomalies are scored 0.0-1.0 based on confidence:
- >0.9: Critical anomaly (investigate immediately)
- >0.75: High confidence anomaly (investigate soon)
- >0.5: Medium confidence anomaly (monitor)
- >0.3: Low confidence anomaly (log only)
Anomaly Dashboard
Access at: https://grafana.heliosdb.io/d/anomaly-detection
Panels:
- Detected anomalies (last 24h)
- Anomaly score timeline
- Baseline vs actual metrics
- Anomaly heatmap
- Top anomalous metrics
Runbooks
Critical Runbooks
Database Down
Trigger: Node completely unreachable Steps:
- Check node status:
systemctl status heliosdb - Check logs:
journalctl -u heliosdb -n 100 - If OOM killed: Increase memory, restart
- If crash: Check core dump, restart
- If disk full: Clean up space, restart
- Verify cluster health after restart Escalation: If restart fails, page DBA lead
Quorum Lost
Trigger: Less than majority of nodes healthy Steps:
- Identify down nodes
- Check network connectivity between nodes
- Check for split-brain scenario
- Restore down nodes ASAP
- Verify cluster reformed correctly Escalation: If network partition, engage network team
High Error Rate
Trigger: Query error rate >1% Steps:
- Check error logs for common patterns
- Identify affected query types
- Check for recent schema/code changes
- Check resource utilization (CPU, memory, disk)
- Roll back recent changes if applicable
- Scale up resources if needed Escalation: If errors continue, page on-call engineer
Out of Memory
Trigger: Memory usage >98% Steps:
- Identify memory consumers: Check cache size, active queries
- Kill long-running queries if necessary
- Reduce cache sizes temporarily
- Check for memory leaks in recent changes
- Scale up memory or scale horizontally
- Investigate root cause after mitigation Prevention: Tune cache sizes, set query timeouts
Troubleshooting
Common Issues
High Latency
Symptoms: P95/P99 latency elevated Investigation:
- Check CPU/memory/disk utilization
- Check cache hit rate
- Analyze slow query log
- Check for missing indexes
- Check replication lag
- Check compaction backlog Solutions:
- Add missing indexes
- Optimize queries
- Increase cache size
- Scale resources
- Rebalance shards
High Error Rate
Symptoms: Elevated query failures Investigation:
- Check error logs for patterns
- Check affected query types
- Check recent deployments
- Check resource limits
- Check authentication issues Solutions:
- Roll back bad deployment
- Fix query bugs
- Increase resource limits
- Fix authentication configuration
Replication Lag
Symptoms: Lag >10 seconds Investigation:
- Check network bandwidth
- Check target node load
- Check replication queue size
- Check for conflicts
- Check write rate Solutions:
- Increase network bandwidth
- Scale target node
- Increase replication threads
- Reduce write rate temporarily
- Resolve conflicts
Cache Miss Rate High
Symptoms: Hit rate <60% Investigation:
- Check cache size
- Check eviction rate
- Check query patterns
- Check data hot spots Solutions:
- Increase cache size
- Adjust eviction policy
- Optimize query patterns
- Partition hot data
Diagnostic Queries
Check Node Health
curl http://localhost:9090/healthCheck Cluster Status
curl http://localhost:9090/api/v1/cluster/status | jqCheck Active Queries
curl http://localhost:9090/api/v1/queries/active | jqCheck Replication Status
curl http://localhost:9090/api/v1/replication/status | jqCheck Cache Statistics
curl http://localhost:9090/api/v1/cache/stats | jqBest Practices
Monitoring
- Set Realistic SLOs: Base on actual user requirements, not arbitrary targets
- Monitor What Matters: Focus on user-facing metrics first
- Use Error Budgets: Allow controlled risk-taking and innovation
- Automate Responses: Auto-remediate safe, well-understood issues
- Correlate Data: Link metrics, traces, and logs for full context
- Review Regularly: Conduct weekly SLO reviews, monthly retrospectives
Alerting
- Reduce Noise: Every alert should be actionable
- Use Severity Levels: P0 = page, P1 = email, P2 = ticket
- Write Runbooks: Every alert should have a runbook
- Test Alerts: Regularly test alert routing and escalation
- Suppress Known Issues: Silence alerts during maintenance
- Dedup and Group: Correlate related alerts
Dashboards
- Start High-Level: System overview first, drill down later
- Use Templates: Filter by cluster, region, tenant
- Show Context: Include annotations for deployments
- Enable Sharing: Make dashboards easily shareable
- Document Panels: Add descriptions to complex panels
- Keep Updated: Remove obsolete dashboards
On-Call
- Rotation: 1-week rotations, no more than 3 weeks/quarter
- Handoff: Document active incidents during handoff
- Response Time: <5 min acknowledgment for P0
- Post-Mortems: Write blameless post-mortems for P0/P1
- Continuous Improvement: Update runbooks after incidents
- Work-Life Balance: Limit out-of-hours pages
Appendices
A. Metrics Reference
Complete metrics reference: metrics-comprehensive.rs
B. Alert Rules
Complete alert rules: alert-rules-comprehensive.yml
C. Dashboard Generator
Dashboard generator script: generate-dashboards.py
D. OpenTelemetry Configuration
Complete OTel config: opentelemetry-config.yaml
E. Logging Configuration
Complete logging config: logging-config.yaml
F. SLO Configuration
Complete SLO config: slo-monitoring.yaml
Support
- Documentation: https://docs.heliosdb.io
- Runbooks: https://docs.heliosdb.io/runbooks/
- Slack: #heliosdb-operations
- On-Call: PagerDuty escalation policy “HeliosDB Production”
- Email: sre-team@heliosdb.io
Document Control
- Version: 1.0
- Last Updated: November 24, 2025
- Next Review: February 24, 2026
- Owner: SRE Team
- Approvers: VP Engineering, Director of Operations