HeliosDB Monitoring and Alerting Operations Guide

Version: 1.0 Last Updated: November 24, 2025 Target Audience: SRE Team, Operations Engineers, Database Administrators

Executive Summary
Architecture Overview
Metrics Collection
Dashboards
Alerting
Distributed Tracing
Log Aggregation
SLO Monitoring
Anomaly Detection
Runbooks
Troubleshooting
Best Practices

Executive Summary

What This Document Covers

This guide provides comprehensive documentation for HeliosDB’s production-grade monitoring and alerting infrastructure, covering:

Metrics Collection: 150+ metrics across all 184 crates
Dashboards: 50+ Grafana dashboards for all features
Alerting: 200+ alert rules with P0-P3 severity levels
Distributed Tracing: OpenTelemetry integration for full request tracing
Log Aggregation: ELK Stack and Splunk integration
SLO Monitoring: Service Level Objectives with error budgets
Anomaly Detection: ML-powered anomaly detection and intelligent alerting

Quick Links

Grafana Dashboards: https://grafana.heliosdb.io
Prometheus: https://prometheus.heliosdb.io
Alertmanager: https://alertmanager.heliosdb.io
Jaeger Tracing: https://jaeger.heliosdb.io
Kibana: https://kibana.heliosdb.io
Splunk: https://splunk.heliosdb.io
Runbooks: https://docs.heliosdb.io/runbooks/

Key Metrics at a Glance

Metric	Target	Critical Threshold
Availability	99.9%	<99.5%
P99 Latency	<500ms	>1000ms
P95 Latency	<100ms	>500ms
Error Rate	<0.1%	>1%
Replication Lag	<10s	>60s
Cache Hit Rate	>80%	<60%

Architecture Overview

Monitoring Stack Components

┌─────────────────────────────────────────────────────────────┐
│                      HeliosDB Cluster                        │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐                  │
│  │ Node 1   │  │ Node 2   │  │ Node 3   │                  │
│  │ :9090    │  │ :9090    │  │ :9090    │                  │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘                  │
│       │             │             │                          │
│       │ Metrics     │             │                          │
│       │ Traces      │             │                          │
│       │ Logs        │             │                          │
└───────┼─────────────┼─────────────┼──────────────────────────┘
        │             │             │
        ▼             ▼             ▼
┌─────────────────────────────────────────────────────────────┐
│                 OpenTelemetry Collector                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│  │   Receivers  │  │  Processors  │  │   Exporters  │     │
│  │ OTLP/Jaeger  │  │   Sampling   │  │ Prom/Jaeger  │     │
│  │  Prometheus  │  │  Enrichment  │  │   ES/Loki    │     │
│  └──────────────┘  └──────────────┘  └──────────────┘     │
└─────────────────────────────────────────────────────────────┘
        │                      │                      │
        ▼                      ▼                      ▼
┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│ Prometheus  │      │   Jaeger    │      │ ELK Stack   │
│   Metrics   │      │   Traces    │      │    Logs     │
│   Storage   │      │   Storage   │      │   Storage   │
└──────┬──────┘      └──────┬──────┘      └──────┬──────┘
       │                    │                     │
       ▼                    ▼                     ▼
┌─────────────────────────────────────────────────────────────┐
│                        Grafana                               │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │  System  │  │ Feature  │  │   SLO    │  │ Anomaly  │   │
│  │ Overview │  │ Specific │  │Monitoring│  │Detection │   │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘   │
└─────────────────────────────────────────────────────────────┘
                               │
                               ▼
                    ┌──────────────────────┐
                    │   Alertmanager       │
                    │   - Routing          │
                    │   - Deduplication    │
                    │   - Silencing        │
                    └──────────────────────┘
                               │
            ┌──────────────────┼──────────────────┐
            ▼                  ▼                  ▼
     ┌────────────┐     ┌────────────┐    ┌────────────┐
     │ PagerDuty  │     │   Slack    │    │   Email    │
     └────────────┘     └────────────┘    └────────────┘

Data Flow

Metrics Export: HeliosDB exports Prometheus metrics on /metrics endpoint
Trace Generation: OpenTelemetry SDK generates traces for all operations
Log Emission: Structured JSON logs written to files
Collection: OpenTelemetry Collector scrapes/receives data
Processing: Sampling, enrichment, filtering, correlation
Storage: Prometheus (metrics), Jaeger/Tempo (traces), Elasticsearch/Loki (logs)
Visualization: Grafana queries all data sources
Alerting: Prometheus evaluates alert rules, Alertmanager handles notifications

Metrics Collection

Available Metrics

Core Database Metrics (P0 - Critical)

# Query performance
heliosdb_queries_total{query_type, status, protocol}
heliosdb_query_duration_seconds_bucket{query_type, protocol}
heliosdb_query_rows_scanned{query_type}
heliosdb_query_rows_returned{query_type}

# Transactions
heliosdb_transactions_total{type, status}
heliosdb_transaction_duration_seconds{type}
heliosdb_active_transactions{type}
heliosdb_transaction_conflicts_total{conflict_type}

# Connections
heliosdb_active_connections{protocol}
heliosdb_connection_wait_seconds{protocol}
heliosdb_connection_errors_total{protocol, error_type}

# Storage
heliosdb_storage_operations_total{operation, status}
heliosdb_storage_operation_duration_seconds{operation}
heliosdb_storage_bytes_read_total{tier}
heliosdb_storage_bytes_written_total{tier}
heliosdb_memtable_size_bytes{table}
heliosdb_compaction_queue_depth{level}

Multi-Model Database Metrics

# Vector database
heliosdb_vector_search_operations_total{index_type, status}
heliosdb_vector_search_duration_seconds{index_type}
heliosdb_vector_index_size{index_name, dimensions}

# Graph database
heliosdb_graph_traversal_operations_total{traversal_type, status}
heliosdb_graph_traversal_duration_seconds{traversal_type}
heliosdb_graph_nodes{graph_name, node_type}
heliosdb_graph_edges{graph_name, edge_type}

# Document store
heliosdb_document_operations_total{operation, collection, status}
heliosdb_document_size_bytes{collection}

# Time-series
heliosdb_timeseries_ingest_rate{metric}
heliosdb_timeseries_compression_ratio{metric}

Distributed Systems Metrics

# Cluster health
heliosdb_cluster_nodes{status}
heliosdb_cluster_leader_elections_total
heliosdb_cluster_split_brain_detected_total

# Replication
heliosdb_replication_lag_seconds{source, target, region}
heliosdb_replication_synced_bytes_total{source, target}
heliosdb_replication_conflicts_total{conflict_type, resolution}
heliosdb_replication_queue_depth{source, target}

# Sharding
heliosdb_shard_operations_total{operation, status}
heliosdb_shard_rebalancing_active
heliosdb_shard_data_size_bytes{shard_id}

# Multi-region
heliosdb_cross_region_latency_seconds{source_region, target_region}
heliosdb_region_failover_operations_total{from_region, to_region, status}

Querying Metrics

Common PromQL Queries

Calculate error rate:

sum(rate(heliosdb_queries_total{status="error"}[5m]))
/
sum(rate(heliosdb_queries_total[5m]))
* 100

P95 latency:

histogram_quantile(0.95,
  sum(rate(heliosdb_query_duration_seconds_bucket[5m])) by (le, query_type)
) * 1000

Cache hit rate:

sum(rate(heliosdb_cache_hits_total[5m])) by (cache_type)
/
(
  sum(rate(heliosdb_cache_hits_total[5m])) by (cache_type)
  +
  sum(rate(heliosdb_cache_misses_total[5m])) by (cache_type)
) * 100

Top queries by duration:

topk(10,
  sum(rate(heliosdb_query_duration_seconds_sum[5m])) by (query_type)
  /
  sum(rate(heliosdb_query_duration_seconds_count[5m])) by (query_type)
)

Dashboards

Available Dashboards

1. System Overview Dashboard

URL: https://grafana.heliosdb.io/d/system-overview Purpose: High-level system health and performance Panels:

Cluster health (healthy nodes)
Query rate (QPS)
Error rate
P99 latency
Active connections
Replication lag
Query rate by type (graph)
Query latency percentiles (graph)
CPU usage by node
Memory usage by node
Disk usage by node
Network I/O
Storage operations
Compaction queue depth
Cache performance
Active alerts table

2. Vector Database Dashboard

URL: https://grafana.heliosdb.io/d/vector-database Panels:

Search operations/sec
P95 search latency
Total vectors indexed
Insert rate
Search latency distribution
Index size growth
HNSW performance metrics
Vector dimension distribution

3. Graph Database Dashboard

URL: https://grafana.heliosdb.io/d/graph-database Panels:

Traversal operations/sec
P95 traversal latency
Total nodes/edges
Graph density
Traversal patterns
Cypher query performance
PageRank/centrality metrics

4. Multi-Tenancy Dashboard

URL: https://grafana.heliosdb.io/d/multi-tenancy Panels:

Active tenants
Data size by tenant (top 10)
Query rate by tenant (top 10)
Tenant quotas vs usage
Noisy neighbor detection
Cross-tenant isolation metrics

5. Replication Dashboard

URL: https://grafana.heliosdb.io/d/replication Panels:

Replication lag heatmap
Throughput by replication stream
Conflict rate
Queue depth
Sync status
Cross-region latency matrix

6. Security Dashboard

URL: https://grafana.heliosdb.io/d/security Panels:

Authentication attempts/failures
Authorization checks
Security violations
Audit events
Suspicious activity detection
Failed login attempts by IP

7. Cache Performance Dashboard

URL: https://grafana.heliosdb.io/d/cache-performance Panels:

Hit rate by cache tier
Eviction rate
Cache size
Operation latency
Cache efficiency trends

8. ML Features Dashboard

URL: https://grafana.heliosdb.io/d/ml-features Panels:

Model inference rate
Model latency
Auto-index creation rate
Optimizer invocations
Improvement ratio
NL2SQL confidence scores

9. Edge Computing Dashboard

URL: https://grafana.heliosdb.io/d/edge-computing Panels:

Active edge nodes
Sync lag by edge node
Edge function invocations
Total edge data
Edge network latency

10. Lakehouse Integration Dashboard

URL: https://grafana.heliosdb.io/d/lakehouse Panels:

Tables by format (Iceberg, Delta, Hudi)
Query operations by format
Metadata operations
Compaction activity
Table size distribution

Additional Dashboards

OLTP/OLAP Workloads: Transaction performance and analytical query metrics
Sharding & Rebalancing: Shard distribution and rebalancing activity
Multi-Region: Cross-region performance and failover metrics
Cost Management: Query costs and resource optimization
NL2SQL: Natural language query performance
Document Store: MongoDB-compatible document operations
Time-Series: Time-series ingestion and compression
Distributed Tracing: Trace visualization and analysis
Workload Optimizer: Query optimization metrics
Compression: Compression ratios and performance

Best Practices:

Start with System Overview dashboard for overall health
Drill down to feature-specific dashboards for detailed analysis
Use time range selector to focus on incident timeframes
Use template variables to filter by cluster, instance, tenant
Enable auto-refresh (30s recommended) for real-time monitoring

Alerting

Alert Severity Levels

Severity	Priority	Response Time	Notification	Action
Critical	P0	Immediate (5 min)	PagerDuty + Phone	Page on-call, start incident
High	P1	15 minutes	Slack + Email	Notify team, investigate
Warning	P2	1 hour	Slack	Monitor, plan action
Info	P3	Next business day	Slack (monitoring)	Log for review

Critical Alerts (P0)

DatabaseDown

Description: HeliosDB node is completely down Query: up{job="heliosdb"} == 0 Duration: 30s Impact: Data unavailability, potential data loss Runbook: https://docs.heliosdb.io/runbooks/database-down

ClusterQuorumLost

Description: Less than majority of nodes are healthy Query: count(up{job="heliosdb"} == 1) < (count(up{job="heliosdb"}) / 2 + 1) Duration: 30s Impact: Write operations blocked, data inconsistency risk Runbook: https://docs.heliosdb.io/runbooks/quorum-lost

CriticalQueryErrorRate

Description: Query error rate exceeds 5% Query: sum(rate(heliosdb_queries_total{status="error"}[1m])) / sum(rate(heliosdb_queries_total[1m])) > 0.05 Duration: 2m Impact: Widespread query failures affecting users Runbook: https://docs.heliosdb.io/runbooks/high-error-rate

CriticalReplicationFailure

Description: Replication lag exceeds 5 minutes Query: heliosdb_replication_lag_seconds > 300 Duration: 5m Impact: Data consistency at risk, failover risk Runbook: https://docs.heliosdb.io/runbooks/replication-failure

OutOfMemory

Description: Memory usage exceeds 98% Query: (heliosdb_memory_usage_bytes{type="total"} / heliosdb_memory_usage_bytes{type="limit"}) > 0.98 Duration: 1m Impact: Imminent OOM kills, database crash risk Runbook: https://docs.heliosdb.io/runbooks/out-of-memory

DiskFull

Description: Disk usage exceeds 95% Query: (heliosdb_disk_usage_bytes / heliosdb_disk_usage_bytes{type="capacity"}) > 0.95 Duration: 2m Impact: Write operations will fail, database may crash Runbook: https://docs.heliosdb.io/runbooks/disk-full

High Priority Alerts (P1)

HighQueryLatencyP99

Description: P99 query latency exceeds 1 second Query: histogram_quantile(0.99, sum(rate(heliosdb_query_duration_seconds_bucket[5m])) by (le)) > 1.0 Duration: 5m Impact: Poor user experience, SLA at risk Action: Analyze slow queries, check for missing indexes

HighConnectionPoolUtilization

Description: Connection pool at 85%+ utilization Query: heliosdb_active_connections / heliosdb_active_connections{type="max"} > 0.85 Duration: 10m Impact: Connection timeouts imminent Action: Increase pool size or scale horizontally

CacheHitRateDegraded

Description: Cache hit rate below 60% Query: sum(rate(heliosdb_cache_hits_total[10m])) / (sum(rate(heliosdb_cache_hits_total[10m])) + sum(rate(heliosdb_cache_misses_total[10m]))) < 0.6 Duration: 15m Impact: Increased storage I/O, query latency Action: Increase cache size, analyze eviction patterns

Alert Response Workflow

Alert Fires
    │
    ▼
┌─────────────────┐
│ Alertmanager    │
│ - Deduplication │
│ - Grouping      │
│ - Routing       │
└────────┬────────┘
         │
         ├──────────────────────────────┐
         ▼                              ▼
    [Critical?]                    [High/Warning?]
         │                              │
         ▼                              ▼
  ┌──────────────┐              ┌──────────────┐
  │  PagerDuty   │              │    Slack     │
  │  + Phone     │              │   + Email    │
  └──────┬───────┘              └──────┬───────┘
         │                              │
         ▼                              ▼
  ┌──────────────┐              ┌──────────────┐
  │ On-call Eng  │              │  Team Lead   │
  │ Acknowledges │              │   Reviews    │
  └──────┬───────┘              └──────┬───────┘
         │                              │
         ▼                              ▼
  ┌──────────────┐              ┌──────────────┐
  │   Follow     │              │   Monitor    │
  │   Runbook    │              │  & Plan Fix  │
  └──────┬───────┘              └──────────────┘
         │
         ▼
  ┌──────────────┐
  │  Mitigate    │
  │   Issue      │
  └──────┬───────┘
         │
         ▼
  ┌──────────────┐
  │   Resolve    │
  │    Alert     │
  └──────┬───────┘
         │
         ▼
  ┌──────────────┐
  │ Post-Mortem  │
  │   (if P0/P1) │
  └──────────────┘

Distributed Tracing

Trace Instrumentation

HeliosDB automatically instruments all operations with OpenTelemetry:

Trace Hierarchy:

Query Request (Root Span)
│
├─ Authentication (Span)
│
├─ Authorization (Span)
│
├─ Query Parsing (Span)
│
├─ Query Planning (Span)
│  ├─ Cost Estimation
│  └─ Index Selection
│
├─ Query Execution (Span)
│  ├─ Storage Read (Span)
│  │  ├─ Cache Lookup
│  │  └─ Disk Read (if cache miss)
│  ├─ Index Scan (Span)
│  └─ Filter/Join Operations (Span)
│
└─ Result Serialization (Span)

Trace Attributes

Every span includes:

trace_id: Unique trace identifier
span_id: Unique span identifier
parent_span_id: Parent span (for hierarchy)
service.name: “heliosdb”
service.version: “7.0.0”
deployment.environment: “production”
db.system: “heliosdb”
db.operation: Query type (SELECT, INSERT, etc.)
db.statement: Hashed SQL statement
db.table: Table name
db.rows_affected: Number of rows
duration_ms: Span duration
tenant_id: Tenant identifier
query_complexity: low/medium/high
slo.met: true/false

Viewing Traces

Jaeger UI: https://jaeger.heliosdb.io

Common Searches:

# Find slow queries
duration > 1s

# Find errors
error=true

# Find specific query type
db.operation=SELECT

# Find distributed transactions
transaction.distributed=true

# Find queries for specific tenant
tenant_id=abc123

# Find queries missing SLO
slo.met=false

Trace Sampling

Always sampled: Errors, slow queries (>1s), distributed transactions
High sample rate (50%): P99+ latency queries
Medium sample rate (10%): Normal queries
Low sample rate (1%): Health checks
Never sampled: Internal monitoring

Log Aggregation

Log Structure

All logs are emitted in structured JSON format:

{
  "timestamp": "2025-11-24T10:30:45.123Z",
  "level": "INFO",
  "category": "query",
  "message": "Query executed successfully",
  "service": "heliosdb",
  "version": "7.0.0",
  "environment": "production",
  "cluster": "prod-cluster-01",
  "node_id": "node-01",
  "region": "us-east-1",
  "query_id": "q-12345",
  "query_type": "SELECT",
  "duration_ms": 45.2,
  "rows_returned": 100,
  "cache_hit": true,
  "tenant_id": "tenant-uuid",
  "trace_id": "abc123",
  "span_id": "xyz789"
}

Searching Logs

Kibana (ELK): https://kibana.heliosdb.io

Common Queries:

# Find errors in last hour
level:ERROR AND @timestamp:[now-1h TO now]

# Find slow queries
category:query AND duration_ms:>1000

# Find security violations
category:security AND event_type:unauthorized_access

# Find queries for specific tenant
tenant_id:"abc123"

# Find queries that missed SLO
slo.met:false

# Correlate logs with traces
trace_id:"abc123"

Splunk: https://splunk.heliosdb.io

# Error rate trend
index=heliosdb_logs level=ERROR
| timechart count by category

# Slow query analysis
index=heliosdb_queries duration_ms>1000
| stats avg(duration_ms) p95(duration_ms) p99(duration_ms) count by query_type

# Security audit
index=heliosdb_audit category=security
| stats count by event_type, user_id
| sort -count

SLO Monitoring

Defined SLOs

Availability SLO

Target: 99.9% (three nines)
Window: 30 days
Error Budget: 43.2 minutes/month
Current Status: Check dashboard at https://grafana.heliosdb.io/d/slo-overview

Latency SLOs

P95 Latency: <100ms (all queries)
P99 Latency: <500ms (OLTP queries)
Window: 30 days

Data Freshness SLO

Replication Lag: <10 seconds
Window: 30 days

Error Budget

Formula:

Error Budget Remaining = (1 - (Actual Availability / Target Availability)) × 100

Burn Rate:

Burn Rate = (Error Budget Consumed / Time Elapsed) / (Total Error Budget / SLO Window)

Alert Thresholds:

Fast Burn (14.4x): 1% budget consumed in 1 hour → Page immediately
Moderate Burn (6x): 5% budget consumed in 6 hours → Alert team
Slow Burn (1x): 10% budget consumed in 3 days → Notify for review

SLO Dashboard

Access the SLO dashboard at: https://grafana.heliosdb.io/d/slo-overview

Panels:

SLO compliance status (green/yellow/red)
Error budget remaining (%)
Current burn rate
SLI trends (7-day)
Top SLO violations
Projected budget exhaustion

Anomaly Detection

Detection Methods

1. Statistical Detection

Z-Score: Detects values >3 standard deviations from mean
ARIMA: Forecasts expected values, detects deviations
Exponential Smoothing: Detects trend changes

2. ML-Based Detection

Isolation Forest: Detects multivariate anomalies
LSTM: Predicts time-series, detects deviations
Autoencoders: Learns normal patterns, detects unusual patterns

3. Rule-Based Detection

Sudden Spike: Value >3× rolling average
Sudden Drop: Value <0.3× rolling average
Unusual Pattern: Deviation from historical patterns

Anomaly Scoring

Anomalies are scored 0.0-1.0 based on confidence:

>0.9: Critical anomaly (investigate immediately)
>0.75: High confidence anomaly (investigate soon)
>0.5: Medium confidence anomaly (monitor)
>0.3: Low confidence anomaly (log only)

Anomaly Dashboard

Access at: https://grafana.heliosdb.io/d/anomaly-detection

Panels:

Detected anomalies (last 24h)
Anomaly score timeline
Baseline vs actual metrics
Anomaly heatmap
Top anomalous metrics

Runbooks

Critical Runbooks

Database Down

Trigger: Node completely unreachable Steps:

Check node status: systemctl status heliosdb
Check logs: journalctl -u heliosdb -n 100
If OOM killed: Increase memory, restart
If crash: Check core dump, restart
If disk full: Clean up space, restart
Verify cluster health after restart Escalation: If restart fails, page DBA lead

Quorum Lost

Trigger: Less than majority of nodes healthy Steps:

Identify down nodes
Check network connectivity between nodes
Check for split-brain scenario
Restore down nodes ASAP
Verify cluster reformed correctly Escalation: If network partition, engage network team

High Error Rate

Trigger: Query error rate >1% Steps:

Check error logs for common patterns
Identify affected query types
Check for recent schema/code changes
Check resource utilization (CPU, memory, disk)
Roll back recent changes if applicable
Scale up resources if needed Escalation: If errors continue, page on-call engineer

Out of Memory

Trigger: Memory usage >98% Steps:

Identify memory consumers: Check cache size, active queries
Kill long-running queries if necessary
Reduce cache sizes temporarily
Check for memory leaks in recent changes
Scale up memory or scale horizontally
Investigate root cause after mitigation Prevention: Tune cache sizes, set query timeouts

Troubleshooting

Common Issues

High Latency

Symptoms: P95/P99 latency elevated Investigation:

Check CPU/memory/disk utilization
Check cache hit rate
Analyze slow query log
Check for missing indexes
Check replication lag
Check compaction backlog Solutions:

Add missing indexes
Optimize queries
Increase cache size
Scale resources
Rebalance shards

High Error Rate

Symptoms: Elevated query failures Investigation:

Check error logs for patterns
Check affected query types
Check recent deployments
Check resource limits
Check authentication issues Solutions:

Roll back bad deployment
Fix query bugs
Increase resource limits
Fix authentication configuration

Replication Lag

Symptoms: Lag >10 seconds Investigation:

Check network bandwidth
Check target node load
Check replication queue size
Check for conflicts
Check write rate Solutions:

Increase network bandwidth
Scale target node
Increase replication threads
Reduce write rate temporarily
Resolve conflicts

Cache Miss Rate High

Symptoms: Hit rate <60% Investigation:

Check cache size
Check eviction rate
Check query patterns
Check data hot spots Solutions:

Increase cache size
Adjust eviction policy
Optimize query patterns
Partition hot data

Diagnostic Queries

Check Node Health

curl http://localhost:9090/health

Check Cluster Status

curl http://localhost:9090/api/v1/cluster/status | jq

Check Active Queries

curl http://localhost:9090/api/v1/queries/active | jq

Check Replication Status

curl http://localhost:9090/api/v1/replication/status | jq

Check Cache Statistics

curl http://localhost:9090/api/v1/cache/stats | jq

Best Practices

Monitoring

Set Realistic SLOs: Base on actual user requirements, not arbitrary targets
Monitor What Matters: Focus on user-facing metrics first
Use Error Budgets: Allow controlled risk-taking and innovation
Automate Responses: Auto-remediate safe, well-understood issues
Correlate Data: Link metrics, traces, and logs for full context
Review Regularly: Conduct weekly SLO reviews, monthly retrospectives

Alerting

Reduce Noise: Every alert should be actionable
Use Severity Levels: P0 = page, P1 = email, P2 = ticket
Write Runbooks: Every alert should have a runbook
Test Alerts: Regularly test alert routing and escalation
Suppress Known Issues: Silence alerts during maintenance
Dedup and Group: Correlate related alerts

Dashboards

Start High-Level: System overview first, drill down later
Use Templates: Filter by cluster, region, tenant
Show Context: Include annotations for deployments
Enable Sharing: Make dashboards easily shareable
Document Panels: Add descriptions to complex panels
Keep Updated: Remove obsolete dashboards

On-Call

Rotation: 1-week rotations, no more than 3 weeks/quarter
Handoff: Document active incidents during handoff
Response Time: <5 min acknowledgment for P0
Post-Mortems: Write blameless post-mortems for P0/P1
Continuous Improvement: Update runbooks after incidents
Work-Life Balance: Limit out-of-hours pages

Appendices

A. Metrics Reference

Complete metrics reference: metrics-comprehensive.rs

B. Alert Rules

Complete alert rules: alert-rules-comprehensive.yml

C. Dashboard Generator

Dashboard generator script: generate-dashboards.py

D. OpenTelemetry Configuration

Complete OTel config: opentelemetry-config.yaml

E. Logging Configuration

Complete logging config: logging-config.yaml

F. SLO Configuration

Complete SLO config: slo-monitoring.yaml

Support

Documentation: https://docs.heliosdb.io
Runbooks: https://docs.heliosdb.io/runbooks/
Slack: #heliosdb-operations
On-Call: PagerDuty escalation policy “HeliosDB Production”
Email: sre-team@heliosdb.io

Document Control

Version: 1.0
Last Updated: November 24, 2025
Next Review: February 24, 2026
Owner: SRE Team
Approvers: VP Engineering, Director of Operations

HeliosDB Monitoring and Alerting Operations Guide

HeliosDB Monitoring and Alerting Operations Guide

Table of Contents

Executive Summary

What This Document Covers

Quick Links

Key Metrics at a Glance

Architecture Overview

Monitoring Stack Components

Data Flow

Metrics Collection

Available Metrics

Core Database Metrics (P0 - Critical)

Multi-Model Database Metrics

Distributed Systems Metrics

Querying Metrics

Common PromQL Queries

Dashboards

Available Dashboards

1. System Overview Dashboard

2. Vector Database Dashboard

3. Graph Database Dashboard

4. Multi-Tenancy Dashboard

5. Replication Dashboard

6. Security Dashboard

7. Cache Performance Dashboard

8. ML Features Dashboard

9. Edge Computing Dashboard

10. Lakehouse Integration Dashboard

Additional Dashboards

Dashboard Navigation

Alerting

Alert Severity Levels

Critical Alerts (P0)

DatabaseDown

ClusterQuorumLost

CriticalQueryErrorRate

CriticalReplicationFailure

OutOfMemory

DiskFull

High Priority Alerts (P1)

HighQueryLatencyP99

HighConnectionPoolUtilization

CacheHitRateDegraded

Alert Response Workflow

Distributed Tracing

Trace Instrumentation

Trace Attributes

Viewing Traces

Trace Sampling

Log Aggregation

Log Structure

Log Categories

Searching Logs

SLO Monitoring

Defined SLOs

Availability SLO

Latency SLOs

Data Freshness SLO

Error Budget

SLO Dashboard

Anomaly Detection

Detection Methods

1. Statistical Detection

2. ML-Based Detection

3. Rule-Based Detection

Anomaly Scoring

Anomaly Dashboard

Runbooks

Critical Runbooks

Database Down

Quorum Lost

High Error Rate

Out of Memory

Troubleshooting

Common Issues

High Latency

High Error Rate

Replication Lag

Cache Miss Rate High