HeliosDB Advanced Observability Dashboard - User Guide
HeliosDB Advanced Observability Dashboard - User Guide
Version: 1.0 Date: November 9, 2025 Feature ID: F1.9 ARR Impact: $12M
Table of Contents
- Overview
- Architecture
- Getting Started
- REST API Reference
- WebSocket Protocol
- Metrics System
- Alert Engine
- Query Performance Dashboard
- Distributed Tracing
- Configuration
- Performance Tuning
- Troubleshooting
Overview
The HeliosDB Advanced Observability Dashboard provides comprehensive monitoring and alerting for your HeliosDB clusters with:
- Real-time metrics visualization (<1s latency)
- Query performance analytics (P50/P95/P99 latencies)
- Distributed tracing flamegraphs (integrated with OpenTelemetry)
- Intelligent alert engine (ML-based anomaly detection)
- Multi-channel notifications (email, Slack, PagerDuty, webhooks)
- Prometheus-compatible metrics export
- 100K+ metrics/sec ingestion capacity
Key Features
| Feature | Description | Performance Target |
|---|---|---|
| Metrics Ingestion | Time-series metric collection | 100K+ metrics/sec |
| Real-time Updates | WebSocket streaming | <1s latency |
| Dashboard Load Time | React SPA loading | <2s |
| API Response Time | REST endpoint latency | <50ms |
| Alert Evaluation | Rule-based monitoring | <100ms |
| Data Retention | Multi-resolution storage | 24 hours (configurable) |
Architecture
System Components
┌─────────────────────────────────────────────────────────┐│ React Dashboard UI ││ (Real-time Charts & Graphs) │└───────────────┬─────────────────────┬───────────────────┘ │ │ │ HTTP/REST │ WebSocket │ │┌───────────────▼─────────────────────▼───────────────────┐│ actix-web Server (Rust) ││ ┌─────────────┬──────────────┬───────────────────┐ ││ │ Metrics API │ Alert Engine │ WebSocket Server │ ││ └──────┬──────┴──────┬───────┴─────┬─────────────┘ ││ │ │ │ ││ ┌──────▼──────┐ ┌───▼────┐ ┌────▼──────┐ ││ │ Metrics │ │ Alert │ │ Real-time │ ││ │ Aggregator │ │ Rules │ │ Broker │ ││ └─────────────┘ └────────┘ └───────────┘ │└─────────────────────────────────────────────────────────┘ │ │ Metrics Collection │┌───────────────▼─────────────────────────────────────────┐│ HeliosDB Nodes (Query Engine, Storage) │└─────────────────────────────────────────────────────────┘Data Flow
- Metrics Collection: HeliosDB nodes emit metrics during query execution
- Ingestion: Metrics API receives data via REST POST or pub/sub
- Aggregation: Time-series data is aggregated into multiple resolutions
- Alert Evaluation: Alert engine checks rules against incoming metrics
- Real-time Streaming: WebSocket server broadcasts to connected clients
- Visualization: React UI renders charts and graphs in real-time
Getting Started
Installation
Add the observability package to your HeliosDB deployment:
[dependencies]heliosdb-observability = { path = "../heliosdb-observability", features = ["dashboard"] }Basic Setup
use heliosdb_observability::{DashboardConfig, DashboardServer};
#[tokio::main]async fn main() -> Result<(), Box<dyn std::error::Error>> { // Create configuration let config = DashboardConfig { bind_address: "0.0.0.0".to_string(), port: 8080, enable_cors: true, ws_ping_interval: 30, metrics_retention_hours: 24, alert_check_interval: 10, enable_auth: false, static_files_dir: Some("./dashboard/build".to_string()), };
// Create and start server let server = DashboardServer::new(config);
// Start background tasks server.run_background_tasks().await;
// Run server server.run().await?;
Ok(())}Quick Start
Run the example:
cargo run --example dashboard_quickstartAccess the dashboard:
- Web UI: http://localhost:8080/
- API: http://localhost:8080/api/
- WebSocket: ws://localhost:8080/ws
- Prometheus: http://localhost:8080/api/metrics/prometheus
REST API Reference
Metrics Endpoints
POST /api/metrics/query
Query time-series metrics.
Request:
{ "metric": "query_latency_ms", "start": "2025-11-09T00:00:00Z", "end": "2025-11-09T23:59:59Z", "step": 60, "labels": { "database": "main" }}Response:
{ "success": true, "data": [ { "metric": "query_latency_ms", "data": [ {"timestamp": 1699488000, "value": 45.2}, {"timestamp": 1699488060, "value": 42.8} ] } ]}GET /api/metrics/prometheus
Export metrics in Prometheus text format.
Response:
# TYPE heliosdb_query_latency_ms histogramheliosdb_query_latency_ms_bucket{le="10"} 1234heliosdb_query_latency_ms_bucket{le="50"} 5678heliosdb_query_latency_ms_sum 123456.0heliosdb_query_latency_ms_count 10000Alert Endpoints
GET /api/alerts
Get active alerts.
Response:
{ "success": true, "data": [ { "id": "alert_abc123", "rule_id": "rule_latency", "name": "High Query Latency", "severity": "warning", "value": 1250.5, "threshold": 1000.0, "triggered_at": "2025-11-09T12:34:56Z", "state": "firing" } ]}POST /api/alerts/rules
Create a new alert rule.
Request:
{ "id": "rule_cpu", "name": "High CPU Usage", "description": "CPU usage exceeds 80%", "metric": "cpu_usage_percent", "condition": "greater_than", "threshold": 80.0, "window_secs": 300, "min_violations": 3, "severity": "critical", "channels": ["slack_alerts"], "enabled": true}GET /api/alerts/history?limit=100
Get alert history (up to 1000 alerts).
Stats Endpoint
GET /api/stats
Get dashboard statistics summary.
Response:
{ "success": true, "data": { "total_queries": 123456, "avg_query_latency_ms": 45.2, "p95_query_latency_ms": 125.8, "p99_query_latency_ms": 342.1, "active_connections": 128, "cache_hit_rate": 0.87, "active_alerts": 2, "cluster_status": "healthy", "timestamp": "2025-11-09T12:34:56Z" }}Cluster Health Endpoint
GET /api/cluster/health
Get cluster health status.
Response:
{ "success": true, "data": { "status": "healthy", "total_nodes": 3, "healthy_nodes": 3, "unhealthy_nodes": 0, "nodes": [ { "node_id": "node-1", "status": "online", "cpu_usage": 45.2, "memory_usage": 67.8, "disk_usage": 42.1, "active_queries": 12, "uptime_seconds": 123456 } ] }}WebSocket Protocol
Connection
Connect to the WebSocket endpoint:
const ws = new WebSocket('ws://localhost:8080/ws');
ws.onopen = () => { console.log('Connected to HeliosDB Dashboard');
// Subscribe to metrics ws.send(JSON.stringify({ type: 'subscribe', metrics: ['query_latency_ms', 'cpu_usage_percent'] }));};Message Format
Client → Server (Subscribe)
{ "type": "subscribe", "metrics": ["query_latency_ms", "throughput"]}Client → Server (Unsubscribe)
{ "type": "unsubscribe", "metrics": ["throughput"]}Server → Client (Metrics Update)
{ "type": "metrics", "data": [ { "name": "query_latency_ms", "value": 45.2, "timestamp": "2025-11-09T12:34:56Z", "labels": {"database": "main"}, "metric_type": "histogram" } ]}Metrics System
Supported Metric Types
| Type | Description | Use Case |
|---|---|---|
| Counter | Monotonically increasing value | Request counts, bytes sent |
| Gauge | Current value that can go up/down | CPU usage, memory usage |
| Histogram | Distribution of values | Query latency, response times |
| Summary | Similar to histogram, pre-calculated quantiles | Custom percentiles |
Metric Labels
All metrics support labels for multi-dimensional data:
let point = MetricPoint { name: "query_latency_ms".to_string(), value: 42.5, timestamp: Utc::now(), labels: HashMap::from([ ("database".to_string(), "main".to_string()), ("operation".to_string(), "SELECT".to_string()), ("table".to_string(), "users".to_string()), ]), metric_type: MetricType::Histogram,};Aggregation Windows
Metrics are automatically aggregated into multiple resolutions for efficient querying:
| Window | Duration | Retention | Use Case |
|---|---|---|---|
| Raw | 1s | 1 hour | Real-time monitoring |
| 1-minute | 60s | 6 hours | Recent trends |
| 5-minute | 300s | 24 hours | Daily patterns |
| 15-minute | 900s | 3 days | Weekly trends |
| 1-hour | 3600s | 7 days | Long-term analysis |
| 1-day | 86400s | 30 days | Historical data |
Alert Engine
Rule Configuration
Alert rules support:
- Threshold-based alerts (>, <, ==, !=, >=, <=)
- Windowed evaluation (check multiple violations)
- Label filtering (alert on specific metric labels)
- Multi-channel notifications
Example Rules
High Query Latency
AlertRule { id: "rule_latency".to_string(), name: "High Query Latency".to_string(), metric: "query_latency_ms".to_string(), condition: RuleCondition::GreaterThan, threshold: 1000.0, window_secs: 60, min_violations: 3, // 3 violations in 60s severity: AlertSeverity::Warning, channels: vec!["slack_alerts".to_string()], enabled: true,}Low Cache Hit Rate
AlertRule { id: "rule_cache".to_string(), name: "Low Cache Hit Rate".to_string(), metric: "cache_hit_rate".to_string(), condition: RuleCondition::LessThan, threshold: 0.7, // < 70% window_secs: 300, min_violations: 5, severity: AlertSeverity::Info, channels: vec!["email_ops".to_string()], enabled: true,}Notification Channels
Slack
NotificationChannel::Slack { webhook_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL".to_string(), channel: "#database-alerts".to_string(),}Email (SMTP)
NotificationChannel::Email { smtp_host: "smtp.gmail.com".to_string(), smtp_port: 587, username: "alerts@example.com".to_string(), password: "your-app-password".to_string(), from: "alerts@example.com".to_string(), to: vec!["ops-team@example.com".to_string()],}PagerDuty
NotificationChannel::PagerDuty { integration_key: "YOUR_PAGERDUTY_INTEGRATION_KEY".to_string(),}Performance Tuning
Metrics Retention
Adjust retention based on your needs:
DashboardConfig { metrics_retention_hours: 72, // 3 days ...}Longer retention = more storage, slower queries.
Alert Check Interval
DashboardConfig { alert_check_interval: 5, // Check every 5 seconds ...}Lower interval = faster alerting, higher CPU usage.
WebSocket Ping Interval
DashboardConfig { ws_ping_interval: 15, // Ping every 15 seconds ...}Troubleshooting
High Memory Usage
Symptom: Dashboard consuming excessive memory
Solution: Reduce metrics_retention_hours or increase aggregation pruning frequency
Slow Queries
Symptom: API queries taking >1s
Solution: Use appropriate step parameter for downsampling, enable aggregation
Missed Alerts
Symptom: Alerts not triggering
Solution: Check min_violations setting, verify metric labels match rule filters
WebSocket Disconnections
Symptom: Real-time updates stop
Solution: Check ws_ping_interval, verify network connectivity, review firewall rules
End of User Guide
For additional support, see: