Skip to content

HeliosDB Advanced Observability Dashboard - User Guide

HeliosDB Advanced Observability Dashboard - User Guide

Version: 1.0 Date: November 9, 2025 Feature ID: F1.9 ARR Impact: $12M


Table of Contents

  1. Overview
  2. Architecture
  3. Getting Started
  4. REST API Reference
  5. WebSocket Protocol
  6. Metrics System
  7. Alert Engine
  8. Query Performance Dashboard
  9. Distributed Tracing
  10. Configuration
  11. Performance Tuning
  12. Troubleshooting

Overview

The HeliosDB Advanced Observability Dashboard provides comprehensive monitoring and alerting for your HeliosDB clusters with:

  • Real-time metrics visualization (<1s latency)
  • Query performance analytics (P50/P95/P99 latencies)
  • Distributed tracing flamegraphs (integrated with OpenTelemetry)
  • Intelligent alert engine (ML-based anomaly detection)
  • Multi-channel notifications (email, Slack, PagerDuty, webhooks)
  • Prometheus-compatible metrics export
  • 100K+ metrics/sec ingestion capacity

Key Features

FeatureDescriptionPerformance Target
Metrics IngestionTime-series metric collection100K+ metrics/sec
Real-time UpdatesWebSocket streaming<1s latency
Dashboard Load TimeReact SPA loading<2s
API Response TimeREST endpoint latency<50ms
Alert EvaluationRule-based monitoring<100ms
Data RetentionMulti-resolution storage24 hours (configurable)

Architecture

System Components

┌─────────────────────────────────────────────────────────┐
│ React Dashboard UI │
│ (Real-time Charts & Graphs) │
└───────────────┬─────────────────────┬───────────────────┘
│ │
│ HTTP/REST │ WebSocket
│ │
┌───────────────▼─────────────────────▼───────────────────┐
│ actix-web Server (Rust) │
│ ┌─────────────┬──────────────┬───────────────────┐ │
│ │ Metrics API │ Alert Engine │ WebSocket Server │ │
│ └──────┬──────┴──────┬───────┴─────┬─────────────┘ │
│ │ │ │ │
│ ┌──────▼──────┐ ┌───▼────┐ ┌────▼──────┐ │
│ │ Metrics │ │ Alert │ │ Real-time │ │
│ │ Aggregator │ │ Rules │ │ Broker │ │
│ └─────────────┘ └────────┘ └───────────┘ │
└─────────────────────────────────────────────────────────┘
│ Metrics Collection
┌───────────────▼─────────────────────────────────────────┐
│ HeliosDB Nodes (Query Engine, Storage) │
└─────────────────────────────────────────────────────────┘

Data Flow

  1. Metrics Collection: HeliosDB nodes emit metrics during query execution
  2. Ingestion: Metrics API receives data via REST POST or pub/sub
  3. Aggregation: Time-series data is aggregated into multiple resolutions
  4. Alert Evaluation: Alert engine checks rules against incoming metrics
  5. Real-time Streaming: WebSocket server broadcasts to connected clients
  6. Visualization: React UI renders charts and graphs in real-time

Getting Started

Installation

Add the observability package to your HeliosDB deployment:

[dependencies]
heliosdb-observability = { path = "../heliosdb-observability", features = ["dashboard"] }

Basic Setup

use heliosdb_observability::{DashboardConfig, DashboardServer};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Create configuration
let config = DashboardConfig {
bind_address: "0.0.0.0".to_string(),
port: 8080,
enable_cors: true,
ws_ping_interval: 30,
metrics_retention_hours: 24,
alert_check_interval: 10,
enable_auth: false,
static_files_dir: Some("./dashboard/build".to_string()),
};
// Create and start server
let server = DashboardServer::new(config);
// Start background tasks
server.run_background_tasks().await;
// Run server
server.run().await?;
Ok(())
}

Quick Start

Run the example:

Terminal window
cargo run --example dashboard_quickstart

Access the dashboard:


REST API Reference

Metrics Endpoints

POST /api/metrics/query

Query time-series metrics.

Request:

{
"metric": "query_latency_ms",
"start": "2025-11-09T00:00:00Z",
"end": "2025-11-09T23:59:59Z",
"step": 60,
"labels": {
"database": "main"
}
}

Response:

{
"success": true,
"data": [
{
"metric": "query_latency_ms",
"data": [
{"timestamp": 1699488000, "value": 45.2},
{"timestamp": 1699488060, "value": 42.8}
]
}
]
}

GET /api/metrics/prometheus

Export metrics in Prometheus text format.

Response:

# TYPE heliosdb_query_latency_ms histogram
heliosdb_query_latency_ms_bucket{le="10"} 1234
heliosdb_query_latency_ms_bucket{le="50"} 5678
heliosdb_query_latency_ms_sum 123456.0
heliosdb_query_latency_ms_count 10000

Alert Endpoints

GET /api/alerts

Get active alerts.

Response:

{
"success": true,
"data": [
{
"id": "alert_abc123",
"rule_id": "rule_latency",
"name": "High Query Latency",
"severity": "warning",
"value": 1250.5,
"threshold": 1000.0,
"triggered_at": "2025-11-09T12:34:56Z",
"state": "firing"
}
]
}

POST /api/alerts/rules

Create a new alert rule.

Request:

{
"id": "rule_cpu",
"name": "High CPU Usage",
"description": "CPU usage exceeds 80%",
"metric": "cpu_usage_percent",
"condition": "greater_than",
"threshold": 80.0,
"window_secs": 300,
"min_violations": 3,
"severity": "critical",
"channels": ["slack_alerts"],
"enabled": true
}

GET /api/alerts/history?limit=100

Get alert history (up to 1000 alerts).

Stats Endpoint

GET /api/stats

Get dashboard statistics summary.

Response:

{
"success": true,
"data": {
"total_queries": 123456,
"avg_query_latency_ms": 45.2,
"p95_query_latency_ms": 125.8,
"p99_query_latency_ms": 342.1,
"active_connections": 128,
"cache_hit_rate": 0.87,
"active_alerts": 2,
"cluster_status": "healthy",
"timestamp": "2025-11-09T12:34:56Z"
}
}

Cluster Health Endpoint

GET /api/cluster/health

Get cluster health status.

Response:

{
"success": true,
"data": {
"status": "healthy",
"total_nodes": 3,
"healthy_nodes": 3,
"unhealthy_nodes": 0,
"nodes": [
{
"node_id": "node-1",
"status": "online",
"cpu_usage": 45.2,
"memory_usage": 67.8,
"disk_usage": 42.1,
"active_queries": 12,
"uptime_seconds": 123456
}
]
}
}

WebSocket Protocol

Connection

Connect to the WebSocket endpoint:

const ws = new WebSocket('ws://localhost:8080/ws');
ws.onopen = () => {
console.log('Connected to HeliosDB Dashboard');
// Subscribe to metrics
ws.send(JSON.stringify({
type: 'subscribe',
metrics: ['query_latency_ms', 'cpu_usage_percent']
}));
};

Message Format

Client → Server (Subscribe)

{
"type": "subscribe",
"metrics": ["query_latency_ms", "throughput"]
}

Client → Server (Unsubscribe)

{
"type": "unsubscribe",
"metrics": ["throughput"]
}

Server → Client (Metrics Update)

{
"type": "metrics",
"data": [
{
"name": "query_latency_ms",
"value": 45.2,
"timestamp": "2025-11-09T12:34:56Z",
"labels": {"database": "main"},
"metric_type": "histogram"
}
]
}

Metrics System

Supported Metric Types

TypeDescriptionUse Case
CounterMonotonically increasing valueRequest counts, bytes sent
GaugeCurrent value that can go up/downCPU usage, memory usage
HistogramDistribution of valuesQuery latency, response times
SummarySimilar to histogram, pre-calculated quantilesCustom percentiles

Metric Labels

All metrics support labels for multi-dimensional data:

let point = MetricPoint {
name: "query_latency_ms".to_string(),
value: 42.5,
timestamp: Utc::now(),
labels: HashMap::from([
("database".to_string(), "main".to_string()),
("operation".to_string(), "SELECT".to_string()),
("table".to_string(), "users".to_string()),
]),
metric_type: MetricType::Histogram,
};

Aggregation Windows

Metrics are automatically aggregated into multiple resolutions for efficient querying:

WindowDurationRetentionUse Case
Raw1s1 hourReal-time monitoring
1-minute60s6 hoursRecent trends
5-minute300s24 hoursDaily patterns
15-minute900s3 daysWeekly trends
1-hour3600s7 daysLong-term analysis
1-day86400s30 daysHistorical data

Alert Engine

Rule Configuration

Alert rules support:

  • Threshold-based alerts (>, <, ==, !=, >=, <=)
  • Windowed evaluation (check multiple violations)
  • Label filtering (alert on specific metric labels)
  • Multi-channel notifications

Example Rules

High Query Latency

AlertRule {
id: "rule_latency".to_string(),
name: "High Query Latency".to_string(),
metric: "query_latency_ms".to_string(),
condition: RuleCondition::GreaterThan,
threshold: 1000.0,
window_secs: 60,
min_violations: 3, // 3 violations in 60s
severity: AlertSeverity::Warning,
channels: vec!["slack_alerts".to_string()],
enabled: true,
}

Low Cache Hit Rate

AlertRule {
id: "rule_cache".to_string(),
name: "Low Cache Hit Rate".to_string(),
metric: "cache_hit_rate".to_string(),
condition: RuleCondition::LessThan,
threshold: 0.7, // < 70%
window_secs: 300,
min_violations: 5,
severity: AlertSeverity::Info,
channels: vec!["email_ops".to_string()],
enabled: true,
}

Notification Channels

Slack

NotificationChannel::Slack {
webhook_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL".to_string(),
channel: "#database-alerts".to_string(),
}

Email (SMTP)

NotificationChannel::Email {
smtp_host: "smtp.gmail.com".to_string(),
smtp_port: 587,
username: "alerts@example.com".to_string(),
password: "your-app-password".to_string(),
from: "alerts@example.com".to_string(),
to: vec!["ops-team@example.com".to_string()],
}

PagerDuty

NotificationChannel::PagerDuty {
integration_key: "YOUR_PAGERDUTY_INTEGRATION_KEY".to_string(),
}

Performance Tuning

Metrics Retention

Adjust retention based on your needs:

DashboardConfig {
metrics_retention_hours: 72, // 3 days
...
}

Longer retention = more storage, slower queries.

Alert Check Interval

DashboardConfig {
alert_check_interval: 5, // Check every 5 seconds
...
}

Lower interval = faster alerting, higher CPU usage.

WebSocket Ping Interval

DashboardConfig {
ws_ping_interval: 15, // Ping every 15 seconds
...
}

Troubleshooting

High Memory Usage

Symptom: Dashboard consuming excessive memory Solution: Reduce metrics_retention_hours or increase aggregation pruning frequency

Slow Queries

Symptom: API queries taking >1s Solution: Use appropriate step parameter for downsampling, enable aggregation

Missed Alerts

Symptom: Alerts not triggering Solution: Check min_violations setting, verify metric labels match rule filters

WebSocket Disconnections

Symptom: Real-time updates stop Solution: Check ws_ping_interval, verify network connectivity, review firewall rules


End of User Guide

For additional support, see: