HeliosDB Advanced Observability Dashboard - User Guide

Version: 1.0 Date: November 9, 2025 Feature ID: F1.9 ARR Impact: $12M

Overview
Architecture
Getting Started
REST API Reference
WebSocket Protocol
Metrics System
Alert Engine
Query Performance Dashboard
Distributed Tracing
Configuration
Performance Tuning
Troubleshooting

Overview

The HeliosDB Advanced Observability Dashboard provides comprehensive monitoring and alerting for your HeliosDB clusters with:

Real-time metrics visualization (<1s latency)
Query performance analytics (P50/P95/P99 latencies)
Distributed tracing flamegraphs (integrated with OpenTelemetry)
Intelligent alert engine (ML-based anomaly detection)
Multi-channel notifications (email, Slack, PagerDuty, webhooks)
Prometheus-compatible metrics export
100K+ metrics/sec ingestion capacity

Key Features

Feature	Description	Performance Target
Metrics Ingestion	Time-series metric collection	100K+ metrics/sec
Real-time Updates	WebSocket streaming	<1s latency
Dashboard Load Time	React SPA loading	<2s
API Response Time	REST endpoint latency	<50ms
Alert Evaluation	Rule-based monitoring	<100ms
Data Retention	Multi-resolution storage	24 hours (configurable)

Architecture

System Components

┌─────────────────────────────────────────────────────────┐
│                   React Dashboard UI                     │
│              (Real-time Charts & Graphs)                 │
└───────────────┬─────────────────────┬───────────────────┘
                │                     │
                │ HTTP/REST           │ WebSocket
                │                     │
┌───────────────▼─────────────────────▼───────────────────┐
│              actix-web Server (Rust)                     │
│  ┌─────────────┬──────────────┬───────────────────┐    │
│  │ Metrics API │ Alert Engine │ WebSocket Server  │    │
│  └──────┬──────┴──────┬───────┴─────┬─────────────┘    │
│         │             │             │                    │
│  ┌──────▼──────┐ ┌───▼────┐   ┌────▼──────┐           │
│  │  Metrics    │ │ Alert  │   │ Real-time │           │
│  │ Aggregator  │ │ Rules  │   │  Broker   │           │
│  └─────────────┘ └────────┘   └───────────┘           │
└─────────────────────────────────────────────────────────┘
                │
                │ Metrics Collection
                │
┌───────────────▼─────────────────────────────────────────┐
│         HeliosDB Nodes (Query Engine, Storage)          │
└─────────────────────────────────────────────────────────┘

Data Flow

Metrics Collection: HeliosDB nodes emit metrics during query execution
Ingestion: Metrics API receives data via REST POST or pub/sub
Aggregation: Time-series data is aggregated into multiple resolutions
Alert Evaluation: Alert engine checks rules against incoming metrics
Real-time Streaming: WebSocket server broadcasts to connected clients
Visualization: React UI renders charts and graphs in real-time

Getting Started

Installation

Add the observability package to your HeliosDB deployment:

[dependencies]
heliosdb-observability = { path = "../heliosdb-observability", features = ["dashboard"] }

Basic Setup

use heliosdb_observability::{DashboardConfig, DashboardServer};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create configuration
    let config = DashboardConfig {
        bind_address: "0.0.0.0".to_string(),
        port: 8080,
        enable_cors: true,
        ws_ping_interval: 30,
        metrics_retention_hours: 24,
        alert_check_interval: 10,
        enable_auth: false,
        static_files_dir: Some("./dashboard/build".to_string()),
    };

    // Create and start server
    let server = DashboardServer::new(config);

    // Start background tasks
    server.run_background_tasks().await;

    // Run server
    server.run().await?;

    Ok(())
}

Quick Start

Run the example:

cargo run --example dashboard_quickstart

Access the dashboard:

Web UI: http://localhost:8080/
API: http://localhost:8080/api/
WebSocket: ws://localhost:8080/ws
Prometheus: http://localhost:8080/api/metrics/prometheus

REST API Reference

Metrics Endpoints

POST /api/metrics/query

Query time-series metrics.

Request:

{
  "metric": "query_latency_ms",
  "start": "2025-11-09T00:00:00Z",
  "end": "2025-11-09T23:59:59Z",
  "step": 60,
  "labels": {
    "database": "main"
  }
}

Response:

{
  "success": true,
  "data": [
    {
      "metric": "query_latency_ms",
      "data": [
        {"timestamp": 1699488000, "value": 45.2},
        {"timestamp": 1699488060, "value": 42.8}
      ]
    }
  ]
}

GET /api/metrics/prometheus

Export metrics in Prometheus text format.

Response:

# TYPE heliosdb_query_latency_ms histogram
heliosdb_query_latency_ms_bucket{le="10"} 1234
heliosdb_query_latency_ms_bucket{le="50"} 5678
heliosdb_query_latency_ms_sum 123456.0
heliosdb_query_latency_ms_count 10000

Alert Endpoints

GET /api/alerts

Get active alerts.

Response:

{
  "success": true,
  "data": [
    {
      "id": "alert_abc123",
      "rule_id": "rule_latency",
      "name": "High Query Latency",
      "severity": "warning",
      "value": 1250.5,
      "threshold": 1000.0,
      "triggered_at": "2025-11-09T12:34:56Z",
      "state": "firing"
    }
  ]
}

POST /api/alerts/rules

Create a new alert rule.

Request:

{
  "id": "rule_cpu",
  "name": "High CPU Usage",
  "description": "CPU usage exceeds 80%",
  "metric": "cpu_usage_percent",
  "condition": "greater_than",
  "threshold": 80.0,
  "window_secs": 300,
  "min_violations": 3,
  "severity": "critical",
  "channels": ["slack_alerts"],
  "enabled": true
}

GET /api/alerts/history?limit=100

Get alert history (up to 1000 alerts).

Stats Endpoint

GET /api/stats

Get dashboard statistics summary.

Response:

{
  "success": true,
  "data": {
    "total_queries": 123456,
    "avg_query_latency_ms": 45.2,
    "p95_query_latency_ms": 125.8,
    "p99_query_latency_ms": 342.1,
    "active_connections": 128,
    "cache_hit_rate": 0.87,
    "active_alerts": 2,
    "cluster_status": "healthy",
    "timestamp": "2025-11-09T12:34:56Z"
  }
}

Cluster Health Endpoint

GET /api/cluster/health

Get cluster health status.

Response:

{
  "success": true,
  "data": {
    "status": "healthy",
    "total_nodes": 3,
    "healthy_nodes": 3,
    "unhealthy_nodes": 0,
    "nodes": [
      {
        "node_id": "node-1",
        "status": "online",
        "cpu_usage": 45.2,
        "memory_usage": 67.8,
        "disk_usage": 42.1,
        "active_queries": 12,
        "uptime_seconds": 123456
      }
    ]
  }
}

WebSocket Protocol

Connection

Connect to the WebSocket endpoint:

const ws = new WebSocket('ws://localhost:8080/ws');

ws.onopen = () => {
  console.log('Connected to HeliosDB Dashboard');

  // Subscribe to metrics
  ws.send(JSON.stringify({
    type: 'subscribe',
    metrics: ['query_latency_ms', 'cpu_usage_percent']
  }));
};

Message Format

{
  "type": "subscribe",
  "metrics": ["query_latency_ms", "throughput"]
}

Client → Server (Unsubscribe)

{
  "type": "unsubscribe",
  "metrics": ["throughput"]
}

Server → Client (Metrics Update)

{
  "type": "metrics",
  "data": [
    {
      "name": "query_latency_ms",
      "value": 45.2,
      "timestamp": "2025-11-09T12:34:56Z",
      "labels": {"database": "main"},
      "metric_type": "histogram"
    }
  ]
}

Metrics System

Supported Metric Types

Type	Description	Use Case
Counter	Monotonically increasing value	Request counts, bytes sent
Gauge	Current value that can go up/down	CPU usage, memory usage
Histogram	Distribution of values	Query latency, response times
Summary	Similar to histogram, pre-calculated quantiles	Custom percentiles

Metric Labels

All metrics support labels for multi-dimensional data:

let point = MetricPoint {
    name: "query_latency_ms".to_string(),
    value: 42.5,
    timestamp: Utc::now(),
    labels: HashMap::from([
        ("database".to_string(), "main".to_string()),
        ("operation".to_string(), "SELECT".to_string()),
        ("table".to_string(), "users".to_string()),
    ]),
    metric_type: MetricType::Histogram,
};

Aggregation Windows

Metrics are automatically aggregated into multiple resolutions for efficient querying:

Window	Duration	Retention	Use Case
Raw	1s	1 hour	Real-time monitoring
1-minute	60s	6 hours	Recent trends
5-minute	300s	24 hours	Daily patterns
15-minute	900s	3 days	Weekly trends
1-hour	3600s	7 days	Long-term analysis
1-day	86400s	30 days	Historical data

Alert Engine

Rule Configuration

Alert rules support:

Threshold-based alerts (>, <, ==, !=, >=, <=)
Windowed evaluation (check multiple violations)
Label filtering (alert on specific metric labels)
Multi-channel notifications

Example Rules

High Query Latency

AlertRule {
    id: "rule_latency".to_string(),
    name: "High Query Latency".to_string(),
    metric: "query_latency_ms".to_string(),
    condition: RuleCondition::GreaterThan,
    threshold: 1000.0,
    window_secs: 60,
    min_violations: 3, // 3 violations in 60s
    severity: AlertSeverity::Warning,
    channels: vec!["slack_alerts".to_string()],
    enabled: true,
}

Low Cache Hit Rate

AlertRule {
    id: "rule_cache".to_string(),
    name: "Low Cache Hit Rate".to_string(),
    metric: "cache_hit_rate".to_string(),
    condition: RuleCondition::LessThan,
    threshold: 0.7, // < 70%
    window_secs: 300,
    min_violations: 5,
    severity: AlertSeverity::Info,
    channels: vec!["email_ops".to_string()],
    enabled: true,
}

Notification Channels

Slack

NotificationChannel::Slack {
    webhook_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL".to_string(),
    channel: "#database-alerts".to_string(),
}

Email (SMTP)

NotificationChannel::Email {
    smtp_host: "smtp.gmail.com".to_string(),
    smtp_port: 587,
    username: "alerts@example.com".to_string(),
    password: "your-app-password".to_string(),
    from: "alerts@example.com".to_string(),
    to: vec!["ops-team@example.com".to_string()],
}

PagerDuty

NotificationChannel::PagerDuty {
    integration_key: "YOUR_PAGERDUTY_INTEGRATION_KEY".to_string(),
}

Performance Tuning

Metrics Retention

Adjust retention based on your needs:

DashboardConfig {
    metrics_retention_hours: 72, // 3 days
    ...
}

Longer retention = more storage, slower queries.

Alert Check Interval

DashboardConfig {
    alert_check_interval: 5, // Check every 5 seconds
    ...
}

Lower interval = faster alerting, higher CPU usage.

WebSocket Ping Interval

DashboardConfig {
    ws_ping_interval: 15, // Ping every 15 seconds
    ...
}

Troubleshooting

High Memory Usage

Symptom: Dashboard consuming excessive memory Solution: Reduce metrics_retention_hours or increase aggregation pruning frequency

Slow Queries

Symptom: API queries taking >1s Solution: Use appropriate step parameter for downsampling, enable aggregation

Missed Alerts

Symptom: Alerts not triggering Solution: Check min_violations setting, verify metric labels match rule filters

WebSocket Disconnections

Symptom: Real-time updates stop Solution: Check ws_ping_interval, verify network connectivity, review firewall rules

End of User Guide

For additional support, see:

HeliosDB Advanced Observability Dashboard - User Guide

HeliosDB Advanced Observability Dashboard - User Guide

Table of Contents

Overview

Key Features

Architecture

System Components

Data Flow

Getting Started

Installation

Basic Setup

Quick Start

REST API Reference

Metrics Endpoints

POST /api/metrics/query

GET /api/metrics/prometheus

Alert Endpoints

GET /api/alerts

POST /api/alerts/rules

GET /api/alerts/history?limit=100

Stats Endpoint

GET /api/stats

Cluster Health Endpoint

GET /api/cluster/health

WebSocket Protocol

Connection

Message Format

Client → Server (Subscribe)

Client → Server (Unsubscribe)

Server → Client (Metrics Update)

Metrics System

Supported Metric Types

Metric Labels

Aggregation Windows

Alert Engine

Rule Configuration

Example Rules

High Query Latency

Low Cache Hit Rate

Notification Channels

Slack

Email (SMTP)

PagerDuty

Performance Tuning

Metrics Retention

Alert Check Interval

WebSocket Ping Interval

Troubleshooting

High Memory Usage

Slow Queries

Missed Alerts

WebSocket Disconnections