HeliosDB Graph Architecture

This document describes the architecture and design decisions for the HeliosDB Graph Query Support module.

Overview

The HeliosDB Graph module provides a comprehensive graph query engine built on top of the petgraph library, with specialized algorithms for recursive queries, traversal, and path-finding.

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                      GraphEngine                             │
│  - Graph management                                          │
│  - Query coordination                                        │
│  - Configuration                                             │
└────────────┬────────────────────────────────────────────────┘
             │
    ┌────────┴────────┬──────────────┬──────────────┬─────────┐
    │                 │              │              │         │
┌───▼───┐      ┌──────▼──────┐  ┌───▼────┐  ┌──────▼──────┐  │
│  CTE  │      │  Traversal  │  │ Cycle  │  │ Path-finding│  │
│Engine │      │  Algorithms │  │Detection│  │ Algorithms  │  │
└───────┘      └─────────────┘  └────────┘  └─────────────┘  │
                                                               │
    ┌──────────────────────────────────────────────────────────┘
    │
┌───▼────────┐      ┌──────────────┐      ┌────────────────┐
│   Index    │      │  Algorithms  │      │   petgraph     │
│  Manager   │      │  (PageRank,  │      │  (core graph   │
│            │      │   centrality)│      │   structure)   │
└────────────┘      └──────────────┘      └────────────────┘

Core Components

1. GraphEngine

The main entry point for all graph operations.

Responsibilities:

Graph lifecycle management (create, register, destroy)
Node and edge operations (add, remove, update)
Query routing to specialized engines
Configuration management

Design Decisions:

Uses DashMap for thread-safe concurrent graph access
Maintains separate indexes per graph for isolation
Delegates complex operations to specialized modules

Key APIs:

pub async fn register_graph(&self, name: String) -> Result<()>
pub async fn add_node(&self, graph: &str, node: Node) -> Result<()>
pub async fn add_edge(&self, graph: &str, edge: Edge) -> Result<()>
pub async fn traverse_graph(&self, graph: &str, start: NodeId,
                           mode: TraversalMode, max_depth: usize) -> Result<Vec<NodeId>>

2. CTE Engine (src/cte.rs)

Implements recursive Common Table Expressions with SQL-like semantics.

Algorithm:

Execute anchor query (base case)
Maintain work queue of current results
Execute recursive query with current batch
Apply union semantics (UNION or UNION ALL)
Detect cycles using row hashing
Terminate on max depth or empty results

Optimizations:

Hash-based deduplication for UNION
Early cycle detection
Configurable iteration limits
Work queue for memory efficiency

Complexity:

Time: O(iterations × query_cost)
Space: O(result_size)

3. Traversal Module (src/traversal.rs)

Implements graph traversal algorithms with cycle detection.

Breadth-First Search (BFS)

Algorithm:

1. Initialize queue with start node
2. While queue not empty:
   a. Dequeue node
   b. Visit node
   c. Enqueue unvisited neighbors
3. Return visited nodes in level order

Complexity:

Time: O(V + E)
Space: O(V)

Use Cases:

Finding shortest path (unweighted)
Level-order traversal
Closest neighbors

Depth-First Search (DFS)

Algorithm:

1. Initialize stack with start node
2. While stack not empty:
   a. Pop node
   b. Visit node
   c. Push unvisited neighbors
3. Return visited nodes in depth-first order

Complexity:

Time: O(V + E)
Space: O(V)

Use Cases:

Topological sorting
Cycle detection
Path enumeration

4. Path-Finding Module (src/pathfinding.rs)

Implements shortest path and all-paths algorithms.

Dijkstra’s Algorithm

Algorithm:

1. Initialize distances to infinity, source to 0
2. Use min-heap priority queue
3. While queue not empty:
   a. Extract min distance node
   b. For each neighbor:
      - Calculate tentative distance
      - Update if shorter
      - Add to queue
4. Reconstruct path from source to target

Complexity:

Time: O((V + E) log V) with binary heap
Space: O(V)

Optimizations:

Early termination when target found
Path reconstruction using predecessor map
Edge weight caching

A* Algorithm

Extension of Dijkstra with heuristic function.

Algorithm:

f(n) = g(n) + h(n)
where:
  g(n) = actual cost from start
  h(n) = estimated cost to goal (heuristic)

Benefits:

Faster than Dijkstra when good heuristic available
Guaranteed optimal if heuristic is admissible
Useful for geographical graphs

All Paths Enumeration

Algorithm:

DFS-based with backtracking:
1. Maintain current path
2. Mark nodes as visited
3. If target reached, save path
4. Recurse on unvisited neighbors
5. Backtrack (unmark and remove from path)

Complexity:

Time: O(V!) worst case (exponential)
Space: O(V × max_paths)

Safeguards:

Maximum path length limit
Maximum paths limit
Cycle detection

5. Cycle Detection Module (src/cycle.rs)

Implements multiple cycle detection algorithms.

DFS-Based Cycle Detection

Algorithm:

1. For each unvisited node:
   a. Start DFS
   b. Track recursion stack
   c. If node in stack is revisited, cycle found
   d. Extract cycle from path

Complexity:

Time: O(V + E)
Space: O(V)

Tarjan’s Algorithm (Strongly Connected Components)

Algorithm:

1. Assign discovery time and low-link value
2. Use stack to track current path
3. When low-link equals discovery time:
   a. Found SCC root
   b. Pop stack until root
   c. Create component

Complexity:

Time: O(V + E)
Space: O(V)

Benefits:

Finds all strongly connected components
Single pass algorithm
Efficient for large graphs

6. Index Manager (src/index.rs)

Provides specialized indexes for efficient graph queries.

Adjacency List Index

Structure:

outgoing: HashMap<NodeId, HashSet<NodeId>>
incoming: HashMap<NodeId, HashSet<NodeId>>
degrees: HashMap<NodeId, (in_degree, out_degree)>

Operations:

O(1) neighbor lookup
O(1) predecessor lookup
O(1) degree lookup

Property Index

Structure:

index: HashMap<PropertyName, HashMap<Value, HashSet<EntityId>>>

Operations:

O(1) query by property value
O(k) for k matching entities
Supports node and edge properties

7. Algorithms Module (src/algorithms.rs)

Implements advanced graph algorithms.

Connected Components

Uses union-find or BFS to identify connected components in undirected view of graph.

Complexity: O(V + E)

Transitive Closure

Floyd-Warshall-like approach to compute all reachable pairs.

Complexity: O(V³)

Topological Sort

Kahn’s algorithm using in-degree counting.

Algorithm:

1. Calculate in-degrees
2. Queue nodes with in-degree 0
3. Process queue:
   a. Remove node
   b. Decrease neighbor in-degrees
   c. Queue if in-degree becomes 0
4. If all nodes processed, return sorted order

Complexity: O(V + E)

PageRank

Iterative algorithm for computing node importance.

Algorithm:

PR(v) = (1-d)/N + d × Σ(PR(u)/out_degree(u))

where:
  d = damping factor (typically 0.85)
  N = total nodes
  u = incoming neighbors

Complexity: O(iterations × E)

Betweenness Centrality

Measures importance based on shortest paths passing through node.

Algorithm:

1. For each node as source:
   a. Run BFS to find shortest paths
   b. Count paths through each node
2. Accumulate contributions
3. Normalize by graph size

Complexity: O(V × E)

Data Structures

Graph Representation

Uses petgraph::DiGraph for underlying storage:

Directed graph with weighted edges
Efficient neighbor iteration
Stable node/edge indices

Node Structure

pub struct Node {
    pub id: NodeId,           // Application-level ID
    pub label: String,        // Human-readable label
    pub properties: HashMap<String, Value>,  // Arbitrary properties
}

Edge Structure

pub struct Edge {
    pub id: EdgeId,
    pub source: NodeId,
    pub target: NodeId,
    pub label: String,
    pub weight: Weight,       // For weighted algorithms
    pub properties: HashMap<String, Value>,
}

Path Structure

pub struct Path {
    pub nodes: Vec<NodeId>,   // Ordered node sequence
    pub edges: Vec<EdgeId>,   // Ordered edge sequence
    pub cost: f64,            // Total path cost
    pub metadata: HashMap<String, String>,
}

Performance Considerations

Memory Management

Graph Storage: O(V + E) using adjacency lists
Index Overhead: O(V) per graph for adjacency index
Property Index: O(P × V) where P = unique property values
Path Cache: Configurable, default 10,000 entries

Concurrency

DashMap for thread-safe graph access
Read-optimized for query-heavy workloads
Write locks only during graph modifications
Per-graph locking for isolation

Optimizations

Early Termination
- Stop BFS when target found
- Stop DFS when max depth reached
- Limit path enumeration
Caching
- Path query results
- Degree calculations
- Property lookups
Lazy Evaluation
- Iterator-based traversal
- On-demand index building
- Streaming results

Configuration Tuning

For Large Graphs (>100K nodes)

GraphConfig {
    max_depth: 10,              // Limit depth
    max_paths: 100,             // Limit path enumeration
    enable_cycle_detection: false,  // Disable if known acyclic
    cache_size: 100000,         // Larger cache
    enable_optimization: true,
    ..Default::default()
}

For Deep Hierarchies

GraphConfig {
    max_depth: 1000,            // Allow deep traversal
    max_iterations: 100000,     // More CTE iterations
    enable_cycle_detection: true,  // Important for safety
    ..Default::default()
}

For Real-time Queries

GraphConfig {
    max_depth: 5,               // Shallow traversal
    max_paths: 10,              // Few paths
    enable_optimization: true,  // Query optimization
    cache_size: 50000,          // Moderate cache
    ..Default::default()
}

Testing Strategy

Unit Tests

Individual algorithm correctness
Edge cases (empty graphs, single nodes)
Cycle detection accuracy
Index consistency

Integration Tests

End-to-end workflows
Multi-graph operations
Concurrent access
Large graph performance

Test Coverage

Target: >80% code coverage

All public APIs tested
Error paths verified
Performance benchmarks

Future Enhancements

Parallel Algorithms
- Parallel BFS/DFS
- Distributed PageRank
- Multi-threaded path finding
Query Optimization
- Query plan optimization
- Index selection
- Join order optimization
Persistent Storage
- Graph serialization
- Incremental updates
- Transaction support
Advanced Algorithms
- Community detection (Louvain)
- Graph coloring
- Maximum flow
- Minimum cut
Visualization
- Graph export (GraphML, DOT)
- Interactive visualization
- Query plan visualization

Dependencies

petgraph: Core graph data structure
tokio: Async runtime
dashmap: Concurrent hash map
ordered-float: Ordered floats for priority queue
serde: Serialization

HeliosDB Graph Architecture

HeliosDB Graph Architecture

Overview

Architecture Diagram

Core Components

1. GraphEngine

2. CTE Engine (src/cte.rs)

3. Traversal Module (src/traversal.rs)

Breadth-First Search (BFS)

Depth-First Search (DFS)

4. Path-Finding Module (src/pathfinding.rs)

Dijkstra’s Algorithm

A* Algorithm

All Paths Enumeration

5. Cycle Detection Module (src/cycle.rs)

DFS-Based Cycle Detection

Tarjan’s Algorithm (Strongly Connected Components)

6. Index Manager (src/index.rs)

Adjacency List Index

Property Index

7. Algorithms Module (src/algorithms.rs)

Connected Components

Transitive Closure

Topological Sort

PageRank

Betweenness Centrality

Data Structures

Graph Representation

Node Structure

Edge Structure

Path Structure

Performance Considerations

Memory Management

Concurrency

Optimizations

Configuration Tuning

For Large Graphs (>100K nodes)

For Deep Hierarchies

For Real-time Queries

Testing Strategy

Unit Tests

Integration Tests

Test Coverage

Future Enhancements

Dependencies

References