Skip to content

HeliosDB Graph Architecture

HeliosDB Graph Architecture

This document describes the architecture and design decisions for the HeliosDB Graph Query Support module.

Overview

The HeliosDB Graph module provides a comprehensive graph query engine built on top of the petgraph library, with specialized algorithms for recursive queries, traversal, and path-finding.

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│ GraphEngine │
│ - Graph management │
│ - Query coordination │
│ - Configuration │
└────────────┬────────────────────────────────────────────────┘
┌────────┴────────┬──────────────┬──────────────┬─────────┐
│ │ │ │ │
┌───▼───┐ ┌──────▼──────┐ ┌───▼────┐ ┌──────▼──────┐ │
│ CTE │ │ Traversal │ │ Cycle │ │ Path-finding│ │
│Engine │ │ Algorithms │ │Detection│ │ Algorithms │ │
└───────┘ └─────────────┘ └────────┘ └─────────────┘ │
┌──────────────────────────────────────────────────────────┘
┌───▼────────┐ ┌──────────────┐ ┌────────────────┐
│ Index │ │ Algorithms │ │ petgraph │
│ Manager │ │ (PageRank, │ │ (core graph │
│ │ │ centrality)│ │ structure) │
└────────────┘ └──────────────┘ └────────────────┘

Core Components

1. GraphEngine

The main entry point for all graph operations.

Responsibilities:

  • Graph lifecycle management (create, register, destroy)
  • Node and edge operations (add, remove, update)
  • Query routing to specialized engines
  • Configuration management

Design Decisions:

  • Uses DashMap for thread-safe concurrent graph access
  • Maintains separate indexes per graph for isolation
  • Delegates complex operations to specialized modules

Key APIs:

pub async fn register_graph(&self, name: String) -> Result<()>
pub async fn add_node(&self, graph: &str, node: Node) -> Result<()>
pub async fn add_edge(&self, graph: &str, edge: Edge) -> Result<()>
pub async fn traverse_graph(&self, graph: &str, start: NodeId,
mode: TraversalMode, max_depth: usize) -> Result<Vec<NodeId>>

2. CTE Engine (src/cte.rs)

Implements recursive Common Table Expressions with SQL-like semantics.

Algorithm:

  1. Execute anchor query (base case)
  2. Maintain work queue of current results
  3. Execute recursive query with current batch
  4. Apply union semantics (UNION or UNION ALL)
  5. Detect cycles using row hashing
  6. Terminate on max depth or empty results

Optimizations:

  • Hash-based deduplication for UNION
  • Early cycle detection
  • Configurable iteration limits
  • Work queue for memory efficiency

Complexity:

  • Time: O(iterations × query_cost)
  • Space: O(result_size)

3. Traversal Module (src/traversal.rs)

Implements graph traversal algorithms with cycle detection.

Breadth-First Search (BFS)

Algorithm:

1. Initialize queue with start node
2. While queue not empty:
a. Dequeue node
b. Visit node
c. Enqueue unvisited neighbors
3. Return visited nodes in level order

Complexity:

  • Time: O(V + E)
  • Space: O(V)

Use Cases:

  • Finding shortest path (unweighted)
  • Level-order traversal
  • Closest neighbors

Depth-First Search (DFS)

Algorithm:

1. Initialize stack with start node
2. While stack not empty:
a. Pop node
b. Visit node
c. Push unvisited neighbors
3. Return visited nodes in depth-first order

Complexity:

  • Time: O(V + E)
  • Space: O(V)

Use Cases:

  • Topological sorting
  • Cycle detection
  • Path enumeration

4. Path-Finding Module (src/pathfinding.rs)

Implements shortest path and all-paths algorithms.

Dijkstra’s Algorithm

Algorithm:

1. Initialize distances to infinity, source to 0
2. Use min-heap priority queue
3. While queue not empty:
a. Extract min distance node
b. For each neighbor:
- Calculate tentative distance
- Update if shorter
- Add to queue
4. Reconstruct path from source to target

Complexity:

  • Time: O((V + E) log V) with binary heap
  • Space: O(V)

Optimizations:

  • Early termination when target found
  • Path reconstruction using predecessor map
  • Edge weight caching

A* Algorithm

Extension of Dijkstra with heuristic function.

Algorithm:

f(n) = g(n) + h(n)
where:
g(n) = actual cost from start
h(n) = estimated cost to goal (heuristic)

Benefits:

  • Faster than Dijkstra when good heuristic available
  • Guaranteed optimal if heuristic is admissible
  • Useful for geographical graphs

All Paths Enumeration

Algorithm:

DFS-based with backtracking:
1. Maintain current path
2. Mark nodes as visited
3. If target reached, save path
4. Recurse on unvisited neighbors
5. Backtrack (unmark and remove from path)

Complexity:

  • Time: O(V!) worst case (exponential)
  • Space: O(V × max_paths)

Safeguards:

  • Maximum path length limit
  • Maximum paths limit
  • Cycle detection

5. Cycle Detection Module (src/cycle.rs)

Implements multiple cycle detection algorithms.

DFS-Based Cycle Detection

Algorithm:

1. For each unvisited node:
a. Start DFS
b. Track recursion stack
c. If node in stack is revisited, cycle found
d. Extract cycle from path

Complexity:

  • Time: O(V + E)
  • Space: O(V)

Tarjan’s Algorithm (Strongly Connected Components)

Algorithm:

1. Assign discovery time and low-link value
2. Use stack to track current path
3. When low-link equals discovery time:
a. Found SCC root
b. Pop stack until root
c. Create component

Complexity:

  • Time: O(V + E)
  • Space: O(V)

Benefits:

  • Finds all strongly connected components
  • Single pass algorithm
  • Efficient for large graphs

6. Index Manager (src/index.rs)

Provides specialized indexes for efficient graph queries.

Adjacency List Index

Structure:

outgoing: HashMap<NodeId, HashSet<NodeId>>
incoming: HashMap<NodeId, HashSet<NodeId>>
degrees: HashMap<NodeId, (in_degree, out_degree)>

Operations:

  • O(1) neighbor lookup
  • O(1) predecessor lookup
  • O(1) degree lookup

Property Index

Structure:

index: HashMap<PropertyName, HashMap<Value, HashSet<EntityId>>>

Operations:

  • O(1) query by property value
  • O(k) for k matching entities
  • Supports node and edge properties

7. Algorithms Module (src/algorithms.rs)

Implements advanced graph algorithms.

Connected Components

Uses union-find or BFS to identify connected components in undirected view of graph.

Complexity: O(V + E)

Transitive Closure

Floyd-Warshall-like approach to compute all reachable pairs.

Complexity: O(V³)

Topological Sort

Kahn’s algorithm using in-degree counting.

Algorithm:

1. Calculate in-degrees
2. Queue nodes with in-degree 0
3. Process queue:
a. Remove node
b. Decrease neighbor in-degrees
c. Queue if in-degree becomes 0
4. If all nodes processed, return sorted order

Complexity: O(V + E)

PageRank

Iterative algorithm for computing node importance.

Algorithm:

PR(v) = (1-d)/N + d × Σ(PR(u)/out_degree(u))
where:
d = damping factor (typically 0.85)
N = total nodes
u = incoming neighbors

Complexity: O(iterations × E)

Betweenness Centrality

Measures importance based on shortest paths passing through node.

Algorithm:

1. For each node as source:
a. Run BFS to find shortest paths
b. Count paths through each node
2. Accumulate contributions
3. Normalize by graph size

Complexity: O(V × E)

Data Structures

Graph Representation

Uses petgraph::DiGraph for underlying storage:

  • Directed graph with weighted edges
  • Efficient neighbor iteration
  • Stable node/edge indices

Node Structure

pub struct Node {
pub id: NodeId, // Application-level ID
pub label: String, // Human-readable label
pub properties: HashMap<String, Value>, // Arbitrary properties
}

Edge Structure

pub struct Edge {
pub id: EdgeId,
pub source: NodeId,
pub target: NodeId,
pub label: String,
pub weight: Weight, // For weighted algorithms
pub properties: HashMap<String, Value>,
}

Path Structure

pub struct Path {
pub nodes: Vec<NodeId>, // Ordered node sequence
pub edges: Vec<EdgeId>, // Ordered edge sequence
pub cost: f64, // Total path cost
pub metadata: HashMap<String, String>,
}

Performance Considerations

Memory Management

  1. Graph Storage: O(V + E) using adjacency lists
  2. Index Overhead: O(V) per graph for adjacency index
  3. Property Index: O(P × V) where P = unique property values
  4. Path Cache: Configurable, default 10,000 entries

Concurrency

  1. DashMap for thread-safe graph access
  2. Read-optimized for query-heavy workloads
  3. Write locks only during graph modifications
  4. Per-graph locking for isolation

Optimizations

  1. Early Termination

    • Stop BFS when target found
    • Stop DFS when max depth reached
    • Limit path enumeration
  2. Caching

    • Path query results
    • Degree calculations
    • Property lookups
  3. Lazy Evaluation

    • Iterator-based traversal
    • On-demand index building
    • Streaming results

Configuration Tuning

For Large Graphs (>100K nodes)

GraphConfig {
max_depth: 10, // Limit depth
max_paths: 100, // Limit path enumeration
enable_cycle_detection: false, // Disable if known acyclic
cache_size: 100000, // Larger cache
enable_optimization: true,
..Default::default()
}

For Deep Hierarchies

GraphConfig {
max_depth: 1000, // Allow deep traversal
max_iterations: 100000, // More CTE iterations
enable_cycle_detection: true, // Important for safety
..Default::default()
}

For Real-time Queries

GraphConfig {
max_depth: 5, // Shallow traversal
max_paths: 10, // Few paths
enable_optimization: true, // Query optimization
cache_size: 50000, // Moderate cache
..Default::default()
}

Testing Strategy

Unit Tests

  • Individual algorithm correctness
  • Edge cases (empty graphs, single nodes)
  • Cycle detection accuracy
  • Index consistency

Integration Tests

  • End-to-end workflows
  • Multi-graph operations
  • Concurrent access
  • Large graph performance

Test Coverage

Target: >80% code coverage

  • All public APIs tested
  • Error paths verified
  • Performance benchmarks

Future Enhancements

  1. Parallel Algorithms

    • Parallel BFS/DFS
    • Distributed PageRank
    • Multi-threaded path finding
  2. Query Optimization

    • Query plan optimization
    • Index selection
    • Join order optimization
  3. Persistent Storage

    • Graph serialization
    • Incremental updates
    • Transaction support
  4. Advanced Algorithms

    • Community detection (Louvain)
    • Graph coloring
    • Maximum flow
    • Minimum cut
  5. Visualization

    • Graph export (GraphML, DOT)
    • Interactive visualization
    • Query plan visualization

Dependencies

  • petgraph: Core graph data structure
  • tokio: Async runtime
  • dashmap: Concurrent hash map
  • ordered-float: Ordered floats for priority queue
  • serde: Serialization

References