HeliosDB Graph Architecture
HeliosDB Graph Architecture
This document describes the architecture and design decisions for the HeliosDB Graph Query Support module.
Overview
The HeliosDB Graph module provides a comprehensive graph query engine built on top of the petgraph library, with specialized algorithms for recursive queries, traversal, and path-finding.
Architecture Diagram
┌─────────────────────────────────────────────────────────────┐│ GraphEngine ││ - Graph management ││ - Query coordination ││ - Configuration │└────────────┬────────────────────────────────────────────────┘ │ ┌────────┴────────┬──────────────┬──────────────┬─────────┐ │ │ │ │ │┌───▼───┐ ┌──────▼──────┐ ┌───▼────┐ ┌──────▼──────┐ ││ CTE │ │ Traversal │ │ Cycle │ │ Path-finding│ ││Engine │ │ Algorithms │ │Detection│ │ Algorithms │ │└───────┘ └─────────────┘ └────────┘ └─────────────┘ │ │ ┌──────────────────────────────────────────────────────────┘ │┌───▼────────┐ ┌──────────────┐ ┌────────────────┐│ Index │ │ Algorithms │ │ petgraph ││ Manager │ │ (PageRank, │ │ (core graph ││ │ │ centrality)│ │ structure) │└────────────┘ └──────────────┘ └────────────────┘Core Components
1. GraphEngine
The main entry point for all graph operations.
Responsibilities:
- Graph lifecycle management (create, register, destroy)
- Node and edge operations (add, remove, update)
- Query routing to specialized engines
- Configuration management
Design Decisions:
- Uses
DashMapfor thread-safe concurrent graph access - Maintains separate indexes per graph for isolation
- Delegates complex operations to specialized modules
Key APIs:
pub async fn register_graph(&self, name: String) -> Result<()>pub async fn add_node(&self, graph: &str, node: Node) -> Result<()>pub async fn add_edge(&self, graph: &str, edge: Edge) -> Result<()>pub async fn traverse_graph(&self, graph: &str, start: NodeId, mode: TraversalMode, max_depth: usize) -> Result<Vec<NodeId>>2. CTE Engine (src/cte.rs)
Implements recursive Common Table Expressions with SQL-like semantics.
Algorithm:
- Execute anchor query (base case)
- Maintain work queue of current results
- Execute recursive query with current batch
- Apply union semantics (UNION or UNION ALL)
- Detect cycles using row hashing
- Terminate on max depth or empty results
Optimizations:
- Hash-based deduplication for UNION
- Early cycle detection
- Configurable iteration limits
- Work queue for memory efficiency
Complexity:
- Time: O(iterations × query_cost)
- Space: O(result_size)
3. Traversal Module (src/traversal.rs)
Implements graph traversal algorithms with cycle detection.
Breadth-First Search (BFS)
Algorithm:
1. Initialize queue with start node2. While queue not empty: a. Dequeue node b. Visit node c. Enqueue unvisited neighbors3. Return visited nodes in level orderComplexity:
- Time: O(V + E)
- Space: O(V)
Use Cases:
- Finding shortest path (unweighted)
- Level-order traversal
- Closest neighbors
Depth-First Search (DFS)
Algorithm:
1. Initialize stack with start node2. While stack not empty: a. Pop node b. Visit node c. Push unvisited neighbors3. Return visited nodes in depth-first orderComplexity:
- Time: O(V + E)
- Space: O(V)
Use Cases:
- Topological sorting
- Cycle detection
- Path enumeration
4. Path-Finding Module (src/pathfinding.rs)
Implements shortest path and all-paths algorithms.
Dijkstra’s Algorithm
Algorithm:
1. Initialize distances to infinity, source to 02. Use min-heap priority queue3. While queue not empty: a. Extract min distance node b. For each neighbor: - Calculate tentative distance - Update if shorter - Add to queue4. Reconstruct path from source to targetComplexity:
- Time: O((V + E) log V) with binary heap
- Space: O(V)
Optimizations:
- Early termination when target found
- Path reconstruction using predecessor map
- Edge weight caching
A* Algorithm
Extension of Dijkstra with heuristic function.
Algorithm:
f(n) = g(n) + h(n)where: g(n) = actual cost from start h(n) = estimated cost to goal (heuristic)Benefits:
- Faster than Dijkstra when good heuristic available
- Guaranteed optimal if heuristic is admissible
- Useful for geographical graphs
All Paths Enumeration
Algorithm:
DFS-based with backtracking:1. Maintain current path2. Mark nodes as visited3. If target reached, save path4. Recurse on unvisited neighbors5. Backtrack (unmark and remove from path)Complexity:
- Time: O(V!) worst case (exponential)
- Space: O(V × max_paths)
Safeguards:
- Maximum path length limit
- Maximum paths limit
- Cycle detection
5. Cycle Detection Module (src/cycle.rs)
Implements multiple cycle detection algorithms.
DFS-Based Cycle Detection
Algorithm:
1. For each unvisited node: a. Start DFS b. Track recursion stack c. If node in stack is revisited, cycle found d. Extract cycle from pathComplexity:
- Time: O(V + E)
- Space: O(V)
Tarjan’s Algorithm (Strongly Connected Components)
Algorithm:
1. Assign discovery time and low-link value2. Use stack to track current path3. When low-link equals discovery time: a. Found SCC root b. Pop stack until root c. Create componentComplexity:
- Time: O(V + E)
- Space: O(V)
Benefits:
- Finds all strongly connected components
- Single pass algorithm
- Efficient for large graphs
6. Index Manager (src/index.rs)
Provides specialized indexes for efficient graph queries.
Adjacency List Index
Structure:
outgoing: HashMap<NodeId, HashSet<NodeId>>incoming: HashMap<NodeId, HashSet<NodeId>>degrees: HashMap<NodeId, (in_degree, out_degree)>Operations:
- O(1) neighbor lookup
- O(1) predecessor lookup
- O(1) degree lookup
Property Index
Structure:
index: HashMap<PropertyName, HashMap<Value, HashSet<EntityId>>>Operations:
- O(1) query by property value
- O(k) for k matching entities
- Supports node and edge properties
7. Algorithms Module (src/algorithms.rs)
Implements advanced graph algorithms.
Connected Components
Uses union-find or BFS to identify connected components in undirected view of graph.
Complexity: O(V + E)
Transitive Closure
Floyd-Warshall-like approach to compute all reachable pairs.
Complexity: O(V³)
Topological Sort
Kahn’s algorithm using in-degree counting.
Algorithm:
1. Calculate in-degrees2. Queue nodes with in-degree 03. Process queue: a. Remove node b. Decrease neighbor in-degrees c. Queue if in-degree becomes 04. If all nodes processed, return sorted orderComplexity: O(V + E)
PageRank
Iterative algorithm for computing node importance.
Algorithm:
PR(v) = (1-d)/N + d × Σ(PR(u)/out_degree(u))
where: d = damping factor (typically 0.85) N = total nodes u = incoming neighborsComplexity: O(iterations × E)
Betweenness Centrality
Measures importance based on shortest paths passing through node.
Algorithm:
1. For each node as source: a. Run BFS to find shortest paths b. Count paths through each node2. Accumulate contributions3. Normalize by graph sizeComplexity: O(V × E)
Data Structures
Graph Representation
Uses petgraph::DiGraph for underlying storage:
- Directed graph with weighted edges
- Efficient neighbor iteration
- Stable node/edge indices
Node Structure
pub struct Node { pub id: NodeId, // Application-level ID pub label: String, // Human-readable label pub properties: HashMap<String, Value>, // Arbitrary properties}Edge Structure
pub struct Edge { pub id: EdgeId, pub source: NodeId, pub target: NodeId, pub label: String, pub weight: Weight, // For weighted algorithms pub properties: HashMap<String, Value>,}Path Structure
pub struct Path { pub nodes: Vec<NodeId>, // Ordered node sequence pub edges: Vec<EdgeId>, // Ordered edge sequence pub cost: f64, // Total path cost pub metadata: HashMap<String, String>,}Performance Considerations
Memory Management
- Graph Storage: O(V + E) using adjacency lists
- Index Overhead: O(V) per graph for adjacency index
- Property Index: O(P × V) where P = unique property values
- Path Cache: Configurable, default 10,000 entries
Concurrency
- DashMap for thread-safe graph access
- Read-optimized for query-heavy workloads
- Write locks only during graph modifications
- Per-graph locking for isolation
Optimizations
-
Early Termination
- Stop BFS when target found
- Stop DFS when max depth reached
- Limit path enumeration
-
Caching
- Path query results
- Degree calculations
- Property lookups
-
Lazy Evaluation
- Iterator-based traversal
- On-demand index building
- Streaming results
Configuration Tuning
For Large Graphs (>100K nodes)
GraphConfig { max_depth: 10, // Limit depth max_paths: 100, // Limit path enumeration enable_cycle_detection: false, // Disable if known acyclic cache_size: 100000, // Larger cache enable_optimization: true, ..Default::default()}For Deep Hierarchies
GraphConfig { max_depth: 1000, // Allow deep traversal max_iterations: 100000, // More CTE iterations enable_cycle_detection: true, // Important for safety ..Default::default()}For Real-time Queries
GraphConfig { max_depth: 5, // Shallow traversal max_paths: 10, // Few paths enable_optimization: true, // Query optimization cache_size: 50000, // Moderate cache ..Default::default()}Testing Strategy
Unit Tests
- Individual algorithm correctness
- Edge cases (empty graphs, single nodes)
- Cycle detection accuracy
- Index consistency
Integration Tests
- End-to-end workflows
- Multi-graph operations
- Concurrent access
- Large graph performance
Test Coverage
Target: >80% code coverage
- All public APIs tested
- Error paths verified
- Performance benchmarks
Future Enhancements
-
Parallel Algorithms
- Parallel BFS/DFS
- Distributed PageRank
- Multi-threaded path finding
-
Query Optimization
- Query plan optimization
- Index selection
- Join order optimization
-
Persistent Storage
- Graph serialization
- Incremental updates
- Transaction support
-
Advanced Algorithms
- Community detection (Louvain)
- Graph coloring
- Maximum flow
- Minimum cut
-
Visualization
- Graph export (GraphML, DOT)
- Interactive visualization
- Query plan visualization
Dependencies
petgraph: Core graph data structuretokio: Async runtimedashmap: Concurrent hash mapordered-float: Ordered floats for priority queueserde: Serialization