Graph-RAG Tutorial
Graph-RAG Tutorial
Available since: v3.17.0 (2026-04-24, phase 3 MVP) — entity linker, docling ingestion, centrality rerank in v3.19.0
Build: cargo build --release --features graph-rag (implies code-graph)
APIs: EmbeddedDatabase::graph_rag_project_symbols, graph_rag_search, graph_rag_link_exact, graph_rag_link_vector, graph_rag_ingest_*
Tables: _hdb_graph_nodes, _hdb_graph_edges — both plain user tables, joinable, branch-aware, queryable
UVP
A “RAG” stack typically means a vector DB, a graph DB, a relational store, and four glue services keeping them in sync. HeliosDB Nano collapses all four into one embedded process. Every entity — code symbol, doc chunk, email, support ticket, investor question — lives in _hdb_graph_nodes with typed edges in _hdb_graph_edges. The flagship query, seed → BFS expand → centrality rerank, runs as a single EmbeddedDatabase::graph_rag_search() call (or one MCP tools/call) and pushes every WHERE through the bloom-filter / zone-map / SIMD storage path. One graph, one schema, one binary, real SQL semantics.
Prerequisites
- HeliosDB Nano v3.19+ source tree
- Rust 1.85+
- The Code-Graph Tutorial — graph-rag’s primary node population is the projection from
_hdb_code_symbols - About 25 minutes
1. Build with --features graph-rag
cargo build --release --features graph-raggraph-rag implies code-graph (the symbol projection is meaningless without a code graph). All the code_index / lsp_* APIs from the Code-Graph Tutorial are available alongside.
2. The Universal Schema
The two tables are bootstrapped on first call to any graph_rag_* API:
-- Created automatically — shown here for reference.CREATE TABLE IF NOT EXISTS _hdb_graph_nodes ( node_id BIGSERIAL PRIMARY KEY, node_kind TEXT NOT NULL, source_ref TEXT, title TEXT, text TEXT, extra TEXT);
CREATE TABLE IF NOT EXISTS _hdb_graph_edges ( edge_id BIGSERIAL PRIMARY KEY, from_node BIGINT NOT NULL REFERENCES _hdb_graph_nodes(node_id), to_node BIGINT NOT NULL REFERENCES _hdb_graph_nodes(node_id), edge_kind TEXT NOT NULL, weight REAL, extra TEXT);Convention:
| Field | What it carries |
|---|---|
node_kind | Function, Class, Struct, DocChunk, DocSection, Email, Issue, InvestorQuestion, Answer, Person, Pdf, Office, Audio, Image |
source_ref | Stable cross-source identifier — code_symbol:42, doc:readme, email:<message-id>, docling:document:<name> |
edge_kind | CALLS, IMPORTS, REFERENCES, MENTIONS, CITES, REPLIES_TO, ASKS_ABOUT, AUTHORED_BY, CONTAINS, PART_OF |
weight | Edge-specific scalar — call frequency, similarity, sender hops |
These are plain Nano tables. You can JOIN, WHERE, INSERT, UPDATE, BRANCH, AS OF against them like any user table.
3. Project Code Symbols Into the Graph
use heliosdb_nano::{ code_graph::CodeIndexOptions, EmbeddedDatabase, Result,};
fn bootstrap(db: &EmbeddedDatabase) -> Result<()> { db.execute("CREATE TABLE src (path TEXT PRIMARY KEY, lang TEXT, content TEXT)")?; db.execute("INSERT INTO src VALUES \ ('lib.rs', 'rust', 'pub fn answer() -> i32 { 42 } pub fn caller() { answer(); }')")?; db.code_index(CodeIndexOptions::for_table("src"))?;
// Mirror every _hdb_code_symbols row + every resolved // _hdb_code_symbol_refs row into the universal schema. let stats = db.graph_rag_project_symbols()?; println!("projected {} code symbols", stats.code_symbols_projected); Ok(())}graph_rag_project_symbols is idempotent. It uses _hdb_graph_nodes.source_ref = "code_symbol:<id>" as the dedupe key. Edges from _hdb_code_symbol_refs with a non-NULL to_symbol become matching _hdb_graph_edges rows; the kind name is lifted verbatim (CALLS, REFERENCES, …).
If you’re calling code_index from --features graph-rag, the projection runs automatically as a final step inside code_index itself — you don’t need to call graph_rag_project_symbols explicitly unless you want the stats back.
4. The Flagship Query — graph_rag_search
use heliosdb_nano::graph_rag::{Direction, GraphRagOptions};
let opts = GraphRagOptions { seed_text: "answer".into(), seed_kinds: vec!["Function".into()], // optional — empty = all kinds hops: 2, edge_kinds: vec!["CALLS".into()], // empty = all kinds direction: Direction::Both, limit: 50,};let hits = db.graph_rag_search(&opts)?;
for h in hits { println!("[hop={}] {} ({}) — {:?}", h.hop_distance, h.title.unwrap_or_default(), h.node_kind, h.source_ref);}Mechanics:
- Seed selection. Substring match (case-insensitive) of
seed_textagainsttitle || text, optionally filtered byseed_kinds. The kind filter pushes throughFilteredScanso seed selection on a million-node corpus stays in milliseconds. - BFS expansion. Up to
hopsdeep, capped atlimittotal nodes.directionisOut(usefrom_node = seed),In(useto_node = seed), orBoth. - Deterministic ordering. Hop-distance ascending, then
node_idascending — stable across runs.
Edge-kind filtering pushes through the same FilteredScan path; bloom filters skip blocks cheaply on edge_kind IN ('CALLS').
5. Same Query from SQL — WITH CONTEXT
A pre-parser SQL clause exposes the same expansion to ad-hoc SQL:
SELECT n.node_id, n.node_kind, n.titleFROM _hdb_graph_nodes nWHERE n.title LIKE '%answer%'WITH CONTEXT ( HOPS 2, EDGES CALLS|REFERENCES, DIRECTION both, LIMIT 50);The pre-parser strips the WITH CONTEXT (...) clause, runs the inner SELECT to collect seeds, then BFS-expands per the directives. RERANK BY <expr> is parsed but unused in the MVP — reranking ships under §7.
6. Cross-Modal Ingestion — Beyond Code
Code symbols project through graph_rag_project_symbols. For other modalities use the matching ingester:
| Adapter | What it ingests | Source table contract |
|---|---|---|
graph_rag_ingest_docs | Markdown / plain text | (id_col, text_col [, title_col]) |
graph_rag_ingest_email | Mail messages + senders | structured email columns |
graph_rag_ingest_issues | Issues + comments | issue tracker JSON |
graph_rag_ingest_qa | Investor questions + answers | Q&A schema |
graph_rag_ingest_pdf / _office / _audio / _image | docling-converted documents | external docling-serve POST |
For the docling family, see DOCLING_INGESTION_TUTORIAL. A trimmed Markdown example:
use heliosdb_nano::graph_rag::{ChunkStrategy, IngestDocsOptions};
db.execute("CREATE TABLE docs (id TEXT PRIMARY KEY, body TEXT, title TEXT)")?;db.execute("INSERT INTO docs VALUES \ ('intro', '# Welcome\\nHelloDB does X. ## Setup\\nRun cargo build.', 'Getting started')")?;
let opts = IngestDocsOptions { source_table: "docs".into(), id_col: "id".into(), text_col: "body".into(), title_col: Some("title".into()), chunk_by: ChunkStrategy::Headings,};let stats = db.graph_rag_ingest_docs(&opts)?;// → IngestStats { nodes_added: 3, edges_added: 2, ... }ChunkStrategy::Headings splits on Markdown heading boundaries; each ## heading becomes a DocSection and the prose under it becomes a DocChunk with a PART_OF edge into the section. Idempotent via source_ref = "doc:<id>:section:<i>".
7. Linking — Cross-Modal MENTIONS Edges
Two strategies.
Exact qualified-name match
let stats = db.graph_rag_link_exact(&[])?; // empty = default kinds// → LinkerStats { nodes_scanned: …, mentions_added: …, candidates_seen: … }Scans every text-bearing node (DocChunk, DocSection, Email, Issue, Comment, InvestorQuestion, Answer) and emits a MENTIONS edge from the node to any code symbol whose qualified name appears as a whole word in the haystack. Matching is case-sensitive on qualified because lower-casing would conflate Foo/foo types. Pass extra kinds to extend the scan: db.graph_rag_link_exact(&["BlogPost", "Wiki"]).
Vector-similarity match (v3.19.0)
When you have embeddings for both sides:
use heliosdb_nano::graph_rag::{SymbolEmbedding, TextEmbedding};
let texts: Vec<TextEmbedding> = vec![ TextEmbedding { node_id: 7, vector: vec![0.1, 0.2, 0.3, 0.4] }, TextEmbedding { node_id: 12, vector: vec![0.0, 0.1, 0.0, 0.5] },];let symbols: Vec<SymbolEmbedding> = vec![ SymbolEmbedding { node_id: 42, vector: vec![0.1, 0.2, 0.3, 0.4] }, SymbolEmbedding { node_id: 43, vector: vec![0.9, 0.0, 0.0, 0.1] },];
let stats = db.graph_rag_link_vector( &texts, &symbols, /* top_k */ 3, /* threshold */ 0.6,)?;Cosine top-k per text node, gated by threshold ∈ [-1, 1], emits MENTIONS edges with weight = similarity. Mismatched-dimension pairs are skipped silently (so a multi-embedder corpus doesn’t crash). Idempotent via (from_node, to_node) dedupe.
Embeddings are caller-supplied: Nano carries no inference runtime. Pair with the code-embed feature (Code-Graph Tutorial §10) or any external service.
8. Centrality-Biased Rerank (v3.19.0)
Default ordering is hop-distance ascending. To pull hot-path symbols up, post-rerank with centrality_rerank:
use heliosdb_nano::graph_rag::{centrality_rerank, Centrality};
let hits = db.graph_rag_search(&opts)?;let centrality = Centrality::from_edges(&db, &["CALLS"])?; // weighted in-degree, normalised to [0, 1]let reranked = centrality_rerank(hits, ¢rality, /* α */ 0.7);
for h in &reranked { println!("[hop={}] {}", h.hop_distance, h.title.clone().unwrap_or_default());}Score formula (from src/graph_rag/centrality.rs):
score(hit) = (1.0 / (1.0 + hop_distance)) * (1.0 + α * centrality(node_id))Stable tie-break by node_id. The centrality weight is clamped to [0.0, 4.0]. This is the FR’s “Option B (post-rerank)” lift; an in-descent HNSW navigation bias is a tracked phase-3.1 follow-up.
9. Worked End-to-End Example
use heliosdb_nano::{ code_graph::CodeIndexOptions, graph_rag::{ centrality_rerank, Centrality, ChunkStrategy, Direction, GraphRagOptions, IngestDocsOptions, }, EmbeddedDatabase, Result,};
fn main() -> Result<()> { let db = EmbeddedDatabase::new_in_memory()?;
// 1. Code corpus db.execute("CREATE TABLE src (path TEXT PRIMARY KEY, lang TEXT, content TEXT)")?; db.execute("INSERT INTO src VALUES \ ('parser.rs', 'rust', 'pub fn parse(s: &str) -> i32 { 0 }'), \ ('lexer.rs', 'rust', 'use crate::parse; pub fn tokenize(s: &str) { parse(s); }')")?; db.code_index(CodeIndexOptions::for_table("src"))?;
// 2. Doc corpus db.execute("CREATE TABLE docs (id TEXT PRIMARY KEY, body TEXT, title TEXT)")?; db.execute("INSERT INTO docs VALUES \ ('readme', 'parse() turns a string into an AST', 'README')")?; db.graph_rag_ingest_docs(&IngestDocsOptions { source_table: "docs".into(), id_col: "id".into(), text_col: "body".into(), title_col: Some("title".into()), chunk_by: ChunkStrategy::Row, })?;
// 3. Cross-modal linking — DocChunk → Function via MENTIONS db.graph_rag_link_exact(&[])?;
// 4. Search let opts = GraphRagOptions { seed_text: "parse".into(), hops: 2, direction: Direction::Both, limit: 25, ..Default::default() }; let hits = db.graph_rag_search(&opts)?; let centrality = Centrality::from_edges(&db, &["CALLS"])?; let reranked = centrality_rerank(hits, ¢rality, 0.7);
for h in reranked { println!("{:>6} hop={} {} — {}", h.node_id, h.hop_distance, h.node_kind, h.title.unwrap_or_default()); } Ok(())}You’ll see a code symbol (Function, parse), its caller (tokenize, hop 1), and the DocChunk linked via MENTIONS to parse (hop 1 the other direction). All three came out of one BFS over _hdb_graph_edges.
10. SQL Verification
Inspect the graph at any point:
SELECT node_kind, COUNT(*) FROM _hdb_graph_nodes GROUP BY node_kind;SELECT edge_kind, COUNT(*) FROM _hdb_graph_edges GROUP BY edge_kind;
-- All MENTIONS edges with weightsSELECT n1.title AS from_title, n1.node_kind AS from_kind, n2.title AS to_title, n2.node_kind AS to_kind, e.weightFROM _hdb_graph_edges eJOIN _hdb_graph_nodes n1 ON n1.node_id = e.from_nodeJOIN _hdb_graph_nodes n2 ON n2.node_id = e.to_nodeWHERE e.edge_kind = 'MENTIONS'ORDER BY e.weight DESC;11. Where Next
- DOCLING_INGESTION_TUTORIAL — pull PDFs, DOCX, audio, images into the graph via docling-serve.
- MCP_ENDPOINT_TUTORIAL — drive
helios_graphrag_searchfrom Claude Code / Cursor / Continue with streaming progress notifications. - CODE_GRAPH_TUTORIAL — populate
_hdb_code_symbols(the projection’s source). - SEMANTIC_HASH_INDEX_QUICKREF — Merkle-tree subtree invalidation so re-indexing is incremental.
- VECTOR_SEARCH_TUTORIAL — the HNSW substrate that backs vector linking.