Skip to content

Graph-RAG Tutorial

Graph-RAG Tutorial

Available since: v3.17.0 (2026-04-24, phase 3 MVP) — entity linker, docling ingestion, centrality rerank in v3.19.0 Build: cargo build --release --features graph-rag (implies code-graph) APIs: EmbeddedDatabase::graph_rag_project_symbols, graph_rag_search, graph_rag_link_exact, graph_rag_link_vector, graph_rag_ingest_* Tables: _hdb_graph_nodes, _hdb_graph_edges — both plain user tables, joinable, branch-aware, queryable


UVP

A “RAG” stack typically means a vector DB, a graph DB, a relational store, and four glue services keeping them in sync. HeliosDB Nano collapses all four into one embedded process. Every entity — code symbol, doc chunk, email, support ticket, investor question — lives in _hdb_graph_nodes with typed edges in _hdb_graph_edges. The flagship query, seed → BFS expand → centrality rerank, runs as a single EmbeddedDatabase::graph_rag_search() call (or one MCP tools/call) and pushes every WHERE through the bloom-filter / zone-map / SIMD storage path. One graph, one schema, one binary, real SQL semantics.


Prerequisites

  • HeliosDB Nano v3.19+ source tree
  • Rust 1.85+
  • The Code-Graph Tutorial — graph-rag’s primary node population is the projection from _hdb_code_symbols
  • About 25 minutes

1. Build with --features graph-rag

Terminal window
cargo build --release --features graph-rag

graph-rag implies code-graph (the symbol projection is meaningless without a code graph). All the code_index / lsp_* APIs from the Code-Graph Tutorial are available alongside.


2. The Universal Schema

The two tables are bootstrapped on first call to any graph_rag_* API:

-- Created automatically — shown here for reference.
CREATE TABLE IF NOT EXISTS _hdb_graph_nodes (
node_id BIGSERIAL PRIMARY KEY,
node_kind TEXT NOT NULL,
source_ref TEXT,
title TEXT,
text TEXT,
extra TEXT
);
CREATE TABLE IF NOT EXISTS _hdb_graph_edges (
edge_id BIGSERIAL PRIMARY KEY,
from_node BIGINT NOT NULL REFERENCES _hdb_graph_nodes(node_id),
to_node BIGINT NOT NULL REFERENCES _hdb_graph_nodes(node_id),
edge_kind TEXT NOT NULL,
weight REAL,
extra TEXT
);

Convention:

FieldWhat it carries
node_kindFunction, Class, Struct, DocChunk, DocSection, Email, Issue, InvestorQuestion, Answer, Person, Pdf, Office, Audio, Image
source_refStable cross-source identifier — code_symbol:42, doc:readme, email:<message-id>, docling:document:<name>
edge_kindCALLS, IMPORTS, REFERENCES, MENTIONS, CITES, REPLIES_TO, ASKS_ABOUT, AUTHORED_BY, CONTAINS, PART_OF
weightEdge-specific scalar — call frequency, similarity, sender hops

These are plain Nano tables. You can JOIN, WHERE, INSERT, UPDATE, BRANCH, AS OF against them like any user table.


3. Project Code Symbols Into the Graph

use heliosdb_nano::{
code_graph::CodeIndexOptions,
EmbeddedDatabase, Result,
};
fn bootstrap(db: &EmbeddedDatabase) -> Result<()> {
db.execute("CREATE TABLE src (path TEXT PRIMARY KEY, lang TEXT, content TEXT)")?;
db.execute("INSERT INTO src VALUES \
('lib.rs', 'rust', 'pub fn answer() -> i32 { 42 } pub fn caller() { answer(); }')")?;
db.code_index(CodeIndexOptions::for_table("src"))?;
// Mirror every _hdb_code_symbols row + every resolved
// _hdb_code_symbol_refs row into the universal schema.
let stats = db.graph_rag_project_symbols()?;
println!("projected {} code symbols", stats.code_symbols_projected);
Ok(())
}

graph_rag_project_symbols is idempotent. It uses _hdb_graph_nodes.source_ref = "code_symbol:<id>" as the dedupe key. Edges from _hdb_code_symbol_refs with a non-NULL to_symbol become matching _hdb_graph_edges rows; the kind name is lifted verbatim (CALLS, REFERENCES, …).

If you’re calling code_index from --features graph-rag, the projection runs automatically as a final step inside code_index itself — you don’t need to call graph_rag_project_symbols explicitly unless you want the stats back.


use heliosdb_nano::graph_rag::{Direction, GraphRagOptions};
let opts = GraphRagOptions {
seed_text: "answer".into(),
seed_kinds: vec!["Function".into()], // optional — empty = all kinds
hops: 2,
edge_kinds: vec!["CALLS".into()], // empty = all kinds
direction: Direction::Both,
limit: 50,
};
let hits = db.graph_rag_search(&opts)?;
for h in hits {
println!("[hop={}] {} ({}) — {:?}",
h.hop_distance, h.title.unwrap_or_default(),
h.node_kind, h.source_ref);
}

Mechanics:

  1. Seed selection. Substring match (case-insensitive) of seed_text against title || text, optionally filtered by seed_kinds. The kind filter pushes through FilteredScan so seed selection on a million-node corpus stays in milliseconds.
  2. BFS expansion. Up to hops deep, capped at limit total nodes. direction is Out (use from_node = seed), In (use to_node = seed), or Both.
  3. Deterministic ordering. Hop-distance ascending, then node_id ascending — stable across runs.

Edge-kind filtering pushes through the same FilteredScan path; bloom filters skip blocks cheaply on edge_kind IN ('CALLS').


5. Same Query from SQL — WITH CONTEXT

A pre-parser SQL clause exposes the same expansion to ad-hoc SQL:

SELECT n.node_id, n.node_kind, n.title
FROM _hdb_graph_nodes n
WHERE n.title LIKE '%answer%'
WITH CONTEXT (
HOPS 2,
EDGES CALLS|REFERENCES,
DIRECTION both,
LIMIT 50
);

The pre-parser strips the WITH CONTEXT (...) clause, runs the inner SELECT to collect seeds, then BFS-expands per the directives. RERANK BY <expr> is parsed but unused in the MVP — reranking ships under §7.


6. Cross-Modal Ingestion — Beyond Code

Code symbols project through graph_rag_project_symbols. For other modalities use the matching ingester:

AdapterWhat it ingestsSource table contract
graph_rag_ingest_docsMarkdown / plain text(id_col, text_col [, title_col])
graph_rag_ingest_emailMail messages + sendersstructured email columns
graph_rag_ingest_issuesIssues + commentsissue tracker JSON
graph_rag_ingest_qaInvestor questions + answersQ&A schema
graph_rag_ingest_pdf / _office / _audio / _imagedocling-converted documentsexternal docling-serve POST

For the docling family, see DOCLING_INGESTION_TUTORIAL. A trimmed Markdown example:

use heliosdb_nano::graph_rag::{ChunkStrategy, IngestDocsOptions};
db.execute("CREATE TABLE docs (id TEXT PRIMARY KEY, body TEXT, title TEXT)")?;
db.execute("INSERT INTO docs VALUES \
('intro', '# Welcome\\nHelloDB does X. ## Setup\\nRun cargo build.', 'Getting started')")?;
let opts = IngestDocsOptions {
source_table: "docs".into(),
id_col: "id".into(),
text_col: "body".into(),
title_col: Some("title".into()),
chunk_by: ChunkStrategy::Headings,
};
let stats = db.graph_rag_ingest_docs(&opts)?;
// → IngestStats { nodes_added: 3, edges_added: 2, ... }

ChunkStrategy::Headings splits on Markdown heading boundaries; each ## heading becomes a DocSection and the prose under it becomes a DocChunk with a PART_OF edge into the section. Idempotent via source_ref = "doc:<id>:section:<i>".


7. Linking — Cross-Modal MENTIONS Edges

Two strategies.

Exact qualified-name match

let stats = db.graph_rag_link_exact(&[])?; // empty = default kinds
// → LinkerStats { nodes_scanned: …, mentions_added: …, candidates_seen: … }

Scans every text-bearing node (DocChunk, DocSection, Email, Issue, Comment, InvestorQuestion, Answer) and emits a MENTIONS edge from the node to any code symbol whose qualified name appears as a whole word in the haystack. Matching is case-sensitive on qualified because lower-casing would conflate Foo/foo types. Pass extra kinds to extend the scan: db.graph_rag_link_exact(&["BlogPost", "Wiki"]).

Vector-similarity match (v3.19.0)

When you have embeddings for both sides:

use heliosdb_nano::graph_rag::{SymbolEmbedding, TextEmbedding};
let texts: Vec<TextEmbedding> = vec![
TextEmbedding { node_id: 7, vector: vec![0.1, 0.2, 0.3, 0.4] },
TextEmbedding { node_id: 12, vector: vec![0.0, 0.1, 0.0, 0.5] },
];
let symbols: Vec<SymbolEmbedding> = vec![
SymbolEmbedding { node_id: 42, vector: vec![0.1, 0.2, 0.3, 0.4] },
SymbolEmbedding { node_id: 43, vector: vec![0.9, 0.0, 0.0, 0.1] },
];
let stats = db.graph_rag_link_vector(
&texts,
&symbols,
/* top_k */ 3,
/* threshold */ 0.6,
)?;

Cosine top-k per text node, gated by threshold ∈ [-1, 1], emits MENTIONS edges with weight = similarity. Mismatched-dimension pairs are skipped silently (so a multi-embedder corpus doesn’t crash). Idempotent via (from_node, to_node) dedupe.

Embeddings are caller-supplied: Nano carries no inference runtime. Pair with the code-embed feature (Code-Graph Tutorial §10) or any external service.


8. Centrality-Biased Rerank (v3.19.0)

Default ordering is hop-distance ascending. To pull hot-path symbols up, post-rerank with centrality_rerank:

use heliosdb_nano::graph_rag::{centrality_rerank, Centrality};
let hits = db.graph_rag_search(&opts)?;
let centrality = Centrality::from_edges(&db, &["CALLS"])?; // weighted in-degree, normalised to [0, 1]
let reranked = centrality_rerank(hits, &centrality, /* α */ 0.7);
for h in &reranked {
println!("[hop={}] {}", h.hop_distance, h.title.clone().unwrap_or_default());
}

Score formula (from src/graph_rag/centrality.rs):

score(hit) = (1.0 / (1.0 + hop_distance)) * (1.0 + α * centrality(node_id))

Stable tie-break by node_id. The centrality weight is clamped to [0.0, 4.0]. This is the FR’s “Option B (post-rerank)” lift; an in-descent HNSW navigation bias is a tracked phase-3.1 follow-up.


9. Worked End-to-End Example

use heliosdb_nano::{
code_graph::CodeIndexOptions,
graph_rag::{
centrality_rerank, Centrality, ChunkStrategy, Direction,
GraphRagOptions, IngestDocsOptions,
},
EmbeddedDatabase, Result,
};
fn main() -> Result<()> {
let db = EmbeddedDatabase::new_in_memory()?;
// 1. Code corpus
db.execute("CREATE TABLE src (path TEXT PRIMARY KEY, lang TEXT, content TEXT)")?;
db.execute("INSERT INTO src VALUES \
('parser.rs', 'rust', 'pub fn parse(s: &str) -> i32 { 0 }'), \
('lexer.rs', 'rust', 'use crate::parse; pub fn tokenize(s: &str) { parse(s); }')")?;
db.code_index(CodeIndexOptions::for_table("src"))?;
// 2. Doc corpus
db.execute("CREATE TABLE docs (id TEXT PRIMARY KEY, body TEXT, title TEXT)")?;
db.execute("INSERT INTO docs VALUES \
('readme', 'parse() turns a string into an AST', 'README')")?;
db.graph_rag_ingest_docs(&IngestDocsOptions {
source_table: "docs".into(),
id_col: "id".into(),
text_col: "body".into(),
title_col: Some("title".into()),
chunk_by: ChunkStrategy::Row,
})?;
// 3. Cross-modal linking — DocChunk → Function via MENTIONS
db.graph_rag_link_exact(&[])?;
// 4. Search
let opts = GraphRagOptions {
seed_text: "parse".into(),
hops: 2,
direction: Direction::Both,
limit: 25,
..Default::default()
};
let hits = db.graph_rag_search(&opts)?;
let centrality = Centrality::from_edges(&db, &["CALLS"])?;
let reranked = centrality_rerank(hits, &centrality, 0.7);
for h in reranked {
println!("{:>6} hop={} {} — {}",
h.node_id, h.hop_distance, h.node_kind,
h.title.unwrap_or_default());
}
Ok(())
}

You’ll see a code symbol (Function, parse), its caller (tokenize, hop 1), and the DocChunk linked via MENTIONS to parse (hop 1 the other direction). All three came out of one BFS over _hdb_graph_edges.


10. SQL Verification

Inspect the graph at any point:

SELECT node_kind, COUNT(*) FROM _hdb_graph_nodes GROUP BY node_kind;
SELECT edge_kind, COUNT(*) FROM _hdb_graph_edges GROUP BY edge_kind;
-- All MENTIONS edges with weights
SELECT n1.title AS from_title, n1.node_kind AS from_kind,
n2.title AS to_title, n2.node_kind AS to_kind,
e.weight
FROM _hdb_graph_edges e
JOIN _hdb_graph_nodes n1 ON n1.node_id = e.from_node
JOIN _hdb_graph_nodes n2 ON n2.node_id = e.to_node
WHERE e.edge_kind = 'MENTIONS'
ORDER BY e.weight DESC;

11. Where Next