Graph-RAG Tutorial

Available since: v3.17.0 (2026-04-24, phase 3 MVP) — entity linker, docling ingestion, centrality rerank in v3.19.0 Build: cargo build --release --features graph-rag (implies code-graph) APIs: EmbeddedDatabase::graph_rag_project_symbols, graph_rag_search, graph_rag_link_exact, graph_rag_link_vector, graph_rag_ingest_* Tables: _hdb_graph_nodes, _hdb_graph_edges — both plain user tables, joinable, branch-aware, queryable

UVP

A “RAG” stack typically means a vector DB, a graph DB, a relational store, and four glue services keeping them in sync. HeliosDB Nano collapses all four into one embedded process. Every entity — code symbol, doc chunk, email, support ticket, investor question — lives in _hdb_graph_nodes with typed edges in _hdb_graph_edges. The flagship query, seed → BFS expand → centrality rerank, runs as a single EmbeddedDatabase::graph_rag_search() call (or one MCP tools/call) and pushes every WHERE through the bloom-filter / zone-map / SIMD storage path. One graph, one schema, one binary, real SQL semantics.

Prerequisites

HeliosDB Nano v3.19+ source tree
Rust 1.85+
The Code-Graph Tutorial — graph-rag’s primary node population is the projection from _hdb_code_symbols
About 25 minutes

1. Build with `--features graph-rag`

cargo build --release --features graph-rag

graph-rag implies code-graph (the symbol projection is meaningless without a code graph). All the code_index / lsp_* APIs from the Code-Graph Tutorial are available alongside.

2. The Universal Schema

The two tables are bootstrapped on first call to any graph_rag_* API:

-- Created automatically — shown here for reference.
CREATE TABLE IF NOT EXISTS _hdb_graph_nodes (
  node_id    BIGSERIAL PRIMARY KEY,
  node_kind  TEXT NOT NULL,
  source_ref TEXT,
  title      TEXT,
  text       TEXT,
  extra      TEXT
);

CREATE TABLE IF NOT EXISTS _hdb_graph_edges (
  edge_id    BIGSERIAL PRIMARY KEY,
  from_node  BIGINT NOT NULL REFERENCES _hdb_graph_nodes(node_id),
  to_node    BIGINT NOT NULL REFERENCES _hdb_graph_nodes(node_id),
  edge_kind  TEXT NOT NULL,
  weight     REAL,
  extra      TEXT
);

Convention:

Field	What it carries
`node_kind`	`Function`, `Class`, `Struct`, `DocChunk`, `DocSection`, `Email`, `Issue`, `InvestorQuestion`, `Answer`, `Person`, `Pdf`, `Office`, `Audio`, `Image`
`source_ref`	Stable cross-source identifier — `code_symbol:42`, `doc:readme`, `email:<message-id>`, `docling:document:<name>`
`edge_kind`	`CALLS`, `IMPORTS`, `REFERENCES`, `MENTIONS`, `CITES`, `REPLIES_TO`, `ASKS_ABOUT`, `AUTHORED_BY`, `CONTAINS`, `PART_OF`
`weight`	Edge-specific scalar — call frequency, similarity, sender hops

These are plain Nano tables. You can JOIN, WHERE, INSERT, UPDATE, BRANCH, AS OF against them like any user table.

3. Project Code Symbols Into the Graph

use heliosdb_nano::{
    code_graph::CodeIndexOptions,
    EmbeddedDatabase, Result,
};

fn bootstrap(db: &EmbeddedDatabase) -> Result<()> {
    db.execute("CREATE TABLE src (path TEXT PRIMARY KEY, lang TEXT, content TEXT)")?;
    db.execute("INSERT INTO src VALUES \
      ('lib.rs', 'rust', 'pub fn answer() -> i32 { 42 } pub fn caller() { answer(); }')")?;
    db.code_index(CodeIndexOptions::for_table("src"))?;

    // Mirror every _hdb_code_symbols row + every resolved
    // _hdb_code_symbol_refs row into the universal schema.
    let stats = db.graph_rag_project_symbols()?;
    println!("projected {} code symbols", stats.code_symbols_projected);
    Ok(())
}

graph_rag_project_symbols is idempotent. It uses _hdb_graph_nodes.source_ref = "code_symbol:<id>" as the dedupe key. Edges from _hdb_code_symbol_refs with a non-NULL to_symbol become matching _hdb_graph_edges rows; the kind name is lifted verbatim (CALLS, REFERENCES, …).

If you’re calling code_index from --features graph-rag, the projection runs automatically as a final step inside code_index itself — you don’t need to call graph_rag_project_symbols explicitly unless you want the stats back.

4. The Flagship Query — `graph_rag_search`

use heliosdb_nano::graph_rag::{Direction, GraphRagOptions};

let opts = GraphRagOptions {
    seed_text: "answer".into(),
    seed_kinds: vec!["Function".into()],     // optional — empty = all kinds
    hops: 2,
    edge_kinds: vec!["CALLS".into()],         // empty = all kinds
    direction: Direction::Both,
    limit: 50,
};
let hits = db.graph_rag_search(&opts)?;

for h in hits {
    println!("[hop={}] {} ({}) — {:?}",
        h.hop_distance, h.title.unwrap_or_default(),
        h.node_kind, h.source_ref);
}

Mechanics:

Seed selection. Substring match (case-insensitive) of seed_text against title || text, optionally filtered by seed_kinds. The kind filter pushes through FilteredScan so seed selection on a million-node corpus stays in milliseconds.
BFS expansion. Up to hops deep, capped at limit total nodes. direction is Out (use from_node = seed), In (use to_node = seed), or Both.
Deterministic ordering. Hop-distance ascending, then node_id ascending — stable across runs.

Edge-kind filtering pushes through the same FilteredScan path; bloom filters skip blocks cheaply on edge_kind IN ('CALLS').

5. Same Query from SQL — `WITH CONTEXT`

A pre-parser SQL clause exposes the same expansion to ad-hoc SQL:

SELECT n.node_id, n.node_kind, n.title
FROM _hdb_graph_nodes n
WHERE n.title LIKE '%answer%'
WITH CONTEXT (
  HOPS 2,
  EDGES CALLS|REFERENCES,
  DIRECTION both,
  LIMIT 50
);

The pre-parser strips the WITH CONTEXT (...) clause, runs the inner SELECT to collect seeds, then BFS-expands per the directives. RERANK BY <expr> is parsed but unused in the MVP — reranking ships under §7.

Code symbols project through graph_rag_project_symbols. For other modalities use the matching ingester:

Adapter	What it ingests	Source table contract
`graph_rag_ingest_docs`	Markdown / plain text	`(id_col, text_col [, title_col])`
`graph_rag_ingest_email`	Mail messages + senders	structured email columns
`graph_rag_ingest_issues`	Issues + comments	issue tracker JSON
`graph_rag_ingest_qa`	Investor questions + answers	Q&A schema
`graph_rag_ingest_pdf` / `_office` / `_audio` / `_image`	docling-converted documents	external docling-serve POST

For the docling family, see DOCLING_INGESTION_TUTORIAL. A trimmed Markdown example:

use heliosdb_nano::graph_rag::{ChunkStrategy, IngestDocsOptions};

db.execute("CREATE TABLE docs (id TEXT PRIMARY KEY, body TEXT, title TEXT)")?;
db.execute("INSERT INTO docs VALUES \
  ('intro', '# Welcome\\nHelloDB does X. ## Setup\\nRun cargo build.', 'Getting started')")?;

let opts = IngestDocsOptions {
    source_table: "docs".into(),
    id_col:       "id".into(),
    text_col:     "body".into(),
    title_col:    Some("title".into()),
    chunk_by:     ChunkStrategy::Headings,
};
let stats = db.graph_rag_ingest_docs(&opts)?;
// → IngestStats { nodes_added: 3, edges_added: 2, ... }

ChunkStrategy::Headings splits on Markdown heading boundaries; each ## heading becomes a DocSection and the prose under it becomes a DocChunk with a PART_OF edge into the section. Idempotent via source_ref = "doc:<id>:section:<i>".

7. Linking — Cross-Modal `MENTIONS` Edges

Two strategies.

Exact qualified-name match

let stats = db.graph_rag_link_exact(&[])?;   // empty = default kinds
// → LinkerStats { nodes_scanned: …, mentions_added: …, candidates_seen: … }

Scans every text-bearing node (DocChunk, DocSection, Email, Issue, Comment, InvestorQuestion, Answer) and emits a MENTIONS edge from the node to any code symbol whose qualified name appears as a whole word in the haystack. Matching is case-sensitive on qualified because lower-casing would conflate Foo/foo types. Pass extra kinds to extend the scan: db.graph_rag_link_exact(&["BlogPost", "Wiki"]).

Vector-similarity match (v3.19.0)

When you have embeddings for both sides:

use heliosdb_nano::graph_rag::{SymbolEmbedding, TextEmbedding};

let texts: Vec<TextEmbedding> = vec![
    TextEmbedding { node_id: 7,  vector: vec![0.1, 0.2, 0.3, 0.4] },
    TextEmbedding { node_id: 12, vector: vec![0.0, 0.1, 0.0, 0.5] },
];
let symbols: Vec<SymbolEmbedding> = vec![
    SymbolEmbedding { node_id: 42, vector: vec![0.1, 0.2, 0.3, 0.4] },
    SymbolEmbedding { node_id: 43, vector: vec![0.9, 0.0, 0.0, 0.1] },
];

let stats = db.graph_rag_link_vector(
    &texts,
    &symbols,
    /* top_k    */ 3,
    /* threshold */ 0.6,
)?;

Cosine top-k per text node, gated by threshold ∈ [-1, 1], emits MENTIONS edges with weight = similarity. Mismatched-dimension pairs are skipped silently (so a multi-embedder corpus doesn’t crash). Idempotent via (from_node, to_node) dedupe.

Embeddings are caller-supplied: Nano carries no inference runtime. Pair with the code-embed feature (Code-Graph Tutorial §10) or any external service.

8. Centrality-Biased Rerank (v3.19.0)

Default ordering is hop-distance ascending. To pull hot-path symbols up, post-rerank with centrality_rerank:

use heliosdb_nano::graph_rag::{centrality_rerank, Centrality};

let hits = db.graph_rag_search(&opts)?;
let centrality = Centrality::from_edges(&db, &["CALLS"])?;     // weighted in-degree, normalised to [0, 1]
let reranked = centrality_rerank(hits, &centrality, /* α */ 0.7);

for h in &reranked {
    println!("[hop={}] {}", h.hop_distance, h.title.clone().unwrap_or_default());
}

Score formula (from src/graph_rag/centrality.rs):

score(hit) = (1.0 / (1.0 + hop_distance)) * (1.0 + α * centrality(node_id))

Stable tie-break by node_id. The centrality weight is clamped to [0.0, 4.0]. This is the FR’s “Option B (post-rerank)” lift; an in-descent HNSW navigation bias is a tracked phase-3.1 follow-up.

9. Worked End-to-End Example

use heliosdb_nano::{
    code_graph::CodeIndexOptions,
    graph_rag::{
        centrality_rerank, Centrality, ChunkStrategy, Direction,
        GraphRagOptions, IngestDocsOptions,
    },
    EmbeddedDatabase, Result,
};

fn main() -> Result<()> {
    let db = EmbeddedDatabase::new_in_memory()?;

    // 1. Code corpus
    db.execute("CREATE TABLE src (path TEXT PRIMARY KEY, lang TEXT, content TEXT)")?;
    db.execute("INSERT INTO src VALUES \
      ('parser.rs', 'rust', 'pub fn parse(s: &str) -> i32 { 0 }'), \
      ('lexer.rs',  'rust', 'use crate::parse; pub fn tokenize(s: &str) { parse(s); }')")?;
    db.code_index(CodeIndexOptions::for_table("src"))?;

    // 2. Doc corpus
    db.execute("CREATE TABLE docs (id TEXT PRIMARY KEY, body TEXT, title TEXT)")?;
    db.execute("INSERT INTO docs VALUES \
      ('readme', 'parse() turns a string into an AST', 'README')")?;
    db.graph_rag_ingest_docs(&IngestDocsOptions {
        source_table: "docs".into(),
        id_col: "id".into(),
        text_col: "body".into(),
        title_col: Some("title".into()),
        chunk_by: ChunkStrategy::Row,
    })?;

    // 3. Cross-modal linking — DocChunk → Function via MENTIONS
    db.graph_rag_link_exact(&[])?;

    // 4. Search
    let opts = GraphRagOptions {
        seed_text: "parse".into(),
        hops: 2,
        direction: Direction::Both,
        limit: 25,
        ..Default::default()
    };
    let hits = db.graph_rag_search(&opts)?;
    let centrality = Centrality::from_edges(&db, &["CALLS"])?;
    let reranked = centrality_rerank(hits, &centrality, 0.7);

    for h in reranked {
        println!("{:>6} hop={} {} — {}",
            h.node_id, h.hop_distance, h.node_kind,
            h.title.unwrap_or_default());
    }
    Ok(())
}

You’ll see a code symbol (Function, parse), its caller (tokenize, hop 1), and the DocChunk linked via MENTIONS to parse (hop 1 the other direction). All three came out of one BFS over _hdb_graph_edges.

10. SQL Verification

Inspect the graph at any point:

SELECT node_kind, COUNT(*) FROM _hdb_graph_nodes GROUP BY node_kind;
SELECT edge_kind, COUNT(*) FROM _hdb_graph_edges GROUP BY edge_kind;

-- All MENTIONS edges with weights
SELECT n1.title AS from_title, n1.node_kind AS from_kind,
       n2.title AS to_title,   n2.node_kind AS to_kind,
       e.weight
FROM _hdb_graph_edges e
JOIN _hdb_graph_nodes n1 ON n1.node_id = e.from_node
JOIN _hdb_graph_nodes n2 ON n2.node_id = e.to_node
WHERE e.edge_kind = 'MENTIONS'
ORDER BY e.weight DESC;

11. Where Next

DOCLING_INGESTION_TUTORIAL — pull PDFs, DOCX, audio, images into the graph via docling-serve.
MCP_ENDPOINT_TUTORIAL — drive helios_graphrag_search from Claude Code / Cursor / Continue with streaming progress notifications.
CODE_GRAPH_TUTORIAL — populate _hdb_code_symbols (the projection’s source).
SEMANTIC_HASH_INDEX_QUICKREF — Merkle-tree subtree invalidation so re-indexing is incremental.
VECTOR_SEARCH_TUTORIAL — the HNSW substrate that backs vector linking.

Graph-RAG Tutorial

Graph-RAG Tutorial

UVP

Prerequisites

1. Build with --features graph-rag

2. The Universal Schema

3. Project Code Symbols Into the Graph

4. The Flagship Query — graph_rag_search

5. Same Query from SQL — WITH CONTEXT

6. Cross-Modal Ingestion — Beyond Code

7. Linking — Cross-Modal MENTIONS Edges

Exact qualified-name match

Vector-similarity match (v3.19.0)

8. Centrality-Biased Rerank (v3.19.0)

9. Worked End-to-End Example

10. SQL Verification

11. Where Next

1. Build with `--features graph-rag`

4. The Flagship Query — `graph_rag_search`

5. Same Query from SQL — `WITH CONTEXT`

7. Linking — Cross-Modal `MENTIONS` Edges