Code-Graph Tutorial

Available since: v3.15.0 (2026-04-24, phase 1) — TypeScript/JS/Go/SQL/Markdown grammars in v3.16.0 Build: cargo build --release --features code-graph APIs: EmbeddedDatabase::code_index, lsp_definition, lsp_references, lsp_call_hierarchy, lsp_hover (plus the matching FROM lsp_*(...) table-function shapes in SQL) Tables created on first index: _hdb_code_files, _hdb_code_symbols, _hdb_code_symbol_refs

UVP

Most “AI code search” stacks bolt a vector DB onto a separate AST service and a third LSP daemon, then keep three consistency stories in their head. HeliosDB Nano collapses the three into one embedded process. Tree-sitter parses your corpus into the same engine that already serves PostgreSQL queries; LSP-shaped lookups (definition, references, call hierarchy, hover) push their WHEREs through the bloom-filter / zone-map / SIMD storage path you already paid for. One binary, plain SQL tables, branch-aware, time-travel-aware, no microservices.

Prerequisites

HeliosDB Nano v3.19+ source tree
Rust 1.85+ (rustc --version)
About 15 minutes
(Optional) Node, Python, or Go to compare extracted symbols against your own reading

The tutorial works against an in-memory database; swap to EmbeddedDatabase::open(path) if you want persistence.

1. Build with `--features code-graph`

git clone https://github.com/Dimensigon/HDB-HeliosDB-Nano
cd HDB-HeliosDB-Nano
cargo build --release --features code-graph

Default builds skip every tree-sitter dependency. When the flag is on, src/code_graph/ compiles, the seven static grammars (Rust, Python, TypeScript, TSX, JavaScript, Go, Markdown, SQL) ship inside the binary, and the methods below appear on EmbeddedDatabase. Without the flag, code_index does not exist and the SQL lsp_* table-functions fall through to the base parser as plain references (i.e. they error).

2. The Source Table Contract

The indexer reads from a user table you control. The contract is exactly three columns:

CREATE TABLE src (
  path    TEXT PRIMARY KEY,
  lang    TEXT,
  content TEXT
);

Other columns (e.g. commit_sha, last_modified) are fine — they’re ignored by the indexer. The PK is the file path; re-inserting the same path replaces the previous row.

Accepted lang values (case-insensitive, alias-aware): rust/rs, python/py, typescript/ts, tsx, javascript/js/mjs/cjs, go, markdown/md, sql. Anything else routes through the runtime grammar registry (see §10) or, if no match, gets counted in files_skipped.

3. First Index — Rust

use heliosdb_nano::{
    code_graph::{CodeIndexOptions, DefinitionHint},
    EmbeddedDatabase, Result,
};

fn main() -> Result<()> {
    let db = EmbeddedDatabase::new_in_memory()?;

    db.execute(
        "CREATE TABLE src (path TEXT PRIMARY KEY, lang TEXT, content TEXT)",
    )?;
    db.execute(
        "INSERT INTO src VALUES \
           ('lib.rs', 'rust', 'pub fn answer() -> i32 { 42 } \
                               pub fn caller() -> i32 { answer() }')",
    )?;

    // First call bootstraps _hdb_code_files / _hdb_code_symbols /
    // _hdb_code_symbol_refs and parses every row of `src`.
    let stats = db.code_index(CodeIndexOptions::for_table("src"))?;
    println!("{:?}", stats);
    // → CodeIndexStats { files_seen: 1, files_parsed: 1,
    //   files_skipped: 0, files_unchanged: 0,
    //   symbols_written: 2, refs_written: 1, embed_calls: 0,
    //   languages_seen: ["rust"] }

    let defs = db.lsp_definition("answer", &DefinitionHint::default())?;
    for d in &defs {
        println!("{} :: {} @ {}:{}", d.qualified, d.signature, d.path, d.line);
    }
    Ok(())
}

code_index is idempotent. Running it twice without changes gives files_unchanged: 1, symbols_written: 0 — the indexer SHA-256s each row’s content against the stored hash.

4. The Four LSP-Shaped APIs

Method	Question it answers
`lsp_definition(name, hint)`	Where is `name` defined?
`lsp_references(symbol_id)`	Who calls / references this symbol?
`lsp_call_hierarchy(symbol_id, dir, depth)`	What does the call tree rooted at this symbol look like?
`lsp_hover(symbol_id)`	Show me the signature.

Every method goes through the regular SQL planner: each WHERE is an Eq-pushdown the existing FilteredScan path serves cheaply.

use heliosdb_nano::code_graph::lsp::CallDirection;

let defs = db.lsp_definition("answer", &DefinitionHint::default())?;
let target = defs.first().unwrap();
let refs   = db.lsp_references(target.symbol_id)?;
let tree   = db.lsp_call_hierarchy(target.symbol_id, CallDirection::Incoming, 3)?;
let hover  = db.lsp_hover(target.symbol_id)?;

DefinitionHint carries optional disambiguators:

let hint = DefinitionHint {
    hint_file: Some("src/lib.rs".into()),   // restrict to a specific file
    hint_kind: Some("function".into()),     // function / struct / class / method / trait …
};

5. Same Queries from SQL

If you’d rather drive everything from psql, an MCP transport, or a Drizzle client, the four functions also expose as table functions:

SELECT * FROM lsp_definition('answer');
SELECT * FROM lsp_definition('answer', 'src/lib.rs');
SELECT * FROM lsp_references(42);
SELECT * FROM lsp_call_hierarchy(42, 'incoming', 3);
SELECT * FROM lsp_hover(42);

The pre-parser rewrites these to SELECT … FROM _hdb_code_* subqueries before the planner sees the SQL, so the result columns and filtering behaviour are identical to the Rust API.

You can also bypass the table functions and write the joins by hand — the _hdb_code_* tables are plain user tables:

SELECT s.qualified, f.path, s.line_start, s.signature
FROM _hdb_code_symbols s
JOIN _hdb_code_files   f ON f.node_id = s.file_id
WHERE s.kind = 'function'
  AND s.name LIKE 'parse_%'
ORDER BY f.path, s.line_start;

6. `CREATE EXTENSION hdb_code` (v3.16.0+)

For users who prefer DDL ergonomics, v3.16.0 added a no-op extension marker:

CREATE EXTENSION IF NOT EXISTS hdb_code;

This runs the same bootstrap that code_index runs, marks the extension installed in process-local state, and is permissive about IF NOT EXISTS for unknown extensions (silent no-op, matching PostgreSQL’s migration behaviour). It does not turn the feature flag on at runtime — the flag is build-time. The DDL only affects already-code-graph-enabled binaries.

7. Indexing a Multi-Language Corpus

db.execute("INSERT INTO src VALUES \
  ('app.ts',  'typescript', 'export function hello() { return 1; }'), \
  ('main.go', 'go',         'package main\nfunc main() { println(hello()) }'), \
  ('mod.py',  'python',     'def hello():\n    return 1')")?;

let stats = db.code_index(CodeIndexOptions::for_table("src"))?;
// languages_seen contains [\"go\", \"python\", \"rust\", \"typescript\"]

The TypeScript extractor handles function_declaration, method_definition, class_declaration, abstract_class_declaration, interface_declaration, type_alias_declaration, enum_declaration. The Go and Python extractors handle their respective canonical declaration shapes. Cross-file symbol resolution runs at the end of every code_index call and rebinds each resolution = 'unresolved' edge against a corpus-wide name index — single match → exact, multiple → first-seen with heuristic resolution.

Scope-chain resolver via IMPORTS (v3.19.0)

When a CALLS / REFERENCES edge stays unresolved after the cross-file pass, the v3.19 resolver looks for an unambiguous IMPORTS edge in the same file ending in the bare name and upgrades to Exact. Handles Rust use foo::bar, Python from foo import bar, TypeScript import { bar } from './foo', and Go imports.

8. TypeScript Client (over MCP)

The same APIs surface as MCP tools — see MCP_ENDPOINT. A trimmed Node example:

import { spawn } from "node:child_process";
import { createInterface } from "node:readline";

const child = spawn("heliosdb-nano", ["mcp-server", "--data-dir", "./data"]);
const rl = createInterface({ input: child.stdout });

function send(req: any) { child.stdin.write(JSON.stringify(req) + "\n"); }
send({ jsonrpc: "2.0", id: 1, method: "initialize", params: {} });
send({
  jsonrpc: "2.0", id: 2,
  method: "tools/call",
  params: {
    name: "helios_lsp_definition",
    arguments: { name: "answer" },
  },
});

rl.on("line", line => console.log(line));

9. Python Client (direct over `psycopg`)

With the SQL surface, Python doesn’t even need a code-graph SDK:

import psycopg

with psycopg.connect("postgresql://postgres:s3cret@127.0.0.1:5432/postgres") as conn:
    with conn.cursor() as cur:
        cur.execute("SELECT * FROM lsp_definition(%s)", ("answer",))
        for sym_id, path, line, signature, qualified, kind in cur:
            print(f"{qualified} @ {path}:{line} -- {signature}")

        # Walk callers two hops up.
        cur.execute("SELECT * FROM lsp_call_hierarchy(%s, 'incoming', 2)", (sym_id,))
        for depth, node_id, qualified, path, line in cur:
            print(f"{'  ' * depth}{qualified} @ {path}:{line}")

10. Pluggable Embedders — `body_vec`

Phase 2 (v3.16) adds a body_vec VECTOR(n) column to _hdb_code_symbols, materialised lazily on first non-null embedding. The indexer is agnostic about who computes the vector — Nano ships no inference runtime.

External HTTP embedder (default)

Any service that accepts POST { "input": "<text>" } → 200 { "embedding": [f32, …] } works:

let opts = CodeIndexOptions {
    source_table: "src".into(),
    embed_bodies: true,
    embed_endpoint: Some("http://localhost:8080/embed".into()),
    embed_bearer:   Some(std::env::var("EMBED_TOKEN").unwrap()),
    force_reparse:  false,
};
db.code_index(opts)?;

In-process via `code-embed` (fastembed-rs)

If you’d rather not run a sidecar:

cargo build --release --features code-graph,code-embed

use heliosdb_nano::code_graph::{
    code_index_with_embedder, CodeIndexOptions, FastEmbedder,
};

let embedder = Box::new(FastEmbedder::try_default()?);   // BGESmallENV15, 384-dim
code_index_with_embedder(&db, opts, embedder)?;

The model cache (~30 MB on first run) lives under $XDG_CACHE_HOME/.fastembed_cache. The binary itself isn’t changed in size by the feature — fastembed pulls its ORT runtime as a dynamic dep at runtime.

Runtime grammar + extractor registration

For languages outside the static set:

use heliosdb_nano::code_graph::{
    parse::register_grammar,
    register_extractor,
    SymbolExtractor,
};
use std::sync::Arc;

// Side-load a tree_sitter::Language. The consumer chooses the
// loader — dynamically-linked native grammar via libloading, a
// WASM grammar via tree_sitter::WasmStore, etc. See
// src/code_graph/parse.rs for the canonical loader patterns.
let cobol_lang: tree_sitter::Language = load_cobol_grammar()?;
register_grammar("cobol", cobol_lang);

// Pair it with your SymbolExtractor so the indexer emits real
// symbols instead of an empty parse tree.
struct CobolExtractor;
impl SymbolExtractor for CobolExtractor { /* ... */ }
register_extractor("cobol", Arc::new(CobolExtractor));

Without a paired extractor the indexer skips the file outright (an empty parse tree is worse than a clean skip). Inspect code_graph::registered_grammars() and code_graph::registered_extractors() to see what’s live, or query the system view:

SELECT * FROM hdb_code_languages;
-- name        | source
-- rust        | static
-- python      | static
-- ...
-- cobol       | runtime

11. Branch-Aware Indexing

Every _hdb_code_* table is just a Nano table, so branches and time travel work for free:

CREATE BRANCH refactor FROM main;
\branch refactor

-- Mutate src on this branch
UPDATE src SET content = '<new lib.rs>' WHERE path = 'lib.rs';

-- Reindex on the branch only — main is untouched.
SELECT * FROM lsp_definition('answer');

You can also pin a single lsp_* call to a branch without switching session-wide (v3.19):

SELECT * FROM lsp_definition('answer') ON BRANCH 'refactor';
SELECT * FROM lsp_references(42) AS OF COMMIT 'abc123' ON BRANCH 'main';

The pre-parser strips ON BRANCH '<name>', scopes a RAII guard around execution, and restores the prior branch on every early-return path.

12. Incremental Re-Index — `merkle_refresh`

Big corpora amortise re-indexing through the semantic-Merkle roll-up:

CREATE SEMANTIC HASH INDEX IF NOT EXISTS code_merkle;

After this DDL fires (or after the first call to db.code_graph_merkle_refresh()), every file gets a _hdb_code_merkle.rollup_hash covering its symbols. Files whose member symbols haven’t changed get files_unchanged += 1 instead of being re-embedded. See SEMANTIC_HASH_INDEX_QUICKREF.

13. Where Next

SEMANTIC_HASH_INDEX_QUICKREF — incremental reparse via the Merkle subtree-hash DDL.
GRAPH_RAG_TUTORIAL — project the code symbols into the universal graph and run cross-modal seed → expand → rerank queries.
MCP_ENDPOINT_TUTORIAL — drive every LSP query from Claude Code, Cursor, Continue, or Aider over JSON-RPC.
VECTOR_SEARCH_TUTORIAL — the HNSW index that backs body_vec retrieval once an embedder is wired.

Code-Graph Tutorial

Code-Graph Tutorial

UVP

Prerequisites

1. Build with --features code-graph

2. The Source Table Contract

3. First Index — Rust

4. The Four LSP-Shaped APIs

5. Same Queries from SQL

6. CREATE EXTENSION hdb_code (v3.16.0+)

7. Indexing a Multi-Language Corpus

Scope-chain resolver via IMPORTS (v3.19.0)

8. TypeScript Client (over MCP)

9. Python Client (direct over psycopg)

10. Pluggable Embedders — body_vec

External HTTP embedder (default)

In-process via code-embed (fastembed-rs)

Runtime grammar + extractor registration

11. Branch-Aware Indexing

12. Incremental Re-Index — merkle_refresh

13. Where Next

1. Build with `--features code-graph`

6. `CREATE EXTENSION hdb_code` (v3.16.0+)

9. Python Client (direct over `psycopg`)

10. Pluggable Embedders — `body_vec`

In-process via `code-embed` (fastembed-rs)

12. Incremental Re-Index — `merkle_refresh`