Semantic Hash Index — Quick Reference

Available since: v3.19.0 Build: cargo build --release --features code-graph DDL: CREATE SEMANTIC HASH INDEX [IF NOT EXISTS] <name> [ON <table>] Rust API: EmbeddedDatabase::code_graph_merkle_refresh() — wraps code_graph::merkle_refresh Roll-up table: _hdb_code_merkle (file_id BIGINT PK, rollup_hash TEXT, last_updated TIMESTAMP)

UVP

Re-indexing a 100 KLOC corpus from scratch on every commit is tractable but wasteful — most files don’t change. The semantic hash index materialises a per-file BLAKE3 roll-up over (qualified, kind, signature, line_start, line_end) for every symbol in the file, so the next reparse can skip files whose member symbols haven’t shifted. Idempotent. One DDL line. Re-indexing turns into O(changed) instead of O(corpus).

DDL Syntax

CREATE SEMANTIC HASH INDEX [IF NOT EXISTS] <index_name> [ ON <table> ];

<index_name> — administrator-chosen name (free-form identifier).
IF NOT EXISTS — silently no-ops if the named index has been declared in this process before.
ON <table> — currently informational. Phase 3 has a single Merkle target (_hdb_code_symbols); the optional clause is reserved for forward compatibility.

The DDL is process-local: declaring an index doesn’t persist outside the running database. Re-issue after restart.

Examples

-- 1. Bare form
CREATE SEMANTIC HASH INDEX code_merkle;

-- 2. Idempotent under re-runs / migrations
CREATE SEMANTIC HASH INDEX IF NOT EXISTS code_merkle;

-- 3. With the optional ON clause (accepted, ignored)
CREATE SEMANTIC HASH INDEX IF NOT EXISTS code_merkle ON _hdb_code_symbols;

The DDL fires code_graph::merkle_refresh once when the statement runs, materialising the roll-up table on the spot.

What the Index Stores

_hdb_code_merkle holds one row per file:

Column	Type	Meaning
`file_id`	`BIGINT` (PK)	FK to `_hdb_code_files.node_id`
`rollup_hash`	`TEXT`	Hex BLAKE3 over every symbol’s `(qualified, kind, signature, line_start, line_end)`
`last_updated`	`TIMESTAMP`	When this row was last written

The hash domain is per-symbol descriptors — not raw bytes — so cosmetic edits (whitespace, comments) leave the roll-up unchanged.

When To Refresh

The roll-up is not updated automatically by code_index. Refresh after re-indexing:

use heliosdb_nano::{code_graph::CodeIndexOptions, EmbeddedDatabase};

let stats = db.code_index(CodeIndexOptions::for_table("src"))?;
let merkle_stats = db.code_graph_merkle_refresh()?;
println!(
    "files_hashed={}, files_unchanged={}, symbols_hashed={}",
    merkle_stats.files_hashed,
    merkle_stats.files_unchanged,
    merkle_stats.symbols_hashed,
);

Or from SQL:

CREATE SEMANTIC HASH INDEX IF NOT EXISTS code_merkle;   -- idempotent

MerkleStats reports files_hashed (newly written or changed), files_unchanged (roll-up matched the prior hash), and symbols_hashed (cumulative count for this call).

Use It as a Skip-Predicate

Once the index is built, retrieval pipelines can gate expensive work on a hash mismatch:

-- Files whose symbols changed since the last embed pass
SELECT f.path
FROM _hdb_code_files f
JOIN _hdb_code_merkle m ON m.file_id = f.node_id
WHERE m.rollup_hash <> :previous_rollup_hash;

Pair with the Code-Graph Tutorial — a body_vec re-embedding pass can iterate only the rows above instead of the whole corpus. The same predicate works for cross-file re-linking, BM25 invalidation, and MV refresh.

Where Next

CODE_GRAPH_TUTORIAL — populate _hdb_code_symbols (the input the Merkle roll-up sums over).
GRAPH_RAG_TUTORIAL — graph-RAG retrieval also benefits from skipping unchanged files.
MCP_ENDPOINT_TUTORIAL — incremental reparse keeps MCP tool latency stable as the corpus grows.