Skip to content

Semantic Hash Index — Quick Reference

Semantic Hash Index — Quick Reference

Available since: v3.19.0 (2026-04-25) Build: cargo build --release --features code-graph DDL: CREATE SEMANTIC HASH INDEX [IF NOT EXISTS] <name> [ON <table>] Rust API: EmbeddedDatabase::code_graph_merkle_refresh() — wraps code_graph::merkle_refresh Roll-up table: _hdb_code_merkle (file_id BIGINT PK, rollup_hash TEXT, last_updated TIMESTAMP)


UVP

Re-indexing a 100 KLOC corpus from scratch on every commit is tractable but wasteful — most files don’t change. The semantic hash index materialises a per-file BLAKE3 roll-up over (qualified, kind, signature, line_start, line_end) for every symbol in the file, so the next reparse can skip files whose member symbols haven’t shifted. Idempotent. One DDL line. Re-indexing turns into O(changed) instead of O(corpus).


DDL Syntax

CREATE SEMANTIC HASH INDEX [IF NOT EXISTS] <index_name> [ ON <table> ];
  • <index_name> — administrator-chosen name (free-form identifier).
  • IF NOT EXISTS — silently no-ops if the named index has been declared in this process before.
  • ON <table> — currently informational. Phase 3 has a single Merkle target (_hdb_code_symbols); the optional clause is reserved for forward compatibility.

The DDL is process-local: declaring an index doesn’t persist outside the running database. Re-issue after restart.


Examples

-- 1. Bare form
CREATE SEMANTIC HASH INDEX code_merkle;
-- 2. Idempotent under re-runs / migrations
CREATE SEMANTIC HASH INDEX IF NOT EXISTS code_merkle;
-- 3. With the optional ON clause (accepted, ignored)
CREATE SEMANTIC HASH INDEX IF NOT EXISTS code_merkle ON _hdb_code_symbols;

The DDL fires code_graph::merkle_refresh once when the statement runs, materialising the roll-up table on the spot.


What the Index Stores

_hdb_code_merkle holds one row per file:

ColumnTypeMeaning
file_idBIGINT (PK)FK to _hdb_code_files.node_id
rollup_hashTEXTHex BLAKE3 over every symbol’s (qualified, kind, signature, line_start, line_end)
last_updatedTIMESTAMPWhen this row was last written

The hash domain is per-symbol descriptors — not raw bytes — so cosmetic edits (whitespace, comments) leave the roll-up unchanged.


When To Refresh

The roll-up is not updated automatically by code_index. Refresh after re-indexing:

use heliosdb_nano::{code_graph::CodeIndexOptions, EmbeddedDatabase};
let stats = db.code_index(CodeIndexOptions::for_table("src"))?;
let merkle_stats = db.code_graph_merkle_refresh()?;
println!(
"files_hashed={}, files_unchanged={}, symbols_hashed={}",
merkle_stats.files_hashed,
merkle_stats.files_unchanged,
merkle_stats.symbols_hashed,
);

Or from SQL:

CREATE SEMANTIC HASH INDEX IF NOT EXISTS code_merkle; -- idempotent

MerkleStats reports files_hashed (newly written or changed), files_unchanged (roll-up matched the prior hash), and symbols_hashed (cumulative count for this call).


Use It as a Skip-Predicate

Once the index is built, retrieval pipelines can gate expensive work on a hash mismatch:

-- Files whose symbols changed since the last embed pass
SELECT f.path
FROM _hdb_code_files f
JOIN _hdb_code_merkle m ON m.file_id = f.node_id
WHERE m.rollup_hash <> :previous_rollup_hash;

Pair with the Code-Graph Tutorial — a body_vec re-embedding pass can iterate only the rows above instead of the whole corpus. The same predicate works for cross-file re-linking, BM25 invalidation, and MV refresh.


Where Next