Semantic Hash Index — Quick Reference
Semantic Hash Index — Quick Reference
Available since: v3.19.0 (2026-04-25)
Build: cargo build --release --features code-graph
DDL: CREATE SEMANTIC HASH INDEX [IF NOT EXISTS] <name> [ON <table>]
Rust API: EmbeddedDatabase::code_graph_merkle_refresh() — wraps code_graph::merkle_refresh
Roll-up table: _hdb_code_merkle (file_id BIGINT PK, rollup_hash TEXT, last_updated TIMESTAMP)
UVP
Re-indexing a 100 KLOC corpus from scratch on every commit is tractable but wasteful — most files don’t change. The semantic hash index materialises a per-file BLAKE3 roll-up over (qualified, kind, signature, line_start, line_end) for every symbol in the file, so the next reparse can skip files whose member symbols haven’t shifted. Idempotent. One DDL line. Re-indexing turns into O(changed) instead of O(corpus).
DDL Syntax
CREATE SEMANTIC HASH INDEX [IF NOT EXISTS] <index_name> [ ON <table> ];<index_name>— administrator-chosen name (free-form identifier).IF NOT EXISTS— silently no-ops if the named index has been declared in this process before.ON <table>— currently informational. Phase 3 has a single Merkle target (_hdb_code_symbols); the optional clause is reserved for forward compatibility.
The DDL is process-local: declaring an index doesn’t persist outside the running database. Re-issue after restart.
Examples
-- 1. Bare formCREATE SEMANTIC HASH INDEX code_merkle;
-- 2. Idempotent under re-runs / migrationsCREATE SEMANTIC HASH INDEX IF NOT EXISTS code_merkle;
-- 3. With the optional ON clause (accepted, ignored)CREATE SEMANTIC HASH INDEX IF NOT EXISTS code_merkle ON _hdb_code_symbols;The DDL fires code_graph::merkle_refresh once when the statement runs, materialising the roll-up table on the spot.
What the Index Stores
_hdb_code_merkle holds one row per file:
| Column | Type | Meaning |
|---|---|---|
file_id | BIGINT (PK) | FK to _hdb_code_files.node_id |
rollup_hash | TEXT | Hex BLAKE3 over every symbol’s (qualified, kind, signature, line_start, line_end) |
last_updated | TIMESTAMP | When this row was last written |
The hash domain is per-symbol descriptors — not raw bytes — so cosmetic edits (whitespace, comments) leave the roll-up unchanged.
When To Refresh
The roll-up is not updated automatically by code_index. Refresh after re-indexing:
use heliosdb_nano::{code_graph::CodeIndexOptions, EmbeddedDatabase};
let stats = db.code_index(CodeIndexOptions::for_table("src"))?;let merkle_stats = db.code_graph_merkle_refresh()?;println!( "files_hashed={}, files_unchanged={}, symbols_hashed={}", merkle_stats.files_hashed, merkle_stats.files_unchanged, merkle_stats.symbols_hashed,);Or from SQL:
CREATE SEMANTIC HASH INDEX IF NOT EXISTS code_merkle; -- idempotentMerkleStats reports files_hashed (newly written or changed), files_unchanged (roll-up matched the prior hash), and symbols_hashed (cumulative count for this call).
Use It as a Skip-Predicate
Once the index is built, retrieval pipelines can gate expensive work on a hash mismatch:
-- Files whose symbols changed since the last embed passSELECT f.pathFROM _hdb_code_files fJOIN _hdb_code_merkle m ON m.file_id = f.node_idWHERE m.rollup_hash <> :previous_rollup_hash;Pair with the Code-Graph Tutorial — a body_vec re-embedding pass can iterate only the rows above instead of the whole corpus. The same predicate works for cross-file re-linking, BM25 invalidation, and MV refresh.
Where Next
- CODE_GRAPH_TUTORIAL — populate
_hdb_code_symbols(the input the Merkle roll-up sums over). - GRAPH_RAG_TUTORIAL — graph-RAG retrieval also benefits from skipping unchanged files.
- MCP_ENDPOINT_TUTORIAL — incremental reparse keeps MCP tool latency stable as the corpus grows.