Skip to content

Full-Text Search Tutorial

Full-Text Search Tutorial

Available since: v3.13.0 (2026-04-19) Build: default — no feature flag required Surface: PostgreSQL-compatible: to_tsvector, to_tsquery, plainto_tsquery, phraseto_tsquery, ts_rank, ts_rank_cd, the @@ operator, TSVECTOR / TSQUERY column types, CREATE INDEX … USING gin | gist.


UVP

If your stack speaks PostgreSQL FTS, point it at HeliosDB Nano and it works. The same to_tsvector(body) @@ to_tsquery('helios') query runs unchanged. Under the hood it’s not a Postgres GIN index — it’s an in-process BM25 engine (src/search/bm25.rs) shared with the hybrid-search path. That means one in-binary scoring engine for FTS, vector, and hybrid retrieval, with ts_rank / ts_rank_cd returning real BM25 scores you can blend with cosine distance in a single SQL statement. No external search service. No client-side reranking. No drift between your DB schema and your search index.

The EasyRAG team had a Python adapter doing client-side reranking with rank_bm25.BM25Okapi to compensate for missing scalar FTS — v3.13.0 deletes that workaround.


Prerequisites

  • HeliosDB Nano v3.13+ binary (heliosdb-nano --version)
  • A PostgreSQL client: psql, psycopg, pgx, SQLAlchemy, anything wire-compatible
  • ~15 minutes

1. Start the Server

Terminal window
heliosdb-nano start --memory --auth scram-sha-256 --password s3cret

Connect with psql:

Terminal window
psql "postgresql://postgres:s3cret@127.0.0.1:5432/postgres"

2. The Surface in 60 Seconds

FeatureSyntaxWhat it does
Tokenize documentto_tsvector(body)Returns Unicode-normalised token list as tsvector
With configto_tsvector('english', body)Config arg accepted, ignored at runtime
Build queryto_tsquery('helios')Same encoding as tsvector
Plain queryplainto_tsquery('hello helios')Alias of to_tsquery
Phrase queryphraseto_tsquery('embedded db')Alias — does NOT do phrase matching
Matchvec @@ queryTrue iff any query term appears in doc
BM25 scorets_rank(vec, query)Real BM25 score against ephemeral 1-doc index
Same as _cdts_rank_cd(vec, query)Alias — no cover-density distinction
Persistent colbody_tsv TSVECTORStored as JSON token array
Index DDLCREATE INDEX … USING gin (col)Accepted (ORM-compat); see “Limitations”

tsvector and tsquery round-trip as Value::Json arrays — they cross the PostgreSQL wire protocol unchanged and render as JSON to introspection tools.


3. Basic Match — @@ operator

CREATE TABLE articles (
id SERIAL PRIMARY KEY,
title TEXT,
body TEXT
);
INSERT INTO articles (title, body) VALUES
('HeliosDB Nano', 'embedded database with native vector search'),
('PostgreSQL guide', 'tuning the postgres optimiser for olap'),
('Hybrid retrieval', 'combine BM25 lexical scores with cosine vector distance'),
('Time-travel queries', 'snapshot isolation with AS OF SYSTEM TIME'),
('SQL FTS overview', 'tsvector tsquery and the @@ match operator');
SELECT id, title
FROM articles
WHERE to_tsvector(body) @@ to_tsquery('vector');
id | title
----+----------------------
1 | HeliosDB Nano
3 | Hybrid retrieval

4. Ranked Retrieval — ts_rank / ts_rank_cd

SELECT id, title,
ts_rank_cd(to_tsvector(body), to_tsquery('vector')) AS score
FROM articles
WHERE to_tsvector(body) @@ to_tsquery('vector')
ORDER BY score DESC
LIMIT 10;

Both ts_rank and ts_rank_cd are powered by the same BM25 engine (search::Bm25Index, landed in v3.11.0 and surfaced through SQL in v3.13.0). The cover-density variant _cd exists for signature compatibility but produces the same score — positional information isn’t in our tsvector encoding.

The optional weights / normalisation arguments are accepted and ignored:

-- All four signatures parse and run; the weights/norm are no-ops.
SELECT ts_rank(to_tsvector(body), to_tsquery('vector'));
SELECT ts_rank('{0.1, 0.2, 0.4, 1.0}'::float[],
to_tsvector(body), to_tsquery('vector'));
SELECT ts_rank(to_tsvector(body), to_tsquery('vector'), 32);
SELECT ts_rank('{0.1, 0.2, 0.4, 1.0}'::float[],
to_tsvector(body), to_tsquery('vector'), 32);

5. Persistent tsvector Columns

Pre-tokenise on write to keep query-time CPU minimal:

CREATE TABLE docs (
id SERIAL PRIMARY KEY,
body TEXT,
body_tsv TSVECTOR
);
CREATE INDEX docs_body_fts ON docs USING gin (body_tsv);
INSERT INTO docs (body, body_tsv) VALUES
('quick brown fox', to_tsvector('quick brown fox')),
('lazy dog sleeps', to_tsvector('lazy dog sleeps')),
('the fox jumps', to_tsvector('the fox jumps'));
SELECT id, body, ts_rank_cd(body_tsv, to_tsquery('fox')) AS rank
FROM docs
WHERE body_tsv @@ to_tsquery('fox')
ORDER BY rank DESC;

The GIN DDL is accepted and persisted but does not build a runtime inverted index — @@ walks matching rows and matches in the evaluator. Rationale and migration path: see docs/compatibility/fts.md in the source tree.


6. Hybrid Search (BM25 + Vector via Linear Blend)

The same query can compose lexical and semantic signals. This pattern is what most RAG pipelines actually want:

CREATE TABLE chunks (
id SERIAL PRIMARY KEY,
tenant_id INTEGER,
text TEXT,
embedding VECTOR(384)
);
CREATE INDEX chunks_emb_hnsw ON chunks USING hnsw (embedding vector_cosine_ops);
SELECT id, text,
1.0 - (embedding <=> $1::vector) AS vec_score,
ts_rank_cd(to_tsvector(text),
plainto_tsquery($2)) AS bm25_score,
0.7 * (1.0 - (embedding <=> $1::vector))
+ 0.3 * ts_rank_cd(to_tsvector(text),
plainto_tsquery($2)) AS hybrid_score
FROM chunks
WHERE tenant_id = $3
AND (embedding <=> $1::vector) < 0.8
ORDER BY hybrid_score DESC
LIMIT 10;

For RRF (Reciprocal Rank Fusion) and MMR (Maximal Marginal Relevance), call the in-process search::hybrid_search orchestrator from a Rust embedder, or use the MCP tool heliosdb_hybrid_search (added v3.11.0).


7. The EasyRAG Simplification (Real-World Adoption)

Pre-3.13, the EasyRAG adapter (backend/app/services/vectordb/adapters/heliosdb_nano_adapter.py) had to:

# OLD: client-side reranking
from rank_bm25 import BM25Okapi
bm25 = BM25Okapi([doc.split() for doc in docs])
scores = bm25.get_scores(query.split())

After v3.13.0 the entire BM25Okapi block deletes — replaced with one SQL query that uses ts_rank_cd server-side. Migration guide: easyrag/docs/heliosdb_nano_adapter_simplification.md in that project.


8. Limitations — Read This Before You Migrate

These are the cases where Postgres FTS and HeliosDB FTS differ. All come from the same source: we ship one Unicode-word tokenizer and a token-set BM25 — no positions, no stemmers, no persistent inverted index.

Postgres hasWe haveWorkaround
Language stemmers ('english', 'spanish', …)Tokenizer normalises but does NOT stemStem at ingest time, store the stemmed tokens in a TSVECTOR column
Phrase queries ('quick <-> fox')Operator parses, proximity discardedPost-filter in app code, or use phraseto_tsquery for “any term match”
setweight(tsv, 'A')Accepted at the API surface, ignored at runtimeBias scores manually with multiplied terms in ts_rank
Persistent GIN index walksUSING gin DDL accepted, runtime is sequential @@ walkPrefilter (tenant_id, time range, vector cut) before @@ to bound the walk
Multi-column `setweight(…)setweight(…)`

In practice, ~100k rows of moderate-length text scan in <50 ms unfiltered. Beyond that, prefilter — typical RAG queries already filter by tenant_id or a vector proximity cut, which keeps the walk bounded.

The full scope-of-support contract is docs/compatibility/fts.md in the source repo.


9. Verification Checklist

-- Sanity: scalar functions
SELECT to_tsvector('hello heliosdb'); -- ["hello","heliosdb"]
SELECT to_tsquery('helios'); -- ["helios"]
-- Sanity: match operator
SELECT 'hello world'::TEXT::TSVECTOR @@ 'world'::TEXT::TSQUERY;
-- (Use the function form to avoid the cast surprise.)
-- Sanity: rank ordering
SELECT ts_rank_cd(to_tsvector('helios is fast'),
to_tsquery('fast')); -- > 0.0
-- DDL acceptance (no-op at runtime)
CREATE INDEX IF NOT EXISTS t_fts ON articles USING gin (body);

Troubleshooting

SymptomCauseFix
to_tsvector('foxes') ≠ to_tsvector('fox')No stemmerStem at ingest, store the stemmed TSVECTOR
ts_rank_cd returns same score as ts_rankAliased to BM25 — no cover-density encodingBoth are correct BM25; the _cd is API-compat only
USING gin “works” but query is slowDDL accepted, runtime walks rowsAdd a prefilter predicate; index is no-op
@@ returns NULLThree-valued logic — one operand is NULLCOALESCE(body_tsv, to_tsvector(''))
phraseto_tsquery('embedded db') returns rows missing the phraseWe treat phrase queries as bag-of-termsPost-filter in app code if exact phrase matters

Where Next


CHANGELOG references: v3.13.0 (FTS surface), v3.11.0 (BM25 engine, hybrid orchestrator), v3.19.1 (current verified release).