Vector Search & HNSW Indexes

HeliosDB-Lite provides first-class support for vector data with a dedicated VECTOR type, HNSW indexing, product quantization, and three distance operators for similarity search.

Prerequisites

HeliosDB-Lite v3.5 or later
Access to the SQL shell (REPL, PostgreSQL wire protocol, or REST API)

Step 1 — Create a Table with a Vector Column

Declare a column with the VECTOR(n) type, where n is the number of dimensions.

CREATE TABLE documents (
    id        INT PRIMARY KEY,
    title     TEXT,
    embedding VECTOR(3)
);

This creates a table where each row stores a 3-dimensional embedding alongside its metadata.

Step 2 — Insert Vector Data

Vector literals are JSON-style arrays of floating-point numbers enclosed in single quotes.

INSERT INTO documents VALUES (1, 'Introduction to AI',   '[0.1, 0.8, 0.3]');
INSERT INTO documents VALUES (2, 'Database Internals',   '[0.9, 0.1, 0.2]');
INSERT INTO documents VALUES (3, 'Machine Learning 101', '[0.2, 0.9, 0.4]');
INSERT INTO documents VALUES (4, 'Query Optimization',   '[0.8, 0.2, 0.1]');
INSERT INTO documents VALUES (5, 'Neural Networks',      '[0.15, 0.85, 0.35]');

HeliosDB validates that the number of dimensions matches the column definition. Inserting a vector with the wrong number of dimensions produces an error:

-- This fails: VECTOR(3) column but only 2 values provided
INSERT INTO documents VALUES (6, 'Bad Vector', '[1.0, 2.0]');
-- ERROR: Vector dimension mismatch

Step 3 — Distance Operators

HeliosDB supports three distance operators for comparing vectors:

Operator	Distance Metric	Use Case
`<->`	L2 (Euclidean)	General-purpose similarity
`<=>`	Cosine	Text embeddings, normalized
`<#>`	Inner Product	Maximum inner product search

L2 (Euclidean) Distance

SELECT id, title, embedding <-> '[0.1, 0.8, 0.3]' AS distance
FROM documents
ORDER BY distance
LIMIT 3;

Expected output:

 id | title                 | distance
----+-----------------------+----------
  1 | Introduction to AI    |     0.00
  5 | Neural Networks       |     0.09
  3 | Machine Learning 101  |     0.13

Cosine Distance

Cosine distance measures the angle between vectors, ignoring magnitude. It is ideal for text embeddings that are normalized to unit length.

SELECT id, title, embedding <=> '[0.1, 0.8, 0.3]' AS cos_distance
FROM documents
ORDER BY cos_distance
LIMIT 3;

Inner Product

The inner product operator returns the negative inner product (so that smaller values indicate higher similarity, consistent with ORDER BY ASC).

SELECT id, title, embedding <#> '[0.1, 0.8, 0.3]' AS neg_ip
FROM documents
ORDER BY neg_ip
LIMIT 3;

Step 4 — Create an HNSW Index

For tables with many rows, a brute-force scan is slow. HNSW (Hierarchical Navigable Small World) indexes provide fast approximate nearest-neighbor lookups.

CREATE INDEX doc_embedding_idx ON documents
USING hnsw (embedding);

Custom HNSW Parameters

Tune index build quality and search speed with parameters:

CREATE INDEX doc_embedding_idx ON documents
USING hnsw (embedding)
WITH (m = 32, ef_construction = 400, ef_search = 200);

Parameter	Default	Description
`m`	16	Max connections per node per layer
`ef_construction`	200	Beam width during index build (higher = better)
`ef_search`	100	Beam width during query (higher = more accurate)

Higher m and ef_construction improve recall at the cost of build time and memory. Higher ef_search improves query accuracy at the cost of latency.

Step 5 — k-NN Query with Index

Once an HNSW index exists, k-NN queries automatically use it:

SELECT id, title, embedding <-> '[0.15, 0.85, 0.35]' AS distance
FROM documents
ORDER BY distance
LIMIT 5;

The query planner detects the ORDER BY <distance_operator> LIMIT k pattern and routes the search through the HNSW index instead of scanning the full table.

Step 6 — Product Quantization

For very high-dimensional vectors or large datasets, product quantization (PQ) compresses vectors to reduce memory usage while maintaining search quality.

CREATE INDEX doc_pq_idx ON documents
USING hnsw (embedding)
WITH (quantization = 'product', pq_subquantizers = 4, pq_centroids = 256);

PQ divides each vector into subvectors, quantizes each independently, and stores compact codes instead of full floating-point vectors.

Step 7 — Multiple Vector Columns

A table can have multiple vector columns with different dimensions:

CREATE TABLE products (
    id              INT PRIMARY KEY,
    name            TEXT,
    text_embedding  VECTOR(768),
    image_embedding VECTOR(512)
);

CREATE INDEX prod_text_idx  ON products USING hnsw (text_embedding);
CREATE INDEX prod_image_idx ON products USING hnsw (image_embedding);

Search each independently:

-- Find products similar by text
SELECT name FROM products
ORDER BY text_embedding <-> $text_query_vec
LIMIT 10;

-- Find products similar by image
SELECT name FROM products
ORDER BY image_embedding <=> $image_query_vec
LIMIT 10;

Tips and Troubleshooting

Dimension mismatch: Query vectors must have the same number of dimensions as the column. A VECTOR(384) column requires query vectors with exactly 384 elements.
Normalize for cosine: The <=> operator computes cosine distance. For best results, normalize your embeddings to unit length before inserting.
Index build time: Building an HNSW index scans all existing rows. For large tables, this can take time. The index is used immediately after creation.
Approximate results: HNSW provides approximate nearest neighbors. Increase ef_search for better recall at the cost of higher latency.
Index size: HNSW indexes store graph edges in addition to vectors. Expect the index to use roughly 1.5-2x the memory of the raw vector data with default m = 16.
Configuration defaults: Global HNSW defaults can be set in heliosdb.toml under [vector]:

[vector]
default_index_type = "hnsw"
hnsw_ef_construction = 200
hnsw_m = 16
enable_pq = true
pq_subvectors = 8
pq_bits = 8