Distance Metrics Selection Guide

Overview

Choosing the right distance metric is crucial for optimal vector search performance. This guide helps you select and configure distance metrics for your use case.

Available Metrics

1. Cosine Distance

Formula: 1 - (a·b) / (||a|| × ||b||)

Range: [0, 2]

0 = identical vectors
1 = orthogonal vectors
2 = opposite vectors

When to use:

Text embeddings (BERT, GPT, Sentence Transformers)
Normalized vectors
Semantic similarity search
Document retrieval
Recommendation systems

Requirements:

Vectors should be normalized (unit length)
Works best with high-dimensional embeddings (>64D)

Example:

use heliosdb_vector::{HnswIndex, DistanceMetric, normalize};

// Create index with cosine distance
let mut index = HnswIndex::new(16, 200, DistanceMetric::Cosine);

// IMPORTANT: Normalize vectors first
let mut embedding = vec![0.5, 0.3, 0.8, 0.2];
normalize(&mut embedding);

let vdata = VectorData::new(4, embedding);
index.insert(Bytes::from("doc1"), vdata)?;

Performance:

Scalar: ~600ns per calculation (128D)
AVX2: ~100ns per calculation (6x speedup)
AVX-512: ~55ns per calculation (11x speedup)

2. Euclidean Distance (L2)

Formula: sqrt(Σ(ai - bi)²)

Range: [0, ∞)

0 = identical vectors
larger values = more different

When to use:

Image embeddings (ResNet, VGG, CLIP)
General-purpose vector search
Clustering (K-means, DBSCAN)
Anomaly detection
Non-normalized vectors

Requirements:

None (works with any vectors)
Consider normalizing for scale-invariance

Example:

let mut index = HnswIndex::new(16, 200, DistanceMetric::L2);

// No normalization needed
let embedding = vec![10.5, 3.2, 8.7, 15.1];
let vdata = VectorData::new(4, embedding);
index.insert(Bytes::from("img1"), vdata)?;

Performance:

Scalar: ~500ns per calculation (128D)
AVX2: ~80ns per calculation (6x speedup)
AVX-512: ~45ns per calculation (11x speedup)

3. Manhattan Distance (L1)

Formula: Σ|ai - bi|

Range: [0, ∞)

When to use:

Sparse vectors
High-dimensional data (>1000D)
Grid-based features
Taxi-cab geometry
Robust to outliers

Example:

use heliosdb_vector::{DistanceMetric, manhattan_distance};

let a = vec![1.0, 5.0, 3.0];
let b = vec![4.0, 2.0, 7.0];

let dist = manhattan_distance(&a, &b);
// = |1-4| + |5-2| + |3-7| = 3 + 3 + 4 = 10

Performance:

Scalar: ~450ns per calculation (128D)
AVX2: ~75ns per calculation
AVX-512: ~40ns per calculation

4. Dot Product (Inner Product)

Formula: Σ(ai × bi)

Range: (-∞, ∞)

larger positive = more similar
negative = opposite

When to use:

Already normalized vectors
Maximum Inner Product Search (MIPS)
Collaborative filtering
Matrix factorization models
Neural network activations

Example:

use heliosdb_vector::dot_product;

let a = vec![1.0, 2.0, 3.0];
let b = vec![4.0, 5.0, 6.0];

let dot = dot_product(&a, &b);
// = 1×4 + 2×5 + 3×6 = 4 + 10 + 18 = 32

Performance:

Scalar: ~400ns per calculation (128D)
AVX2: ~65ns per calculation
AVX-512: ~35ns per calculation

5. Hamming Distance

Formula: Count of differing bits

Range: [0, n] where n = number of bits

When to use:

Binary vectors/hashes
LSH (Locality Sensitive Hashing)
SimHash for deduplication
Fingerprint matching
Genomic sequences

Example:

use heliosdb_vector::hamming_distance;

let a = vec![0b10101010u8, 0b11110000u8];
let b = vec![0b10100010u8, 0b11110001u8];

let dist = hamming_distance(&a, &b);
// = 2 (2 bits different)

Performance:

Extremely fast (hardware instruction)
~50ns per calculation (512 bits)

6. Jaccard Distance

Formula: 1 - (|A ∩ B| / |A ∪ B|)

Range: [0, 1]

When to use:

Set similarity
Tag/category matching
Sparse binary vectors
Collaborative filtering
Content similarity

Example:

use heliosdb_vector::jaccard_distance;

// Binary vectors (0/1)
let a = vec![1.0, 1.0, 0.0, 0.0, 1.0];
let b = vec![1.0, 0.0, 1.0, 0.0, 0.0];

let dist = jaccard_distance(&a, &b);
// Intersection: 1, Union: 4, Jaccard: 1/4 = 0.25, Distance: 0.75

Decision Tree

Use Cosine Distance if:

Working with text embeddings (BERT, GPT, etc.)
Vectors are normalized or can be normalized
Need scale-invariant similarity
Doing semantic search or document retrieval

Use Euclidean (L2) Distance if:

Working with image embeddings
Vectors have meaningful magnitudes
Need geometric distance
General-purpose similarity search

Use Manhattan (L1) Distance if:

Vectors are sparse
Very high dimensionality (>1000D)
Need robustness to outliers
Working with grid-based data

Use Dot Product if:

Implementing MIPS (Maximum Inner Product Search)
Vectors are already normalized
Working with collaborative filtering
Using matrix factorization models

Use Hamming Distance if:

Working with binary vectors/hashes
Implementing LSH or SimHash
Need ultra-fast comparison
Working with bit signatures

Use Jaccard Distance if:

Comparing sets or tags
Sparse binary vectors
Content-based similarity
Category overlap

Common Use Cases

Text/Document Search

use heliosdb_vector::{HnswIndex, DistanceMetric, normalize};

// Use Cosine for text embeddings
let mut index = HnswIndex::new(16, 200, DistanceMetric::Cosine);

// Always normalize text embeddings
let mut embedding = get_bert_embedding("Hello world");
normalize(&mut embedding);

let vdata = VectorData::new(768, embedding);  // BERT base = 768D
index.insert(Bytes::from("doc1"), vdata)?;

Best metric: Cosine Distance Why: Text embeddings are normalized, cosine measures angular similarity

Image Search

// Use L2 for image embeddings
let mut index = HnswIndex::new(16, 200, DistanceMetric::L2);

// No normalization needed
let embedding = get_resnet_embedding(image);
let vdata = VectorData::new(2048, embedding);  // ResNet-50 = 2048D
index.insert(Bytes::from("img1"), vdata)?;

Best metric: Euclidean (L2) Why: Image embeddings have meaningful magnitudes, L2 captures geometric distance

Product Recommendations

// Use Dot Product for collaborative filtering
let embedding = user_item_matrix[user_id];  // Already normalized
let vdata = VectorData::new(dim, embedding);

// Query uses dot product internally
let metric = DistanceMetric::DotProduct;

Best metric: Dot Product or Cosine Why: MIPS is ideal for collaborative filtering and matrix factorization

Near-Duplicate Detection

use heliosdb_vector::hamming_distance;

// Use Hamming for SimHash
let hash1 = simhash("Document about machine learning");
let hash2 = simhash("Article about machine learning");

let dist = hamming_distance(&hash1, &hash2);

if dist < 3 {  // Very similar
    println!("Near-duplicate detected!");
}

Best metric: Hamming Distance Why: SimHash uses bit-level comparison, Hamming is fastest

Performance Comparison

Throughput (calculations per second, 128D vectors)

Metric	Scalar	AVX2	AVX-512	Speedup
Cosine	1.7M/s	10M/s	18M/s	10.6x
Euclidean	2.0M/s	12.5M/s	22M/s	11x
Manhattan	2.2M/s	13M/s	25M/s	11.4x
Dot Product	2.5M/s	15M/s	28M/s	11.2x
Hamming	20M/s	N/A	N/A	1x (already fast)

Tested on: Intel Xeon (3.5 GHz), 128-dimensional vectors

Normalization

When to Normalize

Always normalize for:

Cosine distance
Dot product (for similarity)
Scale-invariant comparison

Don’t normalize for:

Euclidean distance (if magnitude matters)
Manhattan distance
Hamming distance

How to Normalize

use heliosdb_vector::normalize;

let mut vector = vec![3.0, 4.0, 0.0];
normalize(&mut vector);

// Verify: norm should be 1.0
let norm: f32 = vector.iter().map(|x| x * x).sum::<f32>().sqrt();
assert!((norm - 1.0).abs() < 1e-6);

// Result: [0.6, 0.8, 0.0]

Batch Normalization

use heliosdb_vector::normalize_batch;

let mut vectors = vec![
    vec![1.0, 2.0, 3.0],
    vec![4.0, 5.0, 6.0],
    vec![7.0, 8.0, 9.0],
];

normalize_batch(&mut vectors);  // Normalizes all vectors in place

Switching Metrics

Rebuilding Index

To change metrics, rebuild the index:

// Original index with L2
let mut old_index = HnswIndex::new(16, 200, DistanceMetric::L2);

// New index with Cosine
let mut new_index = HnswIndex::new(16, 200, DistanceMetric::Cosine);

// Transfer and normalize vectors
for (key, vector) in old_index.iter() {
    let mut normalized = vector.clone();
    normalize(&mut normalized);
    new_index.insert(key, VectorData::new(dim, normalized))?;
}

Testing Different Metrics

use heliosdb_vector::{euclidean_distance, cosine_distance, dot_product};

let a = vec![1.0, 2.0, 3.0];
let b = vec![4.0, 5.0, 6.0];

let l2 = euclidean_distance(&a, &b);
let cos = cosine_distance(&a, &b);
let dot = dot_product(&a, &b);

println!("L2: {}, Cosine: {}, Dot: {}", l2, cos, dot);

// Choose metric with best separation

FAQ

Q: Can I use multiple metrics on the same index?

A: No, each HNSW index uses one metric. Create separate indices for different metrics.

Q: Which metric is fastest?

A: Dot product is slightly faster, but all metrics have similar SIMD performance. Choose based on accuracy, not speed.

Q: Should I normalize image embeddings?

A: Depends. For ResNet/VGG, L2 works well without normalization. For CLIP, normalize and use Cosine.

Q: Can I mix normalized and non-normalized vectors?

A: Not recommended. Normalize all vectors or none for consistent results.

Q: How do I know if my vectors are normalized?

A: Check if ||v|| = 1:

let norm: f32 = vector.iter().map(|x| x * x).sum::<f32>().sqrt();
println!("Norm: {}", norm);  // Should be ~1.0 if normalized

Q: Which metric for multilingual embeddings?

A: Cosine distance with normalized embeddings (most multilingual models output normalized vectors).

Q: Can I implement custom distance functions?

A: Yes, but they must satisfy metric properties (non-negative, symmetric, triangle inequality).

Best Practices

Match your model: Use the metric your embedding model was trained with
Normalize when needed: Always normalize for Cosine, usually for Dot Product
Benchmark both: Test L2 vs Cosine on your data to see which works better
Use SIMD: Enable AVX2/AVX-512 for 5-10x speedup
Consider sparsity: Use Manhattan for very sparse vectors
Test recall: Verify metric choice gives good recall on validation set

Distance Metrics Selection Guide

Distance Metrics Selection Guide

Overview

Available Metrics

1. Cosine Distance

2. Euclidean Distance (L2)

3. Manhattan Distance (L1)

4. Dot Product (Inner Product)

5. Hamming Distance

6. Jaccard Distance

Decision Tree

Use Cosine Distance if:

Use Euclidean (L2) Distance if:

Use Manhattan (L1) Distance if:

Use Dot Product if:

Use Hamming Distance if:

Use Jaccard Distance if:

Common Use Cases

Text/Document Search

Image Search

Product Recommendations

Near-Duplicate Detection

Performance Comparison

Throughput (calculations per second, 128D vectors)

Normalization

When to Normalize

How to Normalize

Batch Normalization

Switching Metrics

Rebuilding Index

Testing Different Metrics

FAQ

Best Practices

References