Skip to content

Distance Metrics Selection Guide

Distance Metrics Selection Guide

Overview

Choosing the right distance metric is crucial for optimal vector search performance. This guide helps you select and configure distance metrics for your use case.

Available Metrics

1. Cosine Distance

Formula: 1 - (a·b) / (||a|| × ||b||)

Range: [0, 2]

  • 0 = identical vectors
  • 1 = orthogonal vectors
  • 2 = opposite vectors

When to use:

  • Text embeddings (BERT, GPT, Sentence Transformers)
  • Normalized vectors
  • Semantic similarity search
  • Document retrieval
  • Recommendation systems

Requirements:

  • Vectors should be normalized (unit length)
  • Works best with high-dimensional embeddings (>64D)

Example:

use heliosdb_vector::{HnswIndex, DistanceMetric, normalize};
// Create index with cosine distance
let mut index = HnswIndex::new(16, 200, DistanceMetric::Cosine);
// IMPORTANT: Normalize vectors first
let mut embedding = vec![0.5, 0.3, 0.8, 0.2];
normalize(&mut embedding);
let vdata = VectorData::new(4, embedding);
index.insert(Bytes::from("doc1"), vdata)?;

Performance:

  • Scalar: ~600ns per calculation (128D)
  • AVX2: ~100ns per calculation (6x speedup)
  • AVX-512: ~55ns per calculation (11x speedup)

2. Euclidean Distance (L2)

Formula: sqrt(Σ(ai - bi)²)

Range: [0, ∞)

  • 0 = identical vectors
  • larger values = more different

When to use:

  • Image embeddings (ResNet, VGG, CLIP)
  • General-purpose vector search
  • Clustering (K-means, DBSCAN)
  • Anomaly detection
  • Non-normalized vectors

Requirements:

  • None (works with any vectors)
  • Consider normalizing for scale-invariance

Example:

let mut index = HnswIndex::new(16, 200, DistanceMetric::L2);
// No normalization needed
let embedding = vec![10.5, 3.2, 8.7, 15.1];
let vdata = VectorData::new(4, embedding);
index.insert(Bytes::from("img1"), vdata)?;

Performance:

  • Scalar: ~500ns per calculation (128D)
  • AVX2: ~80ns per calculation (6x speedup)
  • AVX-512: ~45ns per calculation (11x speedup)

3. Manhattan Distance (L1)

Formula: Σ|ai - bi|

Range: [0, ∞)

When to use:

  • Sparse vectors
  • High-dimensional data (>1000D)
  • Grid-based features
  • Taxi-cab geometry
  • Robust to outliers

Example:

use heliosdb_vector::{DistanceMetric, manhattan_distance};
let a = vec![1.0, 5.0, 3.0];
let b = vec![4.0, 2.0, 7.0];
let dist = manhattan_distance(&a, &b);
// = |1-4| + |5-2| + |3-7| = 3 + 3 + 4 = 10

Performance:

  • Scalar: ~450ns per calculation (128D)
  • AVX2: ~75ns per calculation
  • AVX-512: ~40ns per calculation

4. Dot Product (Inner Product)

Formula: Σ(ai × bi)

Range: (-∞, ∞)

  • larger positive = more similar
  • negative = opposite

When to use:

  • Already normalized vectors
  • Maximum Inner Product Search (MIPS)
  • Collaborative filtering
  • Matrix factorization models
  • Neural network activations

Example:

use heliosdb_vector::dot_product;
let a = vec![1.0, 2.0, 3.0];
let b = vec![4.0, 5.0, 6.0];
let dot = dot_product(&a, &b);
// = 1×4 + 2×5 + 3×6 = 4 + 10 + 18 = 32

Performance:

  • Scalar: ~400ns per calculation (128D)
  • AVX2: ~65ns per calculation
  • AVX-512: ~35ns per calculation

5. Hamming Distance

Formula: Count of differing bits

Range: [0, n] where n = number of bits

When to use:

  • Binary vectors/hashes
  • LSH (Locality Sensitive Hashing)
  • SimHash for deduplication
  • Fingerprint matching
  • Genomic sequences

Example:

use heliosdb_vector::hamming_distance;
let a = vec![0b10101010u8, 0b11110000u8];
let b = vec![0b10100010u8, 0b11110001u8];
let dist = hamming_distance(&a, &b);
// = 2 (2 bits different)

Performance:

  • Extremely fast (hardware instruction)
  • ~50ns per calculation (512 bits)

6. Jaccard Distance

Formula: 1 - (|A ∩ B| / |A ∪ B|)

Range: [0, 1]

When to use:

  • Set similarity
  • Tag/category matching
  • Sparse binary vectors
  • Collaborative filtering
  • Content similarity

Example:

use heliosdb_vector::jaccard_distance;
// Binary vectors (0/1)
let a = vec![1.0, 1.0, 0.0, 0.0, 1.0];
let b = vec![1.0, 0.0, 1.0, 0.0, 0.0];
let dist = jaccard_distance(&a, &b);
// Intersection: 1, Union: 4, Jaccard: 1/4 = 0.25, Distance: 0.75

Decision Tree

Use Cosine Distance if:

  • Working with text embeddings (BERT, GPT, etc.)
  • Vectors are normalized or can be normalized
  • Need scale-invariant similarity
  • Doing semantic search or document retrieval

Use Euclidean (L2) Distance if:

  • Working with image embeddings
  • Vectors have meaningful magnitudes
  • Need geometric distance
  • General-purpose similarity search

Use Manhattan (L1) Distance if:

  • Vectors are sparse
  • Very high dimensionality (>1000D)
  • Need robustness to outliers
  • Working with grid-based data

Use Dot Product if:

  • Implementing MIPS (Maximum Inner Product Search)
  • Vectors are already normalized
  • Working with collaborative filtering
  • Using matrix factorization models

Use Hamming Distance if:

  • Working with binary vectors/hashes
  • Implementing LSH or SimHash
  • Need ultra-fast comparison
  • Working with bit signatures

Use Jaccard Distance if:

  • Comparing sets or tags
  • Sparse binary vectors
  • Content-based similarity
  • Category overlap

Common Use Cases

use heliosdb_vector::{HnswIndex, DistanceMetric, normalize};
// Use Cosine for text embeddings
let mut index = HnswIndex::new(16, 200, DistanceMetric::Cosine);
// Always normalize text embeddings
let mut embedding = get_bert_embedding("Hello world");
normalize(&mut embedding);
let vdata = VectorData::new(768, embedding); // BERT base = 768D
index.insert(Bytes::from("doc1"), vdata)?;

Best metric: Cosine Distance Why: Text embeddings are normalized, cosine measures angular similarity


// Use L2 for image embeddings
let mut index = HnswIndex::new(16, 200, DistanceMetric::L2);
// No normalization needed
let embedding = get_resnet_embedding(image);
let vdata = VectorData::new(2048, embedding); // ResNet-50 = 2048D
index.insert(Bytes::from("img1"), vdata)?;

Best metric: Euclidean (L2) Why: Image embeddings have meaningful magnitudes, L2 captures geometric distance


Product Recommendations

// Use Dot Product for collaborative filtering
let embedding = user_item_matrix[user_id]; // Already normalized
let vdata = VectorData::new(dim, embedding);
// Query uses dot product internally
let metric = DistanceMetric::DotProduct;

Best metric: Dot Product or Cosine Why: MIPS is ideal for collaborative filtering and matrix factorization


Near-Duplicate Detection

use heliosdb_vector::hamming_distance;
// Use Hamming for SimHash
let hash1 = simhash("Document about machine learning");
let hash2 = simhash("Article about machine learning");
let dist = hamming_distance(&hash1, &hash2);
if dist < 3 { // Very similar
println!("Near-duplicate detected!");
}

Best metric: Hamming Distance Why: SimHash uses bit-level comparison, Hamming is fastest


Performance Comparison

Throughput (calculations per second, 128D vectors)

MetricScalarAVX2AVX-512Speedup
Cosine1.7M/s10M/s18M/s10.6x
Euclidean2.0M/s12.5M/s22M/s11x
Manhattan2.2M/s13M/s25M/s11.4x
Dot Product2.5M/s15M/s28M/s11.2x
Hamming20M/sN/AN/A1x (already fast)

Tested on: Intel Xeon (3.5 GHz), 128-dimensional vectors


Normalization

When to Normalize

Always normalize for:

  • Cosine distance
  • Dot product (for similarity)
  • Scale-invariant comparison

Don’t normalize for:

  • Euclidean distance (if magnitude matters)
  • Manhattan distance
  • Hamming distance

How to Normalize

use heliosdb_vector::normalize;
let mut vector = vec![3.0, 4.0, 0.0];
normalize(&mut vector);
// Verify: norm should be 1.0
let norm: f32 = vector.iter().map(|x| x * x).sum::<f32>().sqrt();
assert!((norm - 1.0).abs() < 1e-6);
// Result: [0.6, 0.8, 0.0]

Batch Normalization

use heliosdb_vector::normalize_batch;
let mut vectors = vec![
vec![1.0, 2.0, 3.0],
vec![4.0, 5.0, 6.0],
vec![7.0, 8.0, 9.0],
];
normalize_batch(&mut vectors); // Normalizes all vectors in place

Switching Metrics

Rebuilding Index

To change metrics, rebuild the index:

// Original index with L2
let mut old_index = HnswIndex::new(16, 200, DistanceMetric::L2);
// New index with Cosine
let mut new_index = HnswIndex::new(16, 200, DistanceMetric::Cosine);
// Transfer and normalize vectors
for (key, vector) in old_index.iter() {
let mut normalized = vector.clone();
normalize(&mut normalized);
new_index.insert(key, VectorData::new(dim, normalized))?;
}

Testing Different Metrics

use heliosdb_vector::{euclidean_distance, cosine_distance, dot_product};
let a = vec![1.0, 2.0, 3.0];
let b = vec![4.0, 5.0, 6.0];
let l2 = euclidean_distance(&a, &b);
let cos = cosine_distance(&a, &b);
let dot = dot_product(&a, &b);
println!("L2: {}, Cosine: {}, Dot: {}", l2, cos, dot);
// Choose metric with best separation

FAQ

Q: Can I use multiple metrics on the same index?

A: No, each HNSW index uses one metric. Create separate indices for different metrics.

Q: Which metric is fastest?

A: Dot product is slightly faster, but all metrics have similar SIMD performance. Choose based on accuracy, not speed.

Q: Should I normalize image embeddings?

A: Depends. For ResNet/VGG, L2 works well without normalization. For CLIP, normalize and use Cosine.

Q: Can I mix normalized and non-normalized vectors?

A: Not recommended. Normalize all vectors or none for consistent results.

Q: How do I know if my vectors are normalized?

A: Check if ||v|| = 1:

let norm: f32 = vector.iter().map(|x| x * x).sum::<f32>().sqrt();
println!("Norm: {}", norm); // Should be ~1.0 if normalized

Q: Which metric for multilingual embeddings?

A: Cosine distance with normalized embeddings (most multilingual models output normalized vectors).

Q: Can I implement custom distance functions?

A: Yes, but they must satisfy metric properties (non-negative, symmetric, triangle inequality).


Best Practices

  1. Match your model: Use the metric your embedding model was trained with
  2. Normalize when needed: Always normalize for Cosine, usually for Dot Product
  3. Benchmark both: Test L2 vs Cosine on your data to see which works better
  4. Use SIMD: Enable AVX2/AVX-512 for 5-10x speedup
  5. Consider sparsity: Use Manhattan for very sparse vectors
  6. Test recall: Verify metric choice gives good recall on validation set

References