Distance Metrics Selection Guide
Distance Metrics Selection Guide
Overview
Choosing the right distance metric is crucial for optimal vector search performance. This guide helps you select and configure distance metrics for your use case.
Available Metrics
1. Cosine Distance
Formula: 1 - (a·b) / (||a|| × ||b||)
Range: [0, 2]
- 0 = identical vectors
- 1 = orthogonal vectors
- 2 = opposite vectors
When to use:
- Text embeddings (BERT, GPT, Sentence Transformers)
- Normalized vectors
- Semantic similarity search
- Document retrieval
- Recommendation systems
Requirements:
- Vectors should be normalized (unit length)
- Works best with high-dimensional embeddings (>64D)
Example:
use heliosdb_vector::{HnswIndex, DistanceMetric, normalize};
// Create index with cosine distancelet mut index = HnswIndex::new(16, 200, DistanceMetric::Cosine);
// IMPORTANT: Normalize vectors firstlet mut embedding = vec![0.5, 0.3, 0.8, 0.2];normalize(&mut embedding);
let vdata = VectorData::new(4, embedding);index.insert(Bytes::from("doc1"), vdata)?;Performance:
- Scalar: ~600ns per calculation (128D)
- AVX2: ~100ns per calculation (6x speedup)
- AVX-512: ~55ns per calculation (11x speedup)
2. Euclidean Distance (L2)
Formula: sqrt(Σ(ai - bi)²)
Range: [0, ∞)
- 0 = identical vectors
- larger values = more different
When to use:
- Image embeddings (ResNet, VGG, CLIP)
- General-purpose vector search
- Clustering (K-means, DBSCAN)
- Anomaly detection
- Non-normalized vectors
Requirements:
- None (works with any vectors)
- Consider normalizing for scale-invariance
Example:
let mut index = HnswIndex::new(16, 200, DistanceMetric::L2);
// No normalization neededlet embedding = vec![10.5, 3.2, 8.7, 15.1];let vdata = VectorData::new(4, embedding);index.insert(Bytes::from("img1"), vdata)?;Performance:
- Scalar: ~500ns per calculation (128D)
- AVX2: ~80ns per calculation (6x speedup)
- AVX-512: ~45ns per calculation (11x speedup)
3. Manhattan Distance (L1)
Formula: Σ|ai - bi|
Range: [0, ∞)
When to use:
- Sparse vectors
- High-dimensional data (>1000D)
- Grid-based features
- Taxi-cab geometry
- Robust to outliers
Example:
use heliosdb_vector::{DistanceMetric, manhattan_distance};
let a = vec![1.0, 5.0, 3.0];let b = vec![4.0, 2.0, 7.0];
let dist = manhattan_distance(&a, &b);// = |1-4| + |5-2| + |3-7| = 3 + 3 + 4 = 10Performance:
- Scalar: ~450ns per calculation (128D)
- AVX2: ~75ns per calculation
- AVX-512: ~40ns per calculation
4. Dot Product (Inner Product)
Formula: Σ(ai × bi)
Range: (-∞, ∞)
- larger positive = more similar
- negative = opposite
When to use:
- Already normalized vectors
- Maximum Inner Product Search (MIPS)
- Collaborative filtering
- Matrix factorization models
- Neural network activations
Example:
use heliosdb_vector::dot_product;
let a = vec![1.0, 2.0, 3.0];let b = vec![4.0, 5.0, 6.0];
let dot = dot_product(&a, &b);// = 1×4 + 2×5 + 3×6 = 4 + 10 + 18 = 32Performance:
- Scalar: ~400ns per calculation (128D)
- AVX2: ~65ns per calculation
- AVX-512: ~35ns per calculation
5. Hamming Distance
Formula: Count of differing bits
Range: [0, n] where n = number of bits
When to use:
- Binary vectors/hashes
- LSH (Locality Sensitive Hashing)
- SimHash for deduplication
- Fingerprint matching
- Genomic sequences
Example:
use heliosdb_vector::hamming_distance;
let a = vec![0b10101010u8, 0b11110000u8];let b = vec![0b10100010u8, 0b11110001u8];
let dist = hamming_distance(&a, &b);// = 2 (2 bits different)Performance:
- Extremely fast (hardware instruction)
- ~50ns per calculation (512 bits)
6. Jaccard Distance
Formula: 1 - (|A ∩ B| / |A ∪ B|)
Range: [0, 1]
When to use:
- Set similarity
- Tag/category matching
- Sparse binary vectors
- Collaborative filtering
- Content similarity
Example:
use heliosdb_vector::jaccard_distance;
// Binary vectors (0/1)let a = vec![1.0, 1.0, 0.0, 0.0, 1.0];let b = vec![1.0, 0.0, 1.0, 0.0, 0.0];
let dist = jaccard_distance(&a, &b);// Intersection: 1, Union: 4, Jaccard: 1/4 = 0.25, Distance: 0.75Decision Tree
Use Cosine Distance if:
- Working with text embeddings (BERT, GPT, etc.)
- Vectors are normalized or can be normalized
- Need scale-invariant similarity
- Doing semantic search or document retrieval
Use Euclidean (L2) Distance if:
- Working with image embeddings
- Vectors have meaningful magnitudes
- Need geometric distance
- General-purpose similarity search
Use Manhattan (L1) Distance if:
- Vectors are sparse
- Very high dimensionality (>1000D)
- Need robustness to outliers
- Working with grid-based data
Use Dot Product if:
- Implementing MIPS (Maximum Inner Product Search)
- Vectors are already normalized
- Working with collaborative filtering
- Using matrix factorization models
Use Hamming Distance if:
- Working with binary vectors/hashes
- Implementing LSH or SimHash
- Need ultra-fast comparison
- Working with bit signatures
Use Jaccard Distance if:
- Comparing sets or tags
- Sparse binary vectors
- Content-based similarity
- Category overlap
Common Use Cases
Text/Document Search
use heliosdb_vector::{HnswIndex, DistanceMetric, normalize};
// Use Cosine for text embeddingslet mut index = HnswIndex::new(16, 200, DistanceMetric::Cosine);
// Always normalize text embeddingslet mut embedding = get_bert_embedding("Hello world");normalize(&mut embedding);
let vdata = VectorData::new(768, embedding); // BERT base = 768Dindex.insert(Bytes::from("doc1"), vdata)?;Best metric: Cosine Distance Why: Text embeddings are normalized, cosine measures angular similarity
Image Search
// Use L2 for image embeddingslet mut index = HnswIndex::new(16, 200, DistanceMetric::L2);
// No normalization neededlet embedding = get_resnet_embedding(image);let vdata = VectorData::new(2048, embedding); // ResNet-50 = 2048Dindex.insert(Bytes::from("img1"), vdata)?;Best metric: Euclidean (L2) Why: Image embeddings have meaningful magnitudes, L2 captures geometric distance
Product Recommendations
// Use Dot Product for collaborative filteringlet embedding = user_item_matrix[user_id]; // Already normalizedlet vdata = VectorData::new(dim, embedding);
// Query uses dot product internallylet metric = DistanceMetric::DotProduct;Best metric: Dot Product or Cosine Why: MIPS is ideal for collaborative filtering and matrix factorization
Near-Duplicate Detection
use heliosdb_vector::hamming_distance;
// Use Hamming for SimHashlet hash1 = simhash("Document about machine learning");let hash2 = simhash("Article about machine learning");
let dist = hamming_distance(&hash1, &hash2);
if dist < 3 { // Very similar println!("Near-duplicate detected!");}Best metric: Hamming Distance Why: SimHash uses bit-level comparison, Hamming is fastest
Performance Comparison
Throughput (calculations per second, 128D vectors)
| Metric | Scalar | AVX2 | AVX-512 | Speedup |
|---|---|---|---|---|
| Cosine | 1.7M/s | 10M/s | 18M/s | 10.6x |
| Euclidean | 2.0M/s | 12.5M/s | 22M/s | 11x |
| Manhattan | 2.2M/s | 13M/s | 25M/s | 11.4x |
| Dot Product | 2.5M/s | 15M/s | 28M/s | 11.2x |
| Hamming | 20M/s | N/A | N/A | 1x (already fast) |
Tested on: Intel Xeon (3.5 GHz), 128-dimensional vectors
Normalization
When to Normalize
Always normalize for:
- Cosine distance
- Dot product (for similarity)
- Scale-invariant comparison
Don’t normalize for:
- Euclidean distance (if magnitude matters)
- Manhattan distance
- Hamming distance
How to Normalize
use heliosdb_vector::normalize;
let mut vector = vec![3.0, 4.0, 0.0];normalize(&mut vector);
// Verify: norm should be 1.0let norm: f32 = vector.iter().map(|x| x * x).sum::<f32>().sqrt();assert!((norm - 1.0).abs() < 1e-6);
// Result: [0.6, 0.8, 0.0]Batch Normalization
use heliosdb_vector::normalize_batch;
let mut vectors = vec![ vec![1.0, 2.0, 3.0], vec![4.0, 5.0, 6.0], vec![7.0, 8.0, 9.0],];
normalize_batch(&mut vectors); // Normalizes all vectors in placeSwitching Metrics
Rebuilding Index
To change metrics, rebuild the index:
// Original index with L2let mut old_index = HnswIndex::new(16, 200, DistanceMetric::L2);
// New index with Cosinelet mut new_index = HnswIndex::new(16, 200, DistanceMetric::Cosine);
// Transfer and normalize vectorsfor (key, vector) in old_index.iter() { let mut normalized = vector.clone(); normalize(&mut normalized); new_index.insert(key, VectorData::new(dim, normalized))?;}Testing Different Metrics
use heliosdb_vector::{euclidean_distance, cosine_distance, dot_product};
let a = vec![1.0, 2.0, 3.0];let b = vec![4.0, 5.0, 6.0];
let l2 = euclidean_distance(&a, &b);let cos = cosine_distance(&a, &b);let dot = dot_product(&a, &b);
println!("L2: {}, Cosine: {}, Dot: {}", l2, cos, dot);
// Choose metric with best separationFAQ
Q: Can I use multiple metrics on the same index?
A: No, each HNSW index uses one metric. Create separate indices for different metrics.
Q: Which metric is fastest?
A: Dot product is slightly faster, but all metrics have similar SIMD performance. Choose based on accuracy, not speed.
Q: Should I normalize image embeddings?
A: Depends. For ResNet/VGG, L2 works well without normalization. For CLIP, normalize and use Cosine.
Q: Can I mix normalized and non-normalized vectors?
A: Not recommended. Normalize all vectors or none for consistent results.
Q: How do I know if my vectors are normalized?
A: Check if ||v|| = 1:
let norm: f32 = vector.iter().map(|x| x * x).sum::<f32>().sqrt();println!("Norm: {}", norm); // Should be ~1.0 if normalizedQ: Which metric for multilingual embeddings?
A: Cosine distance with normalized embeddings (most multilingual models output normalized vectors).
Q: Can I implement custom distance functions?
A: Yes, but they must satisfy metric properties (non-negative, symmetric, triangle inequality).
Best Practices
- Match your model: Use the metric your embedding model was trained with
- Normalize when needed: Always normalize for Cosine, usually for Dot Product
- Benchmark both: Test L2 vs Cosine on your data to see which works better
- Use SIMD: Enable AVX2/AVX-512 for 5-10x speedup
- Consider sparsity: Use Manhattan for very sparse vectors
- Test recall: Verify metric choice gives good recall on validation set