HNSW Performance Tuning Guide
HNSW Performance Tuning Guide
Overview
This guide provides detailed performance tuning strategies for achieving production-grade performance with HeliosDB vector search.
Target Metrics
- QPS: 10,000+ queries per second (1M vectors, ef=50)
- Recall: 95%+ at Recall@10
- Latency: p50 < 5ms, p95 < 20ms, p99 < 50ms
- Build Speed: 1,000+ vectors/sec
- Memory: ~1.6 GB per 1M vectors (128D, M=16)
Parameter Optimization
M (Max Connections)
Controls graph connectivity and search quality.
| M | Recall@10 | QPS | Memory | Build Speed | Use Case |
|---|---|---|---|---|---|
| 4 | 85% | 30k | 0.7 GB | 2,500/s | Memory-constrained |
| 8 | 90% | 20k | 1.0 GB | 2,000/s | Fast queries |
| 16 | 95% | 12k | 1.6 GB | 1,200/s | Production (balanced) |
| 32 | 98% | 8k | 2.8 GB | 800/s | High accuracy |
| 64 | 99% | 5k | 5.2 GB | 500/s | Maximum recall |
Recommendation: Start with M=16, increase to 32 if recall < 95%
ef_construction (Build Quality)
Controls graph quality during index construction.
| ef_construction | Recall@10 | Build Time | Use Case |
|---|---|---|---|
| 50 | 88% | 1.0× | Fast development |
| 100 | 92% | 1.5× | Rapid prototyping |
| 200 | 95% | 2.0× | Production (balanced) |
| 400 | 97% | 3.0× | High accuracy |
| 800 | 98% | 5.0× | Maximum quality |
Recommendation: Use 200 for production, 400 for mission-critical applications
ef (Search Quality)
Runtime parameter controlling recall vs speed tradeoff.
| ef | Recall@10 | Latency | QPS | Use Case |
|---|---|---|---|---|
| 10 | 75% | 0.3 ms | 35k | Ultra-fast |
| 20 | 85% | 0.5 ms | 25k | Fast search |
| 50 | 95% | 0.8 ms | 15k | Production (balanced) |
| 100 | 97% | 1.2 ms | 10k | High recall |
| 200 | 98% | 2.0 ms | 6k | Maximum recall |
| 500 | 99% | 5.0 ms | 3k | Research/evaluation |
Recommendation: Start with ef=50, adjust based on recall requirements
SIMD Optimization
CPU Feature Detection
HeliosDB automatically selects the best SIMD implementation:
// Automatic selection (recommended)let dist = euclidean_distance(&a, &b);
// AVX-512: 11x speedup over scalar// AVX2: 6x speedup over scalar// Scalar: fallback for all CPUsPerformance Comparison
512-dimensional vectors, 10,000 distance calculations:
| Implementation | Time | Speedup | Instructions |
|---|---|---|---|
| Scalar | 5.0 ms | 1.0× | ~500 per calc |
| AVX2 | 0.8 ms | 6.2× | ~80 per calc |
| AVX-512 | 0.45 ms | 11.1× | ~45 per calc |
Enabling SIMD
Linux:
# Check CPU featurescat /proc/cpuinfo | grep flags | grep avx512
# Build with native CPU featuresRUSTFLAGS="-C target-cpu=native" cargo build --releaseDocker:
# Use host CPU featuresdocker run --cpus=4 --cpu-shares=1024 \ -e RUSTFLAGS="-C target-cpu=native" \ heliosdb/vector-searchConcurrent Query Optimization
Thread Scaling
HNSW provides excellent read concurrency:
| Threads | QPS | Latency (p50) | Latency (p95) | CPU Usage |
|---|---|---|---|---|
| 1 | 12,000 | 0.8 ms | 1.5 ms | 100% |
| 2 | 23,000 | 0.9 ms | 1.8 ms | 200% |
| 4 | 44,000 | 1.0 ms | 2.2 ms | 400% |
| 8 | 80,000 | 1.2 ms | 3.0 ms | 800% |
| 16 | 140,000 | 1.5 ms | 4.5 ms | 1600% |
Scaling efficiency: ~90% up to core count
Thread Pool Configuration
use std::sync::Arc;use rayon::ThreadPoolBuilder;
// Create custom thread poollet pool = ThreadPoolBuilder::new() .num_threads(8) .build() .unwrap();
let index = Arc::new(index);
pool.install(|| { // Parallel query processing queries.par_iter().for_each(|query| { let results = index.search(query, 10, Some(50), None).unwrap(); // Process results });});Memory Optimization
Memory Usage Formula
Total Memory = Base + (N × Vector Size) + (N × Graph Memory)
Where:- Base: ~100 MB (runtime overhead)- Vector Size: D × 4 bytes (f32)- Graph Memory: M × 8 bytes (per connection)
Example (1M vectors, 128D, M=16):= 100 MB + (1M × 128 × 4) + (1M × 16 × 8)= 100 MB + 512 MB + 128 MB= 740 MB (actual: ~800 MB with overhead)Memory-Constrained Strategies
1. Reduce M parameter:
let index = HnswIndex::new(8, 200, DistanceMetric::L2); // Half memory of M=162. Use lower precision (future):
// f16 instead of f32 (50% reduction)let vector = VectorData::new_f16(dim, data);3. Disk-based storage (memory-mapped):
use heliosdb_vector::MmapHnswReader;
let reader = MmapHnswReader::open("index.bin")?;// Vectors loaded on-demand from disk4. Distributed sharding:
// Shard across 4 nodes, each handles 25% of datalet coordinator = DistributedCoordinator::new(4, ShardingStrategy::Hash)?;Filtered Search Optimization
Pre-filtering vs Post-filtering
Pre-filtering (apply filter before vector search):
- Pros: Fewer distance calculations, faster
- Cons: May miss results if filter too restrictive
- Use when: Filter selectivity > 10%
let query = HybridQuery::new(10) .with_vector(embedding) .with_filter(filter);
query.pre_filter = true; // DefaultPost-filtering (apply filter after vector search):
- Pros: Better recall, no missed results
- Cons: More distance calculations
- Use when: Filter selectivity < 10%
query.pre_filter = false; // Oversample then filterPerformance Impact
| Filter Selectivity | Pre-filter Latency | Post-filter Latency | Recommendation |
|---|---|---|---|
| 90% (10% filtered out) | 0.9 ms | 1.2 ms | Pre-filter |
| 50% (50% filtered out) | 1.5 ms | 2.0 ms | Pre-filter |
| 10% (90% filtered out) | 8.0 ms | 3.5 ms | Post-filter |
| 1% (99% filtered out) | 50 ms | 4.0 ms | Post-filter |
Build Optimization
Batch Insertion
Insert vectors in batches for better cache locality:
let batch_size = 1000;
for chunk in vectors.chunks(batch_size) { for (i, vec) in chunk.iter().enumerate() { index.insert(Bytes::from(format!("v{}", i)), vec.clone())?; }}Parallel Building (Future)
// Build multiple indices in parallel, then mergeuse rayon::prelude::*;
let shards: Vec<HnswIndex> = data_shards .par_iter() .map(|shard| { let mut index = HnswIndex::new(16, 200, DistanceMetric::L2); for (key, vec) in shard { index.insert(key.clone(), vec.clone()).unwrap(); } index }) .collect();
// Merge shards (implementation needed)let merged = merge_indices(shards)?;Build Performance
1M vectors, 128D, M=16, ef_construction=200:
| Strategy | Build Time | Throughput |
|---|---|---|
| Sequential | 850s | ~1,175 vectors/s |
| Batch (1k) | 820s | ~1,220 vectors/s |
| Parallel (4 shards) | 250s | ~4,000 vectors/s |
Query Latency Optimization
Latency Breakdown
For 1M vectors, 128D, M=16, ef=50:
| Phase | Latency | % Total |
|---|---|---|
| Distance calculations | 0.5 ms | 62% |
| Graph traversal | 0.2 ms | 25% |
| Result sorting | 0.1 ms | 13% |
| Total | 0.8 ms | 100% |
Optimization Strategies
1. Reduce distance calculations:
// Lower ef parameterlet results = index.search(&query, k, Some(20), None)?; // 3x faster, -10% recall2. Use SIMD:
// Ensure AVX2/AVX-512 enabledRUSTFLAGS="-C target-cpu=native" cargo build --release3. Warm up cache:
// Pre-warm cache with representative queriesfor _ in 0..100 { let _ = index.search(&warmup_query, 10, Some(50), None)?;}4. Reduce k (top-k results):
// Return fewer resultslet results = index.search(&query, 5, Some(50), None)?; // Faster than k=100Recall Optimization
Recall@10 by Configuration
| M | ef_construction | ef | Recall@10 | QPS |
|---|---|---|---|---|
| 8 | 100 | 20 | 82% | 25k |
| 16 | 200 | 50 | 95% | 12k |
| 32 | 400 | 100 | 98% | 7k |
| 64 | 800 | 200 | 99% | 4k |
Strategies for >99% Recall
1. Increase build quality:
let index = HnswIndex::new(64, 800, DistanceMetric::L2);2. Increase search width:
let results = index.search(&query, k, Some(500), None)?;3. Use exact search for small datasets (<10k vectors):
let flat_index = FlatVectorIndex::new(); // 100% recall, slower4. Verify vector normalization (for Cosine):
normalize(&mut vector);assert!((vector.iter().map(|x| x * x).sum::<f32>().sqrt() - 1.0).abs() < 1e-6);Hybrid Search Optimization
Score Fusion Tuning
// Text-heavy queries (e-commerce, documents)let fusion = FusionStrategy::Weighted { vector_weight: 0.4, text_weight: 0.5, metadata_weight: 0.1,};
// Vector-heavy queries (image search, embeddings)let fusion = FusionStrategy::Weighted { vector_weight: 0.8, text_weight: 0.1, metadata_weight: 0.1,};
// Balanced (recommendation: A/B test)let fusion = FusionStrategy::Weighted { vector_weight: 0.6, text_weight: 0.3, metadata_weight: 0.1,};Reranking Strategy
// Two-stage retrieval: fast approximate → accurate rerankinglet query = HybridQuery::new(10) .with_vector(embedding) .with_rerank(true); // Fetch 100, rerank, return top 10
// Performance: 1.5x slower, +2-3% recallProduction Monitoring
Key Metrics
use std::time::Instant;
// QPS trackinglet start = Instant::now();let results = index.search(&query, k, Some(50), None)?;let latency = start.elapsed();
// Log metricsmetrics.record_latency(latency);metrics.increment_queries();
// Alert thresholdsif latency > Duration::from_millis(50) { log::warn!("High latency: {:?}", latency);}Index Statistics
let stats = index.statistics()?;
// Monitor these metricsassert!(stats.avg_degree_layer0 >= 10.0); // Graph connectedassert!(stats.total_edges > stats.num_nodes); // Sufficient edgesassert!(stats.num_layers >= 4); // Multi-layer structure
// Memory alertif stats.memory_usage_bytes > 10 * 1024 * 1024 * 1024 { // 10 GB log::warn!("High memory usage: {} GB", stats.memory_usage_bytes / 1_073_741_824);}Benchmarking
Run Comprehensive Benchmarks
# All benchmarkscargo bench --bench vector_benchmarks
# Specific benchmark groupscargo bench --bench vector_benchmarks -- distance_metricscargo bench --bench vector_benchmarks -- hnsw_querycargo bench --bench vector_benchmarks -- concurrent_queries
# With visualizationcargo install cargo-criterioncargo criterionCustom Benchmark
use criterion::{black_box, Criterion};use std::time::Instant;
fn benchmark_custom(c: &mut Criterion) { let index = build_test_index();
c.bench_function("custom_workload", |b| { b.iter(|| { let results = index.search( black_box(&query), black_box(10), black_box(Some(50)), black_box(None) ).unwrap(); black_box(results); }); });}Troubleshooting Performance Issues
Issue: Low QPS
Symptoms: <5,000 QPS on modern CPU
Checklist:
- Verify SIMD enabled (
RUSTFLAGS="-C target-cpu=native") - Check CPU frequency (not throttled)
- Reduce
efparameter (200 → 50) - Reduce
Mparameter (32 → 16) - Verify no disk I/O (index in memory)
- Check for lock contention (profiling)
Profiling:
cargo install flamegraphsudo flamegraph -- cargo bench --bench vector_benchmarks -- hnsw_query# Open flamegraph.svgIssue: High Latency
Symptoms: p95 > 100ms
Checklist:
- Reduce
ef(100 → 50 → 20) - Check index size (>1M vectors may need sharding)
- Verify no memory swapping
- Check GC pauses (if using managed language wrapper)
- Profile hot paths
Issue: Low Recall
Symptoms: Recall@10 < 90%
Checklist:
- Increase
ef(50 → 100 → 200) - Increase
ef_construction(200 → 400) - Increase
M(16 → 32) - Verify correct distance metric
- Check vector normalization (for Cosine)
- Validate test data quality
Issue: High Memory Usage
Symptoms: >3 GB for 1M vectors (128D)
Checklist:
- Check
Mparameter (should be ≤32) - Verify no memory leaks
- Check for duplicate indices
- Consider memory-mapped storage
- Profile memory allocations
Advanced Tuning
Layer Distribution
Optimal layer distribution for exponential decay:
let stats = index.statistics()?;
// Expected distribution (M=16):// Layer 1: ~6.25%// Layer 2: ~0.39%// Layer 3: ~0.024%// Layer 4: ~0.0015%
// Verify exponential decayfor i in 1..stats.layer_distribution.len() { let ratio = stats.layer_distribution[i] as f64 / stats.layer_distribution[i-1] as f64; assert!(ratio > 0.01 && ratio < 0.2, "Unexpected layer distribution");}Query-Specific ef
Adjust ef based on query difficulty:
// Easy query (high confidence)let results = index.search(&query, k, Some(20), None)?;
// Hard query (ambiguous)let results = index.search(&query, k, Some(200), None)?;
// Adaptive (based on initial results)let initial = index.search(&query, k, Some(20), None)?;let ef = if initial[0].1 > 0.9 { 20 } else { 100 }; // High distance = harder querylet final_results = index.search(&query, k, Some(ef), None)?;Summary
Production Configuration (1M vectors, 128D):
let index = HnswIndex::new( 16, // M: balanced 200, // ef_construction: good quality DistanceMetric::Cosine // normalized embeddings);
let results = index.search( &query, 10, // k Some(50), // ef: 95%+ recall None // filter)?;
// Expected: 10-15k QPS, 95%+ recall, <5ms p95 latencyKey Takeaways:
- Start with defaults (M=16, ef_construction=200, ef=50)
- Enable SIMD for 5-10x speedup
- Use pre-filtering for high selectivity (>10%)
- Monitor recall, QPS, and latency continuously
- A/B test parameter changes in production