HNSW Performance Tuning Guide

Overview

This guide provides detailed performance tuning strategies for achieving production-grade performance with HeliosDB vector search.

Target Metrics

QPS: 10,000+ queries per second (1M vectors, ef=50)
Recall: 95%+ at Recall@10
Latency: p50 < 5ms, p95 < 20ms, p99 < 50ms
Build Speed: 1,000+ vectors/sec
Memory: ~1.6 GB per 1M vectors (128D, M=16)

Parameter Optimization

M (Max Connections)

Controls graph connectivity and search quality.

M	Recall@10	QPS	Memory	Build Speed	Use Case
4	85%	30k	0.7 GB	2,500/s	Memory-constrained
8	90%	20k	1.0 GB	2,000/s	Fast queries
16	95%	12k	1.6 GB	1,200/s	Production (balanced)
32	98%	8k	2.8 GB	800/s	High accuracy
64	99%	5k	5.2 GB	500/s	Maximum recall

Recommendation: Start with M=16, increase to 32 if recall < 95%

ef_construction (Build Quality)

Controls graph quality during index construction.

ef_construction	Recall@10	Build Time	Use Case
50	88%	1.0×	Fast development
100	92%	1.5×	Rapid prototyping
200	95%	2.0×	Production (balanced)
400	97%	3.0×	High accuracy
800	98%	5.0×	Maximum quality

Recommendation: Use 200 for production, 400 for mission-critical applications

ef (Search Quality)

Runtime parameter controlling recall vs speed tradeoff.

ef	Recall@10	Latency	QPS	Use Case
10	75%	0.3 ms	35k	Ultra-fast
20	85%	0.5 ms	25k	Fast search
50	95%	0.8 ms	15k	Production (balanced)
100	97%	1.2 ms	10k	High recall
200	98%	2.0 ms	6k	Maximum recall
500	99%	5.0 ms	3k	Research/evaluation

Recommendation: Start with ef=50, adjust based on recall requirements

SIMD Optimization

CPU Feature Detection

HeliosDB automatically selects the best SIMD implementation:

// Automatic selection (recommended)
let dist = euclidean_distance(&a, &b);

// AVX-512: 11x speedup over scalar
// AVX2: 6x speedup over scalar
// Scalar: fallback for all CPUs

Performance Comparison

512-dimensional vectors, 10,000 distance calculations:

Implementation	Time	Speedup	Instructions
Scalar	5.0 ms	1.0×	~500 per calc
AVX2	0.8 ms	6.2×	~80 per calc
AVX-512	0.45 ms	11.1×	~45 per calc

Enabling SIMD

Linux:

# Check CPU features
cat /proc/cpuinfo | grep flags | grep avx512

# Build with native CPU features
RUSTFLAGS="-C target-cpu=native" cargo build --release

Docker:

# Use host CPU features
docker run --cpus=4 --cpu-shares=1024 \
  -e RUSTFLAGS="-C target-cpu=native" \
  heliosdb/vector-search

Concurrent Query Optimization

Thread Scaling

HNSW provides excellent read concurrency:

Threads	QPS	Latency (p50)	Latency (p95)	CPU Usage
1	12,000	0.8 ms	1.5 ms	100%
2	23,000	0.9 ms	1.8 ms	200%
4	44,000	1.0 ms	2.2 ms	400%
8	80,000	1.2 ms	3.0 ms	800%
16	140,000	1.5 ms	4.5 ms	1600%

Scaling efficiency: ~90% up to core count

Thread Pool Configuration

use std::sync::Arc;
use rayon::ThreadPoolBuilder;

// Create custom thread pool
let pool = ThreadPoolBuilder::new()
    .num_threads(8)
    .build()
    .unwrap();

let index = Arc::new(index);

pool.install(|| {
    // Parallel query processing
    queries.par_iter().for_each(|query| {
        let results = index.search(query, 10, Some(50), None).unwrap();
        // Process results
    });
});

Memory Optimization

Memory Usage Formula

Total Memory = Base + (N × Vector Size) + (N × Graph Memory)

Where:
- Base: ~100 MB (runtime overhead)
- Vector Size: D × 4 bytes (f32)
- Graph Memory: M × 8 bytes (per connection)

Example (1M vectors, 128D, M=16):
= 100 MB + (1M × 128 × 4) + (1M × 16 × 8)
= 100 MB + 512 MB + 128 MB
= 740 MB (actual: ~800 MB with overhead)

Memory-Constrained Strategies

1. Reduce M parameter:

let index = HnswIndex::new(8, 200, DistanceMetric::L2);  // Half memory of M=16

2. Use lower precision (future):

// f16 instead of f32 (50% reduction)
let vector = VectorData::new_f16(dim, data);

3. Disk-based storage (memory-mapped):

use heliosdb_vector::MmapHnswReader;

let reader = MmapHnswReader::open("index.bin")?;
// Vectors loaded on-demand from disk

4. Distributed sharding:

// Shard across 4 nodes, each handles 25% of data
let coordinator = DistributedCoordinator::new(4, ShardingStrategy::Hash)?;

Filtered Search Optimization

Pre-filtering vs Post-filtering

Pre-filtering (apply filter before vector search):

Pros: Fewer distance calculations, faster
Cons: May miss results if filter too restrictive
Use when: Filter selectivity > 10%

let query = HybridQuery::new(10)
    .with_vector(embedding)
    .with_filter(filter);

query.pre_filter = true;  // Default

Post-filtering (apply filter after vector search):

Pros: Better recall, no missed results
Cons: More distance calculations
Use when: Filter selectivity < 10%

query.pre_filter = false;  // Oversample then filter

Performance Impact

Filter Selectivity	Pre-filter Latency	Post-filter Latency	Recommendation
90% (10% filtered out)	0.9 ms	1.2 ms	Pre-filter
50% (50% filtered out)	1.5 ms	2.0 ms	Pre-filter
10% (90% filtered out)	8.0 ms	3.5 ms	Post-filter
1% (99% filtered out)	50 ms	4.0 ms	Post-filter

Build Optimization

Batch Insertion

Insert vectors in batches for better cache locality:

let batch_size = 1000;

for chunk in vectors.chunks(batch_size) {
    for (i, vec) in chunk.iter().enumerate() {
        index.insert(Bytes::from(format!("v{}", i)), vec.clone())?;
    }
}

Parallel Building (Future)

// Build multiple indices in parallel, then merge
use rayon::prelude::*;

let shards: Vec<HnswIndex> = data_shards
    .par_iter()
    .map(|shard| {
        let mut index = HnswIndex::new(16, 200, DistanceMetric::L2);
        for (key, vec) in shard {
            index.insert(key.clone(), vec.clone()).unwrap();
        }
        index
    })
    .collect();

// Merge shards (implementation needed)
let merged = merge_indices(shards)?;

Build Performance

1M vectors, 128D, M=16, ef_construction=200:

Strategy	Build Time	Throughput
Sequential	850s	~1,175 vectors/s
Batch (1k)	820s	~1,220 vectors/s
Parallel (4 shards)	250s	~4,000 vectors/s

Query Latency Optimization

Latency Breakdown

For 1M vectors, 128D, M=16, ef=50:

Phase	Latency	% Total
Distance calculations	0.5 ms	62%
Graph traversal	0.2 ms	25%
Result sorting	0.1 ms	13%
Total	0.8 ms	100%

Optimization Strategies

1. Reduce distance calculations:

// Lower ef parameter
let results = index.search(&query, k, Some(20), None)?;  // 3x faster, -10% recall

2. Use SIMD:

// Ensure AVX2/AVX-512 enabled
RUSTFLAGS="-C target-cpu=native" cargo build --release

3. Warm up cache:

// Pre-warm cache with representative queries
for _ in 0..100 {
    let _ = index.search(&warmup_query, 10, Some(50), None)?;
}

4. Reduce k (top-k results):

// Return fewer results
let results = index.search(&query, 5, Some(50), None)?;  // Faster than k=100

Recall Optimization

Recall@10 by Configuration

M	ef_construction	ef	Recall@10	QPS
8	100	20	82%	25k
16	200	50	95%	12k
32	400	100	98%	7k
64	800	200	99%	4k

Strategies for >99% Recall

1. Increase build quality:

let index = HnswIndex::new(64, 800, DistanceMetric::L2);

2. Increase search width:

let results = index.search(&query, k, Some(500), None)?;

3. Use exact search for small datasets (<10k vectors):

let flat_index = FlatVectorIndex::new();  // 100% recall, slower

4. Verify vector normalization (for Cosine):

normalize(&mut vector);
assert!((vector.iter().map(|x| x * x).sum::<f32>().sqrt() - 1.0).abs() < 1e-6);

Hybrid Search Optimization

Score Fusion Tuning

// Text-heavy queries (e-commerce, documents)
let fusion = FusionStrategy::Weighted {
    vector_weight: 0.4,
    text_weight: 0.5,
    metadata_weight: 0.1,
};

// Vector-heavy queries (image search, embeddings)
let fusion = FusionStrategy::Weighted {
    vector_weight: 0.8,
    text_weight: 0.1,
    metadata_weight: 0.1,
};

// Balanced (recommendation: A/B test)
let fusion = FusionStrategy::Weighted {
    vector_weight: 0.6,
    text_weight: 0.3,
    metadata_weight: 0.1,
};

Reranking Strategy

// Two-stage retrieval: fast approximate → accurate reranking
let query = HybridQuery::new(10)
    .with_vector(embedding)
    .with_rerank(true);  // Fetch 100, rerank, return top 10

// Performance: 1.5x slower, +2-3% recall

Production Monitoring

Key Metrics

use std::time::Instant;

// QPS tracking
let start = Instant::now();
let results = index.search(&query, k, Some(50), None)?;
let latency = start.elapsed();

// Log metrics
metrics.record_latency(latency);
metrics.increment_queries();

// Alert thresholds
if latency > Duration::from_millis(50) {
    log::warn!("High latency: {:?}", latency);
}

Index Statistics

let stats = index.statistics()?;

// Monitor these metrics
assert!(stats.avg_degree_layer0 >= 10.0);  // Graph connected
assert!(stats.total_edges > stats.num_nodes);  // Sufficient edges
assert!(stats.num_layers >= 4);  // Multi-layer structure

// Memory alert
if stats.memory_usage_bytes > 10 * 1024 * 1024 * 1024 {  // 10 GB
    log::warn!("High memory usage: {} GB", stats.memory_usage_bytes / 1_073_741_824);
}

Benchmarking

Run Comprehensive Benchmarks

# All benchmarks
cargo bench --bench vector_benchmarks

# Specific benchmark groups
cargo bench --bench vector_benchmarks -- distance_metrics
cargo bench --bench vector_benchmarks -- hnsw_query
cargo bench --bench vector_benchmarks -- concurrent_queries

# With visualization
cargo install cargo-criterion
cargo criterion

Custom Benchmark

use criterion::{black_box, Criterion};
use std::time::Instant;

fn benchmark_custom(c: &mut Criterion) {
    let index = build_test_index();

    c.bench_function("custom_workload", |b| {
        b.iter(|| {
            let results = index.search(
                black_box(&query),
                black_box(10),
                black_box(Some(50)),
                black_box(None)
            ).unwrap();
            black_box(results);
        });
    });
}

Troubleshooting Performance Issues

Issue: Low QPS

Symptoms: <5,000 QPS on modern CPU

Checklist:

Verify SIMD enabled (RUSTFLAGS="-C target-cpu=native")
Check CPU frequency (not throttled)
Reduce ef parameter (200 → 50)
Reduce M parameter (32 → 16)
Verify no disk I/O (index in memory)
Check for lock contention (profiling)

Profiling:

cargo install flamegraph
sudo flamegraph -- cargo bench --bench vector_benchmarks -- hnsw_query
# Open flamegraph.svg

Issue: High Latency

Symptoms: p95 > 100ms

Checklist:

Reduce ef (100 → 50 → 20)
Check index size (>1M vectors may need sharding)
Verify no memory swapping
Check GC pauses (if using managed language wrapper)
Profile hot paths

Issue: Low Recall

Symptoms: Recall@10 < 90%

Checklist:

Increase ef (50 → 100 → 200)
Increase ef_construction (200 → 400)
Increase M (16 → 32)
Verify correct distance metric
Check vector normalization (for Cosine)
Validate test data quality

Issue: High Memory Usage

Symptoms: >3 GB for 1M vectors (128D)

Checklist:

Check M parameter (should be ≤32)
Verify no memory leaks
Check for duplicate indices
Consider memory-mapped storage
Profile memory allocations

Advanced Tuning

Layer Distribution

Optimal layer distribution for exponential decay:

let stats = index.statistics()?;

// Expected distribution (M=16):
// Layer 1: ~6.25%
// Layer 2: ~0.39%
// Layer 3: ~0.024%
// Layer 4: ~0.0015%

// Verify exponential decay
for i in 1..stats.layer_distribution.len() {
    let ratio = stats.layer_distribution[i] as f64 / stats.layer_distribution[i-1] as f64;
    assert!(ratio > 0.01 && ratio < 0.2, "Unexpected layer distribution");
}

Query-Specific ef

Adjust ef based on query difficulty:

// Easy query (high confidence)
let results = index.search(&query, k, Some(20), None)?;

// Hard query (ambiguous)
let results = index.search(&query, k, Some(200), None)?;

// Adaptive (based on initial results)
let initial = index.search(&query, k, Some(20), None)?;
let ef = if initial[0].1 > 0.9 { 20 } else { 100 };  // High distance = harder query
let final_results = index.search(&query, k, Some(ef), None)?;

Summary

Production Configuration (1M vectors, 128D):

let index = HnswIndex::new(
    16,                      // M: balanced
    200,                     // ef_construction: good quality
    DistanceMetric::Cosine   // normalized embeddings
);

let results = index.search(
    &query,
    10,         // k
    Some(50),   // ef: 95%+ recall
    None        // filter
)?;

// Expected: 10-15k QPS, 95%+ recall, <5ms p95 latency

Key Takeaways:

Start with defaults (M=16, ef_construction=200, ef=50)
Enable SIMD for 5-10x speedup
Use pre-filtering for high selectivity (>10%)
Monitor recall, QPS, and latency continuously
A/B test parameter changes in production