Skip to content

Helios-DistribCache Performance Tutorial

Helios-DistribCache Performance Tutorial

This guide demonstrates optimal workloads and use cases for the multi-tier distributed cache in HeliosProxy. Learn when each cache tier provides maximum benefit and how to configure for specific workloads.

Prerequisites

  • HeliosDB cluster running
  • heliosdb-proxy binary compiled with --features distribcache
  • (Optional) Ollama running for L3 semantic cache

Quick Start

Terminal window
# Run the interactive tutorial
./docs/tutorials/distribcache-tutorial.sh

Cache Architecture Overview

HeliosProxy implements a three-tier cache system optimized for different access patterns:

┌─────────────────────────────────────────────────────────────────────┐
│ Client Request │
└───────────────────────────────┬─────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ L1: Hot Cache (Per-Connection) │
│ ├─ Latency: ~0.1ms │
│ ├─ Size: 500-2000 entries per connection │
│ ├─ TTL: 30-60 seconds │
│ └─ Best for: Repeated queries within single session │
└───────────────────────────────┬─────────────────────────────────────┘
│ Miss
┌─────────────────────────────────────────────────────────────────────┐
│ L2: Warm Cache (Shared Across Connections) │
│ ├─ Latency: ~1ms │
│ ├─ Size: 256MB - 4GB │
│ ├─ TTL: 5-10 minutes │
│ ├─ Query normalization: SELECT * FROM t WHERE id=$1 │
│ └─ Best for: Parameterized queries across all clients │
└───────────────────────────────┬─────────────────────────────────────┘
│ Miss
┌─────────────────────────────────────────────────────────────────────┐
│ L3: Semantic Cache (AI/Embedding-Based) │
│ ├─ Latency: ~5-10ms │
│ ├─ Size: 5,000-50,000 entries │
│ ├─ TTL: 1-24 hours │
│ ├─ Similarity threshold: 0.90-0.95 │
│ └─ Best for: NLP queries, AI agents, similar questions │
└───────────────────────────────┬─────────────────────────────────────┘
│ Miss
┌─────────────────────────────────────────────────────────────────────┐
│ HeliosDB │
│ (~40ms query latency) │
└─────────────────────────────────────────────────────────────────────┘

L1 Cache: Per-Connection Hot Data

When L1 Excels

L1 cache provides maximum benefit when a single connection repeatedly executes identical queries.

WorkloadL1 BenefitExample
User dashboard refreshHIGHSame user refreshing their stats
Pagination (same page)HIGHUser viewing page 1 multiple times
Auto-complete as-you-typeMEDIUMSuggestions update on each keystroke
Polling for updatesHIGH”Check for new messages” every 5s

L1 Configuration

[cache.l1]
enabled = true
size = 1000 # Entries per connection
ttl_secs = 60 # 1 minute TTL

L1 Demo: Dashboard Polling

This workload simulates a user dashboard that polls every 2 seconds:

-- Dashboard query (user 12345)
SELECT
notification_count,
unread_messages,
last_activity
FROM user_dashboard
WHERE user_id = 12345;

Without L1 cache:

  • Every 2 seconds: 40ms database query
  • 30 queries/minute = 1,200ms/minute spent on DB

With L1 cache (30s TTL):

  • First query: 40ms (cache miss)
  • Next 14 queries: 0.1ms each (cache hit)
  • 2 queries/minute to DB = 80ms/minute
  • 93% reduction in database load

L1 Benchmark Results

From our scalability tests:

Query: SELECT balance FROM accounts WHERE id = 1
Clients: 10, Duration: 30s
Without cache: 242 TPS, 41.3ms latency
With L1 (repeated): 242 TPS, 41.3ms latency (CPU-bound, not I/O bound)
Key insight: L1 benefits visible when backend latency is high (network/disk)

L2 Cache: Shared Parameterized Queries

When L2 Excels

L2 cache shines when different clients execute the same query pattern with different parameters.

WorkloadL2 BenefitExample
Product catalog lookupsVERY HIGHDifferent users viewing same products
API endpointsHIGH/api/user/:id with popular IDs
Configuration readsVERY HIGHSame settings queried by all workers
Reference dataVERY HIGHCountries, currencies, categories

Query Normalization

L2 cache normalizes parameterized queries:

-- These all map to the same cache key:
SELECT * FROM products WHERE id = 123
SELECT * FROM products WHERE id = 456
SELECT * FROM products WHERE id = 789
-- Normalized form:
SELECT * FROM products WHERE id = $1

L2 Configuration

[cache.l2]
enabled = true
size_mb = 512 # 512MB shared cache
ttl_secs = 300 # 5 minutes
normalize_queries = true
storage = "memory" # or "mmap" for persistence

L2 Demo: Product Catalog

E-commerce scenario: 80% of traffic views 20% of products (Pareto distribution).

-- Popular product lookup (normalized)
SELECT name, price, description, stock
FROM products
WHERE id = $1;

Traffic pattern:

  • 1000 unique products
  • 80% of requests hit top 200 products
  • 500 concurrent users

Without L2 cache:

  • Every product view: 40ms query
  • 500 requests/sec = 20,000ms/sec of DB time

With L2 cache:

  • First view of each popular product: 40ms (miss)
  • Subsequent views: 1ms (hit)
  • After warmup: 80% hit rate on popular items
  • Effective latency: 8.8ms (0.8 * 1ms + 0.2 * 40ms)
  • 78% latency reduction

L2 Benchmark Results

Query: SELECT balance FROM accounts WHERE id = :random_1_to_1000
Clients: 100, Duration: 30s
HeliosDB Direct: 2,416 TPS, 41.3ms latency
HeliosProxy L2: 2,419 TPS, 41.3ms latency
Note: Random distribution across 1000 IDs = low hit rate (~0.3%)
For cache benefit, need concentrated access patterns.

Table-Specific TTL

Configure longer TTL for stable data:

[cache.tables.products]
ttl_secs = 3600 # Products change rarely
[cache.tables.orders]
ttl_secs = 60 # Orders change frequently
[cache.tables.user_sessions]
exclude = true # Never cache (security)

L3 Cache: Semantic/AI Queries

When L3 Excels

L3 cache provides breakthrough performance for semantically similar queries - queries that mean the same thing but are phrased differently.

WorkloadL3 BenefitExample
Natural language searchVERY HIGH”cheap hotels” vs “budget accommodation”
AI agent queriesVERY HIGHSimilar tool calls with different phrasing
Customer supportHIGHFAQ variations from different users
RAG retrievalHIGHSimilar questions about same docs

How L3 Works

  1. Query text is converted to embedding vector (via Ollama)
  2. Embedding is compared to cached entries using cosine similarity
  3. If similarity > threshold (e.g., 0.92), return cached result
User A: "What are the cheapest flights to Paris?"
↓ Embedding
[0.23, -0.45, 0.12, ...]
↓ Cosine Similarity = 0.94
User B: "Find budget flights to Paris"
↓ Embedding
[0.25, -0.43, 0.11, ...]
↓ CACHE HIT!

L3 Configuration

[cache.l3]
enabled = true
similarity_threshold = 0.92 # 92% similarity for hit
max_entries = 10000
ttl_secs = 3600 # 1 hour
# Ollama for embeddings
embedding_endpoint = "http://localhost:11434"
embedding_model = "all-minilm"
embedding_dim = 384

Support knowledge base with 10,000 articles:

-- Natural language search (vector similarity)
SELECT article_id, title, content
FROM support_articles
ORDER BY embedding <-> $1 -- $1 is query embedding
LIMIT 5;

Query variations (all cached together):

  • “How do I reset my password?”
  • “Password reset instructions”
  • “I forgot my password”
  • “Can’t log in, need to change password”

Performance impact:

  • Without L3: Each query = embedding + vector search (~100ms)
  • With L3: Similar queries = 5ms cache lookup
  • 95% latency reduction for common questions

AI-Specific Optimizations

HeliosProxy includes specialized caches for AI/LLM workloads.

Conversation Context Cache

Caches recent conversation turns for multi-turn AI agents:

[cache.ai.conversation]
enabled = true
max_turns = 50 # Turns per conversation
max_conversations = 10000
ttl_secs = 3600 # 1 hour idle timeout

Use case: AI agent maintaining context across tool calls.

// Pseudocode: AI agent conversation
let context = cache.conversation.get("conv_12345");
// Returns last 50 turns instantly instead of DB query

Benefit:

  • Without cache: Load 50 turns from DB = 50ms
  • With cache: Instant retrieval = 0.5ms
  • 100x faster context retrieval

RAG Chunk Cache

Caches document chunks for Retrieval-Augmented Generation:

[cache.ai.rag]
enabled = true
max_chunks = 100000
max_documents = 10000
ttl_secs = 86400 # 24 hours (documents stable)

Use case: Document Q&A system retrieving relevant chunks.

-- Chunk retrieval (expensive vector search)
SELECT chunk_id, content, embedding
FROM document_chunks
WHERE document_id = $1
ORDER BY embedding <-> $2
LIMIT 10;

Benefit:

  • Same document queried by multiple users
  • First query: Full vector search (150ms)
  • Subsequent: Cached chunks (2ms)
  • 75x faster for popular documents

Tool Result Cache

Caches results of AI tool calls:

[cache.ai.tools]
enabled = true
max_entries = 50000
ttl_secs = 300 # 5 minutes
hash_arguments = true # Cache by tool+args hash

Use case: AI agents calling same tool with same arguments.

{
"tool": "get_weather",
"arguments": {"city": "New York", "date": "2024-01-26"}
}

Benefit:

  • Multiple agents asking for NYC weather = 1 API call
  • Repeated tool patterns cached across all conversations
  • Reduces external API costs by 60-80%

Real-World Workload Examples

Example 1: E-Commerce Dashboard

Workload profile:

  • 1000 concurrent users
  • Dashboard refresh every 30s
  • Product browsing (Pareto distribution)
  • Checkout (write-heavy, not cached)
# Optimized configuration
[cache]
enabled = true
[cache.l1]
enabled = true
size = 500
ttl_secs = 30
[cache.l2]
enabled = true
size_mb = 1024
ttl_secs = 300
normalize_queries = true
[cache.l3]
enabled = false # No NLP queries
[cache.tables.user_cart]
exclude = true # Always fresh
[cache.tables.products]
ttl_secs = 600 # 10 min for products
[cache.tables.categories]
ttl_secs = 3600 # 1 hour for categories

Expected results:

  • L1 hit rate: 40% (dashboard polling)
  • L2 hit rate: 60% (popular products)
  • Effective latency reduction: 50%
  • Database load reduction: 65%

Example 2: AI Customer Support Bot

Workload profile:

  • 500 concurrent conversations
  • RAG retrieval from 50,000 documents
  • Natural language queries (semantic similarity)
  • Tool calls (weather, order lookup, etc.)
# AI-optimized configuration
[cache]
enabled = true
[cache.l1]
enabled = true
size = 200
ttl_secs = 300
[cache.l2]
enabled = true
size_mb = 512
ttl_secs = 600
[cache.l3]
enabled = true
similarity_threshold = 0.90
max_entries = 50000
ttl_secs = 7200
embedding_endpoint = "http://ollama:11434"
embedding_model = "all-minilm"
[cache.ai.conversation]
enabled = true
max_turns = 100
max_conversations = 5000
[cache.ai.rag]
enabled = true
max_chunks = 200000
ttl_secs = 86400
[cache.ai.tools]
enabled = true
max_entries = 100000
ttl_secs = 300

Expected results:

  • L3 hit rate: 35% (similar questions)
  • RAG cache hit rate: 70% (popular docs)
  • Tool cache hit rate: 50% (common queries)
  • Average response time: 200ms → 80ms
  • 60% faster AI responses

Example 3: Analytics Dashboard

Workload profile:

  • 50 concurrent analysts
  • Complex aggregate queries
  • Same reports run repeatedly
  • Data refreshes hourly
# Analytics configuration
[cache]
enabled = true
max_result_size = 52428800 # 50MB results
[cache.l1]
enabled = true
size = 100
ttl_secs = 300
[cache.l2]
enabled = true
size_mb = 4096 # 4GB for large results
ttl_secs = 3600 # 1 hour TTL
normalize_queries = true
storage = "mmap" # Persist across restarts
mmap_path = "/data/cache/l2.mmap"
[cache.l3]
enabled = false
[cache.tables.analytics_hourly]
ttl_secs = 3600
[cache.tables.analytics_daily]
ttl_secs = 86400

Expected results:

  • L2 hit rate: 80% (same reports)
  • Average query time: 5s → 500ms
  • 10x faster dashboard loads

Cache Invalidation Strategies

Automatically invalidates cache when data changes:

[cache.invalidation]
mode = "wal"
wal_subscribe = true

How it works:

  1. HeliosDB streams WAL changes to proxy
  2. Proxy identifies affected tables
  3. Related cache entries invalidated instantly

Best for: OLTP workloads with frequent writes.

TTL-Based Invalidation

Simple time-based expiration:

[cache.invalidation]
mode = "ttl"
ttl_fallback_secs = 60

Best for:

  • Read-heavy workloads
  • Analytics (hourly refresh acceptable)
  • When WAL subscription unavailable

Manual Invalidation

API-driven cache clearing:

Terminal window
# Clear all cache
curl -X POST http://proxy:9090/cache/clear
# Clear specific table
curl -X POST http://proxy:9090/cache/clear/products
# Clear specific key
curl -X POST http://proxy:9090/cache/clear/key/products:123

Best for: Admin tools, deployment pipelines.


Monitoring Cache Performance

Metrics Endpoint

Terminal window
curl http://proxy:9090/cache/stats
{
"l1": {
"connections": 150,
"total_entries": 45000,
"hit_rate": 0.42,
"hits": 1234567,
"misses": 890123
},
"l2": {
"size_mb": 512,
"entries": 89000,
"hit_rate": 0.58,
"hits": 2345678,
"misses": 1234567
},
"l3": {
"entries": 15000,
"hit_rate": 0.31,
"avg_similarity": 0.94,
"embedding_time_ms": 8.5
},
"ai": {
"conversation_cache_entries": 2500,
"rag_chunks_cached": 150000,
"tool_results_cached": 45000
}
}

Key Metrics to Watch

MetricHealthy RangeAction If Outside
L1 hit rate30-60%Increase size or TTL
L2 hit rate40-80%Check query normalization
L3 hit rate20-50%Adjust similarity threshold
L2 size used<90%Increase size_mb
Eviction rate<10%/minIncrease cache size

Prometheus Metrics

# prometheus.yml scrape config
- job_name: 'heliosproxy'
static_configs:
- targets: ['proxy:9100']

Available metrics:

heliosproxy_cache_hits_total{tier="l1|l2|l3"}
heliosproxy_cache_misses_total{tier="l1|l2|l3"}
heliosproxy_cache_size_bytes{tier="l1|l2|l3"}
heliosproxy_cache_evictions_total{tier="l1|l2|l3"}
heliosproxy_cache_latency_seconds{tier="l1|l2|l3",quantile="0.5|0.9|0.99"}

Performance Benchmarks

Scalability with Cache

From our benchmark suite (1000 clients, 30s test):

ConfigurationTPSLatencyEfficiency
HeliosDB Direct20,49148.8ms85%
HeliosProxy (no cache)20,48948.8ms85%
HeliosProxy (L1+L2)20,49548.7ms85%

Key insight: Cache overhead is negligible (<0.1%). Benefits appear when:

  • Queries have temporal locality (same queries repeated)
  • Access patterns are skewed (Pareto distribution)
  • Backend latency is high (network, disk I/O)

Cache Benefit Scenarios

ScenarioWithout CacheWith CacheImprovement
Dashboard polling40ms0.1ms400x
Popular products40ms1ms40x
NLP similar queries100ms5ms20x
RAG chunk retrieval150ms2ms75x
AI tool results500ms1ms500x

Best Practices

1. Start with L1 + L2

Enable L1 and L2 for all deployments. They have minimal overhead:

[cache]
enabled = true
[cache.l1]
enabled = true
[cache.l2]
enabled = true

2. Enable L3 Only for AI Workloads

L3 requires Ollama and adds embedding latency (~10ms):

[cache.l3]
enabled = true # Only if you have NLP queries

3. Tune TTL Based on Data Volatility

Data TypeRecommended TTL
Configuration1-24 hours
Reference data1-6 hours
Product catalog5-30 minutes
User data1-5 minutes
Session dataDon’t cache

4. Monitor Hit Rates

If hit rates are low:

  • L1 low: Users not repeating queries (expected for diverse workloads)
  • L2 low: Queries not normalizing well, or random access pattern
  • L3 low: Raise similarity threshold, or queries too diverse

5. Size Cache Appropriately

Rule of thumb:

  • L1: 500-2000 entries per expected concurrent connection
  • L2: 10-50% of working dataset size
  • L3: 1000 entries per distinct query pattern expected

Troubleshooting

”Cache hit rate is 0%”

Cause: Queries not matching cache entries

Solutions:

  1. Verify cache is enabled: curl http://proxy:9090/cache/stats
  2. Check query normalization is working
  3. Ensure TTL > query interval

”Memory usage too high”

Cause: Cache size misconfigured

Solutions:

[cache.l2]
size_mb = 256 # Reduce from default
max_result_size = 1048576 # 1MB max per entry

“L3 cache slow”

Cause: Embedding computation overhead

Solutions:

  1. Use faster embedding model: all-minilm (384d) vs nomic-embed (768d)
  2. Batch queries if possible
  3. Increase L3 TTL to reduce embedding frequency

”Stale data in cache”

Cause: TTL too long or invalidation not working

Solutions:

  1. Enable WAL invalidation:
    [cache.invalidation]
    mode = "wal"
    wal_subscribe = true
  2. Reduce TTL for volatile tables
  3. Use manual invalidation API after writes

See Also