Helios-DistribCache Performance Tutorial
Helios-DistribCache Performance Tutorial
This guide demonstrates optimal workloads and use cases for the multi-tier distributed cache in HeliosProxy. Learn when each cache tier provides maximum benefit and how to configure for specific workloads.
Prerequisites
- HeliosDB cluster running
heliosdb-proxybinary compiled with--features distribcache- (Optional) Ollama running for L3 semantic cache
Quick Start
# Run the interactive tutorial./docs/tutorials/distribcache-tutorial.shCache Architecture Overview
HeliosProxy implements a three-tier cache system optimized for different access patterns:
┌─────────────────────────────────────────────────────────────────────┐│ Client Request │└───────────────────────────────┬─────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────────┐│ L1: Hot Cache (Per-Connection) ││ ├─ Latency: ~0.1ms ││ ├─ Size: 500-2000 entries per connection ││ ├─ TTL: 30-60 seconds ││ └─ Best for: Repeated queries within single session │└───────────────────────────────┬─────────────────────────────────────┘ │ Miss ▼┌─────────────────────────────────────────────────────────────────────┐│ L2: Warm Cache (Shared Across Connections) ││ ├─ Latency: ~1ms ││ ├─ Size: 256MB - 4GB ││ ├─ TTL: 5-10 minutes ││ ├─ Query normalization: SELECT * FROM t WHERE id=$1 ││ └─ Best for: Parameterized queries across all clients │└───────────────────────────────┬─────────────────────────────────────┘ │ Miss ▼┌─────────────────────────────────────────────────────────────────────┐│ L3: Semantic Cache (AI/Embedding-Based) ││ ├─ Latency: ~5-10ms ││ ├─ Size: 5,000-50,000 entries ││ ├─ TTL: 1-24 hours ││ ├─ Similarity threshold: 0.90-0.95 ││ └─ Best for: NLP queries, AI agents, similar questions │└───────────────────────────────┬─────────────────────────────────────┘ │ Miss ▼┌─────────────────────────────────────────────────────────────────────┐│ HeliosDB ││ (~40ms query latency) │└─────────────────────────────────────────────────────────────────────┘L1 Cache: Per-Connection Hot Data
When L1 Excels
L1 cache provides maximum benefit when a single connection repeatedly executes identical queries.
| Workload | L1 Benefit | Example |
|---|---|---|
| User dashboard refresh | HIGH | Same user refreshing their stats |
| Pagination (same page) | HIGH | User viewing page 1 multiple times |
| Auto-complete as-you-type | MEDIUM | Suggestions update on each keystroke |
| Polling for updates | HIGH | ”Check for new messages” every 5s |
L1 Configuration
[cache.l1]enabled = truesize = 1000 # Entries per connectionttl_secs = 60 # 1 minute TTLL1 Demo: Dashboard Polling
This workload simulates a user dashboard that polls every 2 seconds:
-- Dashboard query (user 12345)SELECT notification_count, unread_messages, last_activityFROM user_dashboardWHERE user_id = 12345;Without L1 cache:
- Every 2 seconds: 40ms database query
- 30 queries/minute = 1,200ms/minute spent on DB
With L1 cache (30s TTL):
- First query: 40ms (cache miss)
- Next 14 queries: 0.1ms each (cache hit)
- 2 queries/minute to DB = 80ms/minute
- 93% reduction in database load
L1 Benchmark Results
From our scalability tests:
Query: SELECT balance FROM accounts WHERE id = 1Clients: 10, Duration: 30s
Without cache: 242 TPS, 41.3ms latencyWith L1 (repeated): 242 TPS, 41.3ms latency (CPU-bound, not I/O bound)
Key insight: L1 benefits visible when backend latency is high (network/disk)L2 Cache: Shared Parameterized Queries
When L2 Excels
L2 cache shines when different clients execute the same query pattern with different parameters.
| Workload | L2 Benefit | Example |
|---|---|---|
| Product catalog lookups | VERY HIGH | Different users viewing same products |
| API endpoints | HIGH | /api/user/:id with popular IDs |
| Configuration reads | VERY HIGH | Same settings queried by all workers |
| Reference data | VERY HIGH | Countries, currencies, categories |
Query Normalization
L2 cache normalizes parameterized queries:
-- These all map to the same cache key:SELECT * FROM products WHERE id = 123SELECT * FROM products WHERE id = 456SELECT * FROM products WHERE id = 789
-- Normalized form:SELECT * FROM products WHERE id = $1L2 Configuration
[cache.l2]enabled = truesize_mb = 512 # 512MB shared cachettl_secs = 300 # 5 minutesnormalize_queries = truestorage = "memory" # or "mmap" for persistenceL2 Demo: Product Catalog
E-commerce scenario: 80% of traffic views 20% of products (Pareto distribution).
-- Popular product lookup (normalized)SELECT name, price, description, stockFROM productsWHERE id = $1;Traffic pattern:
- 1000 unique products
- 80% of requests hit top 200 products
- 500 concurrent users
Without L2 cache:
- Every product view: 40ms query
- 500 requests/sec = 20,000ms/sec of DB time
With L2 cache:
- First view of each popular product: 40ms (miss)
- Subsequent views: 1ms (hit)
- After warmup: 80% hit rate on popular items
- Effective latency: 8.8ms (0.8 * 1ms + 0.2 * 40ms)
- 78% latency reduction
L2 Benchmark Results
Query: SELECT balance FROM accounts WHERE id = :random_1_to_1000Clients: 100, Duration: 30s
HeliosDB Direct: 2,416 TPS, 41.3ms latencyHeliosProxy L2: 2,419 TPS, 41.3ms latency
Note: Random distribution across 1000 IDs = low hit rate (~0.3%)For cache benefit, need concentrated access patterns.Table-Specific TTL
Configure longer TTL for stable data:
[cache.tables.products]ttl_secs = 3600 # Products change rarely
[cache.tables.orders]ttl_secs = 60 # Orders change frequently
[cache.tables.user_sessions]exclude = true # Never cache (security)L3 Cache: Semantic/AI Queries
When L3 Excels
L3 cache provides breakthrough performance for semantically similar queries - queries that mean the same thing but are phrased differently.
| Workload | L3 Benefit | Example |
|---|---|---|
| Natural language search | VERY HIGH | ”cheap hotels” vs “budget accommodation” |
| AI agent queries | VERY HIGH | Similar tool calls with different phrasing |
| Customer support | HIGH | FAQ variations from different users |
| RAG retrieval | HIGH | Similar questions about same docs |
How L3 Works
- Query text is converted to embedding vector (via Ollama)
- Embedding is compared to cached entries using cosine similarity
- If similarity > threshold (e.g., 0.92), return cached result
User A: "What are the cheapest flights to Paris?" ↓ Embedding [0.23, -0.45, 0.12, ...] ↓ Cosine Similarity = 0.94User B: "Find budget flights to Paris" ↓ Embedding [0.25, -0.43, 0.11, ...] ↓ CACHE HIT!L3 Configuration
[cache.l3]enabled = truesimilarity_threshold = 0.92 # 92% similarity for hitmax_entries = 10000ttl_secs = 3600 # 1 hour
# Ollama for embeddingsembedding_endpoint = "http://localhost:11434"embedding_model = "all-minilm"embedding_dim = 384L3 Demo: Customer Support Search
Support knowledge base with 10,000 articles:
-- Natural language search (vector similarity)SELECT article_id, title, contentFROM support_articlesORDER BY embedding <-> $1 -- $1 is query embeddingLIMIT 5;Query variations (all cached together):
- “How do I reset my password?”
- “Password reset instructions”
- “I forgot my password”
- “Can’t log in, need to change password”
Performance impact:
- Without L3: Each query = embedding + vector search (~100ms)
- With L3: Similar queries = 5ms cache lookup
- 95% latency reduction for common questions
AI-Specific Optimizations
HeliosProxy includes specialized caches for AI/LLM workloads.
Conversation Context Cache
Caches recent conversation turns for multi-turn AI agents:
[cache.ai.conversation]enabled = truemax_turns = 50 # Turns per conversationmax_conversations = 10000ttl_secs = 3600 # 1 hour idle timeoutUse case: AI agent maintaining context across tool calls.
// Pseudocode: AI agent conversationlet context = cache.conversation.get("conv_12345");// Returns last 50 turns instantly instead of DB queryBenefit:
- Without cache: Load 50 turns from DB = 50ms
- With cache: Instant retrieval = 0.5ms
- 100x faster context retrieval
RAG Chunk Cache
Caches document chunks for Retrieval-Augmented Generation:
[cache.ai.rag]enabled = truemax_chunks = 100000max_documents = 10000ttl_secs = 86400 # 24 hours (documents stable)Use case: Document Q&A system retrieving relevant chunks.
-- Chunk retrieval (expensive vector search)SELECT chunk_id, content, embeddingFROM document_chunksWHERE document_id = $1ORDER BY embedding <-> $2LIMIT 10;Benefit:
- Same document queried by multiple users
- First query: Full vector search (150ms)
- Subsequent: Cached chunks (2ms)
- 75x faster for popular documents
Tool Result Cache
Caches results of AI tool calls:
[cache.ai.tools]enabled = truemax_entries = 50000ttl_secs = 300 # 5 minuteshash_arguments = true # Cache by tool+args hashUse case: AI agents calling same tool with same arguments.
{ "tool": "get_weather", "arguments": {"city": "New York", "date": "2024-01-26"}}Benefit:
- Multiple agents asking for NYC weather = 1 API call
- Repeated tool patterns cached across all conversations
- Reduces external API costs by 60-80%
Real-World Workload Examples
Example 1: E-Commerce Dashboard
Workload profile:
- 1000 concurrent users
- Dashboard refresh every 30s
- Product browsing (Pareto distribution)
- Checkout (write-heavy, not cached)
# Optimized configuration[cache]enabled = true
[cache.l1]enabled = truesize = 500ttl_secs = 30
[cache.l2]enabled = truesize_mb = 1024ttl_secs = 300normalize_queries = true
[cache.l3]enabled = false # No NLP queries
[cache.tables.user_cart]exclude = true # Always fresh
[cache.tables.products]ttl_secs = 600 # 10 min for products
[cache.tables.categories]ttl_secs = 3600 # 1 hour for categoriesExpected results:
- L1 hit rate: 40% (dashboard polling)
- L2 hit rate: 60% (popular products)
- Effective latency reduction: 50%
- Database load reduction: 65%
Example 2: AI Customer Support Bot
Workload profile:
- 500 concurrent conversations
- RAG retrieval from 50,000 documents
- Natural language queries (semantic similarity)
- Tool calls (weather, order lookup, etc.)
# AI-optimized configuration[cache]enabled = true
[cache.l1]enabled = truesize = 200ttl_secs = 300
[cache.l2]enabled = truesize_mb = 512ttl_secs = 600
[cache.l3]enabled = truesimilarity_threshold = 0.90max_entries = 50000ttl_secs = 7200embedding_endpoint = "http://ollama:11434"embedding_model = "all-minilm"
[cache.ai.conversation]enabled = truemax_turns = 100max_conversations = 5000
[cache.ai.rag]enabled = truemax_chunks = 200000ttl_secs = 86400
[cache.ai.tools]enabled = truemax_entries = 100000ttl_secs = 300Expected results:
- L3 hit rate: 35% (similar questions)
- RAG cache hit rate: 70% (popular docs)
- Tool cache hit rate: 50% (common queries)
- Average response time: 200ms → 80ms
- 60% faster AI responses
Example 3: Analytics Dashboard
Workload profile:
- 50 concurrent analysts
- Complex aggregate queries
- Same reports run repeatedly
- Data refreshes hourly
# Analytics configuration[cache]enabled = truemax_result_size = 52428800 # 50MB results
[cache.l1]enabled = truesize = 100ttl_secs = 300
[cache.l2]enabled = truesize_mb = 4096 # 4GB for large resultsttl_secs = 3600 # 1 hour TTLnormalize_queries = truestorage = "mmap" # Persist across restartsmmap_path = "/data/cache/l2.mmap"
[cache.l3]enabled = false
[cache.tables.analytics_hourly]ttl_secs = 3600
[cache.tables.analytics_daily]ttl_secs = 86400Expected results:
- L2 hit rate: 80% (same reports)
- Average query time: 5s → 500ms
- 10x faster dashboard loads
Cache Invalidation Strategies
WAL-Based Invalidation (Recommended)
Automatically invalidates cache when data changes:
[cache.invalidation]mode = "wal"wal_subscribe = trueHow it works:
- HeliosDB streams WAL changes to proxy
- Proxy identifies affected tables
- Related cache entries invalidated instantly
Best for: OLTP workloads with frequent writes.
TTL-Based Invalidation
Simple time-based expiration:
[cache.invalidation]mode = "ttl"ttl_fallback_secs = 60Best for:
- Read-heavy workloads
- Analytics (hourly refresh acceptable)
- When WAL subscription unavailable
Manual Invalidation
API-driven cache clearing:
# Clear all cachecurl -X POST http://proxy:9090/cache/clear
# Clear specific tablecurl -X POST http://proxy:9090/cache/clear/products
# Clear specific keycurl -X POST http://proxy:9090/cache/clear/key/products:123Best for: Admin tools, deployment pipelines.
Monitoring Cache Performance
Metrics Endpoint
curl http://proxy:9090/cache/stats{ "l1": { "connections": 150, "total_entries": 45000, "hit_rate": 0.42, "hits": 1234567, "misses": 890123 }, "l2": { "size_mb": 512, "entries": 89000, "hit_rate": 0.58, "hits": 2345678, "misses": 1234567 }, "l3": { "entries": 15000, "hit_rate": 0.31, "avg_similarity": 0.94, "embedding_time_ms": 8.5 }, "ai": { "conversation_cache_entries": 2500, "rag_chunks_cached": 150000, "tool_results_cached": 45000 }}Key Metrics to Watch
| Metric | Healthy Range | Action If Outside |
|---|---|---|
| L1 hit rate | 30-60% | Increase size or TTL |
| L2 hit rate | 40-80% | Check query normalization |
| L3 hit rate | 20-50% | Adjust similarity threshold |
| L2 size used | <90% | Increase size_mb |
| Eviction rate | <10%/min | Increase cache size |
Prometheus Metrics
# prometheus.yml scrape config- job_name: 'heliosproxy' static_configs: - targets: ['proxy:9100']Available metrics:
heliosproxy_cache_hits_total{tier="l1|l2|l3"}heliosproxy_cache_misses_total{tier="l1|l2|l3"}heliosproxy_cache_size_bytes{tier="l1|l2|l3"}heliosproxy_cache_evictions_total{tier="l1|l2|l3"}heliosproxy_cache_latency_seconds{tier="l1|l2|l3",quantile="0.5|0.9|0.99"}Performance Benchmarks
Scalability with Cache
From our benchmark suite (1000 clients, 30s test):
| Configuration | TPS | Latency | Efficiency |
|---|---|---|---|
| HeliosDB Direct | 20,491 | 48.8ms | 85% |
| HeliosProxy (no cache) | 20,489 | 48.8ms | 85% |
| HeliosProxy (L1+L2) | 20,495 | 48.7ms | 85% |
Key insight: Cache overhead is negligible (<0.1%). Benefits appear when:
- Queries have temporal locality (same queries repeated)
- Access patterns are skewed (Pareto distribution)
- Backend latency is high (network, disk I/O)
Cache Benefit Scenarios
| Scenario | Without Cache | With Cache | Improvement |
|---|---|---|---|
| Dashboard polling | 40ms | 0.1ms | 400x |
| Popular products | 40ms | 1ms | 40x |
| NLP similar queries | 100ms | 5ms | 20x |
| RAG chunk retrieval | 150ms | 2ms | 75x |
| AI tool results | 500ms | 1ms | 500x |
Best Practices
1. Start with L1 + L2
Enable L1 and L2 for all deployments. They have minimal overhead:
[cache]enabled = true
[cache.l1]enabled = true
[cache.l2]enabled = true2. Enable L3 Only for AI Workloads
L3 requires Ollama and adds embedding latency (~10ms):
[cache.l3]enabled = true # Only if you have NLP queries3. Tune TTL Based on Data Volatility
| Data Type | Recommended TTL |
|---|---|
| Configuration | 1-24 hours |
| Reference data | 1-6 hours |
| Product catalog | 5-30 minutes |
| User data | 1-5 minutes |
| Session data | Don’t cache |
4. Monitor Hit Rates
If hit rates are low:
- L1 low: Users not repeating queries (expected for diverse workloads)
- L2 low: Queries not normalizing well, or random access pattern
- L3 low: Raise similarity threshold, or queries too diverse
5. Size Cache Appropriately
Rule of thumb:
- L1: 500-2000 entries per expected concurrent connection
- L2: 10-50% of working dataset size
- L3: 1000 entries per distinct query pattern expected
Troubleshooting
”Cache hit rate is 0%”
Cause: Queries not matching cache entries
Solutions:
- Verify cache is enabled:
curl http://proxy:9090/cache/stats - Check query normalization is working
- Ensure TTL > query interval
”Memory usage too high”
Cause: Cache size misconfigured
Solutions:
[cache.l2]size_mb = 256 # Reduce from defaultmax_result_size = 1048576 # 1MB max per entry“L3 cache slow”
Cause: Embedding computation overhead
Solutions:
- Use faster embedding model:
all-minilm(384d) vsnomic-embed(768d) - Batch queries if possible
- Increase L3 TTL to reduce embedding frequency
”Stale data in cache”
Cause: TTL too long or invalidation not working
Solutions:
- Enable WAL invalidation:
[cache.invalidation]mode = "wal"wal_subscribe = true
- Reduce TTL for volatile tables
- Use manual invalidation API after writes