Helios-DistribCache Performance Tutorial

This guide demonstrates optimal workloads and use cases for the multi-tier distributed cache in HeliosProxy. Learn when each cache tier provides maximum benefit and how to configure for specific workloads.

Prerequisites

HeliosDB cluster running
heliosdb-proxy binary compiled with --features distribcache
(Optional) Ollama running for L3 semantic cache

Quick Start

# Run the interactive tutorial
./docs/tutorials/distribcache-tutorial.sh

Cache Architecture Overview

HeliosProxy implements a three-tier cache system optimized for different access patterns:

┌─────────────────────────────────────────────────────────────────────┐
│                         Client Request                               │
└───────────────────────────────┬─────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│  L1: Hot Cache (Per-Connection)                                      │
│  ├─ Latency: ~0.1ms                                                  │
│  ├─ Size: 500-2000 entries per connection                           │
│  ├─ TTL: 30-60 seconds                                               │
│  └─ Best for: Repeated queries within single session                 │
└───────────────────────────────┬─────────────────────────────────────┘
                                │ Miss
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│  L2: Warm Cache (Shared Across Connections)                          │
│  ├─ Latency: ~1ms                                                    │
│  ├─ Size: 256MB - 4GB                                                │
│  ├─ TTL: 5-10 minutes                                                │
│  ├─ Query normalization: SELECT * FROM t WHERE id=$1                 │
│  └─ Best for: Parameterized queries across all clients               │
└───────────────────────────────┬─────────────────────────────────────┘
                                │ Miss
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│  L3: Semantic Cache (AI/Embedding-Based)                             │
│  ├─ Latency: ~5-10ms                                                 │
│  ├─ Size: 5,000-50,000 entries                                       │
│  ├─ TTL: 1-24 hours                                                  │
│  ├─ Similarity threshold: 0.90-0.95                                  │
│  └─ Best for: NLP queries, AI agents, similar questions              │
└───────────────────────────────┬─────────────────────────────────────┘
                                │ Miss
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                           HeliosDB                                   │
│                      (~40ms query latency)                           │
└─────────────────────────────────────────────────────────────────────┘

L1 Cache: Per-Connection Hot Data

When L1 Excels

L1 cache provides maximum benefit when a single connection repeatedly executes identical queries.

Workload	L1 Benefit	Example
User dashboard refresh	HIGH	Same user refreshing their stats
Pagination (same page)	HIGH	User viewing page 1 multiple times
Auto-complete as-you-type	MEDIUM	Suggestions update on each keystroke
Polling for updates	HIGH	”Check for new messages” every 5s

L1 Configuration

[cache.l1]
enabled = true
size = 1000      # Entries per connection
ttl_secs = 60    # 1 minute TTL

L1 Demo: Dashboard Polling

This workload simulates a user dashboard that polls every 2 seconds:

-- Dashboard query (user 12345)
SELECT
  notification_count,
  unread_messages,
  last_activity
FROM user_dashboard
WHERE user_id = 12345;

Without L1 cache:

Every 2 seconds: 40ms database query
30 queries/minute = 1,200ms/minute spent on DB

With L1 cache (30s TTL):

First query: 40ms (cache miss)
Next 14 queries: 0.1ms each (cache hit)
2 queries/minute to DB = 80ms/minute
93% reduction in database load

L1 Benchmark Results

From our scalability tests:

Query: SELECT balance FROM accounts WHERE id = 1
Clients: 10, Duration: 30s

Without cache: 242 TPS, 41.3ms latency
With L1 (repeated): 242 TPS, 41.3ms latency (CPU-bound, not I/O bound)

Key insight: L1 benefits visible when backend latency is high (network/disk)

L2 Cache: Shared Parameterized Queries

When L2 Excels

L2 cache shines when different clients execute the same query pattern with different parameters.

Workload	L2 Benefit	Example
Product catalog lookups	VERY HIGH	Different users viewing same products
API endpoints	HIGH	`/api/user/:id` with popular IDs
Configuration reads	VERY HIGH	Same settings queried by all workers
Reference data	VERY HIGH	Countries, currencies, categories

Query Normalization

L2 cache normalizes parameterized queries:

-- These all map to the same cache key:
SELECT * FROM products WHERE id = 123
SELECT * FROM products WHERE id = 456
SELECT * FROM products WHERE id = 789

-- Normalized form:
SELECT * FROM products WHERE id = $1

L2 Configuration

[cache.l2]
enabled = true
size_mb = 512           # 512MB shared cache
ttl_secs = 300          # 5 minutes
normalize_queries = true
storage = "memory"      # or "mmap" for persistence

L2 Demo: Product Catalog

E-commerce scenario: 80% of traffic views 20% of products (Pareto distribution).

-- Popular product lookup (normalized)
SELECT name, price, description, stock
FROM products
WHERE id = $1;

Traffic pattern:

1000 unique products
80% of requests hit top 200 products
500 concurrent users

Without L2 cache:

Every product view: 40ms query
500 requests/sec = 20,000ms/sec of DB time

With L2 cache:

First view of each popular product: 40ms (miss)
Subsequent views: 1ms (hit)
After warmup: 80% hit rate on popular items
Effective latency: 8.8ms (0.8 * 1ms + 0.2 * 40ms)
78% latency reduction

L2 Benchmark Results

Query: SELECT balance FROM accounts WHERE id = :random_1_to_1000
Clients: 100, Duration: 30s

HeliosDB Direct: 2,416 TPS, 41.3ms latency
HeliosProxy L2:  2,419 TPS, 41.3ms latency

Note: Random distribution across 1000 IDs = low hit rate (~0.3%)
For cache benefit, need concentrated access patterns.

Table-Specific TTL

Configure longer TTL for stable data:

[cache.tables.products]
ttl_secs = 3600  # Products change rarely

[cache.tables.orders]
ttl_secs = 60    # Orders change frequently

[cache.tables.user_sessions]
exclude = true   # Never cache (security)

L3 Cache: Semantic/AI Queries

When L3 Excels

L3 cache provides breakthrough performance for semantically similar queries - queries that mean the same thing but are phrased differently.

Workload	L3 Benefit	Example
Natural language search	VERY HIGH	”cheap hotels” vs “budget accommodation”
AI agent queries	VERY HIGH	Similar tool calls with different phrasing
Customer support	HIGH	FAQ variations from different users
RAG retrieval	HIGH	Similar questions about same docs

How L3 Works

Query text is converted to embedding vector (via Ollama)
Embedding is compared to cached entries using cosine similarity
If similarity > threshold (e.g., 0.92), return cached result

User A: "What are the cheapest flights to Paris?"
         ↓ Embedding
         [0.23, -0.45, 0.12, ...]
         ↓ Cosine Similarity = 0.94
User B: "Find budget flights to Paris"
         ↓ Embedding
         [0.25, -0.43, 0.11, ...]
         ↓ CACHE HIT!

L3 Configuration

[cache.l3]
enabled = true
similarity_threshold = 0.92  # 92% similarity for hit
max_entries = 10000
ttl_secs = 3600              # 1 hour

# Ollama for embeddings
embedding_endpoint = "http://localhost:11434"
embedding_model = "all-minilm"
embedding_dim = 384

L3 Demo: Customer Support Search

Support knowledge base with 10,000 articles:

-- Natural language search (vector similarity)
SELECT article_id, title, content
FROM support_articles
ORDER BY embedding <-> $1  -- $1 is query embedding
LIMIT 5;

Query variations (all cached together):

“How do I reset my password?”
“Password reset instructions”
“I forgot my password”
“Can’t log in, need to change password”

Performance impact:

Without L3: Each query = embedding + vector search (~100ms)
With L3: Similar queries = 5ms cache lookup
95% latency reduction for common questions

AI-Specific Optimizations

HeliosProxy includes specialized caches for AI/LLM workloads.

Conversation Context Cache

Caches recent conversation turns for multi-turn AI agents:

[cache.ai.conversation]
enabled = true
max_turns = 50           # Turns per conversation
max_conversations = 10000
ttl_secs = 3600          # 1 hour idle timeout

Use case: AI agent maintaining context across tool calls.

// Pseudocode: AI agent conversation
let context = cache.conversation.get("conv_12345");
// Returns last 50 turns instantly instead of DB query

Benefit:

Without cache: Load 50 turns from DB = 50ms
With cache: Instant retrieval = 0.5ms
100x faster context retrieval

RAG Chunk Cache

Caches document chunks for Retrieval-Augmented Generation:

[cache.ai.rag]
enabled = true
max_chunks = 100000
max_documents = 10000
ttl_secs = 86400         # 24 hours (documents stable)

Use case: Document Q&A system retrieving relevant chunks.

-- Chunk retrieval (expensive vector search)
SELECT chunk_id, content, embedding
FROM document_chunks
WHERE document_id = $1
ORDER BY embedding <-> $2
LIMIT 10;

Benefit:

Same document queried by multiple users
First query: Full vector search (150ms)
Subsequent: Cached chunks (2ms)
75x faster for popular documents

Tool Result Cache

Caches results of AI tool calls:

[cache.ai.tools]
enabled = true
max_entries = 50000
ttl_secs = 300           # 5 minutes
hash_arguments = true    # Cache by tool+args hash

Use case: AI agents calling same tool with same arguments.

{
  "tool": "get_weather",
  "arguments": {"city": "New York", "date": "2024-01-26"}
}

Benefit:

Multiple agents asking for NYC weather = 1 API call
Repeated tool patterns cached across all conversations
Reduces external API costs by 60-80%

Real-World Workload Examples

Example 1: E-Commerce Dashboard

Workload profile:

1000 concurrent users
Dashboard refresh every 30s
Product browsing (Pareto distribution)
Checkout (write-heavy, not cached)

# Optimized configuration
[cache]
enabled = true

[cache.l1]
enabled = true
size = 500
ttl_secs = 30

[cache.l2]
enabled = true
size_mb = 1024
ttl_secs = 300
normalize_queries = true

[cache.l3]
enabled = false  # No NLP queries

[cache.tables.user_cart]
exclude = true   # Always fresh

[cache.tables.products]
ttl_secs = 600   # 10 min for products

[cache.tables.categories]
ttl_secs = 3600  # 1 hour for categories

Expected results:

L1 hit rate: 40% (dashboard polling)
L2 hit rate: 60% (popular products)
Effective latency reduction: 50%
Database load reduction: 65%

Example 2: AI Customer Support Bot

Workload profile:

500 concurrent conversations
RAG retrieval from 50,000 documents
Natural language queries (semantic similarity)
Tool calls (weather, order lookup, etc.)

# AI-optimized configuration
[cache]
enabled = true

[cache.l1]
enabled = true
size = 200
ttl_secs = 300

[cache.l2]
enabled = true
size_mb = 512
ttl_secs = 600

[cache.l3]
enabled = true
similarity_threshold = 0.90
max_entries = 50000
ttl_secs = 7200
embedding_endpoint = "http://ollama:11434"
embedding_model = "all-minilm"

[cache.ai.conversation]
enabled = true
max_turns = 100
max_conversations = 5000

[cache.ai.rag]
enabled = true
max_chunks = 200000
ttl_secs = 86400

[cache.ai.tools]
enabled = true
max_entries = 100000
ttl_secs = 300

Expected results:

L3 hit rate: 35% (similar questions)
RAG cache hit rate: 70% (popular docs)
Tool cache hit rate: 50% (common queries)
Average response time: 200ms → 80ms
60% faster AI responses

Example 3: Analytics Dashboard

Workload profile:

50 concurrent analysts
Complex aggregate queries
Same reports run repeatedly
Data refreshes hourly

# Analytics configuration
[cache]
enabled = true
max_result_size = 52428800  # 50MB results

[cache.l1]
enabled = true
size = 100
ttl_secs = 300

[cache.l2]
enabled = true
size_mb = 4096        # 4GB for large results
ttl_secs = 3600       # 1 hour TTL
normalize_queries = true
storage = "mmap"      # Persist across restarts
mmap_path = "/data/cache/l2.mmap"

[cache.l3]
enabled = false

[cache.tables.analytics_hourly]
ttl_secs = 3600

[cache.tables.analytics_daily]
ttl_secs = 86400

Expected results:

L2 hit rate: 80% (same reports)
Average query time: 5s → 500ms
10x faster dashboard loads

Cache Invalidation Strategies

WAL-Based Invalidation (Recommended)

Automatically invalidates cache when data changes:

[cache.invalidation]
mode = "wal"
wal_subscribe = true

How it works:

HeliosDB streams WAL changes to proxy
Proxy identifies affected tables
Related cache entries invalidated instantly

Best for: OLTP workloads with frequent writes.

TTL-Based Invalidation

Simple time-based expiration:

[cache.invalidation]
mode = "ttl"
ttl_fallback_secs = 60

Best for:

Read-heavy workloads
Analytics (hourly refresh acceptable)
When WAL subscription unavailable

Manual Invalidation

API-driven cache clearing:

# Clear all cache
curl -X POST http://proxy:9090/cache/clear

# Clear specific table
curl -X POST http://proxy:9090/cache/clear/products

# Clear specific key
curl -X POST http://proxy:9090/cache/clear/key/products:123

Best for: Admin tools, deployment pipelines.

Monitoring Cache Performance

Metrics Endpoint

curl http://proxy:9090/cache/stats

{
  "l1": {
    "connections": 150,
    "total_entries": 45000,
    "hit_rate": 0.42,
    "hits": 1234567,
    "misses": 890123
  },
  "l2": {
    "size_mb": 512,
    "entries": 89000,
    "hit_rate": 0.58,
    "hits": 2345678,
    "misses": 1234567
  },
  "l3": {
    "entries": 15000,
    "hit_rate": 0.31,
    "avg_similarity": 0.94,
    "embedding_time_ms": 8.5
  },
  "ai": {
    "conversation_cache_entries": 2500,
    "rag_chunks_cached": 150000,
    "tool_results_cached": 45000
  }
}

Key Metrics to Watch

Metric	Healthy Range	Action If Outside
L1 hit rate	30-60%	Increase size or TTL
L2 hit rate	40-80%	Check query normalization
L3 hit rate	20-50%	Adjust similarity threshold
L2 size used	<90%	Increase size_mb
Eviction rate	<10%/min	Increase cache size

Prometheus Metrics

# prometheus.yml scrape config
- job_name: 'heliosproxy'
  static_configs:
    - targets: ['proxy:9100']

Available metrics:

heliosproxy_cache_hits_total{tier="l1|l2|l3"}
heliosproxy_cache_misses_total{tier="l1|l2|l3"}
heliosproxy_cache_size_bytes{tier="l1|l2|l3"}
heliosproxy_cache_evictions_total{tier="l1|l2|l3"}
heliosproxy_cache_latency_seconds{tier="l1|l2|l3",quantile="0.5|0.9|0.99"}

Performance Benchmarks

Scalability with Cache

From our benchmark suite (1000 clients, 30s test):

Configuration	TPS	Latency	Efficiency
HeliosDB Direct	20,491	48.8ms	85%
HeliosProxy (no cache)	20,489	48.8ms	85%
HeliosProxy (L1+L2)	20,495	48.7ms	85%

Key insight: Cache overhead is negligible (<0.1%). Benefits appear when:

Queries have temporal locality (same queries repeated)
Access patterns are skewed (Pareto distribution)
Backend latency is high (network, disk I/O)

Cache Benefit Scenarios

Scenario	Without Cache	With Cache	Improvement
Dashboard polling	40ms	0.1ms	400x
Popular products	40ms	1ms	40x
NLP similar queries	100ms	5ms	20x
RAG chunk retrieval	150ms	2ms	75x
AI tool results	500ms	1ms	500x

Best Practices

1. Start with L1 + L2

Enable L1 and L2 for all deployments. They have minimal overhead:

[cache]
enabled = true

[cache.l1]
enabled = true

[cache.l2]
enabled = true

2. Enable L3 Only for AI Workloads

L3 requires Ollama and adds embedding latency (~10ms):

[cache.l3]
enabled = true  # Only if you have NLP queries

3. Tune TTL Based on Data Volatility

Data Type	Recommended TTL
Configuration	1-24 hours
Reference data	1-6 hours
Product catalog	5-30 minutes
User data	1-5 minutes
Session data	Don’t cache

4. Monitor Hit Rates

If hit rates are low:

L1 low: Users not repeating queries (expected for diverse workloads)
L2 low: Queries not normalizing well, or random access pattern
L3 low: Raise similarity threshold, or queries too diverse

5. Size Cache Appropriately

Rule of thumb:

L1: 500-2000 entries per expected concurrent connection
L2: 10-50% of working dataset size
L3: 1000 entries per distinct query pattern expected

Troubleshooting

”Cache hit rate is 0%”

Cause: Queries not matching cache entries

Solutions:

Verify cache is enabled: curl http://proxy:9090/cache/stats
Check query normalization is working
Ensure TTL > query interval

”Memory usage too high”

Cause: Cache size misconfigured

Solutions:

[cache.l2]
size_mb = 256        # Reduce from default
max_result_size = 1048576  # 1MB max per entry

“L3 cache slow”

Cause: Embedding computation overhead

Solutions:

Use faster embedding model: all-minilm (384d) vs nomic-embed (768d)
Batch queries if possible
Increase L3 TTL to reduce embedding frequency

”Stale data in cache”

Cause: TTL too long or invalidation not working

Solutions:

Enable WAL invalidation:

[cache.invalidation]
mode = "wal"
wal_subscribe = true

Reduce TTL for volatile tables
Use manual invalidation API after writes

Helios-DistribCache Performance Tutorial

Helios-DistribCache Performance Tutorial

Prerequisites

Quick Start

Cache Architecture Overview

L1 Cache: Per-Connection Hot Data

When L1 Excels

L1 Configuration

L1 Demo: Dashboard Polling

L1 Benchmark Results

L2 Cache: Shared Parameterized Queries

When L2 Excels

Query Normalization

L2 Configuration

L2 Demo: Product Catalog

L2 Benchmark Results

Table-Specific TTL

L3 Cache: Semantic/AI Queries

When L3 Excels

How L3 Works

L3 Configuration

L3 Demo: Customer Support Search

AI-Specific Optimizations

Conversation Context Cache

RAG Chunk Cache

Tool Result Cache

Real-World Workload Examples

Example 1: E-Commerce Dashboard

Example 2: AI Customer Support Bot

Example 3: Analytics Dashboard

Cache Invalidation Strategies

WAL-Based Invalidation (Recommended)

TTL-Based Invalidation

Manual Invalidation

Monitoring Cache Performance

Metrics Endpoint

Key Metrics to Watch

Prometheus Metrics

Performance Benchmarks

Scalability with Cache

Cache Benefit Scenarios

Best Practices

1. Start with L1 + L2

2. Enable L3 Only for AI Workloads

3. Tune TTL Based on Data Volatility

4. Monitor Hit Rates

5. Size Cache Appropriately

Troubleshooting

”Cache hit rate is 0%”

”Memory usage too high”

“L3 cache slow”

”Stale data in cache”

See Also