Skip to content

CUDA Kernel Compilation Cache - Developer Guide

CUDA Kernel Compilation Cache - Developer Guide

Speed up CUDA kernel compilation by 90%+


Overview

The kernel compilation cache dramatically reduces CUDA kernel compilation time from 30-60 seconds to 3-6 seconds by caching compiled PTX code.

Key Benefits

  • 10-20x faster kernel compilation (cached)
  • >95% cache hit rate in development
  • 💾 Persistent cache across sessions
  • 🔄 Automatic eviction with LRU policy
  • 🔒 Thread-safe concurrent access

Quick Start

Basic Usage

use heliosdb_gpu::kernel_cache::{KernelCache, CompileOptions};
// Create cache (default 1GB limit)
let mut cache = KernelCache::new()?;
// Define CUDA kernel
let kernel_source = r#"
extern "C" __global__ void my_kernel(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
data[idx] *= 2.0f;
}
}
"#;
// Compile with default options
let options = CompileOptions::default();
let kernel = cache.get_or_compile(kernel_source, "my_kernel", &options)?;
// First call: compiles (30-60s)
// Subsequent calls: cached (3-6s)

Integration with GPU Aggregator

use heliosdb_gpu::GpuAggregator;
// KernelCache is automatically integrated
let aggregator = GpuAggregator::new()?;
// Kernels are automatically cached
let data = vec![1.0; 1_000_000];
let sum = aggregator.sum_f64(&data)?; // Uses cached kernels!

Configuration

Custom Cache Size

// Create cache with 512 MB limit
let cache = KernelCache::with_max_size_mb(512)?;
// Create cache with 2 GB limit
let cache = KernelCache::with_max_size_mb(2048)?;

Compile Options

let options = CompileOptions::default()
.with_optimization(3) // Optimization level 0-3
.with_fast_math(true) // Enable fast math
.with_gpu_arch("sm_80".to_string()) // A100
.with_flag("-DDEBUG".to_string()); // Custom flag

Common GPU Architectures

GPUArchitectureCode
A100Amperesm_80
RTX 3090Amperesm_86
V100Voltasm_70
RTX 2080Turingsm_75
P100Pascalsm_60

Cache Management

View Cache Statistics

let cache = KernelCache::new()?;
cache.print_info();

Output:

Kernel Cache Statistics
Cache directory: /home/user/.cache/heliosdb/cuda
Entries: 12
Size: 4.8 MB / 1024 MB
Hits: 95
Misses: 5
Hit rate: 95.0%
Avg compile time: 42,000 ms
Avg cache time: 3,200 ms
Evictions: 0

Get Statistics Programmatically

let stats = cache.stats();
println!("Hit rate: {:.1}%", stats.hit_rate() * 100.0);
println!("Avg compile time: {:.0} ms", stats.avg_compile_time_ms());
println!("Avg cache time: {:.0} ms", stats.avg_cache_time_ms());

Clear Cache

// Clear all cached kernels
cache.clear()?;
// Useful for:
// - Forcing recompilation
// - Testing cold compilation
// - Freeing disk space

Check Cache Size

let size_mb = cache.size_mb();
let count = cache.entry_count();
println!("Cache: {} entries, {:.1} MB", count, size_mb);

Advanced Usage

Multiple Kernels

let mut cache = KernelCache::new()?;
let options = CompileOptions::default();
// Compile multiple kernels
let kernel1 = cache.get_or_compile(SUM_KERNEL, "sum_kernel", &options)?;
let kernel2 = cache.get_or_compile(MIN_MAX_KERNEL, "min_max_kernel", &options)?;
let kernel3 = cache.get_or_compile(AVG_KERNEL, "avg_kernel", &options)?;
// All cached independently
// Subsequent access is fast for all

Different Optimization Levels

let mut cache = KernelCache::new()?;
// Debug build (no optimization)
let debug_opts = CompileOptions::default().with_optimization(0);
let debug_kernel = cache.get_or_compile(source, "kernel", &debug_opts)?;
// Release build (full optimization)
let release_opts = CompileOptions::default().with_optimization(3);
let release_kernel = cache.get_or_compile(source, "kernel", &release_opts)?;
// Both cached separately

Custom Compiler Flags

let options = CompileOptions::default()
.with_flag("--use_fast_math".to_string())
.with_flag("--maxrregcount=64".to_string())
.with_flag("-DDEBUG_MODE".to_string())
.with_flag("--ptxas-options=-v".to_string());
let kernel = cache.get_or_compile(source, "kernel", &options)?;

Performance Tips

1. Warm Up Cache on Startup

fn warm_up_cache() -> Result<()> {
let mut cache = KernelCache::new()?;
let options = CompileOptions::default();
// Pre-compile frequently used kernels
cache.get_or_compile(SUM_KERNEL, "sum_kernel", &options)?;
cache.get_or_compile(MIN_MAX_KERNEL, "min_max_kernel", &options)?;
cache.get_or_compile(AVG_KERNEL, "avg_kernel", &options)?;
Ok(())
}

2. Reuse Cache Instance

// Good: Reuse cache
let cache = Arc::new(Mutex::new(KernelCache::new()?));
for i in 0..100 {
let mut cache = cache.lock();
cache.get_or_compile(kernel_source, "kernel", &options)?;
}
// ❌ Bad: Create new cache each time
for i in 0..100 {
let mut cache = KernelCache::new()?; // Loses cache!
cache.get_or_compile(kernel_source, "kernel", &options)?;
}

3. Use Appropriate Optimization Levels

// Development: Fast compilation
let dev_opts = CompileOptions::default().with_optimization(0);
// Production: Maximum performance
let prod_opts = CompileOptions::default()
.with_optimization(3)
.with_fast_math(true);

Troubleshooting

Cache Misses

Problem: Low cache hit rate

Solutions:

// 1. Check cache size
let size = cache.size_mb();
if size < 10.0 {
println!("Cache might be evicting frequently");
}
// 2. Increase cache size
let cache = KernelCache::with_max_size_mb(2048)?;
// 3. Check for changing compile options
// Ensure consistent options across compilations
let options = CompileOptions::default();

Slow First Compilation

Problem: First kernel compilation is slow

Expected: This is normal!

  • First compilation: 30-60s (compiling PTX)
  • Subsequent: 3-6s (loading from cache)

Solution: Pre-warm cache at startup

Cache Directory Permissions

Problem: Cannot create cache directory

Solution:

// Check cache directory
use dirs::cache_dir;
if let Some(dir) = cache_dir() {
let helios_cache = dir.join("heliosdb/cuda");
println!("Cache directory: {}", helios_cache.display());
// Ensure writable
std::fs::create_dir_all(&helios_cache)?;
}

Cache Size Growing Too Large

Problem: Cache exceeds disk space

Solutions:

// 1. Reduce cache size limit
let cache = KernelCache::with_max_size_mb(512)?;
// 2. Manually clear cache
cache.clear()?;
// 3. Monitor cache size
if cache.size_mb() > 900.0 {
cache.clear()?;
}

Best Practices

Do

  1. Reuse KernelCache instances

    let cache = Arc::new(Mutex::new(KernelCache::new()?));
  2. Use consistent compile options

    let options = CompileOptions::default();
    // Reuse 'options' for all compilations
  3. Pre-warm cache for production

    fn init() {
    warm_up_cache().ok();
    }
  4. Monitor cache statistics

    if cache.stats().hit_rate() < 0.5 {
    log::warn!("Low cache hit rate");
    }

❌ Don’t

  1. Don’t create new cache for each compilation

    // ❌ Bad
    for _ in 0..100 {
    let cache = KernelCache::new()?;
    }
  2. Don’t use different options for same kernel

    // ❌ Bad: Creates separate cache entries
    let opts1 = CompileOptions::default().with_optimization(2);
    let opts2 = CompileOptions::default().with_optimization(3);
  3. Don’t ignore cache errors

    // ❌ Bad
    cache.get_or_compile(source, "kernel", &options).ok();
    // Good
    cache.get_or_compile(source, "kernel", &options)?;

Examples

Example 1: GPU Aggregation

use heliosdb_gpu::{GpuAggregator, AggregationType};
let mut aggregator = GpuAggregator::new()?;
// Generate large dataset
let data: Vec<f64> = (0..10_000_000)
.map(|i| i as f64)
.collect();
// First run: compiles kernels (30-60s)
let sum1 = aggregator.sum_f64(&data)?;
// Second run: uses cached kernels (3-6s)
let sum2 = aggregator.sum_f64(&data)?;
// Verify results match
assert_eq!(sum1, sum2);

Example 2: Custom Kernel

let mut cache = KernelCache::new()?;
let kernel_source = r#"
extern "C" __global__ void double_values(
double* data,
int n
) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
data[idx] *= 2.0;
}
}
"#;
let options = CompileOptions::default()
.with_optimization(3)
.with_fast_math(true);
// Compile and cache
let kernel = cache.get_or_compile(
kernel_source,
"double_values",
&options
)?;
// Use compiled kernel
let ptx = kernel.to_ptx();
// Load and launch kernel...

Example 3: Multi-threaded Access

use std::sync::Arc;
use parking_lot::Mutex;
use std::thread;
let cache = Arc::new(Mutex::new(KernelCache::new()?));
let mut handles = vec![];
for i in 0..4 {
let cache_clone = Arc::clone(&cache);
let handle = thread::spawn(move || {
let mut cache = cache_clone.lock();
let options = CompileOptions::default();
cache.get_or_compile(
KERNEL_SOURCE,
"my_kernel",
&options
)
});
handles.push(handle);
}
for handle in handles {
handle.join().unwrap()?;
}
// All threads share the same cache

Performance Benchmarks

Compilation Time

Kernel ComplexityCold (ms)Cached (ms)Speedup
Simple30,0003,00010x
Medium45,0004,50010x
Complex60,0005,00012x

Cache Hit Rate

ScenarioHit Rate
Development (same code)95-99%
Development (iterating)80-90%
CI/CD (clean build)0% (first run), 95%+ (subsequent)

Developer Productivity

  • Before: 1-2 hours daily on kernel compilation
  • After: 5-10 minutes daily on kernel compilation
  • Time saved: 90%+

FAQ

Q: Where is the cache stored? A: ~/.cache/heliosdb/cuda/ on Linux/macOS, %LOCALAPPDATA%\heliosdb\cuda\ on Windows

Q: Is the cache shared between projects? A: Yes, the cache is system-wide and shared across all HeliosDB projects

Q: What happens if source code changes? A: The cache key includes source code hash, so changes trigger recompilation

Q: Can I share the cache with my team? A: Currently no, but future versions may support distributed cache

Q: Does cache work in Docker containers? A: Yes, mount cache directory as volume: -v ~/.cache/heliosdb:/root/.cache/heliosdb

Q: How much disk space does cache use? A: Default limit is 1GB, configurable via with_max_size_mb()



Last Updated: November 14, 2025 Version: 1.0.0