HeliosDB GPU API Contract Design v2.0
HeliosDB GPU API Contract Design v2.0
Document Type: Architecture Specification Version: 2.0 Date: 2025-11-18 Author: Agent 4 (GPU Acceleration Specialist) Status: DESIGN PHASE - Week 5 Day 1 Target Audience: Agent 2 (Multimodal Vector Features), Query Optimizer, Core Engine
Table of Contents
- Executive Summary
- Design Principles
- Memory Allocation API
- Data Transfer API
- Kernel Execution API
- Cost Model Hooks
- Type System
- Error Handling
- Lifecycle Management
- Performance Guarantees
- Integration Patterns
- Example Workflows
1. Executive Summary
This document defines the low-level API contracts for GPU acceleration in HeliosDB. These contracts establish clear boundaries between:
- Memory Management: Allocation, pooling, and lifecycle
- Data Transfer: Host-device communication patterns
- Kernel Execution: GPU compute invocation
- Cost Modeling: Intelligent CPU/GPU routing decisions
1.1 Design Goals
- Type Safety: Leverage Rust’s type system to prevent GPU programming errors
- Zero-Cost Abstractions: Minimal overhead over raw CUDA/ROCm
- Automatic Fallback: Graceful degradation when GPU unavailable
- Vendor Neutrality: Support NVIDIA, AMD, and future vendors
- Memory Safety: RAII-based lifecycle management
1.2 Target Performance
- Memory Transfer Overhead: <5ms for 100MB transfers
- Kernel Launch Overhead: <0.1ms
- Memory Allocation Overhead: <0.01ms (pooled)
- CPU Fallback Decision: <0.001ms
2. Design Principles
2.1 Zero-Copy Philosophy
Minimize data movement between CPU and GPU:
// ❌ BAD: Copy to temporary bufferlet temp = data.to_vec();gpu.upload(&temp)?;
// GOOD: Direct transfer from sourcegpu.upload_direct(&data)?;
// BEST: Zero-copy with pinned memorylet pinned = gpu.pin_memory(&data)?;gpu.upload_pinned(&pinned)?;2.2 RAII Lifecycle Management
All GPU resources use RAII:
{ let buffer = gpu.allocate(size)?; // Use buffer // Automatic cleanup on drop}2.3 Explicit vs Implicit Synchronization
// Explicit sync (user control)gpu.launch_kernel_async(&kernel, ¶ms)?;gpu.synchronize()?;
// Implicit sync (convenience)let result = gpu.launch_kernel_sync(&kernel, ¶ms)?;2.4 Type-Safe Device Pointers
// Type-safe device pointerpub struct DevicePtr<T> { ptr: usize, len: usize, _phantom: PhantomData<T>,}
// Prevents wrong-type access at compile timelet f32_ptr: DevicePtr<f32> = gpu.allocate_typed(1024)?;// ❌ This won't compile:// let i32_ptr: DevicePtr<i32> = f32_ptr;3. Memory Allocation API
3.1 Core Memory Types
/// GPU memory buffer with type safetypub struct GpuBuffer<T> { device_ptr: DevicePtr<T>, capacity: usize, device_id: DeviceId, allocator: Weak<GpuAllocator>,}
/// Pinned host memory for fast transferspub struct PinnedBuffer<T> { host_ptr: *mut T, capacity: usize, is_registered: bool,}
/// Unified memory (CPU and GPU accessible)pub struct UnifiedBuffer<T> { ptr: *mut T, capacity: usize, hint: AccessHint,}
#[derive(Copy, Clone, Debug)]pub enum AccessHint { HostMostly, DeviceMostly, Balanced,}3.2 Allocator Interface
/// GPU memory allocatorpub trait GpuAllocator: Send + Sync { /// Allocate typed buffer fn allocate<T>(&self, count: usize) -> Result<GpuBuffer<T>> where T: DeviceRepr;
/// Allocate with alignment fn allocate_aligned<T>( &self, count: usize, alignment: usize, ) -> Result<GpuBuffer<T>> where T: DeviceRepr;
/// Allocate pinned host memory fn allocate_pinned<T>(&self, count: usize) -> Result<PinnedBuffer<T>> where T: DeviceRepr;
/// Allocate unified memory fn allocate_unified<T>( &self, count: usize, hint: AccessHint, ) -> Result<UnifiedBuffer<T>> where T: DeviceRepr;
/// Get allocation statistics fn stats(&self) -> AllocationStats;
/// Defragment memory (blocking operation) fn defragment(&self) -> Result<DefragmentationReport>;}
/// Allocation statistics#[derive(Debug, Clone)]pub struct AllocationStats { pub total_allocated: usize, pub total_freed: usize, pub current_usage: usize, pub peak_usage: usize, pub fragmentation_ratio: f32, pub allocations: u64, pub cache_hits: u64, pub cache_misses: u64,}
/// Defragmentation report#[derive(Debug, Clone)]pub struct DefragmentationReport { pub bytes_moved: usize, pub time_ms: f64, pub fragmentation_before: f32, pub fragmentation_after: f32,}3.3 Memory Pool Configuration
/// Memory pool configuration#[derive(Debug, Clone)]pub struct PoolConfig { /// Maximum pool size in bytes pub max_size: usize,
/// Minimum allocation size (bytes) pub min_allocation: usize,
/// Maximum allocation size (bytes) pub max_allocation: usize,
/// Bucket growth strategy pub growth_strategy: GrowthStrategy,
/// Eviction policy pub eviction_policy: EvictionPolicy,
/// Enable defragmentation pub auto_defragment: bool,
/// Defragmentation threshold pub defrag_threshold: f32,}
#[derive(Debug, Clone, Copy)]pub enum GrowthStrategy { PowerOfTwo, Fibonacci, Linear(usize),}
#[derive(Debug, Clone, Copy)]pub enum EvictionPolicy { LRU, LFU, ARC, FIFO,}
impl Default for PoolConfig { fn default() -> Self { Self { max_size: 1 << 30, // 1GB min_allocation: 4096, // 4KB max_allocation: 256 << 20, // 256MB growth_strategy: GrowthStrategy::PowerOfTwo, eviction_policy: EvictionPolicy::LRU, auto_defragment: true, defrag_threshold: 0.3, } }}3.4 Memory Buffer Operations
impl<T: DeviceRepr> GpuBuffer<T> { /// Get buffer capacity (elements) pub fn capacity(&self) -> usize;
/// Get buffer size (bytes) pub fn size_bytes(&self) -> usize;
/// Get device ID pub fn device(&self) -> DeviceId;
/// Check if buffer is valid pub fn is_valid(&self) -> bool;
/// Get raw device pointer (unsafe) pub unsafe fn as_device_ptr(&self) -> *mut T;
/// Fill buffer with value pub fn fill(&mut self, value: T) -> Result<()> where T: Copy;
/// Copy from another GPU buffer (device-to-device) pub fn copy_from_device(&mut self, src: &GpuBuffer<T>) -> Result<()>;
/// Copy to another GPU buffer pub fn copy_to_device(&self, dst: &mut GpuBuffer<T>) -> Result<()>;
/// Create view into buffer pub fn view(&self, offset: usize, len: usize) -> Result<GpuBufferView<T>>;
/// Create mutable view pub fn view_mut( &mut self, offset: usize, len: usize, ) -> Result<GpuBufferViewMut<T>>;}3.5 Advanced Memory Features
/// Memory-mapped buffer for zero-copy accesspub struct MappedBuffer<T> { buffer: GpuBuffer<T>, host_ptr: *mut T, mapping: MemoryMapping,}
impl<T: DeviceRepr> MappedBuffer<T> { /// Map GPU buffer to host address space pub fn map(buffer: GpuBuffer<T>) -> Result<Self>;
/// Access as host slice (read-only) pub fn as_slice(&self) -> &[T];
/// Access as mutable host slice pub fn as_mut_slice(&mut self) -> &mut [T];
/// Unmap and return underlying buffer pub fn unmap(self) -> GpuBuffer<T>;
/// Synchronize host changes to device pub fn sync_to_device(&mut self) -> Result<()>;
/// Synchronize device changes to host pub fn sync_to_host(&mut self) -> Result<()>;}4. Data Transfer API
4.1 Transfer Interface
/// Data transfer managerpub trait DataTransfer: Send + Sync { /// Synchronous host-to-device transfer fn upload<T>( &self, src: &[T], dst: &mut GpuBuffer<T>, ) -> Result<()> where T: DeviceRepr;
/// Synchronous device-to-host transfer fn download<T>( &self, src: &GpuBuffer<T>, dst: &mut [T], ) -> Result<()> where T: DeviceRepr;
/// Asynchronous host-to-device transfer fn upload_async<T>( &self, src: &[T], dst: &mut GpuBuffer<T>, stream: &Stream, ) -> Result<TransferHandle> where T: DeviceRepr;
/// Asynchronous device-to-host transfer fn download_async<T>( &self, src: &GpuBuffer<T>, dst: &mut [T], stream: &Stream, ) -> Result<TransferHandle> where T: DeviceRepr;
/// Device-to-device transfer (same GPU) fn copy_device<T>( &self, src: &GpuBuffer<T>, dst: &mut GpuBuffer<T>, ) -> Result<()> where T: DeviceRepr;
/// Peer-to-peer transfer (different GPUs) fn copy_peer<T>( &self, src: &GpuBuffer<T>, dst: &mut GpuBuffer<T>, src_device: DeviceId, dst_device: DeviceId, ) -> Result<()> where T: DeviceRepr;}4.2 Transfer Handle
/// Handle for asynchronous transferpub struct TransferHandle { id: TransferId, stream: StreamId, size_bytes: usize, direction: TransferDirection, start_time: Instant,}
#[derive(Debug, Clone, Copy, PartialEq, Eq)]pub enum TransferDirection { HostToDevice, DeviceToHost, DeviceToDevice, PeerToPeer,}
impl TransferHandle { /// Wait for transfer to complete pub fn wait(self) -> Result<TransferStats>;
/// Check if transfer is complete (non-blocking) pub fn is_complete(&self) -> bool;
/// Get transfer progress (0.0 to 1.0) pub fn progress(&self) -> f32;
/// Get transfer ID pub fn id(&self) -> TransferId;
/// Get estimated time remaining pub fn time_remaining(&self) -> Duration;}
/// Transfer statistics#[derive(Debug, Clone)]pub struct TransferStats { pub bytes_transferred: usize, pub duration: Duration, pub bandwidth_gbps: f64, pub direction: TransferDirection,}4.3 Optimized Transfer Patterns
/// Batch transfer for multiple bufferspub struct BatchTransfer { transfers: Vec<TransferSpec>, strategy: BatchStrategy,}
pub enum TransferSpec { HostToDevice { src: Vec<u8>, dst: GpuBuffer<u8>, }, DeviceToHost { src: GpuBuffer<u8>, dst: Vec<u8>, },}
#[derive(Debug, Clone, Copy)]pub enum BatchStrategy { Sequential, Parallel, Pipelined,}
impl BatchTransfer { /// Create new batch transfer pub fn new(strategy: BatchStrategy) -> Self;
/// Add transfer to batch pub fn add(&mut self, spec: TransferSpec);
/// Execute all transfers pub fn execute(self, stream: &Stream) -> Result<Vec<TransferHandle>>;
/// Execute and wait for all transfers pub fn execute_sync(self) -> Result<Vec<TransferStats>>;}4.4 Pinned Memory Transfers
/// Pinned memory transfer (faster than pageable)pub trait PinnedTransfer { /// Transfer from pinned host memory fn upload_pinned<T>( &self, src: &PinnedBuffer<T>, dst: &mut GpuBuffer<T>, ) -> Result<()> where T: DeviceRepr;
/// Transfer to pinned host memory fn download_pinned<T>( &self, src: &GpuBuffer<T>, dst: &mut PinnedBuffer<T>, ) -> Result<()> where T: DeviceRepr;
/// Asynchronous pinned transfer fn upload_pinned_async<T>( &self, src: &PinnedBuffer<T>, dst: &mut GpuBuffer<T>, stream: &Stream, ) -> Result<TransferHandle> where T: DeviceRepr;}5. Kernel Execution API
5.1 Kernel Definition
/// GPU kernel representationpub struct Kernel { name: String, source: KernelSource, entry_point: String, compiled: Option<CompiledKernel>,}
pub enum KernelSource { /// PTX code (NVIDIA) Ptx(String),
/// CUDA source code CudaSource(String),
/// HIP source code (AMD) HipSource(String),
/// Pre-compiled binary Binary(Vec<u8>),}
/// Compiled kernelpub struct CompiledKernel { binary: Vec<u8>, metadata: KernelMetadata, device_id: DeviceId,}
/// Kernel metadata#[derive(Debug, Clone)]pub struct KernelMetadata { pub name: String, pub register_count: u32, pub shared_memory_bytes: usize, pub constant_memory_bytes: usize, pub max_threads_per_block: u32, pub compute_capability: (u32, u32),}5.2 Launch Configuration
/// Kernel launch configuration#[derive(Debug, Clone)]pub struct LaunchConfig { /// Grid dimensions (blocks) pub grid_dim: Dim3,
/// Block dimensions (threads per block) pub block_dim: Dim3,
/// Shared memory per block (bytes) pub shared_mem_bytes: usize,
/// CUDA stream (optional) pub stream: Option<StreamId>,}
/// 3D dimensions#[derive(Debug, Clone, Copy, PartialEq, Eq)]pub struct Dim3 { pub x: u32, pub y: u32, pub z: u32,}
impl Dim3 { pub fn new(x: u32, y: u32, z: u32) -> Self { Self { x, y, z } }
pub fn linear(size: u32) -> Self { Self { x: size, y: 1, z: 1 } }
pub fn total(&self) -> u32 { self.x * self.y * self.z }}
impl LaunchConfig { /// Create configuration for 1D workload pub fn linear(total_threads: usize, threads_per_block: u32) -> Self;
/// Create configuration for 2D workload pub fn grid_2d( width: u32, height: u32, block_width: u32, block_height: u32, ) -> Self;
/// Optimize configuration for device pub fn optimize_for_device( workload: usize, device: &GpuDevice, ) -> Result<Self>;}5.3 Kernel Launcher
/// Kernel launcher interfacepub trait KernelLauncher: Send + Sync { /// Launch kernel synchronously fn launch_sync( &self, kernel: &Kernel, config: &LaunchConfig, args: &[KernelArg], ) -> Result<KernelStats>;
/// Launch kernel asynchronously fn launch_async( &self, kernel: &Kernel, config: &LaunchConfig, args: &[KernelArg], stream: &Stream, ) -> Result<LaunchHandle>;
/// Compile kernel fn compile(&self, kernel: &Kernel) -> Result<CompiledKernel>;
/// Get optimal launch configuration fn auto_config( &self, kernel: &Kernel, workload_size: usize, ) -> Result<LaunchConfig>;}
/// Kernel argumentpub enum KernelArg { Buffer(GpuBufferRef), Scalar(ScalarValue),}
pub enum ScalarValue { I32(i32), U32(u32), I64(i64), U64(u64), F32(f32), F64(f64),}
/// Reference to GPU bufferpub struct GpuBufferRef { ptr: usize, size: usize,}5.4 Launch Handle
/// Handle for asynchronous kernel launchpub struct LaunchHandle { id: LaunchId, stream: StreamId, start_time: Instant, config: LaunchConfig,}
impl LaunchHandle { /// Wait for kernel to complete pub fn wait(self) -> Result<KernelStats>;
/// Check if kernel is complete pub fn is_complete(&self) -> bool;
/// Get launch ID pub fn id(&self) -> LaunchId;}
/// Kernel execution statistics#[derive(Debug, Clone)]pub struct KernelStats { pub duration: Duration, pub grid_dim: Dim3, pub block_dim: Dim3, pub shared_mem_bytes: usize, pub register_count: u32, pub occupancy: f32,}5.5 Stream Management
/// CUDA/HIP stream for asynchronous executionpub struct Stream { id: StreamId, device_id: DeviceId, priority: StreamPriority,}
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]pub enum StreamPriority { Low, Normal, High,}
impl Stream { /// Create new stream pub fn new(device: &GpuDevice, priority: StreamPriority) -> Result<Self>;
/// Synchronize stream (wait for all operations) pub fn synchronize(&self) -> Result<()>;
/// Query if stream is idle pub fn is_idle(&self) -> bool;
/// Add callback to stream pub fn add_callback<F>(&self, callback: F) -> Result<()> where F: FnOnce() + Send + 'static;}6. Cost Model Hooks
6.1 Cost Model Interface
/// GPU cost model for CPU/GPU routing decisionspub trait GpuCostModel: Send + Sync { /// Estimate GPU execution cost fn estimate_gpu_cost(&self, op: &Operation) -> Result<ExecutionCost>;
/// Estimate CPU execution cost fn estimate_cpu_cost(&self, op: &Operation) -> Result<ExecutionCost>;
/// Determine optimal execution target fn choose_target(&self, op: &Operation) -> Result<ExecutionTarget>;
/// Update cost model with observed execution fn learn_from_execution(&mut self, stats: &ExecutionStats);}
/// Execution cost estimate#[derive(Debug, Clone)]pub struct ExecutionCost { /// Estimated execution time pub time_ms: f64,
/// Estimated memory usage (bytes) pub memory_bytes: usize,
/// Estimated energy consumption (joules) pub energy_joules: f64,
/// Confidence in estimate (0.0 to 1.0) pub confidence: f32,}
/// Execution target#[derive(Debug, Clone, Copy, PartialEq, Eq)]pub enum ExecutionTarget { CPU, GPU(DeviceId), Hybrid { cpu_ratio: f32 },}6.2 Operation Descriptor
/// Operation to be executed#[derive(Debug, Clone)]pub struct Operation { pub op_type: OperationType, pub input_size: usize, pub output_size: usize, pub complexity: ComplexityClass, pub data_layout: DataLayout,}
#[derive(Debug, Clone, Copy, PartialEq, Eq)]pub enum OperationType { Aggregation(AggregateOp), Scan(ScanOp), Join(JoinOp), VectorOp(VectorOp), MatrixOp(MatrixOp), Custom(u32),}
#[derive(Debug, Clone, Copy)]pub enum AggregateOp { Sum, Avg, Min, Max, Count,}
#[derive(Debug, Clone, Copy)]pub enum ComplexityClass { Linear, LogLinear, Quadratic, Cubic,}
#[derive(Debug, Clone, Copy)]pub enum DataLayout { RowMajor, ColumnMajor, Sparse,}6.3 Adaptive Cost Learning
/// Adaptive cost model that learns from executionpub struct AdaptiveCostModel { history: ExecutionHistory, predictor: CostPredictor, config: AdaptiveConfig,}
#[derive(Debug, Clone)]pub struct AdaptiveConfig { /// Minimum samples before switching target pub min_samples: usize,
/// Confidence threshold for GPU execution pub confidence_threshold: f32,
/// Memory pressure threshold (0.0 to 1.0) pub memory_pressure_threshold: f32,
/// Enable online learning pub enable_learning: bool,}
impl AdaptiveCostModel { /// Create new adaptive model pub fn new(config: AdaptiveConfig) -> Self;
/// Predict cost with confidence interval pub fn predict_with_confidence( &self, op: &Operation, target: ExecutionTarget, ) -> Result<(ExecutionCost, ConfidenceInterval)>;
/// Get historical statistics pub fn get_stats(&self, op_type: OperationType) -> Option<OpStats>;}
#[derive(Debug, Clone)]pub struct ConfidenceInterval { pub lower_ms: f64, pub upper_ms: f64, pub confidence_level: f32,}6.4 Execution Statistics
/// Execution statistics for learning#[derive(Debug, Clone)]pub struct ExecutionStats { pub operation: Operation, pub target: ExecutionTarget, pub actual_time_ms: f64, pub memory_used_bytes: usize, pub cache_hits: u64, pub cache_misses: u64, pub transfer_time_ms: f64, pub compute_time_ms: f64, pub success: bool, pub timestamp: Instant,}
/// Operation statistics#[derive(Debug, Clone)]pub struct OpStats { pub op_type: OperationType, pub execution_count: u64, pub cpu_successes: u64, pub gpu_successes: u64, pub avg_cpu_time_ms: f64, pub avg_gpu_time_ms: f64, pub speedup_factor: f64,}7. Type System
7.1 Device-Compatible Types
/// Marker trait for types that can be transferred to GPUpub unsafe trait DeviceRepr: Copy + Send + Sync + 'static { /// Check if type is properly aligned fn is_aligned(ptr: *const Self) -> bool;
/// Get alignment requirement fn alignment() -> usize;}
// Implement for primitive typesunsafe impl DeviceRepr for f32 { fn is_aligned(ptr: *const Self) -> bool { (ptr as usize) % 4 == 0 }
fn alignment() -> usize { 4 }}
unsafe impl DeviceRepr for f64 { fn is_aligned(ptr: *const Self) -> bool { (ptr as usize) % 8 == 0 }
fn alignment() -> usize { 8 }}
// ... similar for i8, i16, i32, i64, u8, u16, u32, u647.2 Vector Types
/// GPU-compatible vector types#[repr(C, align(16))]#[derive(Copy, Clone, Debug)]pub struct Float4 { pub x: f32, pub y: f32, pub z: f32, pub w: f32,}
unsafe impl DeviceRepr for Float4 { fn is_aligned(ptr: *const Self) -> bool { (ptr as usize) % 16 == 0 }
fn alignment() -> usize { 16 }}
// Similar for Float2, Float3, Int4, etc.7.3 Type-Safe Buffer Views
/// Immutable view into GPU bufferpub struct GpuBufferView<'a, T> { buffer: &'a GpuBuffer<T>, offset: usize, len: usize,}
impl<'a, T: DeviceRepr> GpuBufferView<'a, T> { /// Get view length pub fn len(&self) -> usize;
/// Check if view is empty pub fn is_empty(&self) -> bool;
/// Split view at index pub fn split_at(&self, mid: usize) -> (Self, Self);
/// Create sub-view pub fn subview(&self, offset: usize, len: usize) -> Result<Self>;}
/// Mutable view into GPU bufferpub struct GpuBufferViewMut<'a, T> { buffer: &'a mut GpuBuffer<T>, offset: usize, len: usize,}8. Error Handling
8.1 Error Types
/// GPU API errors#[derive(Debug, thiserror::Error)]pub enum GpuError { #[error("GPU device not available")] DeviceNotAvailable,
#[error("CUDA error: {0}")] CudaError(String),
#[error("ROCm error: {0}")] RocmError(String),
#[error("Out of GPU memory: requested {requested}MB, available {available}MB")] OutOfMemory { requested: usize, available: usize },
#[error("Invalid launch configuration: {0}")] InvalidLaunchConfig(String),
#[error("Kernel compilation failed: {0}")] CompilationError(String),
#[error("Transfer failed: {0}")] TransferError(String),
#[error("Unsupported operation: {0}")] Unsupported(String),
#[error("Synchronization failed: {0}")] SyncError(String),
#[error("Invalid buffer access: offset={offset}, len={len}, capacity={capacity}")] InvalidAccess { offset: usize, len: usize, capacity: usize, },}8.2 Result Types
/// GPU operation resultpub type GpuResult<T> = Result<T, GpuError>;
/// Result with optional CPU fallbackpub enum FallbackResult<T> { Gpu(T), Cpu(T), Error(GpuError),}
impl<T> FallbackResult<T> { pub fn unwrap(self) -> T { match self { Self::Gpu(v) | Self::Cpu(v) => v, Self::Error(e) => panic!("GPU operation failed: {}", e), } }
pub fn is_gpu(&self) -> bool { matches!(self, Self::Gpu(_)) }}8.3 Error Recovery
/// Error recovery strategypub trait ErrorRecovery { /// Attempt to recover from error fn try_recover(&self, error: &GpuError) -> RecoveryAction;}
#[derive(Debug, Clone, Copy)]pub enum RecoveryAction { /// Retry operation Retry,
/// Fall back to CPU FallbackToCpu,
/// Abort with error Abort,
/// Reduce memory usage and retry ReduceMemory,}9. Lifecycle Management
9.1 Resource Initialization
/// GPU context initializationpub struct GpuContext { devices: Vec<GpuDevice>, allocator: Arc<dyn GpuAllocator>, transfer: Arc<dyn DataTransfer>, launcher: Arc<dyn KernelLauncher>, cost_model: Arc<Mutex<dyn GpuCostModel>>,}
impl GpuContext { /// Initialize GPU context pub fn new(config: GpuConfig) -> GpuResult<Self>;
/// Get device by ID pub fn device(&self, id: DeviceId) -> Option<&GpuDevice>;
/// Get default device pub fn default_device(&self) -> &GpuDevice;
/// Get allocator pub fn allocator(&self) -> &Arc<dyn GpuAllocator>;
/// Get data transfer interface pub fn transfer(&self) -> &Arc<dyn DataTransfer>;
/// Get kernel launcher pub fn launcher(&self) -> &Arc<dyn KernelLauncher>;
/// Get cost model pub fn cost_model(&self) -> &Arc<Mutex<dyn GpuCostModel>>;
/// Shutdown context pub fn shutdown(self) -> GpuResult<()>;}
/// GPU configuration#[derive(Debug, Clone)]pub struct GpuConfig { pub device_ids: Vec<DeviceId>, pub memory_pool_config: PoolConfig, pub enable_peer_access: bool, pub cost_model_config: AdaptiveConfig,}9.2 Resource Cleanup
impl Drop for GpuContext { fn drop(&mut self) { // Automatic cleanup: // 1. Wait for all pending operations // 2. Free all allocations // 3. Destroy streams // 4. Release devices }}
impl<T> Drop for GpuBuffer<T> { fn drop(&mut self) { // Return buffer to pool or free }}10. Performance Guarantees
10.1 Latency Guarantees
| Operation | Maximum Latency | Typical Latency |
|---|---|---|
| Memory allocation (pooled) | 0.1ms | 0.01ms |
| Memory allocation (new) | 5ms | 1ms |
| Kernel launch | 0.5ms | 0.05ms |
| Host-to-device transfer (1MB) | 2ms | 0.5ms |
| Device-to-host transfer (1MB) | 2ms | 0.5ms |
| Synchronization | 1ms | 0.1ms |
10.2 Throughput Guarantees
| Operation | Minimum Throughput | Target Throughput |
|---|---|---|
| Memory transfer (pinned) | 5 GB/s | 10 GB/s |
| Memory transfer (pageable) | 2 GB/s | 4 GB/s |
| Kernel execution (FLOPS) | 1 TFLOPS | 10 TFLOPS |
10.3 Memory Guarantees
| Metric | Guarantee |
|---|---|
| Memory leak | Zero leaks (RAII) |
| Fragmentation | <30% after defrag |
| Pool overhead | <5% of allocated |
| Allocation failure | Graceful degradation |
11. Integration Patterns
11.1 Query Execution Integration
/// Example: GPU-accelerated aggregationpub struct GpuAggregateExecutor { ctx: Arc<GpuContext>,}
impl GpuAggregateExecutor { pub fn execute( &self, data: &[f32], op: AggregateOp, ) -> FallbackResult<f32> { // Check if GPU execution is beneficial let operation = Operation { op_type: OperationType::Aggregation(op), input_size: data.len() * 4, output_size: 4, complexity: ComplexityClass::Linear, data_layout: DataLayout::RowMajor, };
let target = self.ctx.cost_model() .lock() .unwrap() .choose_target(&operation) .unwrap();
match target { ExecutionTarget::GPU(device_id) => { match self.execute_gpu(data, op, device_id) { Ok(result) => FallbackResult::Gpu(result), Err(_) => { // Fallback to CPU FallbackResult::Cpu(self.execute_cpu(data, op)) } } } ExecutionTarget::CPU => { FallbackResult::Cpu(self.execute_cpu(data, op)) } ExecutionTarget::Hybrid { .. } => { // Hybrid execution not implemented yet FallbackResult::Cpu(self.execute_cpu(data, op)) } } }
fn execute_gpu( &self, data: &[f32], op: AggregateOp, device_id: DeviceId, ) -> GpuResult<f32> { // 1. Allocate GPU buffer let mut buffer = self.ctx.allocator().allocate::<f32>(data.len())?;
// 2. Transfer data self.ctx.transfer().upload(data, &mut buffer)?;
// 3. Launch kernel let kernel = self.get_aggregate_kernel(op)?; let config = LaunchConfig::linear(data.len(), 256); let result_buffer = self.ctx.allocator().allocate::<f32>(1)?;
let args = vec![ KernelArg::Buffer(buffer.as_ref()), KernelArg::Buffer(result_buffer.as_ref()), KernelArg::Scalar(ScalarValue::U32(data.len() as u32)), ];
self.ctx.launcher().launch_sync(&kernel, &config, &args)?;
// 4. Download result let mut result = vec![0.0f32; 1]; self.ctx.transfer().download(&result_buffer, &mut result)?;
Ok(result[0]) }
fn execute_cpu(&self, data: &[f32], op: AggregateOp) -> f32 { // CPU fallback implementation match op { AggregateOp::Sum => data.iter().sum(), AggregateOp::Avg => data.iter().sum::<f32>() / data.len() as f32, AggregateOp::Min => *data.iter().min_by(|a, b| a.partial_cmp(b).unwrap()).unwrap(), AggregateOp::Max => *data.iter().max_by(|a, b| a.partial_cmp(b).unwrap()).unwrap(), AggregateOp::Count => data.len() as f32, } }}11.2 Vector Search Integration
/// Example: GPU-accelerated similarity searchpub struct GpuSimilaritySearch { ctx: Arc<GpuContext>,}
impl GpuSimilaritySearch { pub fn search( &self, query: &[f32], database: &[Vec<f32>], top_k: usize, ) -> FallbackResult<Vec<(usize, f32)>> { // Determine execution target let operation = Operation { op_type: OperationType::VectorOp(VectorOp::Similarity), input_size: (query.len() + database.len() * query.len()) * 4, output_size: top_k * 12, complexity: ComplexityClass::Linear, data_layout: DataLayout::RowMajor, };
let target = self.ctx.cost_model() .lock() .unwrap() .choose_target(&operation) .unwrap();
match target { ExecutionTarget::GPU(_) => { match self.search_gpu(query, database, top_k) { Ok(results) => FallbackResult::Gpu(results), Err(_) => FallbackResult::Cpu(self.search_cpu(query, database, top_k)), } } _ => FallbackResult::Cpu(self.search_cpu(query, database, top_k)), } }}12. Example Workflows
12.1 Simple Aggregation
// Initialize GPU contextlet config = GpuConfig::default();let ctx = GpuContext::new(config)?;
// Prepare datalet data = vec![1.0f32; 1_000_000];
// Allocate GPU bufferlet mut buffer = ctx.allocator().allocate::<f32>(data.len())?;
// Transfer datactx.transfer().upload(&data, &mut buffer)?;
// Launch aggregation kernellet kernel = load_sum_kernel()?;let config = LaunchConfig::linear(data.len(), 256);let result_buffer = ctx.allocator().allocate::<f32>(1)?;
ctx.launcher().launch_sync( &kernel, &config, &[ KernelArg::Buffer(buffer.as_ref()), KernelArg::Buffer(result_buffer.as_ref()), KernelArg::Scalar(ScalarValue::U32(data.len() as u32)), ],)?;
// Download resultlet mut result = vec![0.0f32];ctx.transfer().download(&result_buffer, &mut result)?;
println!("Sum: {}", result[0]);12.2 Async Pipeline
// Create stream for async operationslet stream = Stream::new(ctx.default_device(), StreamPriority::Normal)?;
// Async uploadlet upload_handle = ctx.transfer() .upload_async(&data, &mut buffer, &stream)?;
// Async kernel launchlet kernel_handle = ctx.launcher() .launch_async(&kernel, &config, &args, &stream)?;
// Async downloadlet download_handle = ctx.transfer() .download_async(&result_buffer, &mut result, &stream)?;
// Wait for completionupload_handle.wait()?;kernel_handle.wait()?;let stats = download_handle.wait()?;
println!("Bandwidth: {:.2} GB/s", stats.bandwidth_gbps);12.3 Multi-GPU Workload
// Initialize multi-GPU contextlet config = GpuConfig { device_ids: vec![DeviceId(0), DeviceId(1)], ..Default::default()};let ctx = GpuContext::new(config)?;
// Partition data across GPUslet chunk_size = data.len() / 2;let (data_0, data_1) = data.split_at(chunk_size);
// Execute on both GPUs in parallellet handle_0 = std::thread::spawn({ let ctx = ctx.clone(); move || execute_on_gpu(&ctx, DeviceId(0), data_0)});
let handle_1 = std::thread::spawn({ let ctx = ctx.clone(); move || execute_on_gpu(&ctx, DeviceId(1), data_1)});
// Collect resultslet result_0 = handle_0.join().unwrap()?;let result_1 = handle_1.join().unwrap()?;
// Combine resultslet final_result = combine_results(result_0, result_1);Conclusion
This API contract provides a comprehensive foundation for GPU acceleration in HeliosDB. Key design decisions:
- Type Safety: Leveraging Rust’s type system for compile-time correctness
- RAII: Automatic resource management prevents leaks
- Abstraction: Vendor-neutral API supports CUDA and ROCm
- Performance: Zero-cost abstractions with <5ms overhead
- Fallback: Automatic CPU fallback ensures reliability
- Adaptivity: Cost model learns optimal CPU/GPU routing
Next Steps:
- Review and iterate on API design (Day 2)
- Implement prototype (Day 2-3)
- Create comprehensive examples (Day 3)
- Integration testing (Day 4)
- Performance validation (Day 5)
Document Status: DRAFT FOR REVIEW Review Date: Phase 2 Week 5 Day 3 Approval: Pending Agent 2, Query Optimizer review