Skip to content

HeliosDB GPU API Contract Design v2.0

HeliosDB GPU API Contract Design v2.0

Document Type: Architecture Specification Version: 2.0 Date: 2025-11-18 Author: Agent 4 (GPU Acceleration Specialist) Status: DESIGN PHASE - Week 5 Day 1 Target Audience: Agent 2 (Multimodal Vector Features), Query Optimizer, Core Engine


Table of Contents

  1. Executive Summary
  2. Design Principles
  3. Memory Allocation API
  4. Data Transfer API
  5. Kernel Execution API
  6. Cost Model Hooks
  7. Type System
  8. Error Handling
  9. Lifecycle Management
  10. Performance Guarantees
  11. Integration Patterns
  12. Example Workflows

1. Executive Summary

This document defines the low-level API contracts for GPU acceleration in HeliosDB. These contracts establish clear boundaries between:

  • Memory Management: Allocation, pooling, and lifecycle
  • Data Transfer: Host-device communication patterns
  • Kernel Execution: GPU compute invocation
  • Cost Modeling: Intelligent CPU/GPU routing decisions

1.1 Design Goals

  1. Type Safety: Leverage Rust’s type system to prevent GPU programming errors
  2. Zero-Cost Abstractions: Minimal overhead over raw CUDA/ROCm
  3. Automatic Fallback: Graceful degradation when GPU unavailable
  4. Vendor Neutrality: Support NVIDIA, AMD, and future vendors
  5. Memory Safety: RAII-based lifecycle management

1.2 Target Performance

  • Memory Transfer Overhead: <5ms for 100MB transfers
  • Kernel Launch Overhead: <0.1ms
  • Memory Allocation Overhead: <0.01ms (pooled)
  • CPU Fallback Decision: <0.001ms

2. Design Principles

2.1 Zero-Copy Philosophy

Minimize data movement between CPU and GPU:

// ❌ BAD: Copy to temporary buffer
let temp = data.to_vec();
gpu.upload(&temp)?;
// GOOD: Direct transfer from source
gpu.upload_direct(&data)?;
// BEST: Zero-copy with pinned memory
let pinned = gpu.pin_memory(&data)?;
gpu.upload_pinned(&pinned)?;

2.2 RAII Lifecycle Management

All GPU resources use RAII:

{
let buffer = gpu.allocate(size)?;
// Use buffer
// Automatic cleanup on drop
}

2.3 Explicit vs Implicit Synchronization

// Explicit sync (user control)
gpu.launch_kernel_async(&kernel, &params)?;
gpu.synchronize()?;
// Implicit sync (convenience)
let result = gpu.launch_kernel_sync(&kernel, &params)?;

2.4 Type-Safe Device Pointers

// Type-safe device pointer
pub struct DevicePtr<T> {
ptr: usize,
len: usize,
_phantom: PhantomData<T>,
}
// Prevents wrong-type access at compile time
let f32_ptr: DevicePtr<f32> = gpu.allocate_typed(1024)?;
// ❌ This won't compile:
// let i32_ptr: DevicePtr<i32> = f32_ptr;

3. Memory Allocation API

3.1 Core Memory Types

/// GPU memory buffer with type safety
pub struct GpuBuffer<T> {
device_ptr: DevicePtr<T>,
capacity: usize,
device_id: DeviceId,
allocator: Weak<GpuAllocator>,
}
/// Pinned host memory for fast transfers
pub struct PinnedBuffer<T> {
host_ptr: *mut T,
capacity: usize,
is_registered: bool,
}
/// Unified memory (CPU and GPU accessible)
pub struct UnifiedBuffer<T> {
ptr: *mut T,
capacity: usize,
hint: AccessHint,
}
#[derive(Copy, Clone, Debug)]
pub enum AccessHint {
HostMostly,
DeviceMostly,
Balanced,
}

3.2 Allocator Interface

/// GPU memory allocator
pub trait GpuAllocator: Send + Sync {
/// Allocate typed buffer
fn allocate<T>(&self, count: usize) -> Result<GpuBuffer<T>>
where
T: DeviceRepr;
/// Allocate with alignment
fn allocate_aligned<T>(
&self,
count: usize,
alignment: usize,
) -> Result<GpuBuffer<T>>
where
T: DeviceRepr;
/// Allocate pinned host memory
fn allocate_pinned<T>(&self, count: usize) -> Result<PinnedBuffer<T>>
where
T: DeviceRepr;
/// Allocate unified memory
fn allocate_unified<T>(
&self,
count: usize,
hint: AccessHint,
) -> Result<UnifiedBuffer<T>>
where
T: DeviceRepr;
/// Get allocation statistics
fn stats(&self) -> AllocationStats;
/// Defragment memory (blocking operation)
fn defragment(&self) -> Result<DefragmentationReport>;
}
/// Allocation statistics
#[derive(Debug, Clone)]
pub struct AllocationStats {
pub total_allocated: usize,
pub total_freed: usize,
pub current_usage: usize,
pub peak_usage: usize,
pub fragmentation_ratio: f32,
pub allocations: u64,
pub cache_hits: u64,
pub cache_misses: u64,
}
/// Defragmentation report
#[derive(Debug, Clone)]
pub struct DefragmentationReport {
pub bytes_moved: usize,
pub time_ms: f64,
pub fragmentation_before: f32,
pub fragmentation_after: f32,
}

3.3 Memory Pool Configuration

/// Memory pool configuration
#[derive(Debug, Clone)]
pub struct PoolConfig {
/// Maximum pool size in bytes
pub max_size: usize,
/// Minimum allocation size (bytes)
pub min_allocation: usize,
/// Maximum allocation size (bytes)
pub max_allocation: usize,
/// Bucket growth strategy
pub growth_strategy: GrowthStrategy,
/// Eviction policy
pub eviction_policy: EvictionPolicy,
/// Enable defragmentation
pub auto_defragment: bool,
/// Defragmentation threshold
pub defrag_threshold: f32,
}
#[derive(Debug, Clone, Copy)]
pub enum GrowthStrategy {
PowerOfTwo,
Fibonacci,
Linear(usize),
}
#[derive(Debug, Clone, Copy)]
pub enum EvictionPolicy {
LRU,
LFU,
ARC,
FIFO,
}
impl Default for PoolConfig {
fn default() -> Self {
Self {
max_size: 1 << 30, // 1GB
min_allocation: 4096, // 4KB
max_allocation: 256 << 20, // 256MB
growth_strategy: GrowthStrategy::PowerOfTwo,
eviction_policy: EvictionPolicy::LRU,
auto_defragment: true,
defrag_threshold: 0.3,
}
}
}

3.4 Memory Buffer Operations

impl<T: DeviceRepr> GpuBuffer<T> {
/// Get buffer capacity (elements)
pub fn capacity(&self) -> usize;
/// Get buffer size (bytes)
pub fn size_bytes(&self) -> usize;
/// Get device ID
pub fn device(&self) -> DeviceId;
/// Check if buffer is valid
pub fn is_valid(&self) -> bool;
/// Get raw device pointer (unsafe)
pub unsafe fn as_device_ptr(&self) -> *mut T;
/// Fill buffer with value
pub fn fill(&mut self, value: T) -> Result<()>
where
T: Copy;
/// Copy from another GPU buffer (device-to-device)
pub fn copy_from_device(&mut self, src: &GpuBuffer<T>) -> Result<()>;
/// Copy to another GPU buffer
pub fn copy_to_device(&self, dst: &mut GpuBuffer<T>) -> Result<()>;
/// Create view into buffer
pub fn view(&self, offset: usize, len: usize) -> Result<GpuBufferView<T>>;
/// Create mutable view
pub fn view_mut(
&mut self,
offset: usize,
len: usize,
) -> Result<GpuBufferViewMut<T>>;
}

3.5 Advanced Memory Features

/// Memory-mapped buffer for zero-copy access
pub struct MappedBuffer<T> {
buffer: GpuBuffer<T>,
host_ptr: *mut T,
mapping: MemoryMapping,
}
impl<T: DeviceRepr> MappedBuffer<T> {
/// Map GPU buffer to host address space
pub fn map(buffer: GpuBuffer<T>) -> Result<Self>;
/// Access as host slice (read-only)
pub fn as_slice(&self) -> &[T];
/// Access as mutable host slice
pub fn as_mut_slice(&mut self) -> &mut [T];
/// Unmap and return underlying buffer
pub fn unmap(self) -> GpuBuffer<T>;
/// Synchronize host changes to device
pub fn sync_to_device(&mut self) -> Result<()>;
/// Synchronize device changes to host
pub fn sync_to_host(&mut self) -> Result<()>;
}

4. Data Transfer API

4.1 Transfer Interface

/// Data transfer manager
pub trait DataTransfer: Send + Sync {
/// Synchronous host-to-device transfer
fn upload<T>(
&self,
src: &[T],
dst: &mut GpuBuffer<T>,
) -> Result<()>
where
T: DeviceRepr;
/// Synchronous device-to-host transfer
fn download<T>(
&self,
src: &GpuBuffer<T>,
dst: &mut [T],
) -> Result<()>
where
T: DeviceRepr;
/// Asynchronous host-to-device transfer
fn upload_async<T>(
&self,
src: &[T],
dst: &mut GpuBuffer<T>,
stream: &Stream,
) -> Result<TransferHandle>
where
T: DeviceRepr;
/// Asynchronous device-to-host transfer
fn download_async<T>(
&self,
src: &GpuBuffer<T>,
dst: &mut [T],
stream: &Stream,
) -> Result<TransferHandle>
where
T: DeviceRepr;
/// Device-to-device transfer (same GPU)
fn copy_device<T>(
&self,
src: &GpuBuffer<T>,
dst: &mut GpuBuffer<T>,
) -> Result<()>
where
T: DeviceRepr;
/// Peer-to-peer transfer (different GPUs)
fn copy_peer<T>(
&self,
src: &GpuBuffer<T>,
dst: &mut GpuBuffer<T>,
src_device: DeviceId,
dst_device: DeviceId,
) -> Result<()>
where
T: DeviceRepr;
}

4.2 Transfer Handle

/// Handle for asynchronous transfer
pub struct TransferHandle {
id: TransferId,
stream: StreamId,
size_bytes: usize,
direction: TransferDirection,
start_time: Instant,
}
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum TransferDirection {
HostToDevice,
DeviceToHost,
DeviceToDevice,
PeerToPeer,
}
impl TransferHandle {
/// Wait for transfer to complete
pub fn wait(self) -> Result<TransferStats>;
/// Check if transfer is complete (non-blocking)
pub fn is_complete(&self) -> bool;
/// Get transfer progress (0.0 to 1.0)
pub fn progress(&self) -> f32;
/// Get transfer ID
pub fn id(&self) -> TransferId;
/// Get estimated time remaining
pub fn time_remaining(&self) -> Duration;
}
/// Transfer statistics
#[derive(Debug, Clone)]
pub struct TransferStats {
pub bytes_transferred: usize,
pub duration: Duration,
pub bandwidth_gbps: f64,
pub direction: TransferDirection,
}

4.3 Optimized Transfer Patterns

/// Batch transfer for multiple buffers
pub struct BatchTransfer {
transfers: Vec<TransferSpec>,
strategy: BatchStrategy,
}
pub enum TransferSpec {
HostToDevice {
src: Vec<u8>,
dst: GpuBuffer<u8>,
},
DeviceToHost {
src: GpuBuffer<u8>,
dst: Vec<u8>,
},
}
#[derive(Debug, Clone, Copy)]
pub enum BatchStrategy {
Sequential,
Parallel,
Pipelined,
}
impl BatchTransfer {
/// Create new batch transfer
pub fn new(strategy: BatchStrategy) -> Self;
/// Add transfer to batch
pub fn add(&mut self, spec: TransferSpec);
/// Execute all transfers
pub fn execute(self, stream: &Stream) -> Result<Vec<TransferHandle>>;
/// Execute and wait for all transfers
pub fn execute_sync(self) -> Result<Vec<TransferStats>>;
}

4.4 Pinned Memory Transfers

/// Pinned memory transfer (faster than pageable)
pub trait PinnedTransfer {
/// Transfer from pinned host memory
fn upload_pinned<T>(
&self,
src: &PinnedBuffer<T>,
dst: &mut GpuBuffer<T>,
) -> Result<()>
where
T: DeviceRepr;
/// Transfer to pinned host memory
fn download_pinned<T>(
&self,
src: &GpuBuffer<T>,
dst: &mut PinnedBuffer<T>,
) -> Result<()>
where
T: DeviceRepr;
/// Asynchronous pinned transfer
fn upload_pinned_async<T>(
&self,
src: &PinnedBuffer<T>,
dst: &mut GpuBuffer<T>,
stream: &Stream,
) -> Result<TransferHandle>
where
T: DeviceRepr;
}

5. Kernel Execution API

5.1 Kernel Definition

/// GPU kernel representation
pub struct Kernel {
name: String,
source: KernelSource,
entry_point: String,
compiled: Option<CompiledKernel>,
}
pub enum KernelSource {
/// PTX code (NVIDIA)
Ptx(String),
/// CUDA source code
CudaSource(String),
/// HIP source code (AMD)
HipSource(String),
/// Pre-compiled binary
Binary(Vec<u8>),
}
/// Compiled kernel
pub struct CompiledKernel {
binary: Vec<u8>,
metadata: KernelMetadata,
device_id: DeviceId,
}
/// Kernel metadata
#[derive(Debug, Clone)]
pub struct KernelMetadata {
pub name: String,
pub register_count: u32,
pub shared_memory_bytes: usize,
pub constant_memory_bytes: usize,
pub max_threads_per_block: u32,
pub compute_capability: (u32, u32),
}

5.2 Launch Configuration

/// Kernel launch configuration
#[derive(Debug, Clone)]
pub struct LaunchConfig {
/// Grid dimensions (blocks)
pub grid_dim: Dim3,
/// Block dimensions (threads per block)
pub block_dim: Dim3,
/// Shared memory per block (bytes)
pub shared_mem_bytes: usize,
/// CUDA stream (optional)
pub stream: Option<StreamId>,
}
/// 3D dimensions
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct Dim3 {
pub x: u32,
pub y: u32,
pub z: u32,
}
impl Dim3 {
pub fn new(x: u32, y: u32, z: u32) -> Self {
Self { x, y, z }
}
pub fn linear(size: u32) -> Self {
Self { x: size, y: 1, z: 1 }
}
pub fn total(&self) -> u32 {
self.x * self.y * self.z
}
}
impl LaunchConfig {
/// Create configuration for 1D workload
pub fn linear(total_threads: usize, threads_per_block: u32) -> Self;
/// Create configuration for 2D workload
pub fn grid_2d(
width: u32,
height: u32,
block_width: u32,
block_height: u32,
) -> Self;
/// Optimize configuration for device
pub fn optimize_for_device(
workload: usize,
device: &GpuDevice,
) -> Result<Self>;
}

5.3 Kernel Launcher

/// Kernel launcher interface
pub trait KernelLauncher: Send + Sync {
/// Launch kernel synchronously
fn launch_sync(
&self,
kernel: &Kernel,
config: &LaunchConfig,
args: &[KernelArg],
) -> Result<KernelStats>;
/// Launch kernel asynchronously
fn launch_async(
&self,
kernel: &Kernel,
config: &LaunchConfig,
args: &[KernelArg],
stream: &Stream,
) -> Result<LaunchHandle>;
/// Compile kernel
fn compile(&self, kernel: &Kernel) -> Result<CompiledKernel>;
/// Get optimal launch configuration
fn auto_config(
&self,
kernel: &Kernel,
workload_size: usize,
) -> Result<LaunchConfig>;
}
/// Kernel argument
pub enum KernelArg {
Buffer(GpuBufferRef),
Scalar(ScalarValue),
}
pub enum ScalarValue {
I32(i32),
U32(u32),
I64(i64),
U64(u64),
F32(f32),
F64(f64),
}
/// Reference to GPU buffer
pub struct GpuBufferRef {
ptr: usize,
size: usize,
}

5.4 Launch Handle

/// Handle for asynchronous kernel launch
pub struct LaunchHandle {
id: LaunchId,
stream: StreamId,
start_time: Instant,
config: LaunchConfig,
}
impl LaunchHandle {
/// Wait for kernel to complete
pub fn wait(self) -> Result<KernelStats>;
/// Check if kernel is complete
pub fn is_complete(&self) -> bool;
/// Get launch ID
pub fn id(&self) -> LaunchId;
}
/// Kernel execution statistics
#[derive(Debug, Clone)]
pub struct KernelStats {
pub duration: Duration,
pub grid_dim: Dim3,
pub block_dim: Dim3,
pub shared_mem_bytes: usize,
pub register_count: u32,
pub occupancy: f32,
}

5.5 Stream Management

/// CUDA/HIP stream for asynchronous execution
pub struct Stream {
id: StreamId,
device_id: DeviceId,
priority: StreamPriority,
}
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub enum StreamPriority {
Low,
Normal,
High,
}
impl Stream {
/// Create new stream
pub fn new(device: &GpuDevice, priority: StreamPriority) -> Result<Self>;
/// Synchronize stream (wait for all operations)
pub fn synchronize(&self) -> Result<()>;
/// Query if stream is idle
pub fn is_idle(&self) -> bool;
/// Add callback to stream
pub fn add_callback<F>(&self, callback: F) -> Result<()>
where
F: FnOnce() + Send + 'static;
}

6. Cost Model Hooks

6.1 Cost Model Interface

/// GPU cost model for CPU/GPU routing decisions
pub trait GpuCostModel: Send + Sync {
/// Estimate GPU execution cost
fn estimate_gpu_cost(&self, op: &Operation) -> Result<ExecutionCost>;
/// Estimate CPU execution cost
fn estimate_cpu_cost(&self, op: &Operation) -> Result<ExecutionCost>;
/// Determine optimal execution target
fn choose_target(&self, op: &Operation) -> Result<ExecutionTarget>;
/// Update cost model with observed execution
fn learn_from_execution(&mut self, stats: &ExecutionStats);
}
/// Execution cost estimate
#[derive(Debug, Clone)]
pub struct ExecutionCost {
/// Estimated execution time
pub time_ms: f64,
/// Estimated memory usage (bytes)
pub memory_bytes: usize,
/// Estimated energy consumption (joules)
pub energy_joules: f64,
/// Confidence in estimate (0.0 to 1.0)
pub confidence: f32,
}
/// Execution target
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum ExecutionTarget {
CPU,
GPU(DeviceId),
Hybrid { cpu_ratio: f32 },
}

6.2 Operation Descriptor

/// Operation to be executed
#[derive(Debug, Clone)]
pub struct Operation {
pub op_type: OperationType,
pub input_size: usize,
pub output_size: usize,
pub complexity: ComplexityClass,
pub data_layout: DataLayout,
}
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum OperationType {
Aggregation(AggregateOp),
Scan(ScanOp),
Join(JoinOp),
VectorOp(VectorOp),
MatrixOp(MatrixOp),
Custom(u32),
}
#[derive(Debug, Clone, Copy)]
pub enum AggregateOp {
Sum,
Avg,
Min,
Max,
Count,
}
#[derive(Debug, Clone, Copy)]
pub enum ComplexityClass {
Linear,
LogLinear,
Quadratic,
Cubic,
}
#[derive(Debug, Clone, Copy)]
pub enum DataLayout {
RowMajor,
ColumnMajor,
Sparse,
}

6.3 Adaptive Cost Learning

/// Adaptive cost model that learns from execution
pub struct AdaptiveCostModel {
history: ExecutionHistory,
predictor: CostPredictor,
config: AdaptiveConfig,
}
#[derive(Debug, Clone)]
pub struct AdaptiveConfig {
/// Minimum samples before switching target
pub min_samples: usize,
/// Confidence threshold for GPU execution
pub confidence_threshold: f32,
/// Memory pressure threshold (0.0 to 1.0)
pub memory_pressure_threshold: f32,
/// Enable online learning
pub enable_learning: bool,
}
impl AdaptiveCostModel {
/// Create new adaptive model
pub fn new(config: AdaptiveConfig) -> Self;
/// Predict cost with confidence interval
pub fn predict_with_confidence(
&self,
op: &Operation,
target: ExecutionTarget,
) -> Result<(ExecutionCost, ConfidenceInterval)>;
/// Get historical statistics
pub fn get_stats(&self, op_type: OperationType) -> Option<OpStats>;
}
#[derive(Debug, Clone)]
pub struct ConfidenceInterval {
pub lower_ms: f64,
pub upper_ms: f64,
pub confidence_level: f32,
}

6.4 Execution Statistics

/// Execution statistics for learning
#[derive(Debug, Clone)]
pub struct ExecutionStats {
pub operation: Operation,
pub target: ExecutionTarget,
pub actual_time_ms: f64,
pub memory_used_bytes: usize,
pub cache_hits: u64,
pub cache_misses: u64,
pub transfer_time_ms: f64,
pub compute_time_ms: f64,
pub success: bool,
pub timestamp: Instant,
}
/// Operation statistics
#[derive(Debug, Clone)]
pub struct OpStats {
pub op_type: OperationType,
pub execution_count: u64,
pub cpu_successes: u64,
pub gpu_successes: u64,
pub avg_cpu_time_ms: f64,
pub avg_gpu_time_ms: f64,
pub speedup_factor: f64,
}

7. Type System

7.1 Device-Compatible Types

/// Marker trait for types that can be transferred to GPU
pub unsafe trait DeviceRepr: Copy + Send + Sync + 'static {
/// Check if type is properly aligned
fn is_aligned(ptr: *const Self) -> bool;
/// Get alignment requirement
fn alignment() -> usize;
}
// Implement for primitive types
unsafe impl DeviceRepr for f32 {
fn is_aligned(ptr: *const Self) -> bool {
(ptr as usize) % 4 == 0
}
fn alignment() -> usize {
4
}
}
unsafe impl DeviceRepr for f64 {
fn is_aligned(ptr: *const Self) -> bool {
(ptr as usize) % 8 == 0
}
fn alignment() -> usize {
8
}
}
// ... similar for i8, i16, i32, i64, u8, u16, u32, u64

7.2 Vector Types

/// GPU-compatible vector types
#[repr(C, align(16))]
#[derive(Copy, Clone, Debug)]
pub struct Float4 {
pub x: f32,
pub y: f32,
pub z: f32,
pub w: f32,
}
unsafe impl DeviceRepr for Float4 {
fn is_aligned(ptr: *const Self) -> bool {
(ptr as usize) % 16 == 0
}
fn alignment() -> usize {
16
}
}
// Similar for Float2, Float3, Int4, etc.

7.3 Type-Safe Buffer Views

/// Immutable view into GPU buffer
pub struct GpuBufferView<'a, T> {
buffer: &'a GpuBuffer<T>,
offset: usize,
len: usize,
}
impl<'a, T: DeviceRepr> GpuBufferView<'a, T> {
/// Get view length
pub fn len(&self) -> usize;
/// Check if view is empty
pub fn is_empty(&self) -> bool;
/// Split view at index
pub fn split_at(&self, mid: usize) -> (Self, Self);
/// Create sub-view
pub fn subview(&self, offset: usize, len: usize) -> Result<Self>;
}
/// Mutable view into GPU buffer
pub struct GpuBufferViewMut<'a, T> {
buffer: &'a mut GpuBuffer<T>,
offset: usize,
len: usize,
}

8. Error Handling

8.1 Error Types

/// GPU API errors
#[derive(Debug, thiserror::Error)]
pub enum GpuError {
#[error("GPU device not available")]
DeviceNotAvailable,
#[error("CUDA error: {0}")]
CudaError(String),
#[error("ROCm error: {0}")]
RocmError(String),
#[error("Out of GPU memory: requested {requested}MB, available {available}MB")]
OutOfMemory { requested: usize, available: usize },
#[error("Invalid launch configuration: {0}")]
InvalidLaunchConfig(String),
#[error("Kernel compilation failed: {0}")]
CompilationError(String),
#[error("Transfer failed: {0}")]
TransferError(String),
#[error("Unsupported operation: {0}")]
Unsupported(String),
#[error("Synchronization failed: {0}")]
SyncError(String),
#[error("Invalid buffer access: offset={offset}, len={len}, capacity={capacity}")]
InvalidAccess {
offset: usize,
len: usize,
capacity: usize,
},
}

8.2 Result Types

/// GPU operation result
pub type GpuResult<T> = Result<T, GpuError>;
/// Result with optional CPU fallback
pub enum FallbackResult<T> {
Gpu(T),
Cpu(T),
Error(GpuError),
}
impl<T> FallbackResult<T> {
pub fn unwrap(self) -> T {
match self {
Self::Gpu(v) | Self::Cpu(v) => v,
Self::Error(e) => panic!("GPU operation failed: {}", e),
}
}
pub fn is_gpu(&self) -> bool {
matches!(self, Self::Gpu(_))
}
}

8.3 Error Recovery

/// Error recovery strategy
pub trait ErrorRecovery {
/// Attempt to recover from error
fn try_recover(&self, error: &GpuError) -> RecoveryAction;
}
#[derive(Debug, Clone, Copy)]
pub enum RecoveryAction {
/// Retry operation
Retry,
/// Fall back to CPU
FallbackToCpu,
/// Abort with error
Abort,
/// Reduce memory usage and retry
ReduceMemory,
}

9. Lifecycle Management

9.1 Resource Initialization

/// GPU context initialization
pub struct GpuContext {
devices: Vec<GpuDevice>,
allocator: Arc<dyn GpuAllocator>,
transfer: Arc<dyn DataTransfer>,
launcher: Arc<dyn KernelLauncher>,
cost_model: Arc<Mutex<dyn GpuCostModel>>,
}
impl GpuContext {
/// Initialize GPU context
pub fn new(config: GpuConfig) -> GpuResult<Self>;
/// Get device by ID
pub fn device(&self, id: DeviceId) -> Option<&GpuDevice>;
/// Get default device
pub fn default_device(&self) -> &GpuDevice;
/// Get allocator
pub fn allocator(&self) -> &Arc<dyn GpuAllocator>;
/// Get data transfer interface
pub fn transfer(&self) -> &Arc<dyn DataTransfer>;
/// Get kernel launcher
pub fn launcher(&self) -> &Arc<dyn KernelLauncher>;
/// Get cost model
pub fn cost_model(&self) -> &Arc<Mutex<dyn GpuCostModel>>;
/// Shutdown context
pub fn shutdown(self) -> GpuResult<()>;
}
/// GPU configuration
#[derive(Debug, Clone)]
pub struct GpuConfig {
pub device_ids: Vec<DeviceId>,
pub memory_pool_config: PoolConfig,
pub enable_peer_access: bool,
pub cost_model_config: AdaptiveConfig,
}

9.2 Resource Cleanup

impl Drop for GpuContext {
fn drop(&mut self) {
// Automatic cleanup:
// 1. Wait for all pending operations
// 2. Free all allocations
// 3. Destroy streams
// 4. Release devices
}
}
impl<T> Drop for GpuBuffer<T> {
fn drop(&mut self) {
// Return buffer to pool or free
}
}

10. Performance Guarantees

10.1 Latency Guarantees

OperationMaximum LatencyTypical Latency
Memory allocation (pooled)0.1ms0.01ms
Memory allocation (new)5ms1ms
Kernel launch0.5ms0.05ms
Host-to-device transfer (1MB)2ms0.5ms
Device-to-host transfer (1MB)2ms0.5ms
Synchronization1ms0.1ms

10.2 Throughput Guarantees

OperationMinimum ThroughputTarget Throughput
Memory transfer (pinned)5 GB/s10 GB/s
Memory transfer (pageable)2 GB/s4 GB/s
Kernel execution (FLOPS)1 TFLOPS10 TFLOPS

10.3 Memory Guarantees

MetricGuarantee
Memory leakZero leaks (RAII)
Fragmentation<30% after defrag
Pool overhead<5% of allocated
Allocation failureGraceful degradation

11. Integration Patterns

11.1 Query Execution Integration

/// Example: GPU-accelerated aggregation
pub struct GpuAggregateExecutor {
ctx: Arc<GpuContext>,
}
impl GpuAggregateExecutor {
pub fn execute(
&self,
data: &[f32],
op: AggregateOp,
) -> FallbackResult<f32> {
// Check if GPU execution is beneficial
let operation = Operation {
op_type: OperationType::Aggregation(op),
input_size: data.len() * 4,
output_size: 4,
complexity: ComplexityClass::Linear,
data_layout: DataLayout::RowMajor,
};
let target = self.ctx.cost_model()
.lock()
.unwrap()
.choose_target(&operation)
.unwrap();
match target {
ExecutionTarget::GPU(device_id) => {
match self.execute_gpu(data, op, device_id) {
Ok(result) => FallbackResult::Gpu(result),
Err(_) => {
// Fallback to CPU
FallbackResult::Cpu(self.execute_cpu(data, op))
}
}
}
ExecutionTarget::CPU => {
FallbackResult::Cpu(self.execute_cpu(data, op))
}
ExecutionTarget::Hybrid { .. } => {
// Hybrid execution not implemented yet
FallbackResult::Cpu(self.execute_cpu(data, op))
}
}
}
fn execute_gpu(
&self,
data: &[f32],
op: AggregateOp,
device_id: DeviceId,
) -> GpuResult<f32> {
// 1. Allocate GPU buffer
let mut buffer = self.ctx.allocator().allocate::<f32>(data.len())?;
// 2. Transfer data
self.ctx.transfer().upload(data, &mut buffer)?;
// 3. Launch kernel
let kernel = self.get_aggregate_kernel(op)?;
let config = LaunchConfig::linear(data.len(), 256);
let result_buffer = self.ctx.allocator().allocate::<f32>(1)?;
let args = vec![
KernelArg::Buffer(buffer.as_ref()),
KernelArg::Buffer(result_buffer.as_ref()),
KernelArg::Scalar(ScalarValue::U32(data.len() as u32)),
];
self.ctx.launcher().launch_sync(&kernel, &config, &args)?;
// 4. Download result
let mut result = vec![0.0f32; 1];
self.ctx.transfer().download(&result_buffer, &mut result)?;
Ok(result[0])
}
fn execute_cpu(&self, data: &[f32], op: AggregateOp) -> f32 {
// CPU fallback implementation
match op {
AggregateOp::Sum => data.iter().sum(),
AggregateOp::Avg => data.iter().sum::<f32>() / data.len() as f32,
AggregateOp::Min => *data.iter().min_by(|a, b| a.partial_cmp(b).unwrap()).unwrap(),
AggregateOp::Max => *data.iter().max_by(|a, b| a.partial_cmp(b).unwrap()).unwrap(),
AggregateOp::Count => data.len() as f32,
}
}
}

11.2 Vector Search Integration

/// Example: GPU-accelerated similarity search
pub struct GpuSimilaritySearch {
ctx: Arc<GpuContext>,
}
impl GpuSimilaritySearch {
pub fn search(
&self,
query: &[f32],
database: &[Vec<f32>],
top_k: usize,
) -> FallbackResult<Vec<(usize, f32)>> {
// Determine execution target
let operation = Operation {
op_type: OperationType::VectorOp(VectorOp::Similarity),
input_size: (query.len() + database.len() * query.len()) * 4,
output_size: top_k * 12,
complexity: ComplexityClass::Linear,
data_layout: DataLayout::RowMajor,
};
let target = self.ctx.cost_model()
.lock()
.unwrap()
.choose_target(&operation)
.unwrap();
match target {
ExecutionTarget::GPU(_) => {
match self.search_gpu(query, database, top_k) {
Ok(results) => FallbackResult::Gpu(results),
Err(_) => FallbackResult::Cpu(self.search_cpu(query, database, top_k)),
}
}
_ => FallbackResult::Cpu(self.search_cpu(query, database, top_k)),
}
}
}

12. Example Workflows

12.1 Simple Aggregation

// Initialize GPU context
let config = GpuConfig::default();
let ctx = GpuContext::new(config)?;
// Prepare data
let data = vec![1.0f32; 1_000_000];
// Allocate GPU buffer
let mut buffer = ctx.allocator().allocate::<f32>(data.len())?;
// Transfer data
ctx.transfer().upload(&data, &mut buffer)?;
// Launch aggregation kernel
let kernel = load_sum_kernel()?;
let config = LaunchConfig::linear(data.len(), 256);
let result_buffer = ctx.allocator().allocate::<f32>(1)?;
ctx.launcher().launch_sync(
&kernel,
&config,
&[
KernelArg::Buffer(buffer.as_ref()),
KernelArg::Buffer(result_buffer.as_ref()),
KernelArg::Scalar(ScalarValue::U32(data.len() as u32)),
],
)?;
// Download result
let mut result = vec![0.0f32];
ctx.transfer().download(&result_buffer, &mut result)?;
println!("Sum: {}", result[0]);

12.2 Async Pipeline

// Create stream for async operations
let stream = Stream::new(ctx.default_device(), StreamPriority::Normal)?;
// Async upload
let upload_handle = ctx.transfer()
.upload_async(&data, &mut buffer, &stream)?;
// Async kernel launch
let kernel_handle = ctx.launcher()
.launch_async(&kernel, &config, &args, &stream)?;
// Async download
let download_handle = ctx.transfer()
.download_async(&result_buffer, &mut result, &stream)?;
// Wait for completion
upload_handle.wait()?;
kernel_handle.wait()?;
let stats = download_handle.wait()?;
println!("Bandwidth: {:.2} GB/s", stats.bandwidth_gbps);

12.3 Multi-GPU Workload

// Initialize multi-GPU context
let config = GpuConfig {
device_ids: vec![DeviceId(0), DeviceId(1)],
..Default::default()
};
let ctx = GpuContext::new(config)?;
// Partition data across GPUs
let chunk_size = data.len() / 2;
let (data_0, data_1) = data.split_at(chunk_size);
// Execute on both GPUs in parallel
let handle_0 = std::thread::spawn({
let ctx = ctx.clone();
move || execute_on_gpu(&ctx, DeviceId(0), data_0)
});
let handle_1 = std::thread::spawn({
let ctx = ctx.clone();
move || execute_on_gpu(&ctx, DeviceId(1), data_1)
});
// Collect results
let result_0 = handle_0.join().unwrap()?;
let result_1 = handle_1.join().unwrap()?;
// Combine results
let final_result = combine_results(result_0, result_1);

Conclusion

This API contract provides a comprehensive foundation for GPU acceleration in HeliosDB. Key design decisions:

  1. Type Safety: Leveraging Rust’s type system for compile-time correctness
  2. RAII: Automatic resource management prevents leaks
  3. Abstraction: Vendor-neutral API supports CUDA and ROCm
  4. Performance: Zero-cost abstractions with <5ms overhead
  5. Fallback: Automatic CPU fallback ensures reliability
  6. Adaptivity: Cost model learns optimal CPU/GPU routing

Next Steps:

  • Review and iterate on API design (Day 2)
  • Implement prototype (Day 2-3)
  • Create comprehensive examples (Day 3)
  • Integration testing (Day 4)
  • Performance validation (Day 5)

Document Status: DRAFT FOR REVIEW Review Date: Phase 2 Week 5 Day 3 Approval: Pending Agent 2, Query Optimizer review