HeliosDB GPU API Contract Design v2.0

Document Type: Architecture Specification Version: 2.0 Date: 2025-11-18 Author: Agent 4 (GPU Acceleration Specialist) Status: DESIGN PHASE - Week 5 Day 1 Target Audience: Agent 2 (Multimodal Vector Features), Query Optimizer, Core Engine

Executive Summary
Design Principles
Memory Allocation API
Data Transfer API
Kernel Execution API
Cost Model Hooks
Type System
Error Handling
Lifecycle Management
Performance Guarantees
Integration Patterns
Example Workflows

1. Executive Summary

This document defines the low-level API contracts for GPU acceleration in HeliosDB. These contracts establish clear boundaries between:

Memory Management: Allocation, pooling, and lifecycle
Data Transfer: Host-device communication patterns
Kernel Execution: GPU compute invocation
Cost Modeling: Intelligent CPU/GPU routing decisions

1.1 Design Goals

Type Safety: Leverage Rust’s type system to prevent GPU programming errors
Zero-Cost Abstractions: Minimal overhead over raw CUDA/ROCm
Automatic Fallback: Graceful degradation when GPU unavailable
Vendor Neutrality: Support NVIDIA, AMD, and future vendors
Memory Safety: RAII-based lifecycle management

1.2 Target Performance

Memory Transfer Overhead: <5ms for 100MB transfers
Kernel Launch Overhead: <0.1ms
Memory Allocation Overhead: <0.01ms (pooled)
CPU Fallback Decision: <0.001ms

2. Design Principles

2.1 Zero-Copy Philosophy

Minimize data movement between CPU and GPU:

// ❌ BAD: Copy to temporary buffer
let temp = data.to_vec();
gpu.upload(&temp)?;

//  GOOD: Direct transfer from source
gpu.upload_direct(&data)?;

//  BEST: Zero-copy with pinned memory
let pinned = gpu.pin_memory(&data)?;
gpu.upload_pinned(&pinned)?;

2.2 RAII Lifecycle Management

All GPU resources use RAII:

{
    let buffer = gpu.allocate(size)?;
    // Use buffer
    // Automatic cleanup on drop
}

2.3 Explicit vs Implicit Synchronization

// Explicit sync (user control)
gpu.launch_kernel_async(&kernel, &params)?;
gpu.synchronize()?;

// Implicit sync (convenience)
let result = gpu.launch_kernel_sync(&kernel, &params)?;

2.4 Type-Safe Device Pointers

// Type-safe device pointer
pub struct DevicePtr<T> {
    ptr: usize,
    len: usize,
    _phantom: PhantomData<T>,
}

// Prevents wrong-type access at compile time
let f32_ptr: DevicePtr<f32> = gpu.allocate_typed(1024)?;
// ❌ This won't compile:
// let i32_ptr: DevicePtr<i32> = f32_ptr;

3. Memory Allocation API

3.1 Core Memory Types

/// GPU memory buffer with type safety
pub struct GpuBuffer<T> {
    device_ptr: DevicePtr<T>,
    capacity: usize,
    device_id: DeviceId,
    allocator: Weak<GpuAllocator>,
}

/// Pinned host memory for fast transfers
pub struct PinnedBuffer<T> {
    host_ptr: *mut T,
    capacity: usize,
    is_registered: bool,
}

/// Unified memory (CPU and GPU accessible)
pub struct UnifiedBuffer<T> {
    ptr: *mut T,
    capacity: usize,
    hint: AccessHint,
}

#[derive(Copy, Clone, Debug)]
pub enum AccessHint {
    HostMostly,
    DeviceMostly,
    Balanced,
}

3.2 Allocator Interface

/// GPU memory allocator
pub trait GpuAllocator: Send + Sync {
    /// Allocate typed buffer
    fn allocate<T>(&self, count: usize) -> Result<GpuBuffer<T>>
    where
        T: DeviceRepr;

    /// Allocate with alignment
    fn allocate_aligned<T>(
        &self,
        count: usize,
        alignment: usize,
    ) -> Result<GpuBuffer<T>>
    where
        T: DeviceRepr;

    /// Allocate pinned host memory
    fn allocate_pinned<T>(&self, count: usize) -> Result<PinnedBuffer<T>>
    where
        T: DeviceRepr;

    /// Allocate unified memory
    fn allocate_unified<T>(
        &self,
        count: usize,
        hint: AccessHint,
    ) -> Result<UnifiedBuffer<T>>
    where
        T: DeviceRepr;

    /// Get allocation statistics
    fn stats(&self) -> AllocationStats;

    /// Defragment memory (blocking operation)
    fn defragment(&self) -> Result<DefragmentationReport>;
}

/// Allocation statistics
#[derive(Debug, Clone)]
pub struct AllocationStats {
    pub total_allocated: usize,
    pub total_freed: usize,
    pub current_usage: usize,
    pub peak_usage: usize,
    pub fragmentation_ratio: f32,
    pub allocations: u64,
    pub cache_hits: u64,
    pub cache_misses: u64,
}

/// Defragmentation report
#[derive(Debug, Clone)]
pub struct DefragmentationReport {
    pub bytes_moved: usize,
    pub time_ms: f64,
    pub fragmentation_before: f32,
    pub fragmentation_after: f32,
}

3.3 Memory Pool Configuration

/// Memory pool configuration
#[derive(Debug, Clone)]
pub struct PoolConfig {
    /// Maximum pool size in bytes
    pub max_size: usize,

    /// Minimum allocation size (bytes)
    pub min_allocation: usize,

    /// Maximum allocation size (bytes)
    pub max_allocation: usize,

    /// Bucket growth strategy
    pub growth_strategy: GrowthStrategy,

    /// Eviction policy
    pub eviction_policy: EvictionPolicy,

    /// Enable defragmentation
    pub auto_defragment: bool,

    /// Defragmentation threshold
    pub defrag_threshold: f32,
}

#[derive(Debug, Clone, Copy)]
pub enum GrowthStrategy {
    PowerOfTwo,
    Fibonacci,
    Linear(usize),
}

#[derive(Debug, Clone, Copy)]
pub enum EvictionPolicy {
    LRU,
    LFU,
    ARC,
    FIFO,
}

impl Default for PoolConfig {
    fn default() -> Self {
        Self {
            max_size: 1 << 30, // 1GB
            min_allocation: 4096, // 4KB
            max_allocation: 256 << 20, // 256MB
            growth_strategy: GrowthStrategy::PowerOfTwo,
            eviction_policy: EvictionPolicy::LRU,
            auto_defragment: true,
            defrag_threshold: 0.3,
        }
    }
}

3.4 Memory Buffer Operations

impl<T: DeviceRepr> GpuBuffer<T> {
    /// Get buffer capacity (elements)
    pub fn capacity(&self) -> usize;

    /// Get buffer size (bytes)
    pub fn size_bytes(&self) -> usize;

    /// Get device ID
    pub fn device(&self) -> DeviceId;

    /// Check if buffer is valid
    pub fn is_valid(&self) -> bool;

    /// Get raw device pointer (unsafe)
    pub unsafe fn as_device_ptr(&self) -> *mut T;

    /// Fill buffer with value
    pub fn fill(&mut self, value: T) -> Result<()>
    where
        T: Copy;

    /// Copy from another GPU buffer (device-to-device)
    pub fn copy_from_device(&mut self, src: &GpuBuffer<T>) -> Result<()>;

    /// Copy to another GPU buffer
    pub fn copy_to_device(&self, dst: &mut GpuBuffer<T>) -> Result<()>;

    /// Create view into buffer
    pub fn view(&self, offset: usize, len: usize) -> Result<GpuBufferView<T>>;

    /// Create mutable view
    pub fn view_mut(
        &mut self,
        offset: usize,
        len: usize,
    ) -> Result<GpuBufferViewMut<T>>;
}

3.5 Advanced Memory Features

/// Memory-mapped buffer for zero-copy access
pub struct MappedBuffer<T> {
    buffer: GpuBuffer<T>,
    host_ptr: *mut T,
    mapping: MemoryMapping,
}

impl<T: DeviceRepr> MappedBuffer<T> {
    /// Map GPU buffer to host address space
    pub fn map(buffer: GpuBuffer<T>) -> Result<Self>;

    /// Access as host slice (read-only)
    pub fn as_slice(&self) -> &[T];

    /// Access as mutable host slice
    pub fn as_mut_slice(&mut self) -> &mut [T];

    /// Unmap and return underlying buffer
    pub fn unmap(self) -> GpuBuffer<T>;

    /// Synchronize host changes to device
    pub fn sync_to_device(&mut self) -> Result<()>;

    /// Synchronize device changes to host
    pub fn sync_to_host(&mut self) -> Result<()>;
}

4. Data Transfer API

4.1 Transfer Interface

/// Data transfer manager
pub trait DataTransfer: Send + Sync {
    /// Synchronous host-to-device transfer
    fn upload<T>(
        &self,
        src: &[T],
        dst: &mut GpuBuffer<T>,
    ) -> Result<()>
    where
        T: DeviceRepr;

    /// Synchronous device-to-host transfer
    fn download<T>(
        &self,
        src: &GpuBuffer<T>,
        dst: &mut [T],
    ) -> Result<()>
    where
        T: DeviceRepr;

    /// Asynchronous host-to-device transfer
    fn upload_async<T>(
        &self,
        src: &[T],
        dst: &mut GpuBuffer<T>,
        stream: &Stream,
    ) -> Result<TransferHandle>
    where
        T: DeviceRepr;

    /// Asynchronous device-to-host transfer
    fn download_async<T>(
        &self,
        src: &GpuBuffer<T>,
        dst: &mut [T],
        stream: &Stream,
    ) -> Result<TransferHandle>
    where
        T: DeviceRepr;

    /// Device-to-device transfer (same GPU)
    fn copy_device<T>(
        &self,
        src: &GpuBuffer<T>,
        dst: &mut GpuBuffer<T>,
    ) -> Result<()>
    where
        T: DeviceRepr;

    /// Peer-to-peer transfer (different GPUs)
    fn copy_peer<T>(
        &self,
        src: &GpuBuffer<T>,
        dst: &mut GpuBuffer<T>,
        src_device: DeviceId,
        dst_device: DeviceId,
    ) -> Result<()>
    where
        T: DeviceRepr;
}

4.2 Transfer Handle

/// Handle for asynchronous transfer
pub struct TransferHandle {
    id: TransferId,
    stream: StreamId,
    size_bytes: usize,
    direction: TransferDirection,
    start_time: Instant,
}

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum TransferDirection {
    HostToDevice,
    DeviceToHost,
    DeviceToDevice,
    PeerToPeer,
}

impl TransferHandle {
    /// Wait for transfer to complete
    pub fn wait(self) -> Result<TransferStats>;

    /// Check if transfer is complete (non-blocking)
    pub fn is_complete(&self) -> bool;

    /// Get transfer progress (0.0 to 1.0)
    pub fn progress(&self) -> f32;

    /// Get transfer ID
    pub fn id(&self) -> TransferId;

    /// Get estimated time remaining
    pub fn time_remaining(&self) -> Duration;
}

/// Transfer statistics
#[derive(Debug, Clone)]
pub struct TransferStats {
    pub bytes_transferred: usize,
    pub duration: Duration,
    pub bandwidth_gbps: f64,
    pub direction: TransferDirection,
}

4.3 Optimized Transfer Patterns

/// Batch transfer for multiple buffers
pub struct BatchTransfer {
    transfers: Vec<TransferSpec>,
    strategy: BatchStrategy,
}

pub enum TransferSpec {
    HostToDevice {
        src: Vec<u8>,
        dst: GpuBuffer<u8>,
    },
    DeviceToHost {
        src: GpuBuffer<u8>,
        dst: Vec<u8>,
    },
}

#[derive(Debug, Clone, Copy)]
pub enum BatchStrategy {
    Sequential,
    Parallel,
    Pipelined,
}

impl BatchTransfer {
    /// Create new batch transfer
    pub fn new(strategy: BatchStrategy) -> Self;

    /// Add transfer to batch
    pub fn add(&mut self, spec: TransferSpec);

    /// Execute all transfers
    pub fn execute(self, stream: &Stream) -> Result<Vec<TransferHandle>>;

    /// Execute and wait for all transfers
    pub fn execute_sync(self) -> Result<Vec<TransferStats>>;
}

4.4 Pinned Memory Transfers

/// Pinned memory transfer (faster than pageable)
pub trait PinnedTransfer {
    /// Transfer from pinned host memory
    fn upload_pinned<T>(
        &self,
        src: &PinnedBuffer<T>,
        dst: &mut GpuBuffer<T>,
    ) -> Result<()>
    where
        T: DeviceRepr;

    /// Transfer to pinned host memory
    fn download_pinned<T>(
        &self,
        src: &GpuBuffer<T>,
        dst: &mut PinnedBuffer<T>,
    ) -> Result<()>
    where
        T: DeviceRepr;

    /// Asynchronous pinned transfer
    fn upload_pinned_async<T>(
        &self,
        src: &PinnedBuffer<T>,
        dst: &mut GpuBuffer<T>,
        stream: &Stream,
    ) -> Result<TransferHandle>
    where
        T: DeviceRepr;
}

5. Kernel Execution API

5.1 Kernel Definition

/// GPU kernel representation
pub struct Kernel {
    name: String,
    source: KernelSource,
    entry_point: String,
    compiled: Option<CompiledKernel>,
}

pub enum KernelSource {
    /// PTX code (NVIDIA)
    Ptx(String),

    /// CUDA source code
    CudaSource(String),

    /// HIP source code (AMD)
    HipSource(String),

    /// Pre-compiled binary
    Binary(Vec<u8>),
}

/// Compiled kernel
pub struct CompiledKernel {
    binary: Vec<u8>,
    metadata: KernelMetadata,
    device_id: DeviceId,
}

/// Kernel metadata
#[derive(Debug, Clone)]
pub struct KernelMetadata {
    pub name: String,
    pub register_count: u32,
    pub shared_memory_bytes: usize,
    pub constant_memory_bytes: usize,
    pub max_threads_per_block: u32,
    pub compute_capability: (u32, u32),
}

5.2 Launch Configuration

/// Kernel launch configuration
#[derive(Debug, Clone)]
pub struct LaunchConfig {
    /// Grid dimensions (blocks)
    pub grid_dim: Dim3,

    /// Block dimensions (threads per block)
    pub block_dim: Dim3,

    /// Shared memory per block (bytes)
    pub shared_mem_bytes: usize,

    /// CUDA stream (optional)
    pub stream: Option<StreamId>,
}

/// 3D dimensions
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct Dim3 {
    pub x: u32,
    pub y: u32,
    pub z: u32,
}

impl Dim3 {
    pub fn new(x: u32, y: u32, z: u32) -> Self {
        Self { x, y, z }
    }

    pub fn linear(size: u32) -> Self {
        Self { x: size, y: 1, z: 1 }
    }

    pub fn total(&self) -> u32 {
        self.x * self.y * self.z
    }
}

impl LaunchConfig {
    /// Create configuration for 1D workload
    pub fn linear(total_threads: usize, threads_per_block: u32) -> Self;

    /// Create configuration for 2D workload
    pub fn grid_2d(
        width: u32,
        height: u32,
        block_width: u32,
        block_height: u32,
    ) -> Self;

    /// Optimize configuration for device
    pub fn optimize_for_device(
        workload: usize,
        device: &GpuDevice,
    ) -> Result<Self>;
}

5.3 Kernel Launcher

/// Kernel launcher interface
pub trait KernelLauncher: Send + Sync {
    /// Launch kernel synchronously
    fn launch_sync(
        &self,
        kernel: &Kernel,
        config: &LaunchConfig,
        args: &[KernelArg],
    ) -> Result<KernelStats>;

    /// Launch kernel asynchronously
    fn launch_async(
        &self,
        kernel: &Kernel,
        config: &LaunchConfig,
        args: &[KernelArg],
        stream: &Stream,
    ) -> Result<LaunchHandle>;

    /// Compile kernel
    fn compile(&self, kernel: &Kernel) -> Result<CompiledKernel>;

    /// Get optimal launch configuration
    fn auto_config(
        &self,
        kernel: &Kernel,
        workload_size: usize,
    ) -> Result<LaunchConfig>;
}

/// Kernel argument
pub enum KernelArg {
    Buffer(GpuBufferRef),
    Scalar(ScalarValue),
}

pub enum ScalarValue {
    I32(i32),
    U32(u32),
    I64(i64),
    U64(u64),
    F32(f32),
    F64(f64),
}

/// Reference to GPU buffer
pub struct GpuBufferRef {
    ptr: usize,
    size: usize,
}

5.4 Launch Handle

/// Handle for asynchronous kernel launch
pub struct LaunchHandle {
    id: LaunchId,
    stream: StreamId,
    start_time: Instant,
    config: LaunchConfig,
}

impl LaunchHandle {
    /// Wait for kernel to complete
    pub fn wait(self) -> Result<KernelStats>;

    /// Check if kernel is complete
    pub fn is_complete(&self) -> bool;

    /// Get launch ID
    pub fn id(&self) -> LaunchId;
}

/// Kernel execution statistics
#[derive(Debug, Clone)]
pub struct KernelStats {
    pub duration: Duration,
    pub grid_dim: Dim3,
    pub block_dim: Dim3,
    pub shared_mem_bytes: usize,
    pub register_count: u32,
    pub occupancy: f32,
}

5.5 Stream Management

/// CUDA/HIP stream for asynchronous execution
pub struct Stream {
    id: StreamId,
    device_id: DeviceId,
    priority: StreamPriority,
}

#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub enum StreamPriority {
    Low,
    Normal,
    High,
}

impl Stream {
    /// Create new stream
    pub fn new(device: &GpuDevice, priority: StreamPriority) -> Result<Self>;

    /// Synchronize stream (wait for all operations)
    pub fn synchronize(&self) -> Result<()>;

    /// Query if stream is idle
    pub fn is_idle(&self) -> bool;

    /// Add callback to stream
    pub fn add_callback<F>(&self, callback: F) -> Result<()>
    where
        F: FnOnce() + Send + 'static;
}

6. Cost Model Hooks

6.1 Cost Model Interface

/// GPU cost model for CPU/GPU routing decisions
pub trait GpuCostModel: Send + Sync {
    /// Estimate GPU execution cost
    fn estimate_gpu_cost(&self, op: &Operation) -> Result<ExecutionCost>;

    /// Estimate CPU execution cost
    fn estimate_cpu_cost(&self, op: &Operation) -> Result<ExecutionCost>;

    /// Determine optimal execution target
    fn choose_target(&self, op: &Operation) -> Result<ExecutionTarget>;

    /// Update cost model with observed execution
    fn learn_from_execution(&mut self, stats: &ExecutionStats);
}

/// Execution cost estimate
#[derive(Debug, Clone)]
pub struct ExecutionCost {
    /// Estimated execution time
    pub time_ms: f64,

    /// Estimated memory usage (bytes)
    pub memory_bytes: usize,

    /// Estimated energy consumption (joules)
    pub energy_joules: f64,

    /// Confidence in estimate (0.0 to 1.0)
    pub confidence: f32,
}

/// Execution target
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum ExecutionTarget {
    CPU,
    GPU(DeviceId),
    Hybrid { cpu_ratio: f32 },
}

6.2 Operation Descriptor

/// Operation to be executed
#[derive(Debug, Clone)]
pub struct Operation {
    pub op_type: OperationType,
    pub input_size: usize,
    pub output_size: usize,
    pub complexity: ComplexityClass,
    pub data_layout: DataLayout,
}

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum OperationType {
    Aggregation(AggregateOp),
    Scan(ScanOp),
    Join(JoinOp),
    VectorOp(VectorOp),
    MatrixOp(MatrixOp),
    Custom(u32),
}

#[derive(Debug, Clone, Copy)]
pub enum AggregateOp {
    Sum,
    Avg,
    Min,
    Max,
    Count,
}

#[derive(Debug, Clone, Copy)]
pub enum ComplexityClass {
    Linear,
    LogLinear,
    Quadratic,
    Cubic,
}

#[derive(Debug, Clone, Copy)]
pub enum DataLayout {
    RowMajor,
    ColumnMajor,
    Sparse,
}

6.3 Adaptive Cost Learning

/// Adaptive cost model that learns from execution
pub struct AdaptiveCostModel {
    history: ExecutionHistory,
    predictor: CostPredictor,
    config: AdaptiveConfig,
}

#[derive(Debug, Clone)]
pub struct AdaptiveConfig {
    /// Minimum samples before switching target
    pub min_samples: usize,

    /// Confidence threshold for GPU execution
    pub confidence_threshold: f32,

    /// Memory pressure threshold (0.0 to 1.0)
    pub memory_pressure_threshold: f32,

    /// Enable online learning
    pub enable_learning: bool,
}

impl AdaptiveCostModel {
    /// Create new adaptive model
    pub fn new(config: AdaptiveConfig) -> Self;

    /// Predict cost with confidence interval
    pub fn predict_with_confidence(
        &self,
        op: &Operation,
        target: ExecutionTarget,
    ) -> Result<(ExecutionCost, ConfidenceInterval)>;

    /// Get historical statistics
    pub fn get_stats(&self, op_type: OperationType) -> Option<OpStats>;
}

#[derive(Debug, Clone)]
pub struct ConfidenceInterval {
    pub lower_ms: f64,
    pub upper_ms: f64,
    pub confidence_level: f32,
}

6.4 Execution Statistics

/// Execution statistics for learning
#[derive(Debug, Clone)]
pub struct ExecutionStats {
    pub operation: Operation,
    pub target: ExecutionTarget,
    pub actual_time_ms: f64,
    pub memory_used_bytes: usize,
    pub cache_hits: u64,
    pub cache_misses: u64,
    pub transfer_time_ms: f64,
    pub compute_time_ms: f64,
    pub success: bool,
    pub timestamp: Instant,
}

/// Operation statistics
#[derive(Debug, Clone)]
pub struct OpStats {
    pub op_type: OperationType,
    pub execution_count: u64,
    pub cpu_successes: u64,
    pub gpu_successes: u64,
    pub avg_cpu_time_ms: f64,
    pub avg_gpu_time_ms: f64,
    pub speedup_factor: f64,
}

7. Type System

7.1 Device-Compatible Types

/// Marker trait for types that can be transferred to GPU
pub unsafe trait DeviceRepr: Copy + Send + Sync + 'static {
    /// Check if type is properly aligned
    fn is_aligned(ptr: *const Self) -> bool;

    /// Get alignment requirement
    fn alignment() -> usize;
}

// Implement for primitive types
unsafe impl DeviceRepr for f32 {
    fn is_aligned(ptr: *const Self) -> bool {
        (ptr as usize) % 4 == 0
    }

    fn alignment() -> usize {
        4
    }
}

unsafe impl DeviceRepr for f64 {
    fn is_aligned(ptr: *const Self) -> bool {
        (ptr as usize) % 8 == 0
    }

    fn alignment() -> usize {
        8
    }
}

// ... similar for i8, i16, i32, i64, u8, u16, u32, u64

7.2 Vector Types

/// GPU-compatible vector types
#[repr(C, align(16))]
#[derive(Copy, Clone, Debug)]
pub struct Float4 {
    pub x: f32,
    pub y: f32,
    pub z: f32,
    pub w: f32,
}

unsafe impl DeviceRepr for Float4 {
    fn is_aligned(ptr: *const Self) -> bool {
        (ptr as usize) % 16 == 0
    }

    fn alignment() -> usize {
        16
    }
}

// Similar for Float2, Float3, Int4, etc.

7.3 Type-Safe Buffer Views

/// Immutable view into GPU buffer
pub struct GpuBufferView<'a, T> {
    buffer: &'a GpuBuffer<T>,
    offset: usize,
    len: usize,
}

impl<'a, T: DeviceRepr> GpuBufferView<'a, T> {
    /// Get view length
    pub fn len(&self) -> usize;

    /// Check if view is empty
    pub fn is_empty(&self) -> bool;

    /// Split view at index
    pub fn split_at(&self, mid: usize) -> (Self, Self);

    /// Create sub-view
    pub fn subview(&self, offset: usize, len: usize) -> Result<Self>;
}

/// Mutable view into GPU buffer
pub struct GpuBufferViewMut<'a, T> {
    buffer: &'a mut GpuBuffer<T>,
    offset: usize,
    len: usize,
}

8. Error Handling

8.1 Error Types

/// GPU API errors
#[derive(Debug, thiserror::Error)]
pub enum GpuError {
    #[error("GPU device not available")]
    DeviceNotAvailable,

    #[error("CUDA error: {0}")]
    CudaError(String),

    #[error("ROCm error: {0}")]
    RocmError(String),

    #[error("Out of GPU memory: requested {requested}MB, available {available}MB")]
    OutOfMemory { requested: usize, available: usize },

    #[error("Invalid launch configuration: {0}")]
    InvalidLaunchConfig(String),

    #[error("Kernel compilation failed: {0}")]
    CompilationError(String),

    #[error("Transfer failed: {0}")]
    TransferError(String),

    #[error("Unsupported operation: {0}")]
    Unsupported(String),

    #[error("Synchronization failed: {0}")]
    SyncError(String),

    #[error("Invalid buffer access: offset={offset}, len={len}, capacity={capacity}")]
    InvalidAccess {
        offset: usize,
        len: usize,
        capacity: usize,
    },
}

8.2 Result Types

/// GPU operation result
pub type GpuResult<T> = Result<T, GpuError>;

/// Result with optional CPU fallback
pub enum FallbackResult<T> {
    Gpu(T),
    Cpu(T),
    Error(GpuError),
}

impl<T> FallbackResult<T> {
    pub fn unwrap(self) -> T {
        match self {
            Self::Gpu(v) | Self::Cpu(v) => v,
            Self::Error(e) => panic!("GPU operation failed: {}", e),
        }
    }

    pub fn is_gpu(&self) -> bool {
        matches!(self, Self::Gpu(_))
    }
}

8.3 Error Recovery

/// Error recovery strategy
pub trait ErrorRecovery {
    /// Attempt to recover from error
    fn try_recover(&self, error: &GpuError) -> RecoveryAction;
}

#[derive(Debug, Clone, Copy)]
pub enum RecoveryAction {
    /// Retry operation
    Retry,

    /// Fall back to CPU
    FallbackToCpu,

    /// Abort with error
    Abort,

    /// Reduce memory usage and retry
    ReduceMemory,
}

9. Lifecycle Management

9.1 Resource Initialization

/// GPU context initialization
pub struct GpuContext {
    devices: Vec<GpuDevice>,
    allocator: Arc<dyn GpuAllocator>,
    transfer: Arc<dyn DataTransfer>,
    launcher: Arc<dyn KernelLauncher>,
    cost_model: Arc<Mutex<dyn GpuCostModel>>,
}

impl GpuContext {
    /// Initialize GPU context
    pub fn new(config: GpuConfig) -> GpuResult<Self>;

    /// Get device by ID
    pub fn device(&self, id: DeviceId) -> Option<&GpuDevice>;

    /// Get default device
    pub fn default_device(&self) -> &GpuDevice;

    /// Get allocator
    pub fn allocator(&self) -> &Arc<dyn GpuAllocator>;

    /// Get data transfer interface
    pub fn transfer(&self) -> &Arc<dyn DataTransfer>;

    /// Get kernel launcher
    pub fn launcher(&self) -> &Arc<dyn KernelLauncher>;

    /// Get cost model
    pub fn cost_model(&self) -> &Arc<Mutex<dyn GpuCostModel>>;

    /// Shutdown context
    pub fn shutdown(self) -> GpuResult<()>;
}

/// GPU configuration
#[derive(Debug, Clone)]
pub struct GpuConfig {
    pub device_ids: Vec<DeviceId>,
    pub memory_pool_config: PoolConfig,
    pub enable_peer_access: bool,
    pub cost_model_config: AdaptiveConfig,
}

9.2 Resource Cleanup

impl Drop for GpuContext {
    fn drop(&mut self) {
        // Automatic cleanup:
        // 1. Wait for all pending operations
        // 2. Free all allocations
        // 3. Destroy streams
        // 4. Release devices
    }
}

impl<T> Drop for GpuBuffer<T> {
    fn drop(&mut self) {
        // Return buffer to pool or free
    }
}

10. Performance Guarantees

10.1 Latency Guarantees

Operation	Maximum Latency	Typical Latency
Memory allocation (pooled)	0.1ms	0.01ms
Memory allocation (new)	5ms	1ms
Kernel launch	0.5ms	0.05ms
Host-to-device transfer (1MB)	2ms	0.5ms
Device-to-host transfer (1MB)	2ms	0.5ms
Synchronization	1ms	0.1ms

10.2 Throughput Guarantees

Operation	Minimum Throughput	Target Throughput
Memory transfer (pinned)	5 GB/s	10 GB/s
Memory transfer (pageable)	2 GB/s	4 GB/s
Kernel execution (FLOPS)	1 TFLOPS	10 TFLOPS

10.3 Memory Guarantees

Metric	Guarantee
Memory leak	Zero leaks (RAII)
Fragmentation	<30% after defrag
Pool overhead	<5% of allocated
Allocation failure	Graceful degradation

11. Integration Patterns

11.1 Query Execution Integration

/// Example: GPU-accelerated aggregation
pub struct GpuAggregateExecutor {
    ctx: Arc<GpuContext>,
}

impl GpuAggregateExecutor {
    pub fn execute(
        &self,
        data: &[f32],
        op: AggregateOp,
    ) -> FallbackResult<f32> {
        // Check if GPU execution is beneficial
        let operation = Operation {
            op_type: OperationType::Aggregation(op),
            input_size: data.len() * 4,
            output_size: 4,
            complexity: ComplexityClass::Linear,
            data_layout: DataLayout::RowMajor,
        };

        let target = self.ctx.cost_model()
            .lock()
            .unwrap()
            .choose_target(&operation)
            .unwrap();

        match target {
            ExecutionTarget::GPU(device_id) => {
                match self.execute_gpu(data, op, device_id) {
                    Ok(result) => FallbackResult::Gpu(result),
                    Err(_) => {
                        // Fallback to CPU
                        FallbackResult::Cpu(self.execute_cpu(data, op))
                    }
                }
            }
            ExecutionTarget::CPU => {
                FallbackResult::Cpu(self.execute_cpu(data, op))
            }
            ExecutionTarget::Hybrid { .. } => {
                // Hybrid execution not implemented yet
                FallbackResult::Cpu(self.execute_cpu(data, op))
            }
        }
    }

    fn execute_gpu(
        &self,
        data: &[f32],
        op: AggregateOp,
        device_id: DeviceId,
    ) -> GpuResult<f32> {
        // 1. Allocate GPU buffer
        let mut buffer = self.ctx.allocator().allocate::<f32>(data.len())?;

        // 2. Transfer data
        self.ctx.transfer().upload(data, &mut buffer)?;

        // 3. Launch kernel
        let kernel = self.get_aggregate_kernel(op)?;
        let config = LaunchConfig::linear(data.len(), 256);
        let result_buffer = self.ctx.allocator().allocate::<f32>(1)?;

        let args = vec![
            KernelArg::Buffer(buffer.as_ref()),
            KernelArg::Buffer(result_buffer.as_ref()),
            KernelArg::Scalar(ScalarValue::U32(data.len() as u32)),
        ];

        self.ctx.launcher().launch_sync(&kernel, &config, &args)?;

        // 4. Download result
        let mut result = vec![0.0f32; 1];
        self.ctx.transfer().download(&result_buffer, &mut result)?;

        Ok(result[0])
    }

    fn execute_cpu(&self, data: &[f32], op: AggregateOp) -> f32 {
        // CPU fallback implementation
        match op {
            AggregateOp::Sum => data.iter().sum(),
            AggregateOp::Avg => data.iter().sum::<f32>() / data.len() as f32,
            AggregateOp::Min => *data.iter().min_by(|a, b| a.partial_cmp(b).unwrap()).unwrap(),
            AggregateOp::Max => *data.iter().max_by(|a, b| a.partial_cmp(b).unwrap()).unwrap(),
            AggregateOp::Count => data.len() as f32,
        }
    }
}

11.2 Vector Search Integration

/// Example: GPU-accelerated similarity search
pub struct GpuSimilaritySearch {
    ctx: Arc<GpuContext>,
}

impl GpuSimilaritySearch {
    pub fn search(
        &self,
        query: &[f32],
        database: &[Vec<f32>],
        top_k: usize,
    ) -> FallbackResult<Vec<(usize, f32)>> {
        // Determine execution target
        let operation = Operation {
            op_type: OperationType::VectorOp(VectorOp::Similarity),
            input_size: (query.len() + database.len() * query.len()) * 4,
            output_size: top_k * 12,
            complexity: ComplexityClass::Linear,
            data_layout: DataLayout::RowMajor,
        };

        let target = self.ctx.cost_model()
            .lock()
            .unwrap()
            .choose_target(&operation)
            .unwrap();

        match target {
            ExecutionTarget::GPU(_) => {
                match self.search_gpu(query, database, top_k) {
                    Ok(results) => FallbackResult::Gpu(results),
                    Err(_) => FallbackResult::Cpu(self.search_cpu(query, database, top_k)),
                }
            }
            _ => FallbackResult::Cpu(self.search_cpu(query, database, top_k)),
        }
    }
}

12. Example Workflows

12.1 Simple Aggregation

// Initialize GPU context
let config = GpuConfig::default();
let ctx = GpuContext::new(config)?;

// Prepare data
let data = vec![1.0f32; 1_000_000];

// Allocate GPU buffer
let mut buffer = ctx.allocator().allocate::<f32>(data.len())?;

// Transfer data
ctx.transfer().upload(&data, &mut buffer)?;

// Launch aggregation kernel
let kernel = load_sum_kernel()?;
let config = LaunchConfig::linear(data.len(), 256);
let result_buffer = ctx.allocator().allocate::<f32>(1)?;

ctx.launcher().launch_sync(
    &kernel,
    &config,
    &[
        KernelArg::Buffer(buffer.as_ref()),
        KernelArg::Buffer(result_buffer.as_ref()),
        KernelArg::Scalar(ScalarValue::U32(data.len() as u32)),
    ],
)?;

// Download result
let mut result = vec![0.0f32];
ctx.transfer().download(&result_buffer, &mut result)?;

println!("Sum: {}", result[0]);

12.2 Async Pipeline

// Create stream for async operations
let stream = Stream::new(ctx.default_device(), StreamPriority::Normal)?;

// Async upload
let upload_handle = ctx.transfer()
    .upload_async(&data, &mut buffer, &stream)?;

// Async kernel launch
let kernel_handle = ctx.launcher()
    .launch_async(&kernel, &config, &args, &stream)?;

// Async download
let download_handle = ctx.transfer()
    .download_async(&result_buffer, &mut result, &stream)?;

// Wait for completion
upload_handle.wait()?;
kernel_handle.wait()?;
let stats = download_handle.wait()?;

println!("Bandwidth: {:.2} GB/s", stats.bandwidth_gbps);

12.3 Multi-GPU Workload

// Initialize multi-GPU context
let config = GpuConfig {
    device_ids: vec![DeviceId(0), DeviceId(1)],
    ..Default::default()
};
let ctx = GpuContext::new(config)?;

// Partition data across GPUs
let chunk_size = data.len() / 2;
let (data_0, data_1) = data.split_at(chunk_size);

// Execute on both GPUs in parallel
let handle_0 = std::thread::spawn({
    let ctx = ctx.clone();
    move || execute_on_gpu(&ctx, DeviceId(0), data_0)
});

let handle_1 = std::thread::spawn({
    let ctx = ctx.clone();
    move || execute_on_gpu(&ctx, DeviceId(1), data_1)
});

// Collect results
let result_0 = handle_0.join().unwrap()?;
let result_1 = handle_1.join().unwrap()?;

// Combine results
let final_result = combine_results(result_0, result_1);

Conclusion

This API contract provides a comprehensive foundation for GPU acceleration in HeliosDB. Key design decisions:

Type Safety: Leveraging Rust’s type system for compile-time correctness
RAII: Automatic resource management prevents leaks
Abstraction: Vendor-neutral API supports CUDA and ROCm
Performance: Zero-cost abstractions with <5ms overhead
Fallback: Automatic CPU fallback ensures reliability
Adaptivity: Cost model learns optimal CPU/GPU routing

Next Steps:

Review and iterate on API design (Day 2)
Implement prototype (Day 2-3)
Create comprehensive examples (Day 3)
Integration testing (Day 4)
Performance validation (Day 5)

Document Status: DRAFT FOR REVIEW Review Date: Phase 2 Week 5 Day 3 Approval: Pending Agent 2, Query Optimizer review

HeliosDB GPU API Contract Design v2.0

HeliosDB GPU API Contract Design v2.0

Table of Contents

1. Executive Summary

1.1 Design Goals

1.2 Target Performance

2. Design Principles

2.1 Zero-Copy Philosophy

2.2 RAII Lifecycle Management

2.3 Explicit vs Implicit Synchronization

2.4 Type-Safe Device Pointers

3. Memory Allocation API

3.1 Core Memory Types

3.2 Allocator Interface

3.3 Memory Pool Configuration

3.4 Memory Buffer Operations

3.5 Advanced Memory Features

4. Data Transfer API

4.1 Transfer Interface

4.2 Transfer Handle

4.3 Optimized Transfer Patterns

4.4 Pinned Memory Transfers

5. Kernel Execution API

5.1 Kernel Definition

5.2 Launch Configuration

5.3 Kernel Launcher

5.4 Launch Handle

5.5 Stream Management

6. Cost Model Hooks

6.1 Cost Model Interface

6.2 Operation Descriptor

6.3 Adaptive Cost Learning

6.4 Execution Statistics

7. Type System

7.1 Device-Compatible Types

7.2 Vector Types

7.3 Type-Safe Buffer Views

8. Error Handling

8.1 Error Types

8.2 Result Types

8.3 Error Recovery

9. Lifecycle Management

9.1 Resource Initialization

9.2 Resource Cleanup

10. Performance Guarantees

10.1 Latency Guarantees

10.2 Throughput Guarantees

10.3 Memory Guarantees

11. Integration Patterns

11.1 Query Execution Integration

11.2 Vector Search Integration

12. Example Workflows

12.1 Simple Aggregation

12.2 Async Pipeline

12.3 Multi-GPU Workload

Conclusion