HeliosDB Phase 2 Architecture: v7.0 Tier 2 Features & Advanced ML/AI
HeliosDB Phase 2 Architecture: v7.0 Tier 2 Features & Advanced ML/AI
Document Version: 1.0 Created: November 17, 2025 Author: Architect Worker 5 Status: STRATEGIC ARCHITECTURE
Executive Summary
This document provides the comprehensive architectural design for HeliosDB Phase 2, covering v7.0 Tier 2 innovations and advanced ML/AI capabilities. Phase 2 builds upon Phase 1’s production-ready foundation (95%+ complete with 3 features and 4 protocols) to deliver world-first innovations in multimodal AI, GPU acceleration, and autonomous database intelligence.
Phase 2 Objectives
| Metric | Target | Impact |
|---|---|---|
| Duration | 3 months | Months 2-4 of v7.0 roadmap |
| Investment | $2.5M-$3.5M | Justified by $230M-$260M ARR |
| Features | 9 innovations | 6 Tier 2 + 3 advanced ML/AI |
| Team | 8 agents + specialists | ML engineers, GPU experts, architects |
| ARR Impact | $230M-$260M | Additional revenue on top of Phase 1 |
Key Innovations
v7.0 Tier 2 Features (50%+ → 100%):
- F7.1: Multimodal Vector Search (60% → 100%) - $40M ARR
- F7.6: Advanced Webhooks (50% → 100%) - $25M ARR
- F7.12: Unified Observability (40% → 100%) - $35M ARR
Advanced ML/AI Features (30-40% → 100%): 4. F7.2: GraphRAG HTAP (40% → 100%) - $50M ARR 5. F7.5: GPU Acceleration (30% → 100%) - $55M ARR 6. F7.9: AI Schema Architect (50% → 100%) - $40M ARR
v5.x Core Completion: 7. Oracle 23ai Compatibility (0% → 100%) - Critical protocol 8. PostgreSQL 17 Compatibility (0% → 100%) - LISTEN/NOTIFY complete 9. WASM Runtime (40% → 100%) - JavaScript + Python runtimes
Table of Contents
- Architecture Principles
- F7.1: Multimodal Vector Search Architecture
- F7.2: GraphRAG HTAP Architecture
- F7.5: GPU Acceleration Architecture
- F7.6: Advanced Webhooks Architecture
- F7.9: AI Schema Architect Architecture
- F7.12: Unified Observability Architecture
- Protocol Completion Architecture
- Integration Points
- Performance Targets
- Security Architecture
- Scalability Design
Architecture Principles
Core Design Tenets
-
GPU-First Architecture
- GPU acceleration as first-class citizen, not afterthought
- Automatic CPU/GPU routing based on workload characteristics
- Zero-copy data paths between CPU and GPU memory
- Support for NVIDIA CUDA and AMD ROCm
-
AI-Native by Design
- Embeddings, ML training, and inference built into query engine
- Vector operations optimized at storage layer
- Natural language as first-class interface
- LLM integration without external services
-
Convergence Architecture
- OLTP + OLAP + Vector + Graph in single engine
- Shared storage layer, unified transaction model
- Cross-workload optimizations (e.g., OLTP feeds OLAP materialized views)
- Protocol-agnostic feature access
-
Production-First
- 99.99%+ availability targets
- <50ms p99 latency for OLTP
- 10-100x OLAP speedups with GPU
- Zero-downtime upgrades and migrations
-
Developer Experience
- Zero-config defaults for 90% use cases
- Natural language schema design
- Automatic observability and tracing
- CLI-first tooling
F7.1: Multimodal Vector Search Architecture
Current State (60% Complete)
Implemented:
- Basic vector search with HNSW indexes
- Text embeddings via OpenAI/Cohere APIs
- Similarity search with L2/cosine distance
- Vector index persistence
Remaining Work (40%):
- Image/audio/video embedding support
- Cross-modal search (text→image, image→text, etc.)
- GPU-accelerated search
- Batch embedding generation
- Multi-modal index optimization
Architecture Overview
┌─────────────────────────────────────────────────────────────────┐│ Multimodal Vector Search │├─────────────────────────────────────────────────────────────────┤│ ││ ┌─────────────────┐ ┌──────────────────────────────────┐ ││ │ Query Interface│ │ Embedding Generation Layer │ ││ │ - NL queries │──────▶│ - CLIP (text + image) │ ││ │ - SQL vectors │ │ - AudioCLIP (audio) │ ││ │ - API calls │ │ - VideoCLIP (video) │ ││ └─────────────────┘ │ - Unified 1536-dim space │ ││ └──────────────────────────────────┘ ││ │ │ ││ ▼ ▼ ││ ┌─────────────────────────────────────────────────────────────┤│ │ Unified Vector Space (1536 dims) ││ │ - Text embeddings ││ │ - Image embeddings (CLIP projection) ││ │ - Audio embeddings (AudioCLIP projection) ││ │ - Video embeddings (VideoCLIP frame aggregation) ││ └──────────────────────────────────────────────────────────────┤│ │ ││ ▼ ││ ┌─────────────────────────────────────────────────────────────┐││ │ GPU-Accelerated Search Layer │││ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │││ │ │ HNSW Index │ │ IVF Index │ │ Flat (brute) │ │││ │ │ (M=32, ef=64)│ │ (nlist=4096) │ │ GPU-accelerated │ │││ │ └──────────────┘ └──────────────┘ └──────────────────┘ │││ │ - Automatic index selection based on data size │││ │ - GPU batching for parallel similarity computation │││ │ - <50ms search latency for 100K vectors │││ └──────────────────────────────────────────────────────────────┘││ │└───────────────────────────────────────────────────────────────────┘Component Design
1. Embedding Generation Layer
Multi-Model Support:
pub enum EmbeddingModel { // Text + Image CLIP { model: String, // "openai/clip-vit-large-patch14" dim: usize, // 1536 (projected to unified space) },
// Audio AudioCLIP { model: String, // "microsoft/audioclip" dim: usize, // 1536 sample_rate: u32, // 16000 Hz },
// Video (frame-based) VideoCLIP { model: String, // "openai/clip-vit-large-patch14" dim: usize, // 1536 fps: u32, // Frames per second to sample aggregation: VideoAggregation, // Mean, Max, Attention },
// Local ONNX models for offline inference LocalONNX { path: PathBuf, dim: usize, },}
pub enum VideoAggregation { Mean, // Average frame embeddings Max, // Max pooling across frames Attention, // Attention-weighted average}Unified Embedding Space:
- All modalities projected to 1536-dimensional space
- Learned projection matrices for cross-modal alignment
- Training data: LAION-5B, Conceptual Captions, AudioSet
- Cosine similarity normalized to [0, 1]
Example Usage:
-- Generate embeddings from different modalitiesINSERT INTO products (name, image_embedding, description_embedding)VALUES ( 'Red Shoes', EMBED_IMAGE('s3://images/red_shoes.jpg', 'clip'), EMBED_TEXT('Comfortable red running shoes', 'clip'));
-- Cross-modal search: Find images matching text querySELECT name, image_url, SIMILARITY(image_embedding, EMBED_TEXT('red sneakers'))FROM productsORDER BY SIMILARITY(image_embedding, EMBED_TEXT('red sneakers')) DESCLIMIT 10;
-- Cross-modal search: Find products similar to query imageSELECT name, SIMILARITY(image_embedding, EMBED_IMAGE('query.jpg'))FROM productsORDER BY SIMILARITY DESCLIMIT 10;2. GPU-Accelerated Search
Automatic GPU Routing:
pub struct VectorSearchRouter { gpu_threshold: usize, // Use GPU if >10K vectors batch_size: usize, // Batch size for GPU (1000) cpu_fallback: bool, // Fall back to CPU if GPU unavailable}
impl VectorSearchRouter { pub fn route_search(&self, query: &[f32], vectors: &[Vec<f32>], k: usize) -> SearchBackend { if vectors.len() > self.gpu_threshold && has_gpu() { SearchBackend::GPU { device: select_gpu(), batch_size: self.batch_size, } } else { SearchBackend::CPU { threads: num_cpus(), } } }}CUDA Kernel for Similarity Search:
// Batch cosine similarity on GPU__global__ void batch_cosine_similarity( const float* queries, // [batch_size, dim] const float* vectors, // [num_vectors, dim] float* results, // [batch_size, num_vectors] int batch_size, int num_vectors, int dim) { int query_idx = blockIdx.y; int vector_idx = blockIdx.x * blockDim.x + threadIdx.x;
if (query_idx < batch_size && vector_idx < num_vectors) { // Compute dot product float dot = 0.0f; float norm_q = 0.0f; float norm_v = 0.0f;
for (int i = 0; i < dim; i++) { float q = queries[query_idx * dim + i]; float v = vectors[vector_idx * dim + i]; dot += q * v; norm_q += q * q; norm_v += v * v; }
// Cosine similarity results[query_idx * num_vectors + vector_idx] = dot / (sqrtf(norm_q) * sqrtf(norm_v)); }}Performance Characteristics:
- CPU (SIMD): 10K vec/sec @ 1536 dims
- GPU (CUDA): 100K-500K vec/sec @ 1536 dims (10-50x speedup)
- Latency: <50ms p99 for 100K vectors
- Throughput: 1000+ queries/sec on single GPU
3. Index Structures
HNSW (Hierarchical Navigable Small World):
pub struct HNSWIndex { // Index parameters m: usize, // Number of connections per node (32) ef_construction: usize, // Search width during construction (64) ef_search: usize, // Search width during query (64)
// Graph layers layers: Vec<HNSWLayer>,
// Entry point (top layer) entry_point: NodeId,
// Vector storage vectors: Arc<VectorStorage>,}
pub struct HNSWLayer { level: usize, nodes: HashMap<NodeId, HNSWNode>,}
pub struct HNSWNode { id: NodeId, neighbors: Vec<NodeId>, vector_ref: VectorRef,}Index Selection Strategy:
- < 10K vectors: Flat (brute force) on GPU
- 10K - 1M vectors: HNSW with M=32, ef=64
- > 1M vectors: IVF + HNSW hybrid (IVF for coarse search, HNSW for refinement)
Data Flow
Embedding Generation Pipeline:
Input (Image/Audio/Video) ↓Preprocessing (Resize, Normalize, Sample) ↓Model Inference (CLIP/AudioCLIP/VideoCLIP) ↓Projection to Unified Space (1536 dims) ↓Normalization (L2 normalize) ↓Storage (Vector Column)Query Execution Pipeline:
Query (Text/Image/Audio/Video) ↓Generate Query Embedding (same models) ↓Route to CPU or GPU ↓Index Lookup (HNSW/IVF/Flat) ↓Top-K Selection ↓Result Fetching ↓ResponseStorage Layer Integration
Vector Column Type:
pub enum DataType { // ... existing types ... Vector { dim: usize, dtype: VectorDType, // F32, F16, I8 (quantized) }, MultiModalVector { dim: usize, modality: Modality, // Text, Image, Audio, Video },}
pub enum VectorDType { Float32, // Full precision (4 bytes/dim) Float16, // Half precision (2 bytes/dim, 50% space savings) Int8, // Quantized (1 byte/dim, 75% space savings)}Product Quantization for Compression:
- Split 1536-dim vector into 48 subvectors of 32 dims each
- Build codebook of 256 centroids per subvector
- Store vector as 48 bytes (48 × 1 byte index) instead of 6144 bytes (1536 × 4 bytes)
- 96.9% compression ratio with <2% accuracy loss
Performance Targets
| Metric | Target | Rationale |
|---|---|---|
| Search Latency | <50ms p99 @ 100K vectors | Production SLA |
| Throughput | 1000+ qps/GPU | GPU utilization >80% |
| Recall@10 | >95% | Cross-modal accuracy |
| Index Build Time | <5 min @ 1M vectors | Background indexing |
| Storage Overhead | <20% | PQ compression |
Cross-Modal Search Examples
Text → Image:
-- Find images matching text descriptionSELECT image_url, product_nameFROM productsWHERE COSINE_SIMILARITY( image_embedding, EMBED_TEXT('red leather jacket')) > 0.8;Image → Text:
-- Find text descriptions similar to query imageSELECT description, SIMILARITY(description_embedding, @query_image_emb)FROM product_descriptionsORDER BY SIMILARITY DESCLIMIT 10;Audio → Video:
-- Find videos with similar audioSELECT video_id, titleFROM videosWHERE COSINE_SIMILARITY( audio_embedding, EMBED_AUDIO('query_audio.mp3')) > 0.7;Patent Opportunities
“Multi-Modal Vector Embedding System for Unified Search”
- Claims: Unified 1536-dim embedding space for text/image/audio/video
- Novelty: Cross-modal search with learned projection matrices
- Value: $15M-$25M
F7.2: GraphRAG HTAP Architecture
Current State (40% Complete)
Implemented:
- Basic graph storage (adjacency lists)
- Simple graph traversals
- Vector search integration (F6.6)
Remaining Work (60%):
- Cypher query language support
- GQL (Graph Query Language) implementation
- Natural language to graph queries (NL2GQL)
- LLM integration for reasoning
- HTAP architecture (real-time graph analytics on OLTP data)
- Distributed graph processing
Architecture Overview
┌───────────────────────────────────────────────────────────────────────┐│ GraphRAG HTAP System │├───────────────────────────────────────────────────────────────────────┤│ ││ ┌────────────────────────────────────────────────────────────────┐ ││ │ Query Interface Layer │ ││ │ ┌──────────────┐ ┌──────────────┐ ┌─────────────────────┐ │ ││ │ │ Cypher │ │ GQL │ │ Natural Language │ │ ││ │ │ (Neo4j) │ │ (ISO/IEC) │ │ (LLM-powered) │ │ ││ │ └──────────────┘ └──────────────┘ └─────────────────────┘ │ ││ └────────────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌────────────────────────────────────────────────────────────────┐ ││ │ Graph Query Planner & Optimizer │ ││ │ - Pattern matching optimization │ ││ │ - Join reordering for graph traversals │ ││ │ - Index selection (adjacency list vs. CSR) │ ││ │ - Push computation to storage layer │ ││ └────────────────────────────────────────────────────────────────┘ ││ │ ││ ┌───────────────────┴───────────────────┐ ││ ▼ ▼ ││ ┌─────────────────────────────┐ ┌──────────────────────────────┐││ │ OLTP Execution Engine │ │ OLAP Execution Engine │││ │ - Real-time graph updates │ │ - PageRank, BFS, DFS │││ │ - ACID transactions │ │ - Community detection │││ │ - <10ms p99 latency │ │ - Centrality algorithms │││ │ - Row-oriented storage │ │ - Columnar storage │││ └─────────────────────────────┘ └──────────────────────────────┘││ │ │ ││ └──────────────┬──────────────────────────┘ ││ ▼ ││ ┌────────────────────────────────────────────────────────────────┐ ││ │ Unified Graph Storage Layer │ ││ │ ┌──────────────────────────────────────────────────────────┐ │ ││ │ │ Adjacency List Storage (Row-oriented, OLTP-optimized) │ │ ││ │ │ - Vertex: {id, properties, out_edges[], in_edges[]} │ │ ││ │ │ - Edge: {src, dst, label, properties, weight} │ │ ││ │ └──────────────────────────────────────────────────────────┘ │ ││ │ ┌──────────────────────────────────────────────────────────┐ │ ││ │ │ CSR (Compressed Sparse Row) Format (OLAP-optimized) │ │ ││ │ │ - Vertex offsets array: [0, 5, 12, 20, ...] │ │ ││ │ │ - Edge targets array: [1, 3, 5, 7, 9, 2, 4, ...] │ │ ││ │ │ - Cache-friendly, vectorizable │ │ ││ │ └──────────────────────────────────────────────────────────┘ │ ││ └────────────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌────────────────────────────────────────────────────────────────┐ ││ │ RAG Integration Layer │ ││ │ - Vector search over graph nodes (F6.6 + F7.1) │ ││ │ - Graph traversal for context gathering │ ││ │ - LLM reasoning with graph knowledge │ ││ │ - Explainable AI: Show reasoning path through graph │ ││ └────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘Component Design
1. Graph Query Languages
Cypher Support (Neo4j compatibility):
-- Find friends of friendsMATCH (p:Person {name: 'Alice'})-[:FRIEND]->(f)-[:FRIEND]->(fof)WHERE fof <> pRETURN fof.name, COUNT(*) as mutual_friendsORDER BY mutual_friends DESCLIMIT 10;
-- Shortest pathMATCH path = shortestPath( (a:Person {name: 'Alice'})-[:FRIEND*]-(b:Person {name: 'Bob'}))RETURN path;
-- PageRank (OLAP)CALL algo.pageRank('Person', 'FRIEND')YIELD nodeId, scoreRETURN algo.getNodeById(nodeId).name, scoreORDER BY score DESCLIMIT 10;GQL Support (ISO/IEC standard):
-- GQL syntax (more SQL-like)SELECT p.name, COUNT(f) AS friend_countFROM GRAPH social_networkMATCH (p:Person)-[:FRIEND]->(f:Person)GROUP BY p.nameORDER BY friend_count DESCLIMIT 10;Natural Language to Graph Queries:
-- Natural language querySELECT * FROM GRAPH_QUERY('Find all friends of Alice who live in Seattle');
-- Internally translates to Cypher:-- MATCH (p:Person {name: 'Alice'})-[:FRIEND]->(f:Person {city: 'Seattle'})-- RETURN f;2. HTAP Architecture
Dual Storage Format:
pub struct HybridGraphStorage { // OLTP: Adjacency list for fast updates adjacency_list: AdjacencyListStore,
// OLAP: CSR format for fast traversals csr_format: CSRStore,
// Synchronization sync_policy: SyncPolicy, stale_threshold: Duration, // Rebuild CSR if stale >5 mins}
pub enum SyncPolicy { Immediate, // Update CSR on every write (slow writes) Periodic(Duration), // Rebuild CSR every N seconds OnDemand, // Rebuild CSR before OLAP query Adaptive, // ML-based decision (default)}Adaptive Synchronization:
- Monitor OLTP write rate and OLAP query frequency
- If OLAP queries dominate, use Periodic sync (every 30s)
- If OLTP writes dominate, use OnDemand sync
- ML model predicts optimal sync policy based on workload
Columnar Graph Storage for OLAP:
pub struct CSRStore { // Vertex offsets (where each vertex's edges start) offsets: Vec<usize>,
// Edge targets (compressed, cache-friendly) targets: Vec<u32>,
// Edge properties (columnar for SIMD) weights: Vec<f32>, labels: Vec<u16>, // Label IDs
// Metadata num_vertices: usize, num_edges: usize,}
impl CSRStore { // Get neighbors of vertex v (cache-friendly, vectorizable) pub fn neighbors(&self, v: usize) -> &[u32] { let start = self.offsets[v]; let end = self.offsets[v + 1]; &self.targets[start..end] }}3. Graph Algorithms Library
Built-in Algorithms:
pub enum GraphAlgorithm { // Traversal BFS { start: NodeId, max_depth: usize }, DFS { start: NodeId, max_depth: usize }, ShortestPath { src: NodeId, dst: NodeId, algorithm: PathAlgorithm },
// Centrality PageRank { iterations: usize, damping: f64 }, BetweennessCentrality { sample_size: Option<usize> }, ClosenessCentrality,
// Community Detection LabelPropagation { iterations: usize }, Louvain { resolution: f64 }, ConnectedComponents,
// Pattern Matching TriangleCounting, MotifFinding { motif: Motif },}
pub enum PathAlgorithm { Dijkstra, // Weighted shortest path BellmanFord, // Negative weights allowed AStar { heuristic: HeuristicFn },}GPU-Accelerated Graph Analytics:
pub struct GPUGraphEngine { // Graph data on GPU gpu_csr: DeviceBuffer<u32>, gpu_offsets: DeviceBuffer<usize>, gpu_weights: DeviceBuffer<f32>,
// Algorithms algorithms: HashMap<String, CudaKernel>,}
impl GPUGraphEngine { // PageRank on GPU (10-100x faster than CPU) pub fn pagerank(&self, iterations: usize, damping: f64) -> Vec<f64> { let mut scores = vec![1.0 / self.num_vertices as f64; self.num_vertices];
for _ in 0..iterations { // GPU kernel: parallel updates for all vertices self.pagerank_kernel(&mut scores, damping); }
scores }}4. RAG Integration
Vector Search Over Graph Nodes:
-- Find relevant nodes using vector similarityWITH relevant_nodes AS ( SELECT node_id, SIMILARITY(embedding, @query_embedding) AS score FROM graph_nodes WHERE SIMILARITY(embedding, @query_embedding) > 0.7 ORDER BY score DESC LIMIT 10)-- Expand context using graph traversalSELECT n.id, n.properties, pathFROM relevant_nodes rnMATCH (rn)-[:RELATED*1..3]-(n)RETURN n, path;LLM Reasoning with Graph Context:
pub struct GraphRAGEngine { llm: LLMClient, graph: GraphStore, embeddings: EmbeddingModel,}
impl GraphRAGEngine { pub async fn query(&self, question: &str) -> Result<Answer> { // 1. Generate query embedding let query_emb = self.embeddings.embed_text(question).await?;
// 2. Find relevant graph nodes let relevant_nodes = self.graph.vector_search(&query_emb, 10)?;
// 3. Expand context via graph traversal let context_subgraph = self.graph.expand_context(&relevant_nodes, 3)?;
// 4. Serialize subgraph to LLM prompt let prompt = self.serialize_subgraph_to_prompt(&context_subgraph);
// 5. Query LLM with graph context let answer = self.llm.generate(&prompt).await?;
// 6. Extract reasoning path (explainability) let reasoning_path = self.extract_reasoning_path(&answer, &context_subgraph);
Ok(Answer { text: answer, reasoning_path, confidence: self.compute_confidence(&answer, &context_subgraph), }) }}Explainable AI: Reasoning Path Visualization:
Question: "How are Alice and Bob connected?"
Answer: "Alice is connected to Bob through their mutual friend Charlie."
Reasoning Path (Graph):Alice --[FRIEND]--> Charlie --[FRIEND]--> Bob | | +------[WORKS_AT]---+---> Acme Corp
Explanation:1. Alice and Charlie work together at Acme Corp2. Alice and Charlie are friends3. Charlie and Bob are friends4. Therefore, Alice and Bob are connected through CharliePerformance Targets
| Metric | Target | Rationale |
|---|---|---|
| OLTP Latency | <10ms p99 | Real-time graph updates |
| OLAP Throughput | 100M+ edges/sec (GPU) | Large-scale analytics |
| Query Latency | <100ms for 3-hop traversal @ 10M nodes | Production SLA |
| Scalability | 10M+ nodes, 100M+ edges per node | Enterprise scale |
| RAG Accuracy | 95%+ on GraphQA benchmark | Quality target |
Patent Opportunities
“Hybrid Transactional Analytical Graph Database with LLM Integration”
- Claims: HTAP graph with dual storage (adjacency list + CSR), RAG integration
- Novelty: Real-time OLTP + OLAP on same graph data, LLM reasoning with graph context
- Value: $20M-$35M
F7.5: GPU Acceleration Architecture
Current State (30% Complete)
Implemented:
- Basic GPU detection (NVIDIA CUDA)
- Vector similarity on GPU (F7.1 foundation)
- GPU memory management
Remaining Work (70%):
- CUDA kernel library for SQL operations
- AMD ROCm support
- Automatic CPU/GPU query routing
- Multi-GPU support
- GPU-accelerated aggregations
- GPU-accelerated joins
Architecture Overview
┌───────────────────────────────────────────────────────────────────────┐│ GPU Acceleration Layer │├───────────────────────────────────────────────────────────────────────┤│ ││ ┌────────────────────────────────────────────────────────────────┐ ││ │ Query Optimizer │ ││ │ - Cost-based GPU/CPU routing │ ││ │ - Data size, operation type, GPU availability │ ││ │ - Automatic fallback to CPU if GPU unavailable │ ││ └────────────────────────────────────────────────────────────────┘ ││ │ ││ ┌───────────────────┴───────────────────┐ ││ ▼ ▼ ││ ┌─────────────────────────────┐ ┌──────────────────────────────┐││ │ CPU Execution Path │ │ GPU Execution Path │││ │ - Small datasets (<1M rows)│ │ - Large datasets (>1M rows) │││ │ - Complex operations │ │ - Vectorizable operations │││ │ - String operations │ │ - Numeric operations │││ └─────────────────────────────┘ └──────────────────────────────┘││ │ ││ ▼ ││ ┌──────────────────────────────────┐ ││ │ GPU Kernel Dispatcher │ ││ │ - CUDA (NVIDIA) │ ││ │ - ROCm (AMD) │ ││ │ - Multi-GPU coordination │ ││ └──────────────────────────────────┘ ││ │ ││ ┌─────────────────────────┴────────────┐ ││ ▼ ▼ ││ ┌───────────────────────┐ ┌──────────────────┐││ │ CUDA Kernels (NVIDIA)│ │ ROCm Kernels(AMD)│││ │ - Scan/Filter │ │ - Same ops │││ │ - Aggregation │ │ - HIP API │││ │ - Join │ │ - Portable │││ │ - Sort │ └──────────────────┘││ │ - Vector ops │ ││ └───────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘Component Design
1. GPU Query Routing
Cost Model:
pub struct GPUCostModel { // Data transfer costs cpu_to_gpu_bandwidth: f64, // GB/s gpu_to_cpu_bandwidth: f64, // GB/s
// Compute costs gpu_compute_factor: f64, // Speedup vs CPU (10-100x) cpu_compute_factor: f64, // Baseline 1.0
// Thresholds min_data_size_for_gpu: usize, // 1M rows gpu_memory_limit: usize, // GPU RAM (e.g., 24GB)}
impl GPUCostModel { pub fn should_use_gpu(&self, op: &QueryOp) -> bool { // Calculate costs let transfer_cost = self.data_transfer_cost(op); let cpu_compute_cost = self.cpu_compute_cost(op); let gpu_compute_cost = self.gpu_compute_cost(op);
// GPU is beneficial if: // gpu_compute_cost + transfer_cost < cpu_compute_cost let gpu_total = gpu_compute_cost + transfer_cost; let cpu_total = cpu_compute_cost;
gpu_total < cpu_total && op.data_size >= self.min_data_size_for_gpu && op.data_size <= self.gpu_memory_limit && self.is_vectorizable(op) }
fn is_vectorizable(&self, op: &QueryOp) -> bool { match op.op_type { OpType::Scan | OpType::Filter | OpType::Aggregate | OpType::Join | OpType::Sort | OpType::VectorSearch => true,
// String ops not well-suited for GPU OpType::StringManipulation => false,
// Complex ops stay on CPU OpType::Subquery | OpType::CTE => false, } }}Automatic Data Transfer:
pub struct GPUDataManager { // Pinned memory for fast transfers pinned_buffers: Vec<PinnedBuffer>,
// GPU memory pool gpu_allocator: GPUAllocator,
// Transfer streams (async) transfer_streams: Vec<cudaStream_t>,}
impl GPUDataManager { // Zero-copy transfer for large datasets pub async fn transfer_to_gpu(&self, data: &[u8]) -> Result<DeviceBuffer> { // Allocate pinned memory on CPU (faster DMA) let pinned = self.allocate_pinned(data.len())?; pinned.copy_from_slice(data);
// Async transfer to GPU let gpu_buf = self.gpu_allocator.allocate(data.len())?; cudaMemcpyAsync( gpu_buf.ptr(), pinned.ptr(), data.len(), cudaMemcpyHostToDevice, self.transfer_streams[0], )?;
Ok(gpu_buf) }}2. CUDA Kernel Library
Scan/Filter:
// GPU kernel for filtering rows__global__ void filter_kernel( const int* col_data, // Input column const int threshold, // Filter condition int* out_indices, // Output row indices int* out_count, // Output count (atomic) int num_rows) { int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < num_rows) { if (col_data[idx] > threshold) { int pos = atomicAdd(out_count, 1); out_indices[pos] = idx; } }}Aggregation (SUM, AVG, MIN, MAX):
// GPU kernel for parallel aggregation__global__ void sum_reduce_kernel( const float* data, float* partial_sums, int num_rows) { __shared__ float shared_mem[256];
int tid = threadIdx.x; int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Load data into shared memory shared_mem[tid] = (idx < num_rows) ? data[idx] : 0.0f; __syncthreads();
// Parallel reduction in shared memory for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) { if (tid < stride) { shared_mem[tid] += shared_mem[tid + stride]; } __syncthreads(); }
// Write block result if (tid == 0) { partial_sums[blockIdx.x] = shared_mem[0]; }}Join (Hash Join):
// Build hash table on GPU__global__ void build_hash_table_kernel( const int* build_keys, const int* build_values, int* hash_table_keys, int* hash_table_values, int num_rows, int table_size) { int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < num_rows) { int key = build_keys[idx]; int value = build_values[idx];
// Open addressing with linear probing int hash = key % table_size; while (atomicCAS(&hash_table_keys[hash], EMPTY, key) != EMPTY) { hash = (hash + 1) % table_size; } hash_table_values[hash] = value; }}
// Probe hash table__global__ void probe_hash_table_kernel( const int* probe_keys, const int* hash_table_keys, const int* hash_table_values, int* output, int num_rows, int table_size) { int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < num_rows) { int key = probe_keys[idx]; int hash = key % table_size;
// Linear probing while (hash_table_keys[hash] != EMPTY) { if (hash_table_keys[hash] == key) { output[idx] = hash_table_values[hash]; return; } hash = (hash + 1) % table_size; }
output[idx] = -1; // Not found }}3. AMD ROCm Support
Portable Kernel Code (HIP):
// HIP: Portable for both CUDA and ROCm#include <hip/hip_runtime.h>
__global__ void vector_add( const float* a, const float* b, float* c, int n) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < n) { c[idx] = a[idx] + b[idx]; }}
// Compilation:// CUDA: nvcc -x cu vector_add.cpp// ROCm: hipcc vector_add.cppRuntime Detection:
pub enum GPUBackend { CUDA { device: CudaDevice }, ROCm { device: HipDevice }, None,}
pub fn detect_gpu_backend() -> GPUBackend { if has_cuda() { GPUBackend::CUDA { device: select_cuda_device() } } else if has_rocm() { GPUBackend::ROCm { device: select_hip_device() } } else { GPUBackend::None }}4. Multi-GPU Support
Data Partitioning:
pub struct MultiGPUExecutor { devices: Vec<GPUDevice>, partitioning_strategy: PartitioningStrategy,}
pub enum PartitioningStrategy { RoundRobin, // Distribute rows evenly HashPartition, // Hash on key RangePartition, // Range on key Replicate, // Replicate data to all GPUs}
impl MultiGPUExecutor { pub async fn execute_scan(&self, data: &[u8], filter: FilterFn) -> Result<Vec<Row>> { let num_gpus = self.devices.len(); let chunk_size = data.len() / num_gpus;
// Partition data across GPUs let mut tasks = Vec::new(); for (i, device) in self.devices.iter().enumerate() { let start = i * chunk_size; let end = if i == num_gpus - 1 { data.len() } else { (i + 1) * chunk_size }; let chunk = &data[start..end];
// Async execution on each GPU tasks.push(device.execute_scan_async(chunk, filter.clone())); }
// Wait for all GPUs let results = futures::future::join_all(tasks).await;
// Merge results Ok(results.into_iter().flatten().collect()) }}Performance Targets
| Operation | CPU (Baseline) | GPU (CUDA) | Speedup |
|---|---|---|---|
| Scan/Filter | 100M rows/sec | 1-5B rows/sec | 10-50x |
| Aggregation | 50M rows/sec | 500M-2B rows/sec | 10-40x |
| Join | 10M rows/sec | 100M-500M rows/sec | 10-50x |
| Sort | 20M rows/sec | 200M-1B rows/sec | 10-50x |
| Vector Search | 10K qps | 100K-500K qps | 10-50x |
Patent Opportunities
“Automatic GPU Acceleration for Database Queries”
- Claims: Cost-based CPU/GPU routing, automatic data transfer, multi-GPU coordination
- Novelty: Zero-config GPU acceleration for SQL queries
- Value: $20M-$30M
[Document continues with F7.6, F7.9, F7.12, and remaining sections…]
Due to length constraints, I’ll create this as a multi-part document. This is Part 1 covering the first 3 major features.
Next sections to create:
- F7.6: Advanced Webhooks Architecture
- F7.9: AI Schema Architect Architecture
- F7.12: Unified Observability Architecture
- Protocol Completion Architecture (Oracle, PostgreSQL, WASM)
- Integration Points
- Performance Targets
- Security Architecture
- Scalability Design
Total estimated length: 150+ pages
End of Part 1
Document Path: /home/claude/HeliosDB/docs/architecture/v7.0/PHASE2_ARCHITECTURE.md
Status: Part 1 Complete (F7.1, F7.2, F7.5)
Next: Create remaining sections or split into multiple files