Skip to content

HeliosDB Phase 2 Architecture: v7.0 Tier 2 Features & Advanced ML/AI

HeliosDB Phase 2 Architecture: v7.0 Tier 2 Features & Advanced ML/AI

Document Version: 1.0 Created: November 17, 2025 Author: Architect Worker 5 Status: STRATEGIC ARCHITECTURE


Executive Summary

This document provides the comprehensive architectural design for HeliosDB Phase 2, covering v7.0 Tier 2 innovations and advanced ML/AI capabilities. Phase 2 builds upon Phase 1’s production-ready foundation (95%+ complete with 3 features and 4 protocols) to deliver world-first innovations in multimodal AI, GPU acceleration, and autonomous database intelligence.

Phase 2 Objectives

MetricTargetImpact
Duration3 monthsMonths 2-4 of v7.0 roadmap
Investment$2.5M-$3.5MJustified by $230M-$260M ARR
Features9 innovations6 Tier 2 + 3 advanced ML/AI
Team8 agents + specialistsML engineers, GPU experts, architects
ARR Impact$230M-$260MAdditional revenue on top of Phase 1

Key Innovations

v7.0 Tier 2 Features (50%+ → 100%):

  1. F7.1: Multimodal Vector Search (60% → 100%) - $40M ARR
  2. F7.6: Advanced Webhooks (50% → 100%) - $25M ARR
  3. F7.12: Unified Observability (40% → 100%) - $35M ARR

Advanced ML/AI Features (30-40% → 100%): 4. F7.2: GraphRAG HTAP (40% → 100%) - $50M ARR 5. F7.5: GPU Acceleration (30% → 100%) - $55M ARR 6. F7.9: AI Schema Architect (50% → 100%) - $40M ARR

v5.x Core Completion: 7. Oracle 23ai Compatibility (0% → 100%) - Critical protocol 8. PostgreSQL 17 Compatibility (0% → 100%) - LISTEN/NOTIFY complete 9. WASM Runtime (40% → 100%) - JavaScript + Python runtimes


Table of Contents

  1. Architecture Principles
  2. F7.1: Multimodal Vector Search Architecture
  3. F7.2: GraphRAG HTAP Architecture
  4. F7.5: GPU Acceleration Architecture
  5. F7.6: Advanced Webhooks Architecture
  6. F7.9: AI Schema Architect Architecture
  7. F7.12: Unified Observability Architecture
  8. Protocol Completion Architecture
  9. Integration Points
  10. Performance Targets
  11. Security Architecture
  12. Scalability Design

Architecture Principles

Core Design Tenets

  1. GPU-First Architecture

    • GPU acceleration as first-class citizen, not afterthought
    • Automatic CPU/GPU routing based on workload characteristics
    • Zero-copy data paths between CPU and GPU memory
    • Support for NVIDIA CUDA and AMD ROCm
  2. AI-Native by Design

    • Embeddings, ML training, and inference built into query engine
    • Vector operations optimized at storage layer
    • Natural language as first-class interface
    • LLM integration without external services
  3. Convergence Architecture

    • OLTP + OLAP + Vector + Graph in single engine
    • Shared storage layer, unified transaction model
    • Cross-workload optimizations (e.g., OLTP feeds OLAP materialized views)
    • Protocol-agnostic feature access
  4. Production-First

    • 99.99%+ availability targets
    • <50ms p99 latency for OLTP
    • 10-100x OLAP speedups with GPU
    • Zero-downtime upgrades and migrations
  5. Developer Experience

    • Zero-config defaults for 90% use cases
    • Natural language schema design
    • Automatic observability and tracing
    • CLI-first tooling

F7.1: Multimodal Vector Search Architecture

Current State (60% Complete)

Implemented:

  • Basic vector search with HNSW indexes
  • Text embeddings via OpenAI/Cohere APIs
  • Similarity search with L2/cosine distance
  • Vector index persistence

Remaining Work (40%):

  • Image/audio/video embedding support
  • Cross-modal search (text→image, image→text, etc.)
  • GPU-accelerated search
  • Batch embedding generation
  • Multi-modal index optimization

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│ Multimodal Vector Search │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌──────────────────────────────────┐ │
│ │ Query Interface│ │ Embedding Generation Layer │ │
│ │ - NL queries │──────▶│ - CLIP (text + image) │ │
│ │ - SQL vectors │ │ - AudioCLIP (audio) │ │
│ │ - API calls │ │ - VideoCLIP (video) │ │
│ └─────────────────┘ │ - Unified 1536-dim space │ │
│ └──────────────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────┤
│ │ Unified Vector Space (1536 dims) │
│ │ - Text embeddings │
│ │ - Image embeddings (CLIP projection) │
│ │ - Audio embeddings (AudioCLIP projection) │
│ │ - Video embeddings (VideoCLIP frame aggregation) │
│ └──────────────────────────────────────────────────────────────┤
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ GPU-Accelerated Search Layer ││
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ ││
│ │ │ HNSW Index │ │ IVF Index │ │ Flat (brute) │ ││
│ │ │ (M=32, ef=64)│ │ (nlist=4096) │ │ GPU-accelerated │ ││
│ │ └──────────────┘ └──────────────┘ └──────────────────┘ ││
│ │ - Automatic index selection based on data size ││
│ │ - GPU batching for parallel similarity computation ││
│ │ - <50ms search latency for 100K vectors ││
│ └──────────────────────────────────────────────────────────────┘│
│ │
└───────────────────────────────────────────────────────────────────┘

Component Design

1. Embedding Generation Layer

Multi-Model Support:

pub enum EmbeddingModel {
// Text + Image
CLIP {
model: String, // "openai/clip-vit-large-patch14"
dim: usize, // 1536 (projected to unified space)
},
// Audio
AudioCLIP {
model: String, // "microsoft/audioclip"
dim: usize, // 1536
sample_rate: u32, // 16000 Hz
},
// Video (frame-based)
VideoCLIP {
model: String, // "openai/clip-vit-large-patch14"
dim: usize, // 1536
fps: u32, // Frames per second to sample
aggregation: VideoAggregation, // Mean, Max, Attention
},
// Local ONNX models for offline inference
LocalONNX {
path: PathBuf,
dim: usize,
},
}
pub enum VideoAggregation {
Mean, // Average frame embeddings
Max, // Max pooling across frames
Attention, // Attention-weighted average
}

Unified Embedding Space:

  • All modalities projected to 1536-dimensional space
  • Learned projection matrices for cross-modal alignment
  • Training data: LAION-5B, Conceptual Captions, AudioSet
  • Cosine similarity normalized to [0, 1]

Example Usage:

-- Generate embeddings from different modalities
INSERT INTO products (name, image_embedding, description_embedding)
VALUES (
'Red Shoes',
EMBED_IMAGE('s3://images/red_shoes.jpg', 'clip'),
EMBED_TEXT('Comfortable red running shoes', 'clip')
);
-- Cross-modal search: Find images matching text query
SELECT name, image_url, SIMILARITY(image_embedding, EMBED_TEXT('red sneakers'))
FROM products
ORDER BY SIMILARITY(image_embedding, EMBED_TEXT('red sneakers')) DESC
LIMIT 10;
-- Cross-modal search: Find products similar to query image
SELECT name, SIMILARITY(image_embedding, EMBED_IMAGE('query.jpg'))
FROM products
ORDER BY SIMILARITY DESC
LIMIT 10;

Automatic GPU Routing:

pub struct VectorSearchRouter {
gpu_threshold: usize, // Use GPU if >10K vectors
batch_size: usize, // Batch size for GPU (1000)
cpu_fallback: bool, // Fall back to CPU if GPU unavailable
}
impl VectorSearchRouter {
pub fn route_search(&self,
query: &[f32],
vectors: &[Vec<f32>],
k: usize) -> SearchBackend {
if vectors.len() > self.gpu_threshold && has_gpu() {
SearchBackend::GPU {
device: select_gpu(),
batch_size: self.batch_size,
}
} else {
SearchBackend::CPU {
threads: num_cpus(),
}
}
}
}

CUDA Kernel for Similarity Search:

// Batch cosine similarity on GPU
__global__ void batch_cosine_similarity(
const float* queries, // [batch_size, dim]
const float* vectors, // [num_vectors, dim]
float* results, // [batch_size, num_vectors]
int batch_size,
int num_vectors,
int dim
) {
int query_idx = blockIdx.y;
int vector_idx = blockIdx.x * blockDim.x + threadIdx.x;
if (query_idx < batch_size && vector_idx < num_vectors) {
// Compute dot product
float dot = 0.0f;
float norm_q = 0.0f;
float norm_v = 0.0f;
for (int i = 0; i < dim; i++) {
float q = queries[query_idx * dim + i];
float v = vectors[vector_idx * dim + i];
dot += q * v;
norm_q += q * q;
norm_v += v * v;
}
// Cosine similarity
results[query_idx * num_vectors + vector_idx] =
dot / (sqrtf(norm_q) * sqrtf(norm_v));
}
}

Performance Characteristics:

  • CPU (SIMD): 10K vec/sec @ 1536 dims
  • GPU (CUDA): 100K-500K vec/sec @ 1536 dims (10-50x speedup)
  • Latency: <50ms p99 for 100K vectors
  • Throughput: 1000+ queries/sec on single GPU

3. Index Structures

HNSW (Hierarchical Navigable Small World):

pub struct HNSWIndex {
// Index parameters
m: usize, // Number of connections per node (32)
ef_construction: usize, // Search width during construction (64)
ef_search: usize, // Search width during query (64)
// Graph layers
layers: Vec<HNSWLayer>,
// Entry point (top layer)
entry_point: NodeId,
// Vector storage
vectors: Arc<VectorStorage>,
}
pub struct HNSWLayer {
level: usize,
nodes: HashMap<NodeId, HNSWNode>,
}
pub struct HNSWNode {
id: NodeId,
neighbors: Vec<NodeId>,
vector_ref: VectorRef,
}

Index Selection Strategy:

  • < 10K vectors: Flat (brute force) on GPU
  • 10K - 1M vectors: HNSW with M=32, ef=64
  • > 1M vectors: IVF + HNSW hybrid (IVF for coarse search, HNSW for refinement)

Data Flow

Embedding Generation Pipeline:

Input (Image/Audio/Video)
Preprocessing (Resize, Normalize, Sample)
Model Inference (CLIP/AudioCLIP/VideoCLIP)
Projection to Unified Space (1536 dims)
Normalization (L2 normalize)
Storage (Vector Column)

Query Execution Pipeline:

Query (Text/Image/Audio/Video)
Generate Query Embedding (same models)
Route to CPU or GPU
Index Lookup (HNSW/IVF/Flat)
Top-K Selection
Result Fetching
Response

Storage Layer Integration

Vector Column Type:

pub enum DataType {
// ... existing types ...
Vector {
dim: usize,
dtype: VectorDType, // F32, F16, I8 (quantized)
},
MultiModalVector {
dim: usize,
modality: Modality, // Text, Image, Audio, Video
},
}
pub enum VectorDType {
Float32, // Full precision (4 bytes/dim)
Float16, // Half precision (2 bytes/dim, 50% space savings)
Int8, // Quantized (1 byte/dim, 75% space savings)
}

Product Quantization for Compression:

  • Split 1536-dim vector into 48 subvectors of 32 dims each
  • Build codebook of 256 centroids per subvector
  • Store vector as 48 bytes (48 × 1 byte index) instead of 6144 bytes (1536 × 4 bytes)
  • 96.9% compression ratio with <2% accuracy loss

Performance Targets

MetricTargetRationale
Search Latency<50ms p99 @ 100K vectorsProduction SLA
Throughput1000+ qps/GPUGPU utilization >80%
Recall@10>95%Cross-modal accuracy
Index Build Time<5 min @ 1M vectorsBackground indexing
Storage Overhead<20%PQ compression

Cross-Modal Search Examples

Text → Image:

-- Find images matching text description
SELECT image_url, product_name
FROM products
WHERE COSINE_SIMILARITY(
image_embedding,
EMBED_TEXT('red leather jacket')
) > 0.8;

Image → Text:

-- Find text descriptions similar to query image
SELECT description, SIMILARITY(description_embedding, @query_image_emb)
FROM product_descriptions
ORDER BY SIMILARITY DESC
LIMIT 10;

Audio → Video:

-- Find videos with similar audio
SELECT video_id, title
FROM videos
WHERE COSINE_SIMILARITY(
audio_embedding,
EMBED_AUDIO('query_audio.mp3')
) > 0.7;

Patent Opportunities

“Multi-Modal Vector Embedding System for Unified Search”

  • Claims: Unified 1536-dim embedding space for text/image/audio/video
  • Novelty: Cross-modal search with learned projection matrices
  • Value: $15M-$25M

F7.2: GraphRAG HTAP Architecture

Current State (40% Complete)

Implemented:

  • Basic graph storage (adjacency lists)
  • Simple graph traversals
  • Vector search integration (F6.6)

Remaining Work (60%):

  • Cypher query language support
  • GQL (Graph Query Language) implementation
  • Natural language to graph queries (NL2GQL)
  • LLM integration for reasoning
  • HTAP architecture (real-time graph analytics on OLTP data)
  • Distributed graph processing

Architecture Overview

┌───────────────────────────────────────────────────────────────────────┐
│ GraphRAG HTAP System │
├───────────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ Query Interface Layer │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌─────────────────────┐ │ │
│ │ │ Cypher │ │ GQL │ │ Natural Language │ │ │
│ │ │ (Neo4j) │ │ (ISO/IEC) │ │ (LLM-powered) │ │ │
│ │ └──────────────┘ └──────────────┘ └─────────────────────┘ │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ Graph Query Planner & Optimizer │ │
│ │ - Pattern matching optimization │ │
│ │ - Join reordering for graph traversals │ │
│ │ - Index selection (adjacency list vs. CSR) │ │
│ │ - Push computation to storage layer │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────┴───────────────────┐ │
│ ▼ ▼ │
│ ┌─────────────────────────────┐ ┌──────────────────────────────┐│
│ │ OLTP Execution Engine │ │ OLAP Execution Engine ││
│ │ - Real-time graph updates │ │ - PageRank, BFS, DFS ││
│ │ - ACID transactions │ │ - Community detection ││
│ │ - <10ms p99 latency │ │ - Centrality algorithms ││
│ │ - Row-oriented storage │ │ - Columnar storage ││
│ └─────────────────────────────┘ └──────────────────────────────┘│
│ │ │ │
│ └──────────────┬──────────────────────────┘ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ Unified Graph Storage Layer │ │
│ │ ┌──────────────────────────────────────────────────────────┐ │ │
│ │ │ Adjacency List Storage (Row-oriented, OLTP-optimized) │ │ │
│ │ │ - Vertex: {id, properties, out_edges[], in_edges[]} │ │ │
│ │ │ - Edge: {src, dst, label, properties, weight} │ │ │
│ │ └──────────────────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────────────────┐ │ │
│ │ │ CSR (Compressed Sparse Row) Format (OLAP-optimized) │ │ │
│ │ │ - Vertex offsets array: [0, 5, 12, 20, ...] │ │ │
│ │ │ - Edge targets array: [1, 3, 5, 7, 9, 2, 4, ...] │ │ │
│ │ │ - Cache-friendly, vectorizable │ │ │
│ │ └──────────────────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ RAG Integration Layer │ │
│ │ - Vector search over graph nodes (F6.6 + F7.1) │ │
│ │ - Graph traversal for context gathering │ │
│ │ - LLM reasoning with graph knowledge │ │
│ │ - Explainable AI: Show reasoning path through graph │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘

Component Design

1. Graph Query Languages

Cypher Support (Neo4j compatibility):

-- Find friends of friends
MATCH (p:Person {name: 'Alice'})-[:FRIEND]->(f)-[:FRIEND]->(fof)
WHERE fof <> p
RETURN fof.name, COUNT(*) as mutual_friends
ORDER BY mutual_friends DESC
LIMIT 10;
-- Shortest path
MATCH path = shortestPath(
(a:Person {name: 'Alice'})-[:FRIEND*]-(b:Person {name: 'Bob'})
)
RETURN path;
-- PageRank (OLAP)
CALL algo.pageRank('Person', 'FRIEND')
YIELD nodeId, score
RETURN algo.getNodeById(nodeId).name, score
ORDER BY score DESC
LIMIT 10;

GQL Support (ISO/IEC standard):

-- GQL syntax (more SQL-like)
SELECT p.name, COUNT(f) AS friend_count
FROM GRAPH social_network
MATCH (p:Person)-[:FRIEND]->(f:Person)
GROUP BY p.name
ORDER BY friend_count DESC
LIMIT 10;

Natural Language to Graph Queries:

-- Natural language query
SELECT * FROM GRAPH_QUERY('Find all friends of Alice who live in Seattle');
-- Internally translates to Cypher:
-- MATCH (p:Person {name: 'Alice'})-[:FRIEND]->(f:Person {city: 'Seattle'})
-- RETURN f;

2. HTAP Architecture

Dual Storage Format:

pub struct HybridGraphStorage {
// OLTP: Adjacency list for fast updates
adjacency_list: AdjacencyListStore,
// OLAP: CSR format for fast traversals
csr_format: CSRStore,
// Synchronization
sync_policy: SyncPolicy,
stale_threshold: Duration, // Rebuild CSR if stale >5 mins
}
pub enum SyncPolicy {
Immediate, // Update CSR on every write (slow writes)
Periodic(Duration), // Rebuild CSR every N seconds
OnDemand, // Rebuild CSR before OLAP query
Adaptive, // ML-based decision (default)
}

Adaptive Synchronization:

  • Monitor OLTP write rate and OLAP query frequency
  • If OLAP queries dominate, use Periodic sync (every 30s)
  • If OLTP writes dominate, use OnDemand sync
  • ML model predicts optimal sync policy based on workload

Columnar Graph Storage for OLAP:

pub struct CSRStore {
// Vertex offsets (where each vertex's edges start)
offsets: Vec<usize>,
// Edge targets (compressed, cache-friendly)
targets: Vec<u32>,
// Edge properties (columnar for SIMD)
weights: Vec<f32>,
labels: Vec<u16>, // Label IDs
// Metadata
num_vertices: usize,
num_edges: usize,
}
impl CSRStore {
// Get neighbors of vertex v (cache-friendly, vectorizable)
pub fn neighbors(&self, v: usize) -> &[u32] {
let start = self.offsets[v];
let end = self.offsets[v + 1];
&self.targets[start..end]
}
}

3. Graph Algorithms Library

Built-in Algorithms:

pub enum GraphAlgorithm {
// Traversal
BFS { start: NodeId, max_depth: usize },
DFS { start: NodeId, max_depth: usize },
ShortestPath { src: NodeId, dst: NodeId, algorithm: PathAlgorithm },
// Centrality
PageRank { iterations: usize, damping: f64 },
BetweennessCentrality { sample_size: Option<usize> },
ClosenessCentrality,
// Community Detection
LabelPropagation { iterations: usize },
Louvain { resolution: f64 },
ConnectedComponents,
// Pattern Matching
TriangleCounting,
MotifFinding { motif: Motif },
}
pub enum PathAlgorithm {
Dijkstra, // Weighted shortest path
BellmanFord, // Negative weights allowed
AStar { heuristic: HeuristicFn },
}

GPU-Accelerated Graph Analytics:

pub struct GPUGraphEngine {
// Graph data on GPU
gpu_csr: DeviceBuffer<u32>,
gpu_offsets: DeviceBuffer<usize>,
gpu_weights: DeviceBuffer<f32>,
// Algorithms
algorithms: HashMap<String, CudaKernel>,
}
impl GPUGraphEngine {
// PageRank on GPU (10-100x faster than CPU)
pub fn pagerank(&self, iterations: usize, damping: f64) -> Vec<f64> {
let mut scores = vec![1.0 / self.num_vertices as f64; self.num_vertices];
for _ in 0..iterations {
// GPU kernel: parallel updates for all vertices
self.pagerank_kernel(&mut scores, damping);
}
scores
}
}

4. RAG Integration

Vector Search Over Graph Nodes:

-- Find relevant nodes using vector similarity
WITH relevant_nodes AS (
SELECT node_id, SIMILARITY(embedding, @query_embedding) AS score
FROM graph_nodes
WHERE SIMILARITY(embedding, @query_embedding) > 0.7
ORDER BY score DESC
LIMIT 10
)
-- Expand context using graph traversal
SELECT n.id, n.properties, path
FROM relevant_nodes rn
MATCH (rn)-[:RELATED*1..3]-(n)
RETURN n, path;

LLM Reasoning with Graph Context:

pub struct GraphRAGEngine {
llm: LLMClient,
graph: GraphStore,
embeddings: EmbeddingModel,
}
impl GraphRAGEngine {
pub async fn query(&self, question: &str) -> Result<Answer> {
// 1. Generate query embedding
let query_emb = self.embeddings.embed_text(question).await?;
// 2. Find relevant graph nodes
let relevant_nodes = self.graph.vector_search(&query_emb, 10)?;
// 3. Expand context via graph traversal
let context_subgraph = self.graph.expand_context(&relevant_nodes, 3)?;
// 4. Serialize subgraph to LLM prompt
let prompt = self.serialize_subgraph_to_prompt(&context_subgraph);
// 5. Query LLM with graph context
let answer = self.llm.generate(&prompt).await?;
// 6. Extract reasoning path (explainability)
let reasoning_path = self.extract_reasoning_path(&answer, &context_subgraph);
Ok(Answer {
text: answer,
reasoning_path,
confidence: self.compute_confidence(&answer, &context_subgraph),
})
}
}

Explainable AI: Reasoning Path Visualization:

Question: "How are Alice and Bob connected?"
Answer: "Alice is connected to Bob through their mutual friend Charlie."
Reasoning Path (Graph):
Alice --[FRIEND]--> Charlie --[FRIEND]--> Bob
| |
+------[WORKS_AT]---+---> Acme Corp
Explanation:
1. Alice and Charlie work together at Acme Corp
2. Alice and Charlie are friends
3. Charlie and Bob are friends
4. Therefore, Alice and Bob are connected through Charlie

Performance Targets

MetricTargetRationale
OLTP Latency<10ms p99Real-time graph updates
OLAP Throughput100M+ edges/sec (GPU)Large-scale analytics
Query Latency<100ms for 3-hop traversal @ 10M nodesProduction SLA
Scalability10M+ nodes, 100M+ edges per nodeEnterprise scale
RAG Accuracy95%+ on GraphQA benchmarkQuality target

Patent Opportunities

“Hybrid Transactional Analytical Graph Database with LLM Integration”

  • Claims: HTAP graph with dual storage (adjacency list + CSR), RAG integration
  • Novelty: Real-time OLTP + OLAP on same graph data, LLM reasoning with graph context
  • Value: $20M-$35M

F7.5: GPU Acceleration Architecture

Current State (30% Complete)

Implemented:

  • Basic GPU detection (NVIDIA CUDA)
  • Vector similarity on GPU (F7.1 foundation)
  • GPU memory management

Remaining Work (70%):

  • CUDA kernel library for SQL operations
  • AMD ROCm support
  • Automatic CPU/GPU query routing
  • Multi-GPU support
  • GPU-accelerated aggregations
  • GPU-accelerated joins

Architecture Overview

┌───────────────────────────────────────────────────────────────────────┐
│ GPU Acceleration Layer │
├───────────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ Query Optimizer │ │
│ │ - Cost-based GPU/CPU routing │ │
│ │ - Data size, operation type, GPU availability │ │
│ │ - Automatic fallback to CPU if GPU unavailable │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────┴───────────────────┐ │
│ ▼ ▼ │
│ ┌─────────────────────────────┐ ┌──────────────────────────────┐│
│ │ CPU Execution Path │ │ GPU Execution Path ││
│ │ - Small datasets (<1M rows)│ │ - Large datasets (>1M rows) ││
│ │ - Complex operations │ │ - Vectorizable operations ││
│ │ - String operations │ │ - Numeric operations ││
│ └─────────────────────────────┘ └──────────────────────────────┘│
│ │ │
│ ▼ │
│ ┌──────────────────────────────────┐ │
│ │ GPU Kernel Dispatcher │ │
│ │ - CUDA (NVIDIA) │ │
│ │ - ROCm (AMD) │ │
│ │ - Multi-GPU coordination │ │
│ └──────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────┴────────────┐ │
│ ▼ ▼ │
│ ┌───────────────────────┐ ┌──────────────────┐│
│ │ CUDA Kernels (NVIDIA)│ │ ROCm Kernels(AMD)││
│ │ - Scan/Filter │ │ - Same ops ││
│ │ - Aggregation │ │ - HIP API ││
│ │ - Join │ │ - Portable ││
│ │ - Sort │ └──────────────────┘│
│ │ - Vector ops │ │
│ └───────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘

Component Design

1. GPU Query Routing

Cost Model:

pub struct GPUCostModel {
// Data transfer costs
cpu_to_gpu_bandwidth: f64, // GB/s
gpu_to_cpu_bandwidth: f64, // GB/s
// Compute costs
gpu_compute_factor: f64, // Speedup vs CPU (10-100x)
cpu_compute_factor: f64, // Baseline 1.0
// Thresholds
min_data_size_for_gpu: usize, // 1M rows
gpu_memory_limit: usize, // GPU RAM (e.g., 24GB)
}
impl GPUCostModel {
pub fn should_use_gpu(&self, op: &QueryOp) -> bool {
// Calculate costs
let transfer_cost = self.data_transfer_cost(op);
let cpu_compute_cost = self.cpu_compute_cost(op);
let gpu_compute_cost = self.gpu_compute_cost(op);
// GPU is beneficial if:
// gpu_compute_cost + transfer_cost < cpu_compute_cost
let gpu_total = gpu_compute_cost + transfer_cost;
let cpu_total = cpu_compute_cost;
gpu_total < cpu_total &&
op.data_size >= self.min_data_size_for_gpu &&
op.data_size <= self.gpu_memory_limit &&
self.is_vectorizable(op)
}
fn is_vectorizable(&self, op: &QueryOp) -> bool {
match op.op_type {
OpType::Scan | OpType::Filter | OpType::Aggregate |
OpType::Join | OpType::Sort | OpType::VectorSearch => true,
// String ops not well-suited for GPU
OpType::StringManipulation => false,
// Complex ops stay on CPU
OpType::Subquery | OpType::CTE => false,
}
}
}

Automatic Data Transfer:

pub struct GPUDataManager {
// Pinned memory for fast transfers
pinned_buffers: Vec<PinnedBuffer>,
// GPU memory pool
gpu_allocator: GPUAllocator,
// Transfer streams (async)
transfer_streams: Vec<cudaStream_t>,
}
impl GPUDataManager {
// Zero-copy transfer for large datasets
pub async fn transfer_to_gpu(&self, data: &[u8]) -> Result<DeviceBuffer> {
// Allocate pinned memory on CPU (faster DMA)
let pinned = self.allocate_pinned(data.len())?;
pinned.copy_from_slice(data);
// Async transfer to GPU
let gpu_buf = self.gpu_allocator.allocate(data.len())?;
cudaMemcpyAsync(
gpu_buf.ptr(),
pinned.ptr(),
data.len(),
cudaMemcpyHostToDevice,
self.transfer_streams[0],
)?;
Ok(gpu_buf)
}
}

2. CUDA Kernel Library

Scan/Filter:

// GPU kernel for filtering rows
__global__ void filter_kernel(
const int* col_data, // Input column
const int threshold, // Filter condition
int* out_indices, // Output row indices
int* out_count, // Output count (atomic)
int num_rows
) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < num_rows) {
if (col_data[idx] > threshold) {
int pos = atomicAdd(out_count, 1);
out_indices[pos] = idx;
}
}
}

Aggregation (SUM, AVG, MIN, MAX):

// GPU kernel for parallel aggregation
__global__ void sum_reduce_kernel(
const float* data,
float* partial_sums,
int num_rows
) {
__shared__ float shared_mem[256];
int tid = threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Load data into shared memory
shared_mem[tid] = (idx < num_rows) ? data[idx] : 0.0f;
__syncthreads();
// Parallel reduction in shared memory
for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
if (tid < stride) {
shared_mem[tid] += shared_mem[tid + stride];
}
__syncthreads();
}
// Write block result
if (tid == 0) {
partial_sums[blockIdx.x] = shared_mem[0];
}
}

Join (Hash Join):

// Build hash table on GPU
__global__ void build_hash_table_kernel(
const int* build_keys,
const int* build_values,
int* hash_table_keys,
int* hash_table_values,
int num_rows,
int table_size
) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < num_rows) {
int key = build_keys[idx];
int value = build_values[idx];
// Open addressing with linear probing
int hash = key % table_size;
while (atomicCAS(&hash_table_keys[hash], EMPTY, key) != EMPTY) {
hash = (hash + 1) % table_size;
}
hash_table_values[hash] = value;
}
}
// Probe hash table
__global__ void probe_hash_table_kernel(
const int* probe_keys,
const int* hash_table_keys,
const int* hash_table_values,
int* output,
int num_rows,
int table_size
) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < num_rows) {
int key = probe_keys[idx];
int hash = key % table_size;
// Linear probing
while (hash_table_keys[hash] != EMPTY) {
if (hash_table_keys[hash] == key) {
output[idx] = hash_table_values[hash];
return;
}
hash = (hash + 1) % table_size;
}
output[idx] = -1; // Not found
}
}

3. AMD ROCm Support

Portable Kernel Code (HIP):

// HIP: Portable for both CUDA and ROCm
#include <hip/hip_runtime.h>
__global__ void vector_add(
const float* a,
const float* b,
float* c,
int n
) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}
// Compilation:
// CUDA: nvcc -x cu vector_add.cpp
// ROCm: hipcc vector_add.cpp

Runtime Detection:

pub enum GPUBackend {
CUDA { device: CudaDevice },
ROCm { device: HipDevice },
None,
}
pub fn detect_gpu_backend() -> GPUBackend {
if has_cuda() {
GPUBackend::CUDA { device: select_cuda_device() }
} else if has_rocm() {
GPUBackend::ROCm { device: select_hip_device() }
} else {
GPUBackend::None
}
}

4. Multi-GPU Support

Data Partitioning:

pub struct MultiGPUExecutor {
devices: Vec<GPUDevice>,
partitioning_strategy: PartitioningStrategy,
}
pub enum PartitioningStrategy {
RoundRobin, // Distribute rows evenly
HashPartition, // Hash on key
RangePartition, // Range on key
Replicate, // Replicate data to all GPUs
}
impl MultiGPUExecutor {
pub async fn execute_scan(&self, data: &[u8], filter: FilterFn) -> Result<Vec<Row>> {
let num_gpus = self.devices.len();
let chunk_size = data.len() / num_gpus;
// Partition data across GPUs
let mut tasks = Vec::new();
for (i, device) in self.devices.iter().enumerate() {
let start = i * chunk_size;
let end = if i == num_gpus - 1 { data.len() } else { (i + 1) * chunk_size };
let chunk = &data[start..end];
// Async execution on each GPU
tasks.push(device.execute_scan_async(chunk, filter.clone()));
}
// Wait for all GPUs
let results = futures::future::join_all(tasks).await;
// Merge results
Ok(results.into_iter().flatten().collect())
}
}

Performance Targets

OperationCPU (Baseline)GPU (CUDA)Speedup
Scan/Filter100M rows/sec1-5B rows/sec10-50x
Aggregation50M rows/sec500M-2B rows/sec10-40x
Join10M rows/sec100M-500M rows/sec10-50x
Sort20M rows/sec200M-1B rows/sec10-50x
Vector Search10K qps100K-500K qps10-50x

Patent Opportunities

“Automatic GPU Acceleration for Database Queries”

  • Claims: Cost-based CPU/GPU routing, automatic data transfer, multi-GPU coordination
  • Novelty: Zero-config GPU acceleration for SQL queries
  • Value: $20M-$30M

[Document continues with F7.6, F7.9, F7.12, and remaining sections…]

Due to length constraints, I’ll create this as a multi-part document. This is Part 1 covering the first 3 major features.

Next sections to create:

  • F7.6: Advanced Webhooks Architecture
  • F7.9: AI Schema Architect Architecture
  • F7.12: Unified Observability Architecture
  • Protocol Completion Architecture (Oracle, PostgreSQL, WASM)
  • Integration Points
  • Performance Targets
  • Security Architecture
  • Scalability Design

Total estimated length: 150+ pages


End of Part 1

Document Path: /home/claude/HeliosDB/docs/architecture/v7.0/PHASE2_ARCHITECTURE.md Status: Part 1 Complete (F7.1, F7.2, F7.5) Next: Create remaining sections or split into multiple files