HeliosDB Phase 2 Architecture: v7.0 Tier 2 Features & Advanced ML/AI

Document Version: 1.0 Created: November 17, 2025 Author: Architect Worker 5 Status: STRATEGIC ARCHITECTURE

Executive Summary

This document provides the comprehensive architectural design for HeliosDB Phase 2, covering v7.0 Tier 2 innovations and advanced ML/AI capabilities. Phase 2 builds upon Phase 1’s production-ready foundation (95%+ complete with 3 features and 4 protocols) to deliver world-first innovations in multimodal AI, GPU acceleration, and autonomous database intelligence.

Phase 2 Objectives

Metric	Target	Impact
Duration	3 months	Months 2-4 of v7.0 roadmap
Investment	$2.5M-$3.5M	Justified by $230M-$260M ARR
Features	9 innovations	6 Tier 2 + 3 advanced ML/AI
Team	8 agents + specialists	ML engineers, GPU experts, architects
ARR Impact	$230M-$260M	Additional revenue on top of Phase 1

Key Innovations

v7.0 Tier 2 Features (50%+ → 100%):

F7.1: Multimodal Vector Search (60% → 100%) - $40M ARR
F7.6: Advanced Webhooks (50% → 100%) - $25M ARR
F7.12: Unified Observability (40% → 100%) - $35M ARR

Advanced ML/AI Features (30-40% → 100%): 4. F7.2: GraphRAG HTAP (40% → 100%) - $50M ARR 5. F7.5: GPU Acceleration (30% → 100%) - $55M ARR 6. F7.9: AI Schema Architect (50% → 100%) - $40M ARR

v5.x Core Completion: 7. Oracle 23ai Compatibility (0% → 100%) - Critical protocol 8. PostgreSQL 17 Compatibility (0% → 100%) - LISTEN/NOTIFY complete 9. WASM Runtime (40% → 100%) - JavaScript + Python runtimes

Architecture Principles
F7.1: Multimodal Vector Search Architecture
F7.2: GraphRAG HTAP Architecture
F7.5: GPU Acceleration Architecture
F7.6: Advanced Webhooks Architecture
F7.9: AI Schema Architect Architecture
F7.12: Unified Observability Architecture
Protocol Completion Architecture
Integration Points
Performance Targets
Security Architecture
Scalability Design

Architecture Principles

Core Design Tenets

GPU-First Architecture
- GPU acceleration as first-class citizen, not afterthought
- Automatic CPU/GPU routing based on workload characteristics
- Zero-copy data paths between CPU and GPU memory
- Support for NVIDIA CUDA and AMD ROCm
AI-Native by Design
- Embeddings, ML training, and inference built into query engine
- Vector operations optimized at storage layer
- Natural language as first-class interface
- LLM integration without external services
Convergence Architecture
- OLTP + OLAP + Vector + Graph in single engine
- Shared storage layer, unified transaction model
- Cross-workload optimizations (e.g., OLTP feeds OLAP materialized views)
- Protocol-agnostic feature access
Production-First
- 99.99%+ availability targets
- <50ms p99 latency for OLTP
- 10-100x OLAP speedups with GPU
- Zero-downtime upgrades and migrations
Developer Experience
- Zero-config defaults for 90% use cases
- Natural language schema design
- Automatic observability and tracing
- CLI-first tooling

F7.1: Multimodal Vector Search Architecture

Current State (60% Complete)

Implemented:

Basic vector search with HNSW indexes
Text embeddings via OpenAI/Cohere APIs
Similarity search with L2/cosine distance
Vector index persistence

Remaining Work (40%):

Image/audio/video embedding support
Cross-modal search (text→image, image→text, etc.)
GPU-accelerated search
Batch embedding generation
Multi-modal index optimization

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                    Multimodal Vector Search                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  ┌─────────────────┐      ┌──────────────────────────────────┐ │
│  │  Query Interface│      │    Embedding Generation Layer   │ │
│  │  - NL queries   │──────▶│   - CLIP (text + image)         │ │
│  │  - SQL vectors  │      │   - AudioCLIP (audio)           │ │
│  │  - API calls    │      │   - VideoCLIP (video)           │ │
│  └─────────────────┘      │   - Unified 1536-dim space      │ │
│                            └──────────────────────────────────┘ │
│           │                              │                       │
│           ▼                              ▼                       │
│  ┌─────────────────────────────────────────────────────────────┤
│  │              Unified Vector Space (1536 dims)                │
│  │  - Text embeddings                                           │
│  │  - Image embeddings (CLIP projection)                        │
│  │  - Audio embeddings (AudioCLIP projection)                   │
│  │  - Video embeddings (VideoCLIP frame aggregation)            │
│  └──────────────────────────────────────────────────────────────┤
│           │                                                       │
│           ▼                                                       │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │         GPU-Accelerated Search Layer                        ││
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐  ││
│  │  │ HNSW Index   │  │ IVF Index    │  │ Flat (brute)    │  ││
│  │  │ (M=32, ef=64)│  │ (nlist=4096) │  │ GPU-accelerated │  ││
│  │  └──────────────┘  └──────────────┘  └──────────────────┘  ││
│  │  - Automatic index selection based on data size             ││
│  │  - GPU batching for parallel similarity computation         ││
│  │  - <50ms search latency for 100K vectors                    ││
│  └──────────────────────────────────────────────────────────────┘│
│                                                                   │
└───────────────────────────────────────────────────────────────────┘

Component Design

1. Embedding Generation Layer

Multi-Model Support:

pub enum EmbeddingModel {
    // Text + Image
    CLIP {
        model: String,         // "openai/clip-vit-large-patch14"
        dim: usize,            // 1536 (projected to unified space)
    },

    // Audio
    AudioCLIP {
        model: String,         // "microsoft/audioclip"
        dim: usize,            // 1536
        sample_rate: u32,      // 16000 Hz
    },

    // Video (frame-based)
    VideoCLIP {
        model: String,         // "openai/clip-vit-large-patch14"
        dim: usize,            // 1536
        fps: u32,              // Frames per second to sample
        aggregation: VideoAggregation,  // Mean, Max, Attention
    },

    // Local ONNX models for offline inference
    LocalONNX {
        path: PathBuf,
        dim: usize,
    },
}

pub enum VideoAggregation {
    Mean,      // Average frame embeddings
    Max,       // Max pooling across frames
    Attention, // Attention-weighted average
}

Unified Embedding Space:

All modalities projected to 1536-dimensional space
Learned projection matrices for cross-modal alignment
Training data: LAION-5B, Conceptual Captions, AudioSet
Cosine similarity normalized to [0, 1]

Example Usage:

-- Generate embeddings from different modalities
INSERT INTO products (name, image_embedding, description_embedding)
VALUES (
    'Red Shoes',
    EMBED_IMAGE('s3://images/red_shoes.jpg', 'clip'),
    EMBED_TEXT('Comfortable red running shoes', 'clip')
);

-- Cross-modal search: Find images matching text query
SELECT name, image_url, SIMILARITY(image_embedding, EMBED_TEXT('red sneakers'))
FROM products
ORDER BY SIMILARITY(image_embedding, EMBED_TEXT('red sneakers')) DESC
LIMIT 10;

-- Cross-modal search: Find products similar to query image
SELECT name, SIMILARITY(image_embedding, EMBED_IMAGE('query.jpg'))
FROM products
ORDER BY SIMILARITY DESC
LIMIT 10;

2. GPU-Accelerated Search

Automatic GPU Routing:

pub struct VectorSearchRouter {
    gpu_threshold: usize,      // Use GPU if >10K vectors
    batch_size: usize,         // Batch size for GPU (1000)
    cpu_fallback: bool,        // Fall back to CPU if GPU unavailable
}

impl VectorSearchRouter {
    pub fn route_search(&self,
                        query: &[f32],
                        vectors: &[Vec<f32>],
                        k: usize) -> SearchBackend {
        if vectors.len() > self.gpu_threshold && has_gpu() {
            SearchBackend::GPU {
                device: select_gpu(),
                batch_size: self.batch_size,
            }
        } else {
            SearchBackend::CPU {
                threads: num_cpus(),
            }
        }
    }
}

CUDA Kernel for Similarity Search:

// Batch cosine similarity on GPU
__global__ void batch_cosine_similarity(
    const float* queries,      // [batch_size, dim]
    const float* vectors,      // [num_vectors, dim]
    float* results,            // [batch_size, num_vectors]
    int batch_size,
    int num_vectors,
    int dim
) {
    int query_idx = blockIdx.y;
    int vector_idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (query_idx < batch_size && vector_idx < num_vectors) {
        // Compute dot product
        float dot = 0.0f;
        float norm_q = 0.0f;
        float norm_v = 0.0f;

        for (int i = 0; i < dim; i++) {
            float q = queries[query_idx * dim + i];
            float v = vectors[vector_idx * dim + i];
            dot += q * v;
            norm_q += q * q;
            norm_v += v * v;
        }

        // Cosine similarity
        results[query_idx * num_vectors + vector_idx] =
            dot / (sqrtf(norm_q) * sqrtf(norm_v));
    }
}

Performance Characteristics:

CPU (SIMD): 10K vec/sec @ 1536 dims
GPU (CUDA): 100K-500K vec/sec @ 1536 dims (10-50x speedup)
Latency: <50ms p99 for 100K vectors
Throughput: 1000+ queries/sec on single GPU

3. Index Structures

HNSW (Hierarchical Navigable Small World):

pub struct HNSWIndex {
    // Index parameters
    m: usize,              // Number of connections per node (32)
    ef_construction: usize, // Search width during construction (64)
    ef_search: usize,      // Search width during query (64)

    // Graph layers
    layers: Vec<HNSWLayer>,

    // Entry point (top layer)
    entry_point: NodeId,

    // Vector storage
    vectors: Arc<VectorStorage>,
}

pub struct HNSWLayer {
    level: usize,
    nodes: HashMap<NodeId, HNSWNode>,
}

pub struct HNSWNode {
    id: NodeId,
    neighbors: Vec<NodeId>,
    vector_ref: VectorRef,
}

Index Selection Strategy:

< 10K vectors: Flat (brute force) on GPU
10K - 1M vectors: HNSW with M=32, ef=64
> 1M vectors: IVF + HNSW hybrid (IVF for coarse search, HNSW for refinement)

Data Flow

Embedding Generation Pipeline:

Input (Image/Audio/Video)
    ↓
Preprocessing (Resize, Normalize, Sample)
    ↓
Model Inference (CLIP/AudioCLIP/VideoCLIP)
    ↓
Projection to Unified Space (1536 dims)
    ↓
Normalization (L2 normalize)
    ↓
Storage (Vector Column)

Query Execution Pipeline:

Query (Text/Image/Audio/Video)
    ↓
Generate Query Embedding (same models)
    ↓
Route to CPU or GPU
    ↓
Index Lookup (HNSW/IVF/Flat)
    ↓
Top-K Selection
    ↓
Result Fetching
    ↓
Response

Storage Layer Integration

Vector Column Type:

pub enum DataType {
    // ... existing types ...
    Vector {
        dim: usize,
        dtype: VectorDType,  // F32, F16, I8 (quantized)
    },
    MultiModalVector {
        dim: usize,
        modality: Modality,  // Text, Image, Audio, Video
    },
}

pub enum VectorDType {
    Float32,   // Full precision (4 bytes/dim)
    Float16,   // Half precision (2 bytes/dim, 50% space savings)
    Int8,      // Quantized (1 byte/dim, 75% space savings)
}

Product Quantization for Compression:

Split 1536-dim vector into 48 subvectors of 32 dims each
Build codebook of 256 centroids per subvector
Store vector as 48 bytes (48 × 1 byte index) instead of 6144 bytes (1536 × 4 bytes)
96.9% compression ratio with <2% accuracy loss

Performance Targets

Metric	Target	Rationale
Search Latency	<50ms p99 @ 100K vectors	Production SLA
Throughput	1000+ qps/GPU	GPU utilization >80%
Recall@10	>95%	Cross-modal accuracy
Index Build Time	<5 min @ 1M vectors	Background indexing
Storage Overhead	<20%	PQ compression

Text → Image:

-- Find images matching text description
SELECT image_url, product_name
FROM products
WHERE COSINE_SIMILARITY(
    image_embedding,
    EMBED_TEXT('red leather jacket')
) > 0.8;

Image → Text:

-- Find text descriptions similar to query image
SELECT description, SIMILARITY(description_embedding, @query_image_emb)
FROM product_descriptions
ORDER BY SIMILARITY DESC
LIMIT 10;

Audio → Video:

-- Find videos with similar audio
SELECT video_id, title
FROM videos
WHERE COSINE_SIMILARITY(
    audio_embedding,
    EMBED_AUDIO('query_audio.mp3')
) > 0.7;

Patent Opportunities

“Multi-Modal Vector Embedding System for Unified Search”

Claims: Unified 1536-dim embedding space for text/image/audio/video
Novelty: Cross-modal search with learned projection matrices
Value: $15M-$25M

F7.2: GraphRAG HTAP Architecture

Current State (40% Complete)

Implemented:

Basic graph storage (adjacency lists)
Simple graph traversals
Vector search integration (F6.6)

Remaining Work (60%):

Cypher query language support
GQL (Graph Query Language) implementation
Natural language to graph queries (NL2GQL)
LLM integration for reasoning
HTAP architecture (real-time graph analytics on OLTP data)
Distributed graph processing

Architecture Overview

┌───────────────────────────────────────────────────────────────────────┐
│                      GraphRAG HTAP System                              │
├───────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌────────────────────────────────────────────────────────────────┐   │
│  │              Query Interface Layer                              │   │
│  │  ┌──────────────┐  ┌──────────────┐  ┌─────────────────────┐  │   │
│  │  │   Cypher     │  │     GQL      │  │   Natural Language  │  │   │
│  │  │   (Neo4j)    │  │  (ISO/IEC)   │  │   (LLM-powered)     │  │   │
│  │  └──────────────┘  └──────────────┘  └─────────────────────┘  │   │
│  └────────────────────────────────────────────────────────────────┘   │
│                                  │                                      │
│                                  ▼                                      │
│  ┌────────────────────────────────────────────────────────────────┐   │
│  │              Graph Query Planner & Optimizer                    │   │
│  │  - Pattern matching optimization                                │   │
│  │  - Join reordering for graph traversals                         │   │
│  │  - Index selection (adjacency list vs. CSR)                     │   │
│  │  - Push computation to storage layer                            │   │
│  └────────────────────────────────────────────────────────────────┘   │
│                                  │                                      │
│              ┌───────────────────┴───────────────────┐                 │
│              ▼                                         ▼                │
│  ┌─────────────────────────────┐      ┌──────────────────────────────┐│
│  │  OLTP Execution Engine      │      │  OLAP Execution Engine       ││
│  │  - Real-time graph updates  │      │  - PageRank, BFS, DFS        ││
│  │  - ACID transactions        │      │  - Community detection       ││
│  │  - <10ms p99 latency        │      │  - Centrality algorithms     ││
│  │  - Row-oriented storage     │      │  - Columnar storage          ││
│  └─────────────────────────────┘      └──────────────────────────────┘│
│              │                                         │                │
│              └──────────────┬──────────────────────────┘                │
│                             ▼                                           │
│  ┌────────────────────────────────────────────────────────────────┐   │
│  │              Unified Graph Storage Layer                        │   │
│  │  ┌──────────────────────────────────────────────────────────┐  │   │
│  │  │  Adjacency List Storage (Row-oriented, OLTP-optimized)   │  │   │
│  │  │  - Vertex: {id, properties, out_edges[], in_edges[]}     │  │   │
│  │  │  - Edge: {src, dst, label, properties, weight}           │  │   │
│  │  └──────────────────────────────────────────────────────────┘  │   │
│  │  ┌──────────────────────────────────────────────────────────┐  │   │
│  │  │  CSR (Compressed Sparse Row) Format (OLAP-optimized)     │  │   │
│  │  │  - Vertex offsets array: [0, 5, 12, 20, ...]             │  │   │
│  │  │  - Edge targets array: [1, 3, 5, 7, 9, 2, 4, ...]        │  │   │
│  │  │  - Cache-friendly, vectorizable                          │  │   │
│  │  └──────────────────────────────────────────────────────────┘  │   │
│  └────────────────────────────────────────────────────────────────┘   │
│                                  │                                      │
│                                  ▼                                      │
│  ┌────────────────────────────────────────────────────────────────┐   │
│  │                 RAG Integration Layer                           │   │
│  │  - Vector search over graph nodes (F6.6 + F7.1)                │   │
│  │  - Graph traversal for context gathering                        │   │
│  │  - LLM reasoning with graph knowledge                           │   │
│  │  - Explainable AI: Show reasoning path through graph           │   │
│  └────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Component Design

1. Graph Query Languages

Cypher Support (Neo4j compatibility):

-- Find friends of friends
MATCH (p:Person {name: 'Alice'})-[:FRIEND]->(f)-[:FRIEND]->(fof)
WHERE fof <> p
RETURN fof.name, COUNT(*) as mutual_friends
ORDER BY mutual_friends DESC
LIMIT 10;

-- Shortest path
MATCH path = shortestPath(
    (a:Person {name: 'Alice'})-[:FRIEND*]-(b:Person {name: 'Bob'})
)
RETURN path;

-- PageRank (OLAP)
CALL algo.pageRank('Person', 'FRIEND')
YIELD nodeId, score
RETURN algo.getNodeById(nodeId).name, score
ORDER BY score DESC
LIMIT 10;

GQL Support (ISO/IEC standard):

-- GQL syntax (more SQL-like)
SELECT p.name, COUNT(f) AS friend_count
FROM GRAPH social_network
MATCH (p:Person)-[:FRIEND]->(f:Person)
GROUP BY p.name
ORDER BY friend_count DESC
LIMIT 10;

Natural Language to Graph Queries:

-- Natural language query
SELECT * FROM GRAPH_QUERY('Find all friends of Alice who live in Seattle');

-- Internally translates to Cypher:
-- MATCH (p:Person {name: 'Alice'})-[:FRIEND]->(f:Person {city: 'Seattle'})
-- RETURN f;

2. HTAP Architecture

Dual Storage Format:

pub struct HybridGraphStorage {
    // OLTP: Adjacency list for fast updates
    adjacency_list: AdjacencyListStore,

    // OLAP: CSR format for fast traversals
    csr_format: CSRStore,

    // Synchronization
    sync_policy: SyncPolicy,
    stale_threshold: Duration,  // Rebuild CSR if stale >5 mins
}

pub enum SyncPolicy {
    Immediate,      // Update CSR on every write (slow writes)
    Periodic(Duration),  // Rebuild CSR every N seconds
    OnDemand,       // Rebuild CSR before OLAP query
    Adaptive,       // ML-based decision (default)
}

Adaptive Synchronization:

Monitor OLTP write rate and OLAP query frequency
If OLAP queries dominate, use Periodic sync (every 30s)
If OLTP writes dominate, use OnDemand sync
ML model predicts optimal sync policy based on workload

Columnar Graph Storage for OLAP:

pub struct CSRStore {
    // Vertex offsets (where each vertex's edges start)
    offsets: Vec<usize>,

    // Edge targets (compressed, cache-friendly)
    targets: Vec<u32>,

    // Edge properties (columnar for SIMD)
    weights: Vec<f32>,
    labels: Vec<u16>,  // Label IDs

    // Metadata
    num_vertices: usize,
    num_edges: usize,
}

impl CSRStore {
    // Get neighbors of vertex v (cache-friendly, vectorizable)
    pub fn neighbors(&self, v: usize) -> &[u32] {
        let start = self.offsets[v];
        let end = self.offsets[v + 1];
        &self.targets[start..end]
    }
}

3. Graph Algorithms Library

Built-in Algorithms:

pub enum GraphAlgorithm {
    // Traversal
    BFS { start: NodeId, max_depth: usize },
    DFS { start: NodeId, max_depth: usize },
    ShortestPath { src: NodeId, dst: NodeId, algorithm: PathAlgorithm },

    // Centrality
    PageRank { iterations: usize, damping: f64 },
    BetweennessCentrality { sample_size: Option<usize> },
    ClosenessCentrality,

    // Community Detection
    LabelPropagation { iterations: usize },
    Louvain { resolution: f64 },
    ConnectedComponents,

    // Pattern Matching
    TriangleCounting,
    MotifFinding { motif: Motif },
}

pub enum PathAlgorithm {
    Dijkstra,       // Weighted shortest path
    BellmanFord,    // Negative weights allowed
    AStar { heuristic: HeuristicFn },
}

GPU-Accelerated Graph Analytics:

pub struct GPUGraphEngine {
    // Graph data on GPU
    gpu_csr: DeviceBuffer<u32>,
    gpu_offsets: DeviceBuffer<usize>,
    gpu_weights: DeviceBuffer<f32>,

    // Algorithms
    algorithms: HashMap<String, CudaKernel>,
}

impl GPUGraphEngine {
    // PageRank on GPU (10-100x faster than CPU)
    pub fn pagerank(&self, iterations: usize, damping: f64) -> Vec<f64> {
        let mut scores = vec![1.0 / self.num_vertices as f64; self.num_vertices];

        for _ in 0..iterations {
            // GPU kernel: parallel updates for all vertices
            self.pagerank_kernel(&mut scores, damping);
        }

        scores
    }
}

4. RAG Integration

Vector Search Over Graph Nodes:

-- Find relevant nodes using vector similarity
WITH relevant_nodes AS (
    SELECT node_id, SIMILARITY(embedding, @query_embedding) AS score
    FROM graph_nodes
    WHERE SIMILARITY(embedding, @query_embedding) > 0.7
    ORDER BY score DESC
    LIMIT 10
)
-- Expand context using graph traversal
SELECT n.id, n.properties, path
FROM relevant_nodes rn
MATCH (rn)-[:RELATED*1..3]-(n)
RETURN n, path;

LLM Reasoning with Graph Context:

pub struct GraphRAGEngine {
    llm: LLMClient,
    graph: GraphStore,
    embeddings: EmbeddingModel,
}

impl GraphRAGEngine {
    pub async fn query(&self, question: &str) -> Result<Answer> {
        // 1. Generate query embedding
        let query_emb = self.embeddings.embed_text(question).await?;

        // 2. Find relevant graph nodes
        let relevant_nodes = self.graph.vector_search(&query_emb, 10)?;

        // 3. Expand context via graph traversal
        let context_subgraph = self.graph.expand_context(&relevant_nodes, 3)?;

        // 4. Serialize subgraph to LLM prompt
        let prompt = self.serialize_subgraph_to_prompt(&context_subgraph);

        // 5. Query LLM with graph context
        let answer = self.llm.generate(&prompt).await?;

        // 6. Extract reasoning path (explainability)
        let reasoning_path = self.extract_reasoning_path(&answer, &context_subgraph);

        Ok(Answer {
            text: answer,
            reasoning_path,
            confidence: self.compute_confidence(&answer, &context_subgraph),
        })
    }
}

Explainable AI: Reasoning Path Visualization:

Question: "How are Alice and Bob connected?"

Answer: "Alice is connected to Bob through their mutual friend Charlie."

Reasoning Path (Graph):
Alice --[FRIEND]--> Charlie --[FRIEND]--> Bob
  |                   |
  +------[WORKS_AT]---+---> Acme Corp

Explanation:
1. Alice and Charlie work together at Acme Corp
2. Alice and Charlie are friends
3. Charlie and Bob are friends
4. Therefore, Alice and Bob are connected through Charlie

Performance Targets

Metric	Target	Rationale
OLTP Latency	<10ms p99	Real-time graph updates
OLAP Throughput	100M+ edges/sec (GPU)	Large-scale analytics
Query Latency	<100ms for 3-hop traversal @ 10M nodes	Production SLA
Scalability	10M+ nodes, 100M+ edges per node	Enterprise scale
RAG Accuracy	95%+ on GraphQA benchmark	Quality target

Patent Opportunities

“Hybrid Transactional Analytical Graph Database with LLM Integration”

Claims: HTAP graph with dual storage (adjacency list + CSR), RAG integration
Novelty: Real-time OLTP + OLAP on same graph data, LLM reasoning with graph context
Value: $20M-$35M

F7.5: GPU Acceleration Architecture

Current State (30% Complete)

Implemented:

Basic GPU detection (NVIDIA CUDA)
Vector similarity on GPU (F7.1 foundation)
GPU memory management

Remaining Work (70%):

CUDA kernel library for SQL operations
AMD ROCm support
Automatic CPU/GPU query routing
Multi-GPU support
GPU-accelerated aggregations
GPU-accelerated joins

Architecture Overview

┌───────────────────────────────────────────────────────────────────────┐
│                     GPU Acceleration Layer                             │
├───────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌────────────────────────────────────────────────────────────────┐   │
│  │                 Query Optimizer                                 │   │
│  │  - Cost-based GPU/CPU routing                                   │   │
│  │  - Data size, operation type, GPU availability                  │   │
│  │  - Automatic fallback to CPU if GPU unavailable                 │   │
│  └────────────────────────────────────────────────────────────────┘   │
│                                  │                                      │
│              ┌───────────────────┴───────────────────┐                 │
│              ▼                                         ▼                │
│  ┌─────────────────────────────┐      ┌──────────────────────────────┐│
│  │    CPU Execution Path       │      │    GPU Execution Path        ││
│  │  - Small datasets (<1M rows)│      │  - Large datasets (>1M rows) ││
│  │  - Complex operations       │      │  - Vectorizable operations   ││
│  │  - String operations        │      │  - Numeric operations        ││
│  └─────────────────────────────┘      └──────────────────────────────┘│
│                                                    │                    │
│                                                    ▼                    │
│                                  ┌──────────────────────────────────┐  │
│                                  │   GPU Kernel Dispatcher          │  │
│                                  │  - CUDA (NVIDIA)                 │  │
│                                  │  - ROCm (AMD)                    │  │
│                                  │  - Multi-GPU coordination        │  │
│                                  └──────────────────────────────────┘  │
│                                                    │                    │
│                          ┌─────────────────────────┴────────────┐      │
│                          ▼                                       ▼      │
│              ┌───────────────────────┐             ┌──────────────────┐│
│              │  CUDA Kernels (NVIDIA)│             │ ROCm Kernels(AMD)││
│              │  - Scan/Filter        │             │  - Same ops      ││
│              │  - Aggregation        │             │  - HIP API       ││
│              │  - Join               │             │  - Portable      ││
│              │  - Sort               │             └──────────────────┘│
│              │  - Vector ops         │                                 │
│              └───────────────────────┘                                 │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Component Design

1. GPU Query Routing

Cost Model:

pub struct GPUCostModel {
    // Data transfer costs
    cpu_to_gpu_bandwidth: f64,  // GB/s
    gpu_to_cpu_bandwidth: f64,  // GB/s

    // Compute costs
    gpu_compute_factor: f64,     // Speedup vs CPU (10-100x)
    cpu_compute_factor: f64,     // Baseline 1.0

    // Thresholds
    min_data_size_for_gpu: usize, // 1M rows
    gpu_memory_limit: usize,      // GPU RAM (e.g., 24GB)
}

impl GPUCostModel {
    pub fn should_use_gpu(&self, op: &QueryOp) -> bool {
        // Calculate costs
        let transfer_cost = self.data_transfer_cost(op);
        let cpu_compute_cost = self.cpu_compute_cost(op);
        let gpu_compute_cost = self.gpu_compute_cost(op);

        // GPU is beneficial if:
        // gpu_compute_cost + transfer_cost < cpu_compute_cost
        let gpu_total = gpu_compute_cost + transfer_cost;
        let cpu_total = cpu_compute_cost;

        gpu_total < cpu_total &&
        op.data_size >= self.min_data_size_for_gpu &&
        op.data_size <= self.gpu_memory_limit &&
        self.is_vectorizable(op)
    }

    fn is_vectorizable(&self, op: &QueryOp) -> bool {
        match op.op_type {
            OpType::Scan | OpType::Filter | OpType::Aggregate |
            OpType::Join | OpType::Sort | OpType::VectorSearch => true,

            // String ops not well-suited for GPU
            OpType::StringManipulation => false,

            // Complex ops stay on CPU
            OpType::Subquery | OpType::CTE => false,
        }
    }
}

Automatic Data Transfer:

pub struct GPUDataManager {
    // Pinned memory for fast transfers
    pinned_buffers: Vec<PinnedBuffer>,

    // GPU memory pool
    gpu_allocator: GPUAllocator,

    // Transfer streams (async)
    transfer_streams: Vec<cudaStream_t>,
}

impl GPUDataManager {
    // Zero-copy transfer for large datasets
    pub async fn transfer_to_gpu(&self, data: &[u8]) -> Result<DeviceBuffer> {
        // Allocate pinned memory on CPU (faster DMA)
        let pinned = self.allocate_pinned(data.len())?;
        pinned.copy_from_slice(data);

        // Async transfer to GPU
        let gpu_buf = self.gpu_allocator.allocate(data.len())?;
        cudaMemcpyAsync(
            gpu_buf.ptr(),
            pinned.ptr(),
            data.len(),
            cudaMemcpyHostToDevice,
            self.transfer_streams[0],
        )?;

        Ok(gpu_buf)
    }
}

2. CUDA Kernel Library

Scan/Filter:

// GPU kernel for filtering rows
__global__ void filter_kernel(
    const int* col_data,       // Input column
    const int threshold,       // Filter condition
    int* out_indices,          // Output row indices
    int* out_count,            // Output count (atomic)
    int num_rows
) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx < num_rows) {
        if (col_data[idx] > threshold) {
            int pos = atomicAdd(out_count, 1);
            out_indices[pos] = idx;
        }
    }
}

Aggregation (SUM, AVG, MIN, MAX):

// GPU kernel for parallel aggregation
__global__ void sum_reduce_kernel(
    const float* data,
    float* partial_sums,
    int num_rows
) {
    __shared__ float shared_mem[256];

    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Load data into shared memory
    shared_mem[tid] = (idx < num_rows) ? data[idx] : 0.0f;
    __syncthreads();

    // Parallel reduction in shared memory
    for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
        if (tid < stride) {
            shared_mem[tid] += shared_mem[tid + stride];
        }
        __syncthreads();
    }

    // Write block result
    if (tid == 0) {
        partial_sums[blockIdx.x] = shared_mem[0];
    }
}

Join (Hash Join):

// Build hash table on GPU
__global__ void build_hash_table_kernel(
    const int* build_keys,
    const int* build_values,
    int* hash_table_keys,
    int* hash_table_values,
    int num_rows,
    int table_size
) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx < num_rows) {
        int key = build_keys[idx];
        int value = build_values[idx];

        // Open addressing with linear probing
        int hash = key % table_size;
        while (atomicCAS(&hash_table_keys[hash], EMPTY, key) != EMPTY) {
            hash = (hash + 1) % table_size;
        }
        hash_table_values[hash] = value;
    }
}

// Probe hash table
__global__ void probe_hash_table_kernel(
    const int* probe_keys,
    const int* hash_table_keys,
    const int* hash_table_values,
    int* output,
    int num_rows,
    int table_size
) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx < num_rows) {
        int key = probe_keys[idx];
        int hash = key % table_size;

        // Linear probing
        while (hash_table_keys[hash] != EMPTY) {
            if (hash_table_keys[hash] == key) {
                output[idx] = hash_table_values[hash];
                return;
            }
            hash = (hash + 1) % table_size;
        }

        output[idx] = -1;  // Not found
    }
}

3. AMD ROCm Support

Portable Kernel Code (HIP):

// HIP: Portable for both CUDA and ROCm
#include <hip/hip_runtime.h>

__global__ void vector_add(
    const float* a,
    const float* b,
    float* c,
    int n
) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

// Compilation:
// CUDA: nvcc -x cu vector_add.cpp
// ROCm: hipcc vector_add.cpp

Runtime Detection:

pub enum GPUBackend {
    CUDA { device: CudaDevice },
    ROCm { device: HipDevice },
    None,
}

pub fn detect_gpu_backend() -> GPUBackend {
    if has_cuda() {
        GPUBackend::CUDA { device: select_cuda_device() }
    } else if has_rocm() {
        GPUBackend::ROCm { device: select_hip_device() }
    } else {
        GPUBackend::None
    }
}

4. Multi-GPU Support

Data Partitioning:

pub struct MultiGPUExecutor {
    devices: Vec<GPUDevice>,
    partitioning_strategy: PartitioningStrategy,
}

pub enum PartitioningStrategy {
    RoundRobin,     // Distribute rows evenly
    HashPartition,  // Hash on key
    RangePartition, // Range on key
    Replicate,      // Replicate data to all GPUs
}

impl MultiGPUExecutor {
    pub async fn execute_scan(&self, data: &[u8], filter: FilterFn) -> Result<Vec<Row>> {
        let num_gpus = self.devices.len();
        let chunk_size = data.len() / num_gpus;

        // Partition data across GPUs
        let mut tasks = Vec::new();
        for (i, device) in self.devices.iter().enumerate() {
            let start = i * chunk_size;
            let end = if i == num_gpus - 1 { data.len() } else { (i + 1) * chunk_size };
            let chunk = &data[start..end];

            // Async execution on each GPU
            tasks.push(device.execute_scan_async(chunk, filter.clone()));
        }

        // Wait for all GPUs
        let results = futures::future::join_all(tasks).await;

        // Merge results
        Ok(results.into_iter().flatten().collect())
    }
}

Performance Targets

Operation	CPU (Baseline)	GPU (CUDA)	Speedup
Scan/Filter	100M rows/sec	1-5B rows/sec	10-50x
Aggregation	50M rows/sec	500M-2B rows/sec	10-40x
Join	10M rows/sec	100M-500M rows/sec	10-50x
Sort	20M rows/sec	200M-1B rows/sec	10-50x
Vector Search	10K qps	100K-500K qps	10-50x

Patent Opportunities

“Automatic GPU Acceleration for Database Queries”

Claims: Cost-based CPU/GPU routing, automatic data transfer, multi-GPU coordination
Novelty: Zero-config GPU acceleration for SQL queries
Value: $20M-$30M

[Document continues with F7.6, F7.9, F7.12, and remaining sections…]

Due to length constraints, I’ll create this as a multi-part document. This is Part 1 covering the first 3 major features.

Next sections to create:

F7.6: Advanced Webhooks Architecture
F7.9: AI Schema Architect Architecture
F7.12: Unified Observability Architecture
Protocol Completion Architecture (Oracle, PostgreSQL, WASM)
Integration Points
Performance Targets
Security Architecture
Scalability Design

Total estimated length: 150+ pages

End of Part 1

Document Path: /home/claude/HeliosDB/docs/architecture/v7.0/PHASE2_ARCHITECTURE.md Status: Part 1 Complete (F7.1, F7.2, F7.5) Next: Create remaining sections or split into multiple files

HeliosDB Phase 2 Architecture: v7.0 Tier 2 Features & Advanced ML/AI

HeliosDB Phase 2 Architecture: v7.0 Tier 2 Features & Advanced ML/AI

Executive Summary

Phase 2 Objectives

Key Innovations

Table of Contents

Architecture Principles

Core Design Tenets

F7.1: Multimodal Vector Search Architecture

Current State (60% Complete)

Architecture Overview

Component Design

1. Embedding Generation Layer

2. GPU-Accelerated Search

3. Index Structures

Data Flow

Storage Layer Integration

Performance Targets

Cross-Modal Search Examples

Patent Opportunities

F7.2: GraphRAG HTAP Architecture

Current State (40% Complete)

Architecture Overview

Component Design

1. Graph Query Languages

2. HTAP Architecture

3. Graph Algorithms Library

4. RAG Integration

Performance Targets

Patent Opportunities

F7.5: GPU Acceleration Architecture

Current State (30% Complete)

Architecture Overview

Component Design

1. GPU Query Routing

2. CUDA Kernel Library

3. AMD ROCm Support

4. Multi-GPU Support

Performance Targets

Patent Opportunities