Technical Implementation Guideline for Project 'HeliosDB': A Next-Generation Distributed HTAP Database

Technical Implementation Guideline for Project “HeliosDB”: A Next-Generation Distributed HTAP Database

architecture.md

Section 1: Architectural Blueprint: A Decoupled, Shared-Nothing Model

1.1 Core Philosophy: Synthesizing Exadata’s Intelligence with Snowflake’s Scalability

The foundational design of HeliosDB is a two-tier, shared-nothing architecture that strategically combines the independent scaling of decoupled compute and storage with the performance benefits of intelligent, data-aware storage nodes.1 This model is inspired by two leading but distinct architectures: Snowflake’s separation of compute and storage, which allows for unparalleled elasticity and resource isolation 3, and Oracle Exadata’s hardware-software co-design, which pushes processing closer to the data to minimize data movement and accelerate queries.2

Unlike architectures such as Snowflake, which leverage general-purpose cloud object storage (e.g., AWS S3, Google Cloud Storage) as a passive persistence layer 6, HeliosDB designates its Storage Nodes as active, intelligent participants in the query execution process. This approach mirrors the philosophy of Exadata’s storage cells, which possess their own processing capabilities to offload database operations.2 The primary motivation for this hybrid design is to achieve the best of both worlds: the flexibility to scale compute resources (for handling query concurrency and complexity) and storage resources (for handling data volume) independently, while simultaneously minimizing the network bottleneck that can arise in purely decoupled systems where all raw data must be transferred to the compute layer for processing.9 By making the storage layer intelligent, HeliosDB aims to deliver superior performance for data-intensive analytical queries, which is a cornerstone of a true Hybrid Transactional/Analytical Processing (HTAP) system.

This fundamental architectural choice—to build a system with intelligent, custom storage services rather than relying on commodity object storage—is the most critical decision in the design of HeliosDB. It is a deliberate trade-off that prioritizes raw query performance and efficiency over the architectural simplicity and potential cost advantages of using standard cloud infrastructure. This decision directly enables core features like predicate pushdown (Smart Scan), where filtering occurs at the storage layer, thereby reducing the volume of data that must traverse the network. This reduction in data movement is paramount for achieving high performance in analytical workloads that scan large datasets.8

1.2 Compute Node Responsibilities

The Compute Nodes in the HeliosDB architecture function as the stateless “brains” of the query processing layer.7 Their primary responsibilities encompass the full lifecycle of query execution and client interaction, without being burdened by the complexities of durable data management.

Key roles of a Compute Node include:

Client Connection and Session Management: Acting as the entry point for all client connections, handling authentication, and managing user sessions.
SQL Parsing, Planning, and Optimization: Receiving SQL queries, parsing them into an internal representation, and generating an optimized, distributed execution plan. The optimizer is responsible for determining which operations can be pushed down to the Storage Nodes.
Distributed Query Coordination: For any given query, the receiving Compute Node acts as the coordinator. It orchestrates the execution of the distributed plan by dispatching tasks to other Compute Nodes (for complex joins or aggregations) and, crucially, to the relevant Storage Nodes.
Intermediate and Final Result Processing: Performing operations that cannot be offloaded to the storage layer. This includes cross-shard joins, applying window functions, final sorting (ORDER BY), and aggregating the partial results received from multiple Storage Nodes into a final result set to be returned to the client.

Compute Nodes are fundamentally stateless; they do not persist any user data durably. To optimize performance and reduce redundant network traffic, each Compute Node maintains a significant local data cache in its main memory. This cache holds recently accessed data blocks fetched from the Storage Nodes, operating on a least-recently-used (LRU) or similar eviction policy. This is conceptually similar to the local caching performed by Snowflake’s Virtual Warehouses.12 Cache coherence is maintained through an invalidation mechanism managed by the central Metadata Service, which broadcasts notifications to all Compute Nodes upon the commit of transactions that modify data.

1.3 Storage Node Responsibilities

Storage Nodes form the stateful, persistent foundation of the HeliosDB architecture. They are responsible not only for the durable storage of data but also for executing low-level data processing tasks that are offloaded from the Compute Nodes. This “intelligence” is what distinguishes the architecture from simpler decoupled models.8

Each Storage Node is a self-contained service that manages a subset of the database’s total data (a shard). Its responsibilities include:

Durable Data Storage: Managing the physical storage of data on local devices (NVMe SSDs, persistent memory). This is handled by a write-optimized storage engine based on a Log-Structured Merge-tree (LSM-tree), designed for high-throughput ingestion (detailed in ingestion_and_crud.md).
Data Retrieval: Serving data blocks to Compute Nodes upon request.
Execution of Pushed-Down Operations: Running a suite of specialized processing engines that operate directly on local data:
- Predicate Filtering Engine: Executes WHERE clause conditions sent from Compute Nodes, filtering rows at the source.
- Columnar Processing Engine: Operates on data compressed with Hybrid Columnar Compression (HCC), capable of decompressing only the specific columns required for a query.
- Filtered Vector Search Engine: Executes Approximate Nearest Neighbor (ANN) searches on vector data, including the application of scalar predicates directly within the search algorithm.

This distribution of labor ensures that data-intensive filtering and processing happen in parallel across all relevant Storage Nodes, dramatically reducing the amount of data that needs to be sent to the compute layer for final processing.

Section 2: Inter-Node Communication Fabric

The performance of a distributed database, particularly one with a decoupled architecture, is fundamentally defined by the efficiency and capability of its interconnect. In HeliosDB, the network is not merely a data transport layer but the primary system bus that integrates the compute and storage tiers into a cohesive, high-performance unit.

2.1 Protocol Specification: RDMA over Converged Ethernet (RoCEv2)

To achieve the lowest possible latency for inter-node communication, the physical and transport layers of the HeliosDB fabric will be built upon RDMA over Converged Ethernet (RoCEv2). RDMA is a kernel-bypass technology that allows the network interface card (NIC) of one server to directly access the main memory of another server without involving the remote server’s CPU, operating system, or kernel.5 This eliminates significant software overhead and is critical for achieving the microsecond-level latencies required for advanced database optimizations.

The use of RDMA is directly inspired by its application in Oracle Exadata, where it facilitates ultra-fast operations such as reading commit cache information from a remote instance or fetching undo blocks for consistent reads, reducing latency for these operations by an order of magnitude compared to traditional network protocols.8 By adopting RoCEv2, HeliosDB positions itself to implement similar low-latency optimizations in the future, providing a significant performance advantage for transactional workloads.

2.2 The HeliosDB Intelligent Database (HIDB) Protocol

Layered on top of the RoCEv2 fabric will be a custom, high-performance Remote Procedure Call (RPC) protocol named the HeliosDB Intelligent Database (HIDB) Protocol. This protocol is analogous to Oracle’s proprietary Intelligent Database (iDB) protocol, which enables the communication of SQL-level semantics between database and storage servers.13 The HIDB protocol will be designed to serialize and transmit complex query fragments and control messages, transforming the network from a simple data pipe into a true command bus. It will be implemented using a framework like gRPC, but with a custom transport layer optimized to leverage RDMA where possible.

The protocol will define a rich set of message types, including:

PredicatePushdownRequest: A message sent from a Compute Node to one or more Storage Nodes. It contains a serialized representation of the query’s WHERE clause predicates, a list of columns to be projected (SELECT clause), and identifiers for the target tables, partitions, or data blocks.
FilteredResultSet: A streaming response from a Storage Node to a Compute Node. It contains the rows and columns that have passed the predicate filter, formatted in an efficient, possibly compressed, columnar layout to minimize network bandwidth and deserialization overhead on the Compute Node.
VectorSearchRequest: A specialized message for vector similarity searches. It includes the query vector, the number of nearest neighbors to find (), and a set of serialized scalar filter predicates to be applied during the search.
CacheInvalidationNotice: A broadcast message sent from the Metadata Service to all Compute Nodes, instructing them to invalidate specific data blocks in their local caches due to a recent transaction commit.
ReplicationDataStream: A continuous stream of committed transaction logs sent between primary and mirror Storage Nodes for synchronous replication.

The design of the HIDB protocol is a core architectural task, not a minor implementation detail. Its extensibility and efficiency will directly dictate the future capabilities and performance limits of the entire database system. A failure to engineer this protocol for low latency and high throughput would create a bottleneck that could not be overcome by simply adding more powerful hardware to the compute or storage tiers.

Section 3: Metadata and Consensus Service

While the data and query processing pathways in HeliosDB are massively distributed and parallel, the management of the cluster’s global state and schema requires a centralized, consistent, and highly available source of truth. This function is fulfilled by the Metadata and Consensus Service.

3.1 Architecture: Centralized Logic, Distributed State

The Metadata Service will be implemented as a small, dedicated cluster of three or five Metadata Nodes. This logical centralization of cluster management is analogous to the role of the Shard Catalog in Oracle Sharding 14 or the Placement Driver (PD) in TiDB.15 These nodes do not store any user data; their sole responsibility is to maintain the metadata that describes the state and structure of the entire HeliosDB cluster.

To ensure high availability and fault tolerance, the state of the Metadata Service will be replicated across all its nodes using the Raft consensus protocol.17 Raft is a consensus algorithm designed to manage a replicated log, ensuring that all nodes in the metadata cluster agree on a single, consistent state, even in the presence of node failures or network partitions.19 This makes the “brain” of the cluster resilient to failures.

3.2 Managed Metadata

The Raft-replicated state machine within the Metadata Service will be the canonical source of truth for all cluster-wide information, including:

Sharding Topology: The complete mapping of data ranges (known as chunks or shards) to the specific set of Storage Nodes that are responsible for storing them, including the identification of primary and mirror replicas.16
Schema Information: All Data Definition Language (DDL) artifacts, such as the definitions of tables, columns, data types, indexes (both standard and vector), partitioning schemes, and compression settings.
Node and Replica Status: The health and availability of all Compute and Storage Nodes in the cluster, including the current leader for each replicated shard.
Cluster Configuration: Global system settings, security credentials, and access control policies.

3.3 Implementation with etcd/raft

To accelerate development and ensure correctness, the consensus layer will be built using a battle-tested, open-source Raft library, such as the one extracted from etcd.21 The etcd/raft library is intentionally minimalistic; it implements the core Raft algorithm but delegates the responsibilities of network transport and persistent log storage to the application.21

The implementation on the Metadata Nodes will therefore consist of:

The core etcd/raft state machine.
A persistent storage layer for the Raft log and state machine snapshots, likely implemented using an embedded key-value store like RocksDB.23
A network transport layer, built with gRPC, to handle the passing of Raft messages (e.g., AppendEntries, RequestVote) between peer Metadata Nodes.24

This design separates the complex logic of the consensus algorithm from the platform-specific implementation of storage and networking, leading to a more robust and maintainable system. The data path of the database (i.e., DML operations like SELECT, INSERT, UPDATE) is designed to be massively parallel and distributed. In contrast, the control path (i.e., DDL operations like CREATE TABLE or cluster management tasks like adding a new node) is inherently serialized. All such changes must pass through the single Raft leader of the Metadata Service to be committed.25 This is a standard and acceptable trade-off in distributed database design, as control-plane operations are orders of magnitude less frequent than data-plane operations. However, it imposes a critical constraint on the overall architecture: data-plane operations must not require synchronous communication with the Metadata Service. To prevent the Metadata Service from becoming a global bottleneck for every query, Compute Nodes and client drivers must aggressively cache metadata (such as sharding topology and table schemas) and rely on an asynchronous, push-based invalidation mechanism to stay informed of changes.

data_organization.md

Section 1: Sharding and High Availability

Data distribution across the cluster is fundamental to achieving horizontal scalability and fault tolerance. HeliosDB employs a system-managed sharding strategy coupled with synchronous replication to provide both linear scaling of capacity and strong data durability guarantees.

1.1 Sharding Strategy: System-Managed Consistent Hashing

HeliosDB will horizontally partition data across the cluster of Storage Nodes, which act as shards in a shared-nothing architecture.14 The distribution of data will be governed by a system-managed consistent hashing algorithm applied to a user-defined sharding key.27

When a sharded table is created, the user specifies one or more columns as the sharding key (e.g., customer_id, region_code). For each row inserted, the database computes a hash of the sharding key’s value. This hash value is then mapped to a specific shard on the consistent hash ring, determining which Storage Node is responsible for storing that row.

This approach offers several key advantages:

Even Data Distribution: A well-chosen hash function ensures that data is spread evenly across all available shards, preventing “hot spots” where one node becomes overloaded.10
Elastic Scalability: When a new Storage Node is added to the cluster, it takes ownership of a portion of the hash ring. Only the data corresponding to that specific range needs to be moved from its existing nodes, minimizing the amount of data rebalancing required. This is significantly more efficient than re-sharding an entire dataset.10
Transparency: The entire process of data placement and rebalancing is managed automatically by the Metadata Service, making the distributed nature of the data transparent to the application layer, similar to Oracle’s system-managed sharding.27

1.2 Replication Model: Single Synchronous Mirror per Shard

To ensure high availability and data durability, each shard in HeliosDB will be configured as a high-availability pair, consisting of a single primary node and one synchronous mirror.29 This configuration means that every write operation (e.g., INSERT, UPDATE, DELETE) directed to a shard’s primary node must be successfully replicated and committed on its mirror node before a success acknowledgment is returned to the client application.

This synchronous replication model provides a Recovery Point Objective (RPO) of zero. In the event of an unrecoverable failure of the primary node, no committed data is lost, as the mirror is guaranteed to have an identical, up-to-date copy of the data.29 The mirror can then be promoted to become the new primary, allowing the system to resume operations with minimal interruption.

While this choice provides the strongest guarantee of data durability, it comes at the cost of increased write latency. Every write transaction must incur at least one network round-trip time (RTT) to the mirror node before it can be considered committed. This trade-off prioritizes data safety and availability over raw write throughput, which is a critical consideration for mission-critical transactional workloads. To provide greater flexibility for diverse use cases, future versions of HeliosDB should allow this replication mode to be configurable on a per-table or per-keyspace basis, enabling asynchronous replication for workloads where some data loss in a disaster scenario is acceptable in exchange for lower write latency.

1.3 Quorum and Split-Brain Avoidance

In a two-node primary/mirror setup, determining which node should be active if they lose communication with each other is impossible without a third party, a scenario known as “split-brain.” To resolve this and enable automatic failover, HeliosDB will employ a witness-based quorum mechanism for each shard pair.29

A witness is a lightweight process or service that does not store user data but participates in the leader election protocol for a specific shard. Its sole purpose is to provide a third vote to establish a majority (quorum). The failover logic is as follows:

The primary and mirror nodes continuously heartbeat with each other and with their designated witness.
If the mirror node loses contact with the primary but can still communicate with the witness, it will request a failover.
The witness will grant the failover request only if it has also lost contact with the primary node.
With two out of three members (mirror and witness) agreeing that the primary is down, a quorum is established, and the mirror is safely promoted to the new primary.

This mechanism, inspired by its use in systems like SQL Server Database Mirroring, prevents a split-brain scenario where both the original primary and the mirror believe they are the active node, which would lead to data divergence and corruption.29 The state and location of witnesses can be managed and monitored by the central Metadata Service.

Section 2: Intra-Shard Partitioning

While sharding distributes data across different nodes in the cluster, partitioning is a technique for subdividing data within a single shard (Storage Node). HeliosDB will leverage a rich set of partitioning strategies to further optimize data management and query performance at a granular level.30

2.1 Local Data Organization

Within each Storage Node, a large table can be decomposed into smaller, independent segments called partitions.32 Each partition is a separate object with its own storage characteristics but shares the same logical schema as the parent table. This is distinct from sharding, which is an inter-node distribution strategy.31

The primary benefit of local partitioning is enabling partition pruning (also known as partition elimination). When a query contains predicates on the partitioning key, the query optimizer on the Compute Node can intelligently determine that only a subset of the partitions on a given Storage Node needs to be accessed to satisfy the query. The request sent to the Storage Node will specify only the relevant partitions, allowing the node to completely ignore the I/O for all other partitions.33 This can reduce the amount of data scanned by orders of magnitude, dramatically improving query performance.

This combination of inter-node sharding and intra-node partitioning creates a powerful, two-level hierarchy for data pruning. For example, consider a sales table sharded by store_id and locally range-partitioned by sale_date. A query like SELECT * FROM sales WHERE store_id = 101 AND sale_date BETWEEN ‘2025-01-01’ AND ‘2025-01-07’; would first be routed only to the specific shard responsible for store_id = 101 (shard pruning). Then, within that shard, the Storage Node would only scan the seven daily partitions corresponding to the first week of January (partition pruning), ignoring all other data. This hierarchical filtering is a cornerstone of the system’s performance for analytical queries.

2.2 Supported Partitioning Strategies

To address a wide variety of business requirements, HeliosDB will support the most common and powerful partitioning methods found in mature relational databases like Oracle.30 The CREATE TABLE DDL will allow users to specify both a SHARD BY key and a PARTITION BY key.

Supported strategies will include:

Range Partitioning: Data is partitioned based on a continuous range of values in the partition key column(s). This is most commonly used for date or timestamp columns (e.g., partitioning a table by month or day).
List Partitioning: Data is partitioned based on a discrete list of literal values. This is ideal for categorical data (e.g., partitioning a customer table by country_code with partitions for ‘US’, ‘DE’, ‘JP’, etc.).
Hash Partitioning: A hash function is applied to the partition key to determine the partition where a row should be stored. This is used to evenly distribute data across a fixed number of partitions when there is no logical range or list-based key, helping to avoid hot spots within a single shard.33
Composite Partitioning: This allows for a combination of two partitioning methods. For example, a table could be range-partitioned by month, and each monthly partition could then be sub-partitioned by hash on customer_id. This allows for pruning on both dimensions simultaneously.

Section 3: Hybrid Columnar Compression (HCC)

For analytical and data warehousing workloads, which are characterized by large scans over historical data, storage efficiency and I/O performance are critical. HeliosDB will incorporate an advanced storage format inspired by Oracle’s Hybrid Columnar Compression (HCC) to address these needs.35

3.1 Data Structure: The Compression Unit (CU)

Data in HCC-enabled tables will be organized into a logical block called a Compression Unit (CU).37 A CU contains a group of several thousand rows. The internal layout of a CU is a hybrid of row and columnar storage formats.

Within a single CU, all values for a given column across all rows in the unit are grouped together and stored contiguously. This columnar organization places similar data types and values next to each other, which dramatically increases the effectiveness of modern compression algorithms.38 However, unlike a pure columnar store, all the data for a single logical row remains within the boundaries of the same CU.40 This self-contained nature preserves data locality, allowing for relatively efficient retrieval of individual rows with a single I/O operation, which is a key benefit of the hybrid approach.37

3.2 Compression Algorithms and Application

The compression engine will use industry-standard, high-performance compression libraries like LZ4 (optimized for decompression speed) and ZSTD (offering higher compression ratios at the cost of more CPU). The choice of algorithm will be a tunable parameter at the table or column level.

The compression process within a CU leverages the columnar layout. For each column’s data segment, the engine builds a dictionary of unique values and replaces repeated values with compact integer references.37 This technique is especially potent for low-cardinality columns (e.g., a ‘status’ column with only a few distinct values), leading to very high compression ratios. The compressed data remains compressed on disk, in the Storage Node’s flash cache, during network transfer to the Compute Node, and in the Compute Node’s buffer cache, minimizing the data footprint at every stage of the query lifecycle.35

3.3 DML Operations on HCC Data

The high compression ratios of HCC come with a performance trade-off for Data Manipulation Language (DML) operations. Because data for a single row is spread across multiple compressed columnar segments within a CU, updating even a single value in a single row is a heavy operation. It requires the database to lock the entire Compression Unit, decompress it, apply the modification, and then recompress the entire unit.37

This overhead means that HCC is best suited for data that is written once and read many times, or updated very infrequently. Typical use cases include data warehousing fact tables, archival data, and logs. To accommodate this, HeliosDB’s storage engine will be designed to handle data in different formats. New or frequently updated “hot” data will reside in a write-optimized row format (within the LSM-tree’s Memtable). Through a background data lifecycle or compaction process, older, less frequently modified data can be automatically converted into the HCC format to optimize storage and scan performance.

To provide users with control over the trade-offs between storage, query performance, and load time, HeliosDB will offer distinct HCC modes, specified at table creation.

Table: HCC Compression Modes

Mode	Target Use Case	Compression Ratio (Est.)	CPU Overhead (Load)	Query Performance Boost (Scans)
WAREHOUSE_OPTIMIZED	Active Data Warehousing	High (e.g., ~6-10x)	Medium	High
ARCHIVE_OPTIMIZED	Infrequently Accessed Data	Very High (e.g., ~10-15x)	High	Medium (decompression overhead)

query_execution.md

Section 1: Predicate Pushdown Engine (Inspired by Exadata Smart Scan)

A cornerstone of HeliosDB’s performance strategy for analytical queries is the ability to minimize data movement between the storage and compute tiers. The Predicate Pushdown Engine, functionally analogous to Oracle Exadata’s Smart Scan feature, is the primary mechanism for achieving this. It transforms the Storage Nodes from passive data repositories into active participants in the query execution process.11

1.1 Core Mechanism: Offloading WHERE Clause Processing

During the query optimization phase, the Compute Node analyzes the SQL WHERE clause to identify filtering conditions, or predicates, that can be executed directly at the storage layer.13 This includes a wide range of common SQL predicates:

Simple comparisons (e.g., column = value, column > value).
Range checks (e.g., column BETWEEN val1 AND val2).
Membership tests (e.g., column IN (val1, val2,…)).
Pattern matching (e.g., column LIKE ‘pattern%’).
Boolean logic combinations of the above (AND, OR, NOT).

Once identified, these predicates, along with the list of required columns from the SELECT clause (column projection), are serialized into a PredicatePushdownRequest message. This message is then transmitted via the HIDB protocol to all Storage Nodes that hold relevant data for the query.13 By sending the query to the data, instead of pulling all the data to the query, the system drastically reduces network traffic and offloads CPU cycles from the Compute Nodes.8

1.2 Execution on Storage Nodes

Upon receiving a PredicatePushdownRequest, the Storage Node’s execution engine performs the filtering operation directly on its local data. The process is optimized for the underlying storage format:

For Row-Oriented Data: The engine performs a scan of the requested data blocks. For each row, it evaluates the predicate expression. If the row satisfies the condition, only the projected columns are extracted and added to the result set being streamed back to the Compute Node.
For Hybrid Columnar Compressed (HCC) Data: The process is significantly more efficient. The engine accesses only the compressed data for the columns involved in the predicate and the projection. It decompresses only these specific columns on-the-fly to evaluate the filter.40 All other columns in the table are never read from disk or decompressed, resulting in a massive reduction in I/O and CPU consumption.

In both cases, the result is a stream of data sent back to the coordinating Compute Node that is orders of magnitude smaller than the raw data on disk.42 This filtered result set is then used for subsequent processing steps, such as joins or final aggregations, at the compute layer. This transformation of the storage layer into a distributed, data-aware co-processor is the primary source of HeliosDB’s analytical performance advantage. It requires the Storage Node to have a sophisticated software stack, including its own query runtime and expression evaluation engine, but the performance gains justify this complexity.

1.3 Advanced Offloading (Future Work)

The HIDB protocol and Storage Node architecture will be designed with extensibility in mind to support more advanced offloading capabilities in future releases. This roadmap, inspired by the evolution of Exadata’s features, includes:

Join Offloading: Offloading parts of a join operation by creating Bloom filters on the Storage Nodes. For a join between a small dimension table and a large fact table, a Bloom filter of the join keys from the dimension table can be sent to the Storage Nodes scanning the fact table, allowing them to discard non-matching fact table rows at the source.13
Aggregation Offloading: For simple aggregation queries (GROUP BY, SUM, COUNT), partial aggregation can be performed on the Storage Nodes. Instead of returning all filtered rows, each Storage Node can return a much smaller set of partially aggregated results, which are then combined at the Compute Node.8

Section 2: Online Aggregation Engine (Inspired by LeanXcale)

For true HTAP capabilities, a database must handle real-time analytics on operational data without causing contention that slows down transactional workloads. HeliosDB will address this challenge by implementing an Online Aggregation Engine inspired by the novel architecture of LeanXcale.43

2.1 Feature Goal: Real-Time, Contention-Free Aggregates

The system will support the creation of ONLINE AGGREGATE objects, which function like real-time, incrementally updated materialized views.45 These objects will maintain pre-calculated aggregate values (such as COUNT, SUM, AVG, MIN, MAX) as data is ingested. Consequently, an application query for a pre-defined aggregate can be satisfied with a near-instantaneous lookup of a pre-computed value, rather than executing a costly full-table scan and aggregation at query time.46

For example, a dashboard query for SELECT COUNT(*) FROM user_logins WHERE event_date = CURRENT_DATE could be answered by reading a single, continuously updated row in an internal aggregate table, providing sub-second response times even on tables with billions of rows.

2.2 Implementation via Semantic Concurrency Control

The key innovation enabling this feature is a departure from traditional concurrency control mechanisms. Standard locking or Multi-Version Concurrency Control (MVCC) models create a severe bottleneck when thousands of concurrent transactions attempt to update a single “hot” row, such as an aggregate counter. In MVCC, these concurrent updates would result in a cascade of transaction aborts, as only one transaction could successfully commit its change to the counter row at a time.47

HeliosDB will implement Semantic Concurrency Control to eliminate this bottleneck.47 This approach leverages the mathematical properties of the aggregation operations themselves to allow for conflict-free concurrent updates.

Mechanism:

DELTA Column Type: A new internal column type, DELTA, will be introduced. Instead of storing a final value, a DELTA column stores a representation of the operation to be performed (e.g., SUM(+1), SUM(+42.50)).
Commutative Operations: The engine recognizes that aggregation functions like SUM and COUNT are based on commutative operations (e.g., addition, where ). The order in which increments are applied does not affect the final result.
Conflict-Free Write Path: When a transaction inserts a new row that affects an online aggregate, it does not attempt to lock and update the final aggregate value row. Instead, it writes a new, immutable “operation row” or “delta record” to an internal table. This delta record contains the operation to be applied (e.g., +1 to the count for a specific group). Since these writes are appends of new, independent records, they do not conflict with each other, even if they logically apply to the same aggregate group.47
On-the-Fly Read Path: When a query requests the aggregate value, the query engine reads the last known materialized value (the “snapshot”) and then applies all subsequent delta records that have been committed since that snapshot was created. This calculation happens on-the-fly, ensuring the returned value is completely up-to-date.
Background Consolidation: A background process asynchronously and periodically merges the delta records into the materialized value row, creating a new snapshot and allowing the now-processed delta records to be garbage collected.

This mechanism represents a paradigm shift from traditional concurrency control. By understanding the semantics of the operations, the system can permit a level of parallelism that is impossible when only considering physical read/write conflicts. This is the key to resolving the fundamental tension in HTAP systems between transactional throughput and real-time analytical freshness.

Section 3: Distributed Query Execution

The execution of any given query is a coordinated effort between the Compute and Storage tiers, designed to maximize parallelism at every level.

3.1 Query Lifecycle

A typical distributed query follows these steps:

Reception and Planning: A client sends a SQL query to any Compute Node. The node’s parser and optimizer generate a distributed execution plan, which is a tree of physical operators. The optimizer uses statistics about data distribution to decide which predicates and projections to push down.
Dispatch: The coordinating Compute Node dispatches tasks based on the plan. PredicatePushdownRequest and VectorSearchRequest messages are sent in parallel to all Storage Nodes that hold relevant data shards. If the query involves complex joins between data on different shards, sub-queries may also be dispatched to other Compute Nodes.
Storage-Layer Execution: Each receiving Storage Node executes its assigned tasks (filtering, ANN search) on its local data partitions and begins streaming the results back to the designated Compute Node(s).
Compute-Layer Execution: The Compute Nodes receive the partial result streams and execute the remaining operators in the query plan. This includes performing joins across data from different shards, executing complex functions or window functions that could not be pushed down, and performing the final sorting and aggregation of the result set.
Result Return: The coordinating Compute Node assembles the final result set and streams it back to the client.

3.2 Parallelism Model

HeliosDB is designed to exploit parallelism at every stage of query execution to minimize latency:

Multi-Shard Parallelism: The query coordinator dispatches requests to all relevant shards simultaneously. If a query needs to scan data across 10 shards, 10 parallel execution streams are initiated at the storage layer.
Intra-Shard Parallelism: Within a single Storage Node, if a table is locally partitioned, the scan operation can be parallelized across the relevant partitions.
Intra-Node Parallelism: On both Compute and Storage Nodes, individual operators (e.g., hash join, sort, scan) are designed to be multi-threaded and will utilize all available CPU cores. The architecture will draw inspiration from ScyllaDB’s shard-per-core model, where possible, to minimize cross-core contention and maximize CPU cache efficiency.

vector_integration.md

Section 1: The VECTOR Data Type and Storage

To natively support AI and machine learning workloads, HeliosDB will integrate vector database capabilities as a first-class feature, starting with a dedicated VECTOR data type and an optimized storage strategy that balances the needs of hybrid relational-vector queries.

1.1 DDL and Data Type Definition

A new data type, VECTOR(n), will be introduced into the SQL dialect, where n represents the number of dimensions of the vector.49 This allows developers to define tables that seamlessly mix traditional scalar data types with high-dimensional vector embeddings.

The Data Definition Language (DDL) syntax will be straightforward:

SQL

CREATE TABLE document_embeddings (
doc_id BIGINT PRIMARY KEY,
doc_text TEXT,
author VARCHAR(255),
creation_date DATE,
embedding VECTOR(1536)
);

This example defines a table to store text documents, their metadata, and a 1536-dimensional vector embedding for each document’s content, such as one generated by a modern language model.

1.2 Internal Storage Strategy: Hybrid In-line / Out-of-line (TOAST)

The physical storage of vector data presents a significant performance challenge. Storing large vectors directly within the main table row can lead to wide rows, reducing the number of rows that fit on a single data page and potentially increasing I/O for queries that do not need the vector data. Conversely, always storing vectors separately introduces extra I/O for every vector search operation.

To address this, HeliosDB will adopt a flexible storage mechanism inspired by PostgreSQL’s The Oversized-Attribute Storage Technique (TOAST).51 This strategy allows the database to decide whether to store a vector in-line (within the primary row’s data page) or out-of-line (in a separate storage area with a pointer in the main row) based on its size.

Mechanism:

A system-wide threshold, TOAST_TUPLE_THRESHOLD (defaulting to ~2 KB), will be defined.
When a row containing a VECTOR column is inserted or updated, the system checks the total row size.
If the vector’s size (e.g., 1536 dimensions * 4 bytes/dimension = 6144 bytes) causes the row to exceed the threshold, the vector data is moved to a dedicated TOAST storage area. The main table row stores only an 18-byte pointer to this out-of-line data.51
If the vector is small enough (e.g., a 128-dimension vector at 512 bytes) and the total row size remains below the threshold, the vector data is stored directly in-line.

This default behavior can be overridden on a per-column basis using the SET STORAGE clause, providing database administrators with fine-grained control 51:

ALTER TABLE… ALTER COLUMN embedding SET STORAGE EXTERNAL;: Forces out-of-line storage for the vector, which is beneficial if the vector is large and rarely queried alongside scalar columns.
ALTER TABLE… ALTER COLUMN embedding SET STORAGE PLAIN;: Forces in-line storage and disables compression. This is the optimal setting for performance-critical vector search workloads, as it avoids the I/O overhead of fetching out-of-line data, but it may lead to errors if the row becomes too large to fit on a single data page.

The default storage strategy for the VECTOR type is a critical tuning parameter. While out-of-line storage is necessary for very large objects, it introduces a significant performance penalty for vector-only similarity searches. An Approximate Nearest Neighbor (ANN) index scan may need to evaluate thousands of candidate vectors to find the top results.54 If each of these vectors is stored out-of-line, the database must perform two I/O operations per candidate: one to read the pointer from the main table and a second to follow the pointer to the TOAST table to retrieve the vector data.52 This effectively doubles the I/O cost of the search. Therefore, for workloads where vector search performance is paramount, the recommended practice will be to use the PLAIN or MAIN storage strategy and ensure that the vector dimensionality and other row data are constrained such that TOASTing is avoided.55

Section 2: Distributed Vector Indexing

To perform similarity searches efficiently over billions of vectors, a brute-force (exact) search is computationally infeasible. HeliosDB will therefore support the creation of Approximate Nearest Neighbor (ANN) indexes, which enable rapid retrieval of highly similar vectors by trading a small amount of accuracy for orders-of-magnitude speed improvements.

2.1 Supported Index Types: HNSW and IVF

The database will implement two of the most effective and widely adopted ANN indexing algorithms, allowing users to choose the best fit for their specific workload and resource constraints.56 These indexes are built locally on each Storage Node, containing only the data present on that shard.

HNSW (Hierarchical Navigable Small World): A graph-based index that connects vectors as nodes in a multi-layered graph. Edges connect similar vectors, with long-range connections in the upper layers for fast traversal and short-range connections in the lower layers for precision.54 HNSW offers state-of-the-art performance in terms of query latency and recall (accuracy) but is memory-intensive as the graph structure must be held in memory for fast traversal.57
IVF (Inverted File): This algorithm first partitions the vector space into a set of clusters using an algorithm like k-means. Each vector is then assigned to its nearest cluster’s “inverted list”.56 A search query first identifies the closest clusters and then performs a more localized search only within those lists. IVF has a much smaller memory footprint and faster build times than HNSW, making it suitable for extremely large datasets where memory is a constraint, though it may have slightly higher query latency.58

The choice between HNSW and IVF represents a classic trade-off between performance, memory consumption, and build time. The following table provides a guideline for users.

Table: Vector Index Type Comparison

Index Type	Build Time	Memory Usage	Query Latency	Recall (Accuracy)	Ideal Use Case
HNSW	High	High	Very Low	Very High	Real-time, low-latency applications (e.g., interactive search)
IVF	Low	Low	Low-Medium	Good-High	Very large datasets, batch analytics, memory-constrained environments

2.2 Distributed Index Build Process

The creation and maintenance of vector indexes are handled as a distributed process, but the core build operation is local to each Storage Node.

A user issues a CREATE VECTOR INDEX… command, which is received by a Compute Node.
The Compute Node validates the command and broadcasts it to all Storage Nodes in the cluster.
Each Storage Node independently initiates a local index build process. It scans its own shards and partitions for the target table and builds an HNSW graph or IVF structure containing only the vectors it is responsible for.
The resulting index files are stored on the local disk of the Storage Node. For HNSW, the graph structure is also loaded into a dedicated memory region (the “Vector Pool”) for fast query-time access.58

Section 3: Filtered Approximate Nearest Neighbor (ANN) Search

The true power of integrating a vector database into a relational system lies in the ability to perform hybrid queries that combine semantic vector search with precise scalar filtering. A common example is an e-commerce query: “Find products similar to this image (embedding), but only show those in the ‘electronics’ category (category = ‘electronics’) and under $100 (price < 100)“.59

Executing such queries efficiently is a significant technical challenge. Naive approaches, such as pre-filtering (finding all products under $100 and then doing a vector search) or post-filtering (finding the top 10 similar products and then checking if they are under $100), are either extremely slow or produce inaccurate/incomplete results.61

HeliosDB will implement a state-of-the-art filter-aware HNSW traversal algorithm at the Storage Node level to solve this problem effectively.63

Mechanism:

The Compute Node’s query plan will package both the vector search parameters (query vector, ) and the scalar predicates (category = ‘electronics’, price < 100) into the VectorSearchRequest message sent to the Storage Nodes.
On the Storage Node, the engine first uses its standard B-Tree indexes on the category and price columns to rapidly identify the set of row IDs that match the scalar filters. This result is materialized as a highly efficient in-memory data structure, such as a bitmap or a roaring bitmap, which serves as an “allow-list”.64
The HNSW search algorithm is then initiated. During graph traversal, for every candidate vector (node) that is considered, the algorithm performs an immediate check against the allow-list before computing the expensive distance calculation.64 If the candidate’s ID is not in the allow-list, it is discarded, and the traversal continues.
A critical challenge in this approach is that large sections of the HNSW graph may be “filtered out,” creating disconnected “islands” and preventing the search from finding the true nearest neighbors. To mitigate this, the algorithm will incorporate advanced traversal strategies. If the immediate neighbors of a node are all filtered out, the search will be expanded to explore the neighbors-of-neighbors (a two-hop search) to “jump” across the filtered-out gaps in the graph, ensuring high recall is maintained.65

This tight integration of the scalar and vector indexing engines is the key feature that elevates HeliosDB from a relational database with a vector extension to a true hybrid system. It avoids forcing developers to implement complex, multi-stage query logic in their applications and delivers both performance and accuracy for sophisticated, real-world AI use cases.

ingestion_and_crud.md

Section 1: Write-Optimized Storage Engine: Log-Structured Merge-tree (LSM-tree)

To meet the demands of modern applications that require high data ingestion rates for both transactional and analytical workloads, the storage engine on each HeliosDB Storage Node will be built upon a Log-Structured Merge-tree (LSM-tree) architecture. This design, proven in high-performance NoSQL databases like ScyllaDB and Apache Cassandra, prioritizes write throughput by converting random write I/O patterns into sequential ones.66

1.1 Architecture

The LSM-tree storage engine is composed of several key components that work together to manage the data lifecycle from ingestion to persistent storage.

Write Path:

Commit Log (Write-Ahead Log - WAL): When a write operation (INSERT, UPDATE, DELETE) is received by a Storage Node, it is immediately and sequentially appended to a commit log on a durable storage device. This step ensures data durability; if the node crashes before the data is fully persisted, the commit log can be replayed to recover the write.69
Memtable: After being written to the commit log, the data is inserted into an in-memory data structure called a Memtable. The Memtable is typically a sorted structure, such as a balanced binary tree or a skip list, which keeps the data ordered by its primary key in memory.71 Acknowledgment of the write is sent back to the client at this stage, resulting in extremely low write latency as no random disk I/O has occurred yet.
SSTable (Sorted String Table): When the Memtable reaches a configured size threshold, it is frozen (made immutable), and a new, empty Memtable is created to handle incoming writes. The contents of the frozen Memtable are then flushed to disk as a new, immutable file called an SSTable. Because the data in the Memtable is already sorted, this flush operation is a highly efficient sequential write.67

Read Path:
A read request for a specific key must potentially consult multiple data structures to find the most recent version of the data:

The engine first checks the active Memtable for the key.
If not found, it checks any recently frozen (but not yet flushed) Memtables.
Finally, it searches the SSTables on disk, starting from the most recently created ones and working its way to the oldest.
To avoid costly disk I/O for keys that do not exist in an SSTable, each SSTable will be accompanied by a Bloom filter. A Bloom filter is a probabilistic, in-memory data structure that can definitively say if a key is not in the SSTable, allowing the engine to skip reading that file entirely.69

The choice of an LSM-tree architecture is a deliberate engineering trade-off governed by the principles of the RUM (Read, Update, Memory) conjecture. This conjecture posits that it is impossible to simultaneously optimize for low read amplification (the number of disk reads required for a point query), low write/update amplification (the number of times a piece of data is rewritten on disk during its lifetime), and low memory amplification (the amount of memory used per stored data item). By prioritizing write performance, the LSM-tree design accepts a higher degree of read amplification (as a read may need to check multiple SSTables) and write amplification (due to the background compaction process). HeliosDB’s design will focus on providing highly tunable compaction strategies to allow administrators to balance these trade-offs according to their specific workload, whether it is read-heavy, write-heavy, or space-constrained.66

1.2 Compaction

Over time, the write process creates many small SSTable files on disk. To manage this proliferation and optimize read performance, a background process called compaction periodically merges multiple SSTables into a single, new SSTable.67 During this process, duplicate data is resolved (only the newest version is kept), and data marked for deletion is physically removed, thereby reclaiming disk space.69

HeliosDB will implement multiple compaction strategies, such as:

Size-Tiered Compaction Strategy (STCS): Merges SSTables of similar size. This strategy is optimized for write-heavy workloads but can result in higher read amplification.
Leveled Compaction Strategy (LCS): Organizes data into levels of increasing size, ensuring less overlap between SSTables. This strategy offers lower, more predictable read latency at the cost of higher write amplification.

The choice of compaction strategy will be a configurable table property, allowing for fine-grained performance tuning.

Section 2: Handling Updates and Deletes with Tombstones

The immutability of SSTables is a core principle of the LSM-tree architecture. This means that UPDATE and DELETE operations cannot be performed in-place. Instead, they are handled as new write operations that create records indicating a change in state.

2.1 Mechanism

Updates: An UPDATE to an existing row is treated as an INSERT of a new version of that row. The new version is written to the Memtable with a more recent timestamp. During a read, the query engine will encounter both the old and new versions and will return the one with the latest timestamp.
Deletes: A DELETE operation does not immediately remove data from disk. Instead, it inserts a special marker, known as a tombstone, for the specified primary key.70 This tombstone, which also has a timestamp, effectively acts as a “negative” record, signaling that the key has been deleted. When a read query encounters a tombstone, it stops its search and returns that the row does not exist, even if older versions of the data exist in other SSTables.73

2.2 Compaction and Garbage Collection

The physical removal of old data versions and deleted data occurs during the compaction process. When the compaction process merges multiple SSTables:

It identifies all versions of a given key across the input SSTables.
It discards all but the most recent version based on timestamp.
If the most recent version is a tombstone, both the tombstone and all older data versions for that key are discarded from the new, merged SSTable.74 This is how disk space is eventually reclaimed.

2.3 gc_grace_seconds

In a distributed system with multiple replicas, tombstones play a crucial role in ensuring the consistency of deletions. If a tombstone were removed immediately, a node that was temporarily offline could miss the delete operation. Upon rejoining the cluster, it might re-introduce the “deleted” data to other nodes during a repair process, creating a “zombie” record.

To prevent this, tombstones are retained for a configurable period known as gc_grace_seconds (grace period for garbage collection), which defaults to a safe value like 10 days.73 A tombstone will not be garbage-collected during compaction until it is older than this period. This ensures that there is ample time for delete markers to be propagated to all replicas across the cluster, even in the presence of transient node failures or network partitions. The gc_grace_seconds setting must be configured to be longer than the regular cluster repair interval to guarantee delete consistency.73

implementation_plan.md

Section 1: Hybrid Language Development Model

The selection of programming languages for a complex system like a distributed database is a critical architectural decision that balances performance, safety, productivity, and ecosystem support. HeliosDB will adopt a hybrid language model, leveraging the strengths of different languages for different components of the system, specifically separating the high-performance data plane from the flexible control plane.

1.1 Core Engine (Data Plane): Rust

The core, performance-critical components of the HeliosDB system will be implemented in Rust. This includes the entirety of the software running on both Storage Nodes and Compute Nodes that lies in the direct path of data processing and query execution. Key components to be written in Rust include:

The HIDB network protocol server and client logic.
The query execution engine, including all physical operators (scans, joins, aggregations).
The LSM-tree storage engine.
The Hybrid Columnar Compression and decompression routines.
The vector indexing (HNSW, IVF) and search algorithms.

Rationale:
The primary reason for choosing Rust is its ability to provide performance on par with C++ while offering strong compile-time guarantees of memory and thread safety.75 In a highly concurrent, distributed system, issues like data races, null pointer dereferences, and buffer overflows are common sources of bugs and security vulnerabilities. Rust’s ownership and borrowing model eliminates these entire classes of errors at compile time, a feature often referred to as “fearless concurrency”.76 This allows developers to write highly parallel and efficient code with greater confidence, reducing debugging time and increasing system stability. Unlike languages with a garbage collector (like Java or Go), Rust does not have runtime pauses for garbage collection, which is critical for providing predictable, low-latency performance required by a high-performance database.76 While C++ offers similar performance, it lacks Rust’s built-in safety guarantees, making large-scale refactoring risky and requiring a higher level of developer discipline to avoid memory-related bugs.75

1.2 Management and Tooling (Control Plane): Python

The components responsible for cluster management, deployment, automation, and high-level APIs will be implemented in Python. This includes:

The command-line interface (CLI) for database administrators.
Deployment and orchestration scripts (e.g., for Kubernetes).
The Metadata Service’s management API endpoint.
Monitoring and metrics collection agents.

Rationale:
Python is the de facto language for operations, automation, and data science. Its vast ecosystem of libraries for tasks such as web frameworks (for APIs), infrastructure automation, and data analysis makes it exceptionally productive for building the control plane and operational tooling.78 The development speed and ease of use of Python are well-suited for components that are not in the critical, low-latency data path.

1.3 Interfacing between Rust and Python

The two language domains will interact through well-defined, standardized interfaces:

Control Plane to Data Plane: The Python-based management tools will communicate with the Rust-based database nodes via a gRPC API. The database nodes will expose a set of administrative endpoints for tasks like triggering a cluster rebalance, querying node status, or updating configuration.
Client Driver: The primary client driver for Python applications will be constructed using a hybrid approach. The core driver logic, which handles the HIDB protocol, connection pooling, and shard-aware routing, will be implemented as a Rust library. This Rust core will then be exposed as a native Python module using the PyO3 framework.80 This strategy combines the performance and safety of Rust for the network-intensive protocol handling with a familiar, idiomatic Python interface for the application developer.

Section 2: Python Native Client Driver

The initial client driver provided for HeliosDB will target Python, reflecting its prevalence in modern application development and data engineering. The driver will be designed for ease of use, performance, and seamless integration into the Python ecosystem.

2.1 Compliance and API Design

To ensure a familiar and intuitive experience for developers, the driver will strictly adhere to the PEP 249 – Python Database API Specification v2.0.78 This standard defines the common interface for database access in Python, including objects and methods such as:

connect(): Establishes a connection to the database cluster.
connection.cursor(): Creates a cursor object to execute queries.
cursor.execute(sql, params): Executes a SQL statement, with support for parameterized queries to prevent SQL injection.82
cursor.fetchone(), cursor.fetchall(), cursor.fetchmany(): Methods for retrieving query results.83

By following this standard, the HeliosDB driver will be a drop-in replacement for other standard database drivers, allowing for easy integration with existing applications, Object-Relational Mappers (ORMs), and data frameworks.

2.2 Shard-Aware Routing

A key performance feature of the driver will be its shard-aware routing capability, inspired by the high-performance drivers of databases like ScyllaDB.84 This mechanism bypasses the need for a centralized load balancer or proxy for every query, reducing network hops and latency.

Mechanism:

Topology Caching: Upon initial connection to the cluster, the driver will contact the Metadata Service to retrieve the complete sharding topology map. This map details which data ranges (based on the hash of the sharding key) reside on which Storage Nodes. The driver will cache this map locally.86
Local Key Hashing: When a user executes a query that includes a predicate on the sharding key (e.g., WHERE customer_id = 123), the driver will not send the query to a random Compute Node. Instead, it will apply the same consistent hashing algorithm used by the cluster to the customer_id value (123) directly within the client process.
Direct Connection: The resulting hash value allows the driver to look up the exact primary Storage Node (and its associated Compute Nodes) responsible for that data. The driver then routes the query directly to the correct node.
Topology Updates: The driver will subscribe to the Metadata Service for topology updates, allowing it to refresh its local cache when the cluster scales or data rebalancing occurs.

2.3 Asynchronous Support

To cater to modern, high-concurrency Python applications built on frameworks like asyncio, the driver will provide both a traditional synchronous API and a fully asynchronous API using Python’s async and await syntax.82 This will allow applications to handle thousands of concurrent database operations efficiently without being blocked by network I/O, which is essential for building scalable microservices and web applications.

summary_for_ai.md

This document provides a distilled, structured summary of the HeliosDB technical implementation guideline, intended for consumption by an AI code generation tool.

# HeliosDB Technical Specification Summary

## 1. Core Architecture

- **Model**: 2-tier decoupled compute-storage, shared-nothing.
- **Compute Nodes**:
- **Role**: Stateless query processing.
- **Responsibilities**: SQL parsing, optimization, distributed query coordination, client connection management.
- **State**: No durable user data. Maintains local cache for hot data blocks.
- **Storage Nodes**:
- **Role**: Stateful, intelligent data storage and processing.
- **Responsibilities**: Durable storage (LSM-tree), predicate filtering, columnar processing, filtered vector search.
- **Inter-Node Communication**:
- **Protocol**: RDMA over Converged Ethernet (RoCEv2).
- **RPC Framework**: HeliosDB Intelligent Database (HIDB) Protocol over gRPC.
- **Key Messages**: `PredicatePushdownRequest`, `FilteredResultSet`, `VectorSearchRequest`, `CacheInvalidationNotice`.
- **Metadata Service**:
- **Architecture**: Dedicated cluster of Metadata Nodes.
- **Consensus**: Raft protocol, implemented using `etcd/raft` library.
- **Managed State**: Sharding topology, schema, replica status, cluster configuration.

## 2. Data Organization

- **Sharding**:
- **Strategy**: System-managed consistent hashing on a user-defined sharding key.
- **Replication**: One synchronous mirror per primary shard.
- **Quorum**: Witness-based quorum for automatic failover and split-brain prevention.
- **Partitioning (Intra-Shard)**:
- **Purpose**: Local data decomposition for partition pruning.
- **Strategies**: `RANGE`, `LIST`, `HASH`, `COMPOSITE`.
- **Hybrid Columnar Compression (HCC)**:
- **Structure**: Compression Unit (CU) with a hybrid row-columnar layout.
- **Algorithms**: LZ4, ZSTD with dictionary encoding.
- **DML Handling**: Requires CU-level lock; best for read-heavy data.
- **Modes**: `WAREHOUSE_OPTIMIZED`, `ARCHIVE_OPTIMIZED`.

## 3. Query Execution

- **Predicate Pushdown Engine**:
- **Function**: Offloads `WHERE` clause and column projection to Storage Nodes.
- **Execution**: Operates directly on raw or HCC-compressed data at the storage layer.
- **Online Aggregation Engine**:
- **Feature**: Real-time, contention-free aggregate maintenance via `ONLINE AGGREGATE` objects.
- **Mechanism**: Semantic Concurrency Control.
- **Implementation**: Uses a `DELTA` column type to store commutative operations (e.g., `SUM(+1)`) as immutable “operation rows”, avoiding locking on aggregate value rows. Reads compute results on-the-fly by applying deltas to the last materialized snapshot.
- **Distributed Query Execution**:
- **Parallelism**: Multi-shard, intra-shard (across partitions), and intra-node (multi-threaded operators).

## 4. Vector Integration

- **Data Type**: `VECTOR(n)` for n-dimensional vectors.
- **Storage**: Hybrid in-line/out-of-line based on PostgreSQL TOAST.
- **Threshold**: `TOAST_TUPLE_THRESHOLD` (~2KB).
- **Tunable Strategies**: `PLAIN` (force in-line), `EXTERNAL` (force out-of-line). Recommendation: `PLAIN` for performance-critical vector search.
- **Indexing**:
- **Types**: `HNSW` (graph-based, in-memory, fast query) and `IVF` (cluster-based, disk-friendly, large scale).
- **Build Process**: Distributed command, but local build on each Storage Node.
- **Filtered ANN Search**:
- **Strategy**: Filter-aware HNSW traversal (joint-filtering).
- **Mechanism**: Use scalar indexes to create an in-memory “allow-list” (bitmap). Check candidates against the allow-list during HNSW graph traversal *before* distance calculation. Use multi-hop traversal to escape filtered “islands” in the graph.

## 5. Data Ingestion and CRUD

- **Storage Engine**: Log-Structured Merge-tree (LSM-tree).
- **Write Path**: `Commit Log` -> `Memtable` (in-memory) -> `SSTable` (on-disk, immutable).
- **Read Path**: Check `Memtable`, then `SSTables` (newest to oldest). Use `Bloom filters` to skip SSTable reads.
- **Updates and Deletes**:
- **Mechanism**: Handled as new writes. Deletes are recorded as **tombstones**.
- **Garbage Collection**: Old versions and tombstones are physically removed during background compaction.
- **Consistency**: Tombstones are retained for `gc_grace_seconds` to ensure propagation in the distributed system.

## 6. Implementation Plan

- **Language Strategy**:
- **Data Plane (Core Engine)**: Rust (for performance and memory/thread safety).
- **Control Plane (Tooling)**: Python (for productivity and ecosystem).
- **Client API**:
- **Language**: Python.
- **Standard**: PEP 249 compliant.
- **Features**: Shard-aware routing (local key hashing, direct connection), synchronous and asynchronous (`async/await`) APIs.
- **Implementation**: Core logic in Rust, exposed to Python via PyO3.

## 7. Core Data Structures and Function Signatures (Illustrative)

```rust
// Storage Node - LSM Engine
struct LsmStorageEngine {
memtable: Memtable,
immutable_memtables: Vec<Memtable>,
sstables: Vec<SSTableManager>,
commit_log: CommitLog,
}

impl LsmStorageEngine {
async fn write(&self, key: Key, value: Value) -> Result<(), Error>;
async fn read(&self, key: Key) -> Result<Option<Value>, Error>;
}

// Storage Node - Predicate Filter
enum Predicate {
Equals { col_id: u32, value: Literal },
//… other predicate types
}

fn apply_predicates(data_block: &DataBlock, predicates: &[Predicate]) -> FilteredDataBlock;

// Compute Node - Query Planner
struct DistributedPlan {
storage_node_requests: HashMap<NodeId, PredicatePushdownRequest>,
//… other plan details
}

fn generate_plan(sql_query: &str) -> Result<DistributedPlan, Error>;

// HIDB Protocol (gRPC/Protobuf definition)
service HeliosDB {
rpc ExecutePredicatePushdown(PredicatePushdownRequest) returns (stream FilteredResultSet);
rpc ExecuteVectorSearch(VectorSearchRequest) returns (stream VectorSearchResult);
}

message PredicatePushdownRequest {
uint64 table_id = 1;
repeated uint32 partition_ids = 2;
repeated Predicate predicates = 3;
repeated uint32 projection_columns = 4;
}

Works cited

Exadata - Database Platform - Oracle, accessed October 10, 2025, https://www.oracle.com/engineered-systems/exadata/
Oracle Exadata: Architecture and Internals Technical Deep Dive, accessed October 10, 2025, https://www.oracle.com/technetwork/database/exadata/trn4113-exadatadeepdive-5187039.pdf
How Snowflake Architecture Enhances Performance and Flexibility - DBSync, accessed October 10, 2025, https://www.mydbsync.com/blogs/snowflake-architecture
Explain Snowflake’s decoupling of storage and compute for scalability. | by Amar V | Medium, accessed October 10, 2025, https://medium.com/@amarficusoft/explain-snowflakes-decoupling-of-storage-and-compute-for-scalability-fddec7da5b10
Must-Attend Exadata Sessions at Oracle AI World’s Data Deep Dive, accessed October 10, 2025, https://blogs.oracle.com/exadata/post/exadata-at-oracle-ai-world-2025
Snowflake Architecture for Fast Analytics - Simplyblock, accessed October 10, 2025, https://www.simplyblock.io/glossary/what-is-snowflake/
Key Concepts & Architecture | Snowflake Documentation, accessed October 10, 2025, https://docs.snowflake.com/en/user-guide/intro-key-concepts
Exadata Deep Dive - Oracle, accessed October 10, 2025, https://www.oracle.com/technetwork/database/availability/exadata-deep-dive-3961445.pdf
Separation of Storage & Compute in Snowflake [closed] - Stack Overflow, accessed October 10, 2025, https://stackoverflow.com/questions/78192350/separation-of-storage-compute-in-snowflake
Compute-Storage Separation Explained, accessed October 10, 2025, https://www.thenile.dev/blog/storage-compute-separation
Exadata Smart Scan - Database - Oracle, accessed October 10, 2025, https://www.oracle.com/database/technologies/exadata/software/smartscan/
The Snowflake Elastic Data Warehouse, accessed October 10, 2025, https://www.cs.cmu.edu/~15721-f24/papers/Snowflake.pdf
Smart Scan | flashdba, accessed October 10, 2025, https://flashdba.com/history-of-exadata/smart-scan/
Oracle Sharding Architecture, accessed October 10, 2025, https://docs.oracle.com/en/database/oracle/oracle-database/12.2/cncpt/oracle-sharding-architecture.html
Understanding TiDB’s Raft Consensus for Distributed Databases, accessed October 10, 2025, https://www.pingcap.com/article/understanding-tidbs-raft-consensus-for-distributed-databases/
Exploring TiDB Architecture for Big Data Success, accessed October 10, 2025, https://www.pingcap.com/article/exploring-tidb-architecture-for-big-data-success/
Replication Layer - CockroachDB, accessed October 10, 2025, https://www.cockroachlabs.com/docs/stable/architecture/replication-layer
How is ETCD a highly available system, even though it uses Raft which is a CP algorithm?, accessed October 10, 2025, https://codemia.io/knowledge-hub/path/how_is_etcd_a_highly_available_system_even_though_it_uses_raft_which_is_a_cp_algorithm
Categorizing How Distributed Databases Utilize Consensus Algorithms | by Adam Prout, accessed October 10, 2025, https://medium.com/@adamprout/categorizing-how-distributed-databases-utilize-consensus-algorithms-492c8ff9e916
Understanding Raft Algorithm: Consensus and Leader Election Explained - Medium, accessed October 10, 2025, https://medium.com/@jitenderkmr/understanding-raft-algorithm-consensus-and-leader-election-explained-faadf28fd047
etcd-io/raft: Raft library for maintaining a replicated state machine - GitHub, accessed October 10, 2025, https://github.com/etcd-io/raft
Understanding etcd’s Raft Implementation: A Deep Dive into Raft Log - DEV Community, accessed October 10, 2025, https://dev.to/justlorain/understanding-etcds-raft-implementation-a-deep-dive-into-raft-log-bdn
Building Distributed Key Value Database using Raft part II | by AKSHAY MOHITE | Medium, accessed October 10, 2025, https://medium.com/@mohiteakshay2020/building-distributed-key-value-database-using-raft-part-ii-c5ec6efa35c7
RAFT based Key-Value Store with Transaction Support - Stanford Secure Computer Systems Group, accessed October 10, 2025, http://www.scs.stanford.edu/20sp-cs244b/projects/RAFT%20based%20Key-Value%20Store%20with%20Transaction%20Support.pdf
In Search of an Understandable Consensus Algorithm (Extended Version), accessed October 10, 2025, https://raft.github.io/raft.pdf
Database Sharding: Concepts & Examples - MongoDB, accessed October 10, 2025, https://www.mongodb.com/resources/products/capabilities/database-sharding-explained
Oracle Sharding Linear Scalability, Fault Isolation and Geo-distribution for Web-scale OLTP Applications, accessed October 10, 2025, https://www.oracle.com/a/tech/docs/sharding-wp-12c.pdf
What is Database Sharding? - Shard DB Explained - AWS - Updated 2025, accessed October 10, 2025, https://aws.amazon.com/what-is/database-sharding/
Database Mirroring (SQL Server) - Microsoft Learn, accessed October 10, 2025, https://learn.microsoft.com/en-us/sql/database-engine/database-mirroring/database-mirroring-sql-server?view=sql-server-ver17
6 Partitions, Views, and Other Schema Objects - Oracle Help Center, accessed October 10, 2025, https://docs.oracle.com/en/database/oracle/oracle-database/21/cncpt/partitions-views-and-other-schema-objects.html
Sharding vs. partitioning: What’s the difference? - PlanetScale, accessed October 10, 2025, https://planetscale.com/blog/sharding-vs-partitioning-whats-the-difference
3 Physical Organization of a Sharded Database - Oracle Help Center, accessed October 10, 2025, https://docs.oracle.com/en/database/oracle/oracle-database/18/shard/sharding-physical-organization.html
Get the best out of Oracle Partitioning, accessed October 10, 2025, https://www.oracle.com/a/tech/docs/technical-resources/partitioning-guide-v20.pdf
Partitioning Overview - Oracle, accessed October 10, 2025, https://www.oracle.com/database/technologies/partitioning.html
Oracle Hybrid Columnar Compression, accessed October 10, 2025, https://www.oracle.com/a/ocom/docs/ehcc-twp-131254.pdf
Oracle Hybrid Columnar Compression Overview, accessed October 10, 2025, https://www.oracle.com/a/ocom/docs/database/hybrid-columnar-compression-brief.pdf
Exadata Hybrid Columnar Compression - Oracle Help Center, accessed October 10, 2025, https://docs.oracle.com/en/engineered-systems/exadata-database-machine/sagug/exadata-hybrid-columnar-compression.html
Implementing Hybrid Columnar Compression on the Oracle ZFS Storage Appliance, accessed October 10, 2025, https://www.oracle.com/technetwork/server-storage/sun-unified-storage/documentation/problemsolver-hcc-52014-2202692.pdf
Building Columnar Compression for Large PostgreSQL Databases - TigerData, accessed October 10, 2025, https://www.tigerdata.com/blog/building-columnar-compression-in-a-row-oriented-database
Oracle Hybrid Columnar Compression, accessed October 10, 2025, https://www.oracle.com/technetwork/database/exadata/ehcc-twp-131254.pdf
A Technical Overview of the Oracle Exadata Database Machine and Exadata Storage Server, accessed October 10, 2025, http://www.oracle.com/technetwork/server-storage/engineered-systems/exadata/dbmachine-x3-twp-1867467.pdf
Exadata Smart Scan: A Quick Overview - Pythian, accessed October 10, 2025, https://www.pythian.com/blog/technical-track/exadata-smart-scan-overview
LeanXcale | HPE, accessed October 10, 2025, https://www.hpe.com/us/en/software/marketplace/leanxcale.html
Online Aggregations in LeanXcale – LeanXcale, accessed October 10, 2025, https://blog.leanxcale.com/hands-on/online-aggregations-in-leanxcale/
Data Definition Language (DDL) Syntax - LeanXcale Documentation, accessed October 10, 2025, https://docs.leanxcale.com/leanxcale/v2.5/sql_reference/index.html
On LeanXcale database. Interview with Patrick Valduriez and Ricardo Jimenez-Peris, accessed October 10, 2025, https://www.odbms.org/blog/2019/07/on-leanxcale-database-interview-with-patrick-valduriez-and-ricardo-jimenez-peris/
KPI Calculation: LeanXcale Online Aggregates – LeanXcale, accessed October 10, 2025, https://blog.leanxcale.com/database-for-monitoring/kpi-calculation-leanxcale-online-aggregates/
LeanXcale in Agritech - AWS, accessed October 10, 2025, https://doc-html.s3-eu-west-1.amazonaws.com/doc/resources/LeanXcale+in+Agritech.pdf
What Is A Vector Database And How Does It Work? - Yugabyte, accessed October 10, 2025, https://www.yugabyte.com/blog/what-is-a-vector-database/
Understanding pgvector: Optimizing Your Vector Database - EDB, accessed October 10, 2025, https://www.enterprisedb.com/blog/what-is-pgvector
Documentation: 18: 66.2. TOAST - PostgreSQL, accessed October 10, 2025, https://www.postgresql.org/docs/current/storage-toast.html
Postgres TOAST: The Greatest Thing Since Sliced Bread? | Crunchy Data Blog, accessed October 10, 2025, https://www.crunchydata.com/blog/postgres-toast-the-greatest-thing-since-sliced-bread
What Is TOAST (and Why It Isn’t Enough for Data Compression in Postgres) | TigerData, accessed October 10, 2025, https://www.tigerdata.com/blog/what-is-toast-and-why-it-isnt-enough-for-data-compression-in-postgres
Hierarchical Navigable Small Worlds (HNSW) - Pinecone, accessed October 10, 2025, https://www.pinecone.io/learn/series/faiss/hnsw/
Best practices for using pgvector, accessed October 10, 2025, https://postgresql.us/events/pgconfnyc2024/sessions/session/1862/slides/172/pgvector_best_practices_pgconfnyc2024.pdf
How does indexing work in a vector DB (IVF, HNSW, PQ, etc.)? - Milvus, accessed October 10, 2025, https://milvus.io/ai-quick-reference/how-does-indexing-work-in-a-vector-db-ivf-hnsw-pq-etc
HNSW vs Inverted Index/IVF: which is a better ANN algorithm? - Vectroid Resources, accessed October 10, 2025, https://www.vectroid.com/resources/hnsw-vs-inverted-indexivf-which-is-a-better-ann-algorithm
Using HNSW Vector Indexes in AI Vector Search - Oracle Blogs, accessed October 10, 2025, https://blogs.oracle.com/database/post/using-hnsw-vector-indexes-in-ai-vector-search
Filtered Vector Search Techniques - Emergent Mind, accessed October 10, 2025, https://www.emergentmind.com/topics/filtered-vector-search
Filtered Vector Search: State-of-the-art and Research Opportunities - VLDB Endowment, accessed October 10, 2025, https://www.vldb.org/pvldb/vol18/p5488-caminal.pdf
Optimizing Filtered Vector Search in MyScale - Medium, accessed October 10, 2025, https://medium.com/@myscale/optimizing-filtered-vector-search-in-myscale-77675aaa849c
The Missing WHERE Clause in Vector Search - Pinecone, accessed October 10, 2025, https://www.pinecone.io/learn/vector-search-filtering/
Attribute Filtering in Approximate Nearest Neighbor Search: An In-depth Experimental Study, accessed October 10, 2025, https://arxiv.org/html/2508.16263v1
Filtering - Weaviate Documentation, accessed October 10, 2025, https://docs.weaviate.io/weaviate/concepts/filtering
Filtered HNSW search, fast mode - Elasticsearch Labs, accessed October 10, 2025, https://www.elastic.co/search-labs/blog/filtered-hnsw-knn-search
Modernizing Write-Heavy Scenarios in Distributed Systems: LSM Trees and Scylla DB, accessed October 10, 2025, https://www.researchgate.net/publication/395101519_Modernizing_Write-Heavy_Scenarios_in_Distributed_Systems_LSM_Trees_and_Scylla_DB
What is a Log Structured Merge Tree? Definition & FAQs | ScyllaDB, accessed October 10, 2025, https://www.scylladb.com/glossary/log-structured-merge-tree/
Real-Time Write Heavy Database Workloads: Considerations & Tips - ScyllaDB, accessed October 10, 2025, https://www.scylladb.com/2025/02/04/real-time-write-heavy-workloads-considerations-tips/
What Is A Log-Structured Merge-tree (LSM-tree)? - ITU Online IT Training, accessed October 10, 2025, https://www.ituonline.com/tech-definitions/what-is-a-log-structured-merge-tree-lsm-tree/
Building an LSM-Tree Storage Engine from Scratch - DEV Community, accessed October 10, 2025, https://dev.to/justlorain/building-an-lsm-tree-storage-engine-from-scratch-3eom
How NoSQL Databases Speed-Up Write-Heavy Workloads - Medium, accessed October 10, 2025, https://medium.com/@_sidharth_m_/how-nosql-databases-speed-up-write-heavy-workloads-49c41dceb849
Lethe: Enabling Efficient Deletes in LSMs - Research - Projects, accessed October 10, 2025, https://disc-projects.bu.edu/lethe/
Tombstones - ScyllaDB University, accessed October 10, 2025, https://university.scylladb.com/courses/scylla-operations/lessons/scylla-manager-repair-and-tombstones/topic/tombstones/
Part 1: The Hidden Cost of Deletion – How Tombstones Work in ScyllaDB - Datanised, accessed October 10, 2025, https://datanised.com/2025/07/01/part-1-the-hidden-cost-of-deletion-how-tombstones-work-in-scylladb/
Rust vs C++: A Real-World Perspective, accessed October 10, 2025, https://corrode.dev/blog/cpp-rust-interop/
Any main reasons/points to choose rust over c++ - help - The Rust Programming Language Forum, accessed October 10, 2025, https://users.rust-lang.org/t/any-main-reasons-points-to-choose-rust-over-c/114323
We still chose C++ (instead of Rust) for new database development - Hacker News, accessed October 10, 2025, https://news.ycombinator.com/item?id=45389744
Databricks SQL Connector for Python, accessed October 10, 2025, https://docs.databricks.com/aws/en/dev-tools/python-sql-connector
Connect to and query Azure SQL Database using Python and the pyodbc driver, accessed October 10, 2025, https://learn.microsoft.com/en-us/azure/azure-sql/database/azure-sql-python-quickstart?view=azuresql
How to Mix Rust and Python in Your Project | by MatthieuL - Medium, accessed October 10, 2025, https://medium.com/@MatthieuL49/a-mixed-rust-python-project-24491e2af424
Combining Rust and Python: The Best of Both Worlds? - YouTube, accessed October 10, 2025, https://www.youtube.com/watch?v=lyG6AKzu4ew
Using the Python Connector | Snowflake Documentation, accessed October 10, 2025, https://docs.snowflake.com/en/developer-guide/python-connector/python-connector-example
sqlite3 — DB-API 2.0 interface for SQLite databases — Python 3.14.0 documentation, accessed October 10, 2025, https://docs.python.org/3/library/sqlite3.html
Inside ScyllaDB Rust Driver 1.0: A Fully Async Shard-Aware CQL Driver Using Tokio, accessed October 10, 2025, https://www.scylladb.com/2025/03/31/inside-scylladb-rust-driver-1-0/
A Shard-Aware ScyllaDB C/C++ Driver - ScyllaDB, accessed October 10, 2025, https://www.scylladb.com/2021/03/18/a-shard-aware-scylla-c-c-driver/
Using Oracle Sharding, accessed October 10, 2025, https://docs.oracle.com/en/database/oracle/oracle-database/21/shard/developing-applications.html

**Technical Implementation Guideline for Project 'HeliosDB': A Next-Generation Distributed HTAP Database**

Technical Implementation Guideline for Project “HeliosDB”: A Next-Generation Distributed HTAP Database

architecture.md

Section 1: Architectural Blueprint: A Decoupled, Shared-Nothing Model

1.1 Core Philosophy: Synthesizing Exadata’s Intelligence with Snowflake’s Scalability

1.2 Compute Node Responsibilities

1.3 Storage Node Responsibilities

Section 2: Inter-Node Communication Fabric

2.1 Protocol Specification: RDMA over Converged Ethernet (RoCEv2)

2.2 The HeliosDB Intelligent Database (HIDB) Protocol

Section 3: Metadata and Consensus Service

3.1 Architecture: Centralized Logic, Distributed State

3.2 Managed Metadata

3.3 Implementation with etcd/raft

data_organization.md

Section 1: Sharding and High Availability

1.1 Sharding Strategy: System-Managed Consistent Hashing

1.2 Replication Model: Single Synchronous Mirror per Shard

1.3 Quorum and Split-Brain Avoidance

Section 2: Intra-Shard Partitioning

2.1 Local Data Organization

2.2 Supported Partitioning Strategies

Section 3: Hybrid Columnar Compression (HCC)

3.1 Data Structure: The Compression Unit (CU)

3.2 Compression Algorithms and Application

3.3 DML Operations on HCC Data

query_execution.md

Section 1: Predicate Pushdown Engine (Inspired by Exadata Smart Scan)

1.1 Core Mechanism: Offloading WHERE Clause Processing

1.2 Execution on Storage Nodes

1.3 Advanced Offloading (Future Work)

Section 2: Online Aggregation Engine (Inspired by LeanXcale)

2.1 Feature Goal: Real-Time, Contention-Free Aggregates

2.2 Implementation via Semantic Concurrency Control

Section 3: Distributed Query Execution

3.1 Query Lifecycle

3.2 Parallelism Model

vector_integration.md

Section 1: The VECTOR Data Type and Storage

1.1 DDL and Data Type Definition

1.2 Internal Storage Strategy: Hybrid In-line / Out-of-line (TOAST)

Section 2: Distributed Vector Indexing

2.1 Supported Index Types: HNSW and IVF

2.2 Distributed Index Build Process

Section 3: Filtered Approximate Nearest Neighbor (ANN) Search

ingestion_and_crud.md

Section 1: Write-Optimized Storage Engine: Log-Structured Merge-tree (LSM-tree)

1.1 Architecture

1.2 Compaction

Section 2: Handling Updates and Deletes with Tombstones

2.1 Mechanism

2.2 Compaction and Garbage Collection

2.3 gc_grace_seconds

implementation_plan.md

Section 1: Hybrid Language Development Model

1.1 Core Engine (Data Plane): Rust

1.2 Management and Tooling (Control Plane): Python

1.3 Interfacing between Rust and Python

Section 2: Python Native Client Driver

2.1 Compliance and API Design

2.2 Shard-Aware Routing

2.3 Asynchronous Support

summary_for_ai.md

Works cited

Technical Implementation Guideline for Project 'HeliosDB': A Next-Generation Distributed HTAP Database