HeliosDB Metadata Service
HeliosDB Metadata Service
Raft-based distributed metadata consensus service for HeliosDB.
Overview
The metadata service is the brain of the HeliosDB cluster, responsible for:
- Schema Management: DDL operations (CREATE/DROP TABLE, schema registry)
- Sharding Topology: Mapping of data ranges to storage nodes
- Node Registration: Health monitoring of compute/storage nodes
- Cache Invalidation: Broadcasting metadata updates to compute nodes
Architecture
Raft Consensus
The metadata service uses the Raft consensus algorithm to ensure:
- Consistency: All metadata nodes agree on the same state
- Fault Tolerance: Survives node failures (requires 3 or 5 nodes)
- Leader Election: Automatic failover when leader fails
- Log Replication: Changes are replicated before being applied
Components
-
Raft Storage Layer (
raft_storage.rs)- RocksDB-backed persistent storage for Raft log
- Handles log entries, hard state, conf state, and snapshots
- In-memory cache for fast access
-
State Machine (
state_machine.rs)- Applies committed Raft commands
- Manages topology and schema state
- Supports snapshot creation/restoration
-
gRPC Server (
metadata_server.rs)- Implements MetadataService gRPC API
- Routes requests to Raft leader
- Broadcasts updates to subscribers
-
Client Library (
client.rs)- Local cache with automatic invalidation
- Subscribe to metadata updates
- Shard-aware routing support
Features
1. Table Management
// Create tablelet mut table = TableSchema::new(1, "users".to_string());table.add_column(ColumnSchema { column_id: 1, name: "id".to_string(), data_type: DataType::Int64, nullable: false, toast_strategy: None,});
let table_id = service.create_table(table).await?;
// Drop tableservice.drop_table(table_id).await?;2. Topology Management
// Update cluster topologylet mut topology = service.get_topology();
// Add storage nodetopology.add_node(NodeInfo { node_id: "storage_1".to_string(), addr: "127.0.0.1:6000".to_string(), node_type: NodeType::Storage, status: NodeStatus::Online,});
// Add shardtopology.add_shard(ShardInfo { shard_id: 1, primary_node: "storage_1".to_string(), mirror_node: Some("storage_2".to_string()), witness_node: None, key_range: (vec![0x00], vec![0xFF]),});
service.update_topology(topology).await?;3. Cache Invalidation
The service broadcasts MetadataUpdate events via gRPC streaming:
// Client subscribes to updateslet mut client = MetadataClient::connect("http://127.0.0.1:9000").await?;client.subscribe_updates().await?;
// Automatic cache invalidation when topology/schema changeslet topology = client.get_topology().await?; // Uses cache if validRunning the Service
Single Node (Development)
# Start metadata servercargo run --example metadata_server -- 1
# In another terminal, run clientcargo run --example metadata_client -- http://127.0.0.1:9000Multi-Node Cluster (Production)
# Node 1 (Leader)cargo run --example metadata_server -- 1
# Node 2 (Follower)cargo run --example metadata_server -- 2
# Node 3 (Follower)cargo run --example metadata_server -- 3Configuration
let config = MetadataConfig { node_id: 1, // Unique node ID raft_addr: "127.0.0.1:5000".to_string(), storage_path: "./metadata_data".to_string(), peers: vec![2, 3], // Other Raft nodes tick_interval: Duration::from_millis(100),};gRPC API
Table Operations
rpc CreateTable(CreateTableRequest) returns (CreateTableResponse);rpc DropTable(DropTableRequest) returns (DropTableResponse);rpc GetTable(GetTableRequest) returns (GetTableResponse);rpc ListTables(ListTablesRequest) returns (ListTablesResponse);Topology Operations
rpc GetTopology(GetTopologyRequest) returns (Topology);rpc UpdateShardMap(UpdateShardMapRequest) returns (UpdateShardMapResponse);Node Management
rpc RegisterNode(RegisterNodeRequest) returns (RegisterNodeResponse);rpc UnregisterNode(UnregisterNodeRequest) returns (UnregisterNodeResponse);rpc Heartbeat(HeartbeatRequest) returns (HeartbeatResponse);Subscription
rpc Subscribe(SubscribeRequest) returns (stream MetadataUpdate);Data Types Supported
- Integers: INT8, INT16, INT32, INT64, UINT8, UINT16, UINT32, UINT64
- Floating Point: FLOAT32, FLOAT64
- Boolean
- String, Bytes
- Timestamp, Date
- Decimal(precision, scale)
- Vector(dimensions) - For AI/ML embeddings
TOAST Strategy
For large data types (especially vectors), TOAST (The Oversized-Attribute Storage Technique) controls storage:
PLAIN: Always inline (fastest access)EXTENDED: Compress and move out-of-line if too largeEXTERNAL: Always out-of-line, compressedMAIN: Prefer inline, move to external if needed
Fault Tolerance
Leader Election
- When leader fails, followers detect via heartbeat timeout
- New leader elected via Raft voting mechanism
- Typical election time: 1-2 seconds
Write Path (DDL Operations)
- Client sends request to any metadata node
- Follower forwards to leader
- Leader proposes to Raft cluster
- Waits for quorum (majority of nodes)
- Applies to state machine and returns success
- Broadcasts update to all subscribers
Read Path (Schema/Topology Queries)
- Can be served by any node (eventually consistent)
- Or leader only (strongly consistent)
- Clients cache and subscribe to updates
Performance Considerations
Write Throughput
- DDL operations serialized through Raft leader (~1000-5000 ops/sec)
- Batching recommended for bulk schema changes
Read Throughput
- Unlimited (served from local state machine)
- No Raft consensus required for reads
Network
- gRPC over TCP (can be configured for TLS)
- Protobuf binary serialization
- Supports HTTP/2 multiplexing
Testing
# Run unit testscargo test
# Run integration testscargo test --test integration_test
# Run with loggingRUST_LOG=debug cargo test -- --nocaptureMonitoring
Key metrics to monitor:
- Raft leader status
- Proposal latency
- Log compaction size
- Snapshot creation time
- Number of subscribers
- Cache hit rate (clients)
Implementation Details
Raft Log Storage
- Storage Engine: RocksDB
- Log Entry Format: Bincode serialization
- Compaction: Automatic via snapshots
- Key Prefixes:
entry_{index}- Log entriesraft_state- Hard stateconf_state- Configuration statesnapshot- Latest snapshot
State Machine
- Format: In-memory (ClusterTopology + SchemaRegistry)
- Snapshot: Full state serialization
- Version: Monotonically increasing counter
Concurrency
parking_lot::RwLockfor state machine accessparking_lot::Mutexfor Raft node access- Lock-free reads for topology/schema queries
Future Enhancements
- Multi-Raft: Separate Raft groups for topology and schema
- Read Replicas: Non-voting nodes for read scaling
- Partition Spec: Full support for range/list/hash partitioning
- Index Metadata: Track secondary indexes, vector indexes
- Query Statistics: Store table/column statistics for optimizer
- Access Control: Table-level permissions
- Change Data Capture: Stream metadata changes