Skip to content

F6.21 Tenant Replication Architecture

F6.21 Tenant Replication Architecture

System Architecture & Design Document

Feature ID: F6.21 Version: 6.0 Status: Design Phase Priority: P1 (HIGH) Author: System Architecture Team Date: November 2, 2025 Last Updated: November 2, 2025


Table of Contents

  1. Executive Summary
  2. System Architecture
  3. Component Design
  4. Data Architecture
  5. API Design
  6. Technology Stack
  7. Security Architecture
  8. Performance & Scalability
  9. Monitoring & Observability
  10. Disaster Recovery
  11. Deployment Architecture
  12. Architecture Decision Records

1. Executive Summary

1.1 Overview

The Tenant Replication system provides intelligent, unidirectional replication for multi-tenant databases with innovative features including AI-powered predictive replication, semantic conflict resolution, and zero-downtime tenant migration.

1.2 Key Architectural Principles

  • Tenant-First Design: All operations granular at tenant level
  • Cloud-Native: Kubernetes-native, containerized deployment
  • Event-Driven: Asynchronous CDC-based architecture
  • Scalable: Horizontal scaling for replication workers
  • Observable: Comprehensive metrics and tracing
  • Secure: End-to-end encryption, RBAC, audit logging

1.3 Quality Attributes

AttributeTargetMeasurement
Availability99.99%Uptime monitoring
Performance<5s replication lag (P99)Lag metrics
Scalability10K tenants per clusterLoad testing
SecurityZero data breachesSecurity audits
Reliability<30s RTO, <5s RPODR testing
Maintainability<2 day bug fix cycleIssue tracking

2. System Architecture

2.1 High-Level Architecture (C4 Model - Level 1)

graph TB
subgraph "External Systems"
CLIENT[Client Applications]
MONITORING[Monitoring Systems<br/>Prometheus/Grafana]
ALERTS[Alert Systems<br/>PagerDuty/Slack]
end
subgraph "HeliosDB Tenant Replication System"
API[REST/gRPC API Gateway]
CONTROLLER[Replication Controller]
WORKERS[Replication Workers Pool]
PREDICTOR[AI Predictive Engine]
FAILOVER[Automatic Failover Controller]
end
subgraph "Data Layer"
SOURCE_DB[(Source DB<br/>Read-Write)]
TARGET_DB[(Target DB<br/>Read-Only)]
METADATA[(Metadata Store<br/>PostgreSQL)]
QUEUE[Message Queue<br/>Kafka/NATS]
end
CLIENT --> API
API --> CONTROLLER
CONTROLLER --> WORKERS
CONTROLLER --> PREDICTOR
CONTROLLER --> FAILOVER
WORKERS --> SOURCE_DB
WORKERS --> TARGET_DB
WORKERS --> QUEUE
PREDICTOR --> METADATA
FAILOVER --> METADATA
WORKERS --> MONITORING
FAILOVER --> ALERTS
style API fill:#4A90E2
style CONTROLLER fill:#50C878
style WORKERS fill:#FFB347
style PREDICTOR fill:#E74C3C
style FAILOVER fill:#9B59B6

2.2 Container Architecture (C4 Model - Level 2)

graph TB
subgraph "API Layer"
REST[REST API Server<br/>Axum Framework]
GRPC[gRPC Server<br/>Tonic Framework]
AUTH[Auth Service<br/>JWT/OAuth2]
end
subgraph "Control Plane"
ORCHESTRATOR[Replication Orchestrator]
SCHEDULER[QoS Scheduler]
MIGRATION[Migration Controller]
HEALTH[Health Monitor]
end
subgraph "Data Plane"
CDC[CDC Processor<br/>Logical Replication]
TRANSFORM[Transform Pipeline]
COMPRESS[Compression Engine]
APPLY[Apply Worker]
end
subgraph "AI/ML Layer"
PREDICTOR_ML[Access Pattern Predictor]
CONFLICT_AI[Semantic Conflict Resolver]
OPTIMIZER[Performance Optimizer]
end
subgraph "Storage Layer"
REPL_META[(Replication Metadata)]
CHECKPOINT[(Checkpoint Store)]
METRICS_DB[(Metrics Store)]
end
REST --> ORCHESTRATOR
GRPC --> ORCHESTRATOR
AUTH --> REST
AUTH --> GRPC
ORCHESTRATOR --> SCHEDULER
ORCHESTRATOR --> MIGRATION
ORCHESTRATOR --> HEALTH
SCHEDULER --> CDC
CDC --> TRANSFORM
TRANSFORM --> COMPRESS
COMPRESS --> APPLY
PREDICTOR_ML --> SCHEDULER
CONFLICT_AI --> APPLY
OPTIMIZER --> SCHEDULER
ORCHESTRATOR --> REPL_META
CDC --> CHECKPOINT
HEALTH --> METRICS_DB

2.3 Component Diagram (C4 Model - Level 3)

graph TB
subgraph "Replication Worker"
CDC_READER[CDC Reader<br/>PostgreSQL Logical Slot]
PARSER[WAL Parser<br/>Decode Binary Log]
TRANSFORMER[Data Transformer<br/>Anonymize/Aggregate]
COMPRESSOR[Schema-Aware Compressor<br/>Delta/Dictionary]
NETWORK[Network Client<br/>TLS 1.3]
APPLIER[Transaction Applier<br/>Batch Insert/Update]
end
subgraph "AI Engine"
ACCESS_ANALYZER[Access Pattern Analyzer<br/>Time-series ML]
PRIORITY_QUEUE[Priority Queue<br/>Hot Data First]
PATTERN_DETECTOR[Pattern Detector<br/>Anomaly Detection]
end
subgraph "Conflict Resolution"
DETECTOR[Conflict Detector<br/>Timestamp/Version]
SEMANTIC[Semantic Resolver<br/>LLM Integration]
STRATEGY[Resolution Strategy<br/>LWW/Custom]
end
CDC_READER --> PARSER
PARSER --> TRANSFORMER
TRANSFORMER --> COMPRESSOR
COMPRESSOR --> NETWORK
NETWORK --> APPLIER
ACCESS_ANALYZER --> PRIORITY_QUEUE
PRIORITY_QUEUE --> CDC_READER
PATTERN_DETECTOR --> ACCESS_ANALYZER
APPLIER --> DETECTOR
DETECTOR --> SEMANTIC
SEMANTIC --> STRATEGY
STRATEGY --> APPLIER

2.4 Data Flow Architecture

sequenceDiagram
participant Source as Source DB<br/>(Read-Write)
participant CDC as CDC Processor
participant Transform as Transform Pipeline
participant Compress as Compression
participant Queue as Message Queue
participant Apply as Apply Worker
participant Target as Target DB<br/>(Read-Only)
participant Monitor as Monitoring
Source->>CDC: Transaction Commit
CDC->>CDC: Read WAL/Redo Log
CDC->>Transform: Raw Change Events
Transform->>Transform: Apply Transforms<br/>(Anonymize/Aggregate)
Transform->>Compress: Transformed Data
Compress->>Compress: Schema-Aware<br/>Compression
Compress->>Queue: Compressed Batch
Queue->>Apply: Fetch Batch
Apply->>Apply: Decompress
Apply->>Apply: Conflict Detection
alt No Conflict
Apply->>Target: Apply Changes
Target-->>Apply: ACK
else Conflict Detected
Apply->>Apply: Semantic Resolution
Apply->>Target: Apply Merged
Target-->>Apply: ACK
end
Apply->>Monitor: Update Metrics<br/>(Lag, Throughput)
Monitor->>Monitor: Check SLA
alt SLA Violated
Monitor->>Monitor: Trigger Alert
end

3. Component Design

3.1 Replication Controller

Responsibility: Orchestrate replication lifecycle, manage worker pool, enforce SLAs.

graph LR
subgraph "Replication Controller"
API_HANDLER[API Handler]
LIFECYCLE[Lifecycle Manager]
POOL[Worker Pool Manager]
SLA[SLA Enforcer]
STATE[State Machine]
end
subgraph "State Machine States"
INIT[Initializing]
SYNC[Initial Sync]
CDC[CDC Streaming]
PAUSE[Paused]
FAILOVER[Failing Over]
STOPPED[Stopped]
end
API_HANDLER --> LIFECYCLE
LIFECYCLE --> STATE
STATE --> POOL
POOL --> SLA
STATE --> INIT
INIT --> SYNC
SYNC --> CDC
CDC --> PAUSE
CDC --> FAILOVER
PAUSE --> CDC
FAILOVER --> STOPPED
CDC --> STOPPED

Key Algorithms:

  1. Worker Pool Sizing:
workers = min(
max_workers,
ceil(tenant_count / tenants_per_worker),
available_cpu_cores
)
  1. SLA Enforcement:
if replication_lag > qos_threshold:
if priority == Premium:
allocate_dedicated_worker()
elif priority == Standard:
increase_batch_frequency()
else:
log_warning()

3.2 CDC Processor

Responsibility: Capture changes from source database using logical replication.

graph TB
subgraph "CDC Processor"
SLOT[Replication Slot<br/>PostgreSQL]
DECODER[Binary Decoder<br/>WAL Format]
FILTER[Table Filter<br/>Tenant Isolation]
BUFFER[Event Buffer<br/>Ring Buffer]
CHECKPOINT[Checkpoint Manager<br/>LSN Tracking]
end
SOURCE_DB[(Source DB)] --> SLOT
SLOT --> DECODER
DECODER --> FILTER
FILTER --> BUFFER
BUFFER --> TRANSFORM[To Transform Pipeline]
BUFFER --> CHECKPOINT
CHECKPOINT --> METADATA[(Metadata Store)]
style SLOT fill:#E74C3C
style DECODER fill:#3498DB
style FILTER fill:#2ECC71

Technologies:

  • PostgreSQL: Logical replication slots (pgoutput plugin)
  • Protocol: PostgreSQL replication protocol (streaming)
  • Format: Binary WAL decoding (faster than text)

Checkpoint Strategy:

Checkpoint = {
tenant_id: String,
lsn: u64, // Log Sequence Number
timestamp: DateTime,
transaction_id: u64,
table_lsns: HashMap<String, u64>
}

3.3 Transform Pipeline

Responsibility: Apply transformations to replicated data (anonymization, aggregation, filtering).

graph LR
INPUT[Input Batch] --> VALIDATOR[Schema Validator]
VALIDATOR --> ANONYMIZER[PII Anonymizer]
ANONYMIZER --> AGGREGATOR[Aggregator]
AGGREGATOR --> FILTER[Row Filter]
FILTER --> ENRICHER[Data Enricher]
ENRICHER --> OUTPUT[Output Batch]
SCHEMA[(Schema Metadata)] --> VALIDATOR
RULES[(Transform Rules)] --> ANONYMIZER
RULES --> AGGREGATOR
RULES --> FILTER
style ANONYMIZER fill:#E74C3C
style AGGREGATOR fill:#3498DB
style FILTER fill:#2ECC71
style ENRICHER fill:#9B59B6

Transform Types:

  1. Anonymization:
pub enum AnonymizationMethod {
Hash(HashAlgorithm), // SHA-256 hash
Tokenize(TokenService), // Vault token
Redact, // Replace with ***
Generalize(GeneralizationRule) // Age -> Age Group
}
  1. Aggregation:
pub struct AggregationTransform {
group_by: Vec<String>,
aggregations: Vec<AggregationFunc>,
window: TimeWindow,
emit_policy: EmitPolicy
}

3.4 AI Predictive Engine

Responsibility: Predict access patterns and prioritize replication of hot data.

graph TB
subgraph "Predictive Engine"
COLLECTOR[Access Pattern Collector]
FEATURES[Feature Extractor]
MODEL[ML Model<br/>Gradient Boosting]
PREDICTOR[Access Predictor]
PRIORITIZER[Replication Prioritizer]
end
ACCESS_LOG[Access Logs] --> COLLECTOR
COLLECTOR --> FEATURES
FEATURES --> MODEL
MODEL --> PREDICTOR
PREDICTOR --> PRIORITIZER
PRIORITIZER --> SCHEDULER[QoS Scheduler]
subgraph "Features"
FREQ[Access Frequency]
RECENCY[Recency]
SIZE[Data Size]
PATTERN[Access Pattern<br/>Sequential/Random]
end
FEATURES --> FREQ
FEATURES --> RECENCY
FEATURES --> SIZE
FEATURES --> PATTERN
style MODEL fill:#E74C3C
style PREDICTOR fill:#3498DB

ML Model:

  • Algorithm: Gradient Boosting (XGBoost)
  • Features:
    • Access frequency (last 1h, 24h, 7d)
    • Last access timestamp
    • Table size and row count
    • Read:write ratio
    • User count accessing data
  • Output: Priority score (0-100)
  • Training: Online learning every 1 hour

Prioritization Algorithm:

priority_score = (
0.4 * access_frequency_score +
0.3 * recency_score +
0.2 * size_score +
0.1 * pattern_score
)
replication_order = sort_by(priority_score, descending)

3.5 Compression Engine

Responsibility: Schema-aware compression for bandwidth optimization.

graph TB
subgraph "Compression Engine"
ANALYZER[Schema Analyzer]
SELECTOR[Compression Selector]
DELTA[Delta Encoding]
DICT[Dictionary Encoding]
GORILLA[Gorilla Compression]
ZSTD[Zstd Compression]
end
BATCH[Input Batch] --> ANALYZER
ANALYZER --> SELECTOR
SELECTOR -->|Integer| DELTA
SELECTOR -->|String| DICT
SELECTOR -->|Float/Timestamp| GORILLA
SELECTOR -->|JSON/Text| ZSTD
DELTA --> OUTPUT[Compressed Batch]
DICT --> OUTPUT
GORILLA --> OUTPUT
ZSTD --> OUTPUT
SCHEMA[(Schema)] --> ANALYZER
style DELTA fill:#3498DB
style DICT fill:#2ECC71
style GORILLA fill:#E74C3C
style ZSTD fill:#9B59B6

Compression Strategies:

Data TypeStrategyRatioUse Case
Integer (sequential)Delta Encoding5-10xIDs, counters
Integer (random)Zstd2-3xRandom values
String (low cardinality)Dictionary10-50xStatus codes, categories
String (high cardinality)Zstd2-4xNames, descriptions
Float (time-series)Gorilla5-12xMetrics, sensors
TimestampDelta-of-Delta8-15xEvent timestamps
JSONZstd3-5xDocument data

3.6 Failover Controller

Responsibility: Monitor source health, automatically promote replica on failure.

stateDiagram-v2
[*] --> Monitoring
Monitoring --> SourceHealthy: Health Check OK
SourceHealthy --> Monitoring: Continue
Monitoring --> SourceUnhealthy: Health Check Failed
SourceUnhealthy --> Evaluating: Evaluate Criteria
Evaluating --> Monitoring: Don't Failover<br/>(Transient Issue)
Evaluating --> InitiateFailover: Failover Needed
InitiateFailover --> StopReplication: Stop CDC
StopReplication --> PromoteReplica: Set Read-Write
PromoteReplica --> UpdateRouting: Update DNS/LB
UpdateRouting --> SendAlerts: Notify Operators
SendAlerts --> [*]
note right of Evaluating
Criteria:
- Downtime > 30s
- Replication lag < 5s
- Replica healthy
- Not maintenance window
end note

Health Check Algorithm:

pub async fn check_health(&self) -> HealthStatus {
let checks = vec![
self.check_database_connectivity(),
self.check_replication_lag(),
self.check_disk_space(),
self.check_cpu_usage(),
self.check_network_connectivity(),
];
let results = futures::join_all(checks).await;
HealthStatus {
is_healthy: results.iter().all(|r| r.is_ok()),
checks: results,
timestamp: Utc::now(),
}
}

4. Data Architecture

4.1 Replication Metadata Schema

-- Replication configuration
CREATE TABLE replication_configs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id VARCHAR(255) NOT NULL,
source_connection_id UUID NOT NULL REFERENCES connections(id),
target_connection_id UUID NOT NULL REFERENCES connections(id),
-- Configuration
qos_tier VARCHAR(50) NOT NULL, -- BestEffort, Standard, Premium, Synchronous
max_lag_seconds INTEGER NOT NULL,
priority SMALLINT NOT NULL DEFAULT 50, -- 0-100
-- Filters
table_filter JSONB, -- ["users.*", "orders.*"]
row_filter JSONB, -- Complex predicates
-- Transforms
transforms JSONB, -- Transform pipeline config
-- State
status VARCHAR(50) NOT NULL, -- Initializing, Syncing, Streaming, Paused, Stopped
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT unique_tenant_source_target UNIQUE(tenant_id, source_connection_id, target_connection_id)
);
CREATE INDEX idx_replication_configs_tenant ON replication_configs(tenant_id);
CREATE INDEX idx_replication_configs_status ON replication_configs(status);
CREATE INDEX idx_replication_configs_qos ON replication_configs(qos_tier);
-- Replication checkpoints
CREATE TABLE replication_checkpoints (
id BIGSERIAL PRIMARY KEY,
replication_id UUID NOT NULL REFERENCES replication_configs(id) ON DELETE CASCADE,
-- Checkpoint data
lsn BIGINT NOT NULL, -- Log Sequence Number
transaction_id BIGINT, -- Transaction ID
checkpoint_time TIMESTAMPTZ NOT NULL,
-- Per-table checkpoints
table_checkpoints JSONB, -- {table_name: lsn}
-- Metrics at checkpoint
rows_replicated BIGINT NOT NULL DEFAULT 0,
bytes_replicated BIGINT NOT NULL DEFAULT 0,
lag_seconds NUMERIC(10,3),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_checkpoints_replication ON replication_checkpoints(replication_id);
CREATE INDEX idx_checkpoints_time ON replication_checkpoints(checkpoint_time DESC);
-- Replication metrics (time-series)
CREATE TABLE replication_metrics (
id BIGSERIAL PRIMARY KEY,
replication_id UUID NOT NULL REFERENCES replication_configs(id) ON DELETE CASCADE,
timestamp TIMESTAMPTZ NOT NULL,
-- Lag metrics
replication_lag_seconds NUMERIC(10,3),
replication_lag_bytes BIGINT,
-- Throughput metrics
rows_per_second NUMERIC(12,2),
bytes_per_second BIGINT,
transactions_per_second NUMERIC(10,2),
-- Performance metrics
compression_ratio NUMERIC(5,2),
network_latency_ms NUMERIC(8,2),
apply_latency_ms NUMERIC(8,2),
-- Resource metrics
cpu_usage_percent NUMERIC(5,2),
memory_usage_mb BIGINT,
-- Error metrics
conflicts_detected INTEGER DEFAULT 0,
conflicts_resolved INTEGER DEFAULT 0,
errors_count INTEGER DEFAULT 0
);
-- Hypertable for time-series (TimescaleDB extension)
SELECT create_hypertable('replication_metrics', 'timestamp', if_not_exists => TRUE);
CREATE INDEX idx_metrics_replication_time ON replication_metrics(replication_id, timestamp DESC);
-- Replication events (audit log)
CREATE TABLE replication_events (
id BIGSERIAL PRIMARY KEY,
replication_id UUID NOT NULL REFERENCES replication_configs(id) ON DELETE CASCADE,
event_type VARCHAR(100) NOT NULL, -- StateChange, Failover, Error, etc.
severity VARCHAR(20) NOT NULL, -- Info, Warning, Error, Critical
message TEXT NOT NULL,
details JSONB,
-- Context
user_id UUID,
source_ip INET,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_events_replication ON replication_events(replication_id);
CREATE INDEX idx_events_type ON replication_events(event_type);
CREATE INDEX idx_events_severity ON replication_events(severity);
CREATE INDEX idx_events_time ON replication_events(created_at DESC);
-- Failover history
CREATE TABLE failover_history (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
replication_id UUID NOT NULL REFERENCES replication_configs(id),
-- Failover details
trigger VARCHAR(100) NOT NULL, -- Automatic, Manual, Scheduled
reason TEXT,
old_source_id UUID,
new_source_id UUID,
-- Metrics
failover_duration_ms BIGINT,
downtime_ms BIGINT,
data_loss_bytes BIGINT DEFAULT 0,
-- Status
status VARCHAR(50) NOT NULL, -- InProgress, Completed, Failed
error_message TEXT,
started_at TIMESTAMPTZ NOT NULL,
completed_at TIMESTAMPTZ,
created_by UUID
);
CREATE INDEX idx_failover_replication ON failover_history(replication_id);
CREATE INDEX idx_failover_status ON failover_history(status);
CREATE INDEX idx_failover_started ON failover_history(started_at DESC);
-- Migration history
CREATE TABLE migration_history (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id VARCHAR(255) NOT NULL,
-- Migration details
source_region VARCHAR(100) NOT NULL,
target_region VARCHAR(100) NOT NULL,
-- Phases
bulk_copy_completed_at TIMESTAMPTZ,
cdc_catchup_completed_at TIMESTAMPTZ,
cutover_completed_at TIMESTAMPTZ,
-- Metrics
data_size_gb NUMERIC(12,2),
total_duration_seconds INTEGER,
downtime_ms BIGINT,
-- Status
status VARCHAR(50) NOT NULL, -- InProgress, Completed, Failed, RolledBack
error_message TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
completed_at TIMESTAMPTZ,
created_by UUID
);
CREATE INDEX idx_migration_tenant ON migration_history(tenant_id);
CREATE INDEX idx_migration_status ON migration_history(status);
CREATE INDEX idx_migration_created ON migration_history(created_at DESC);
-- AI model metadata
CREATE TABLE ml_models (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
model_type VARCHAR(100) NOT NULL, -- AccessPatternPredictor, ConflictResolver
version VARCHAR(50) NOT NULL,
-- Model artifacts
model_path TEXT NOT NULL,
model_size_mb NUMERIC(10,2),
-- Training metadata
training_dataset_size BIGINT,
training_duration_seconds INTEGER,
-- Performance metrics
accuracy NUMERIC(5,4),
precision_score NUMERIC(5,4),
recall NUMERIC(5,4),
f1_score NUMERIC(5,4),
-- Deployment
status VARCHAR(50) NOT NULL, -- Training, Validating, Deployed, Deprecated
deployed_at TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
created_by UUID
);
CREATE INDEX idx_ml_models_type ON ml_models(model_type);
CREATE INDEX idx_ml_models_status ON ml_models(status);
-- Access pattern predictions
CREATE TABLE access_predictions (
id BIGSERIAL PRIMARY KEY,
tenant_id VARCHAR(255) NOT NULL,
table_name VARCHAR(255) NOT NULL,
prediction_time TIMESTAMPTZ NOT NULL,
prediction_window_hours INTEGER NOT NULL,
-- Predictions
predicted_access_count INTEGER,
predicted_read_bytes BIGINT,
priority_score NUMERIC(5,2), -- 0-100
-- Confidence
confidence NUMERIC(5,4), -- 0-1
model_id UUID REFERENCES ml_models(id),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
SELECT create_hypertable('access_predictions', 'prediction_time', if_not_exists => TRUE);
CREATE INDEX idx_predictions_tenant ON access_predictions(tenant_id);
CREATE INDEX idx_predictions_priority ON access_predictions(priority_score DESC);

4.2 Database Indexes Strategy

Performance Indexes:

-- Lookup replication configs by tenant
CREATE INDEX idx_replication_tenant_active
ON replication_configs(tenant_id)
WHERE status IN ('Syncing', 'Streaming');
-- Find high-priority replications
CREATE INDEX idx_replication_priority_qos
ON replication_configs(priority DESC, qos_tier)
WHERE status = 'Streaming';
-- Checkpoint lookup for resume
CREATE INDEX idx_checkpoint_latest
ON replication_checkpoints(replication_id, checkpoint_time DESC);
-- Metrics time-range queries
CREATE INDEX idx_metrics_time_range
ON replication_metrics(replication_id, timestamp DESC)
WHERE replication_lag_seconds > 5; -- Alert threshold

4.3 Data Retention Policies

-- Retention policies (cleanup old data)
CREATE TABLE data_retention_policies (
table_name VARCHAR(255) PRIMARY KEY,
retention_days INTEGER NOT NULL,
cleanup_enabled BOOLEAN DEFAULT TRUE
);
INSERT INTO data_retention_policies VALUES
('replication_metrics', 90, TRUE), -- 90 days
('replication_events', 365, TRUE), -- 1 year
('replication_checkpoints', 30, TRUE), -- 30 days
('access_predictions', 7, TRUE), -- 7 days
('failover_history', 730, TRUE), -- 2 years
('migration_history', 730, TRUE); -- 2 years
-- Cleanup job (run daily)
CREATE OR REPLACE FUNCTION cleanup_old_data() RETURNS void AS $$
DECLARE
policy RECORD;
BEGIN
FOR policy IN SELECT * FROM data_retention_policies WHERE cleanup_enabled LOOP
EXECUTE format(
'DELETE FROM %I WHERE created_at < NOW() - INTERVAL ''%s days''',
policy.table_name,
policy.retention_days
);
END LOOP;
END;
$$ LANGUAGE plpgsql;

5. API Design

See separate document: F6.21_API_SPECIFICATION.md

Summary:

  • REST API: CRUD operations, management, monitoring
  • gRPC API: High-performance replication operations
  • WebSocket API: Real-time metrics streaming

6. Technology Stack

6.1 Technology Selection Matrix

ComponentTechnologyRationaleAlternatives Considered
LanguageRustPerformance, safety, asyncGo (simpler, slower), Java (GC overhead)
Web FrameworkAxumFast, ergonomic, Tower ecosystemActix-web (less maintained), Rocket (slower)
gRPCTonicNative Rust, HTTP/2grpc-rs (C++ bindings), tarpc (simpler)
Async RuntimeTokioIndustry standard, matureasync-std (smaller ecosystem)
DatabasePostgreSQL 16+Logical replication, JSON, TimescaleDBMySQL (weaker replication), MongoDB (NoSQL)
Message QueueNATS JetStreamLightweight, persistence, exactly-onceKafka (heavier), RabbitMQ (slower)
CDCPostgreSQL Logical ReplicationNative, low overheadDebezium (JVM required), custom WAL parser
CompressionZstd, customBest ratio, fastLZ4 (less compression), Snappy (faster, less ratio)
ML FrameworkCandle (Rust)Native Rust, no PythonPyTorch (requires Python bridge), ONNX Runtime
MetricsPrometheusIndustry standard, pull-basedInfluxDB (time-series only), Datadog (expensive)
TracingOpenTelemetryStandard, vendor-agnosticJaeger only (limited), custom
ConfigYAML + Env VarsHuman-readable, secure secretsTOML (less common), JSON (no comments)
OrchestrationKubernetesCloud-native, scalableDocker Swarm (simpler, less features), Nomad

6.2 Dependency Graph

graph TB
subgraph "External Dependencies"
TOKIO[tokio 1.35+]
AXUM[axum 0.7+]
TONIC[tonic 0.11+]
SQLX[sqlx 0.7+]
NATS[async-nats 0.33+]
ZSTD[zstd 0.13+]
PROMETHEUS[prometheus-client 0.22+]
OTEL[opentelemetry 0.21+]
end
subgraph "Internal Dependencies"
MULTITENANCY[heliosdb-multitenancy]
METADATA[heliosdb-metadata]
SECURITY[heliosdb-security]
NETWORK[heliosdb-network]
STREAMING[heliosdb-streaming]
end
APP[heliosdb-tenant-replication] --> TOKIO
APP --> AXUM
APP --> TONIC
APP --> SQLX
APP --> NATS
APP --> ZSTD
APP --> PROMETHEUS
APP --> OTEL
APP --> MULTITENANCY
APP --> METADATA
APP --> SECURITY
APP --> NETWORK
APP --> STREAMING
style APP fill:#E74C3C
style MULTITENANCY fill:#3498DB
style METADATA fill:#2ECC71

6.3 Technology Versions

[dependencies]
# Async runtime
tokio = { version = "1.35", features = ["full"] }
tokio-util = "0.7"
# Web frameworks
axum = "0.7"
tower = "0.4"
tower-http = "0.5"
tonic = "0.11"
prost = "0.12"
# Database
sqlx = { version = "0.7", features = ["postgres", "runtime-tokio", "tls-rustls", "json", "uuid", "chrono"] }
deadpool-postgres = "0.12"
# Message queue
async-nats = "0.33"
# Serialization
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
serde_yaml = "0.9"
# Compression
zstd = "0.13"
snap = "1.1" # Snappy
# Cryptography
aes-gcm = "0.10"
ring = "0.17"
# Metrics & Observability
prometheus-client = "0.22"
opentelemetry = "0.21"
opentelemetry-otlp = "0.14"
tracing = "0.1"
tracing-subscriber = "0.3"
# ML/AI
candle-core = "0.3"
candle-nn = "0.3"
# Utilities
anyhow = "1.0"
thiserror = "1.0"
uuid = { version = "1.6", features = ["v4", "serde"] }
chrono = { version = "0.4", features = ["serde"] }

7. Security Architecture

7.1 Security Layers

graph TB
subgraph "Network Security"
TLS[TLS 1.3<br/>All Connections]
MTLS[mTLS<br/>Service-to-Service]
FW[Firewall Rules<br/>Network Policies]
end
subgraph "Authentication & Authorization"
JWT[JWT Tokens<br/>API Access]
RBAC[RBAC<br/>Permission Model]
OAUTH[OAuth2<br/>Third-party Auth]
end
subgraph "Data Security"
ENCRYPT_TRANSIT[Encryption in Transit<br/>AES-256-GCM]
ENCRYPT_REST[Encryption at Rest<br/>Database TDE]
MASKING[Data Masking<br/>PII Protection]
end
subgraph "Audit & Compliance"
AUDIT_LOG[Audit Logging<br/>All Operations]
RETENTION[Data Retention<br/>GDPR Compliance]
ACCESS_LOG[Access Logs<br/>Security Events]
end
CLIENT[Client] --> TLS
TLS --> JWT
JWT --> RBAC
RBAC --> ENCRYPT_TRANSIT
ENCRYPT_TRANSIT --> ENCRYPT_REST
ENCRYPT_REST --> MASKING
RBAC --> AUDIT_LOG
ENCRYPT_TRANSIT --> ACCESS_LOG
MASKING --> RETENTION

7.2 Authentication Flow

sequenceDiagram
participant Client
participant API Gateway
participant Auth Service
participant JWT Validator
participant RBAC
participant Backend
Client->>API Gateway: Request + JWT Token
API Gateway->>JWT Validator: Validate Token
alt Invalid Token
JWT Validator-->>Client: 401 Unauthorized
else Valid Token
JWT Validator->>RBAC: Check Permissions
alt Insufficient Permissions
RBAC-->>Client: 403 Forbidden
else Authorized
RBAC->>Backend: Forward Request
Backend-->>Client: Response
end
end
Backend->>Auth Service: Log Access Event

7.3 Encryption Strategy

In-Transit Encryption:

pub struct EncryptionConfig {
// TLS configuration
tls_version: TlsVersion::TLS13,
cipher_suites: vec![
CipherSuite::TLS_AES_256_GCM_SHA384,
CipherSuite::TLS_CHACHA20_POLY1305_SHA256,
],
// Certificate management
cert_path: String,
key_path: String,
ca_cert_path: String,
// mTLS for service-to-service
require_client_cert: bool,
}

At-Rest Encryption:

pub struct DataEncryption {
algorithm: EncryptionAlgorithm::AES256GCM,
key_derivation: KeyDerivation::PBKDF2,
// Key management
kms_provider: KMSProvider::AWS, // AWS KMS, Azure Key Vault, GCP KMS
key_rotation_days: 30,
// Encrypted fields
encrypted_columns: vec![
"password_hash",
"api_key",
"connection_string",
],
}

7.4 Access Control Model (RBAC)

pub enum Role {
Admin, // Full access
Operator, // Manage replications
ReadOnly, // View only
TenantAdmin, // Manage own tenant
}
pub enum Permission {
// Replication management
CreateReplication,
UpdateReplication,
DeleteReplication,
ViewReplication,
// Failover
TriggerFailover,
ViewFailoverHistory,
// Migration
StartMigration,
CancelMigration,
// Monitoring
ViewMetrics,
ViewLogs,
ConfigureAlerts,
}
pub struct RBACPolicy {
role: Role,
permissions: Vec<Permission>,
resource_filter: ResourceFilter, // Tenant-level isolation
}

Role-Permission Matrix:

PermissionAdminOperatorReadOnlyTenantAdmin
CreateReplication(own tenant)
UpdateReplication(own tenant)
DeleteReplication(own tenant)
ViewReplication(own tenant)
TriggerFailover
StartMigration
ViewMetrics(own tenant)
ConfigureAlerts

7.5 Tenant Isolation

pub struct TenantIsolation {
// Database-level isolation
row_level_security: bool, // PostgreSQL RLS
tenant_id_column: String, // Tenant identifier
// Network isolation
vpc_per_tenant: bool, // VPC isolation (enterprise)
network_policies: Vec<NetworkPolicy>,
// Resource isolation
cpu_quota: CpuQuota,
memory_limit: MemoryLimit,
// Data isolation
encryption_key_per_tenant: bool,
}

7.6 Audit Logging

pub struct AuditLog {
event_id: Uuid,
event_type: AuditEventType,
timestamp: DateTime<Utc>,
// Actor
user_id: Option<Uuid>,
service_account: Option<String>,
source_ip: IpAddr,
// Resource
resource_type: ResourceType,
resource_id: String,
tenant_id: String,
// Action
action: Action,
status: ActionStatus, // Success, Failure
// Details
details: serde_json::Value,
// Compliance
gdpr_relevant: bool,
hipaa_relevant: bool,
}

8. Performance & Scalability

8.1 Performance Targets

MetricTargetBaselineStretch Goal
Replication Lag (P50)<2 seconds<5 seconds<1 second
Replication Lag (P99)<5 seconds<15 seconds<3 seconds
Throughput>50K rows/sec>10K rows/sec>100K rows/sec
Latency (Apply)<100ms<500ms<50ms
CPU Usage<60%<80%<40%
Memory Usage<4GB per worker<8GB<2GB
Network Bandwidth<100 Mbps<500 Mbps<50 Mbps
Compression Ratio3-5x2x5-10x

8.2 Scalability Architecture

graph TB
subgraph "Horizontal Scaling"
LB[Load Balancer<br/>Nginx/HAProxy]
subgraph "API Tier (Stateless)"
API1[API Server 1]
API2[API Server 2]
API3[API Server N]
end
subgraph "Worker Tier (Stateful)"
W1[Replication Worker 1<br/>Tenants 1-100]
W2[Replication Worker 2<br/>Tenants 101-200]
W3[Replication Worker N<br/>Tenants N]
end
end
subgraph "Data Tier"
PGPOOL[PgPool<br/>Connection Pooling]
subgraph "Database Cluster"
PG_PRIMARY[(PostgreSQL Primary)]
PG_REPLICA1[(PostgreSQL Replica 1)]
PG_REPLICA2[(PostgreSQL Replica 2)]
end
end
LB --> API1
LB --> API2
LB --> API3
API1 --> W1
API2 --> W2
API3 --> W3
W1 --> PGPOOL
W2 --> PGPOOL
W3 --> PGPOOL
PGPOOL --> PG_PRIMARY
PGPOOL --> PG_REPLICA1
PGPOOL --> PG_REPLICA2
PG_PRIMARY --> PG_REPLICA1
PG_PRIMARY --> PG_REPLICA2
style LB fill:#3498DB
style PGPOOL fill:#2ECC71
style PG_PRIMARY fill:#E74C3C

8.3 Scaling Strategies

Vertical Scaling:

# Worker resource limits
worker:
cpu_request: "2000m"
cpu_limit: "4000m"
memory_request: "4Gi"
memory_limit: "8Gi"
# Scaling triggers
scale_up_triggers:
- cpu_usage > 70%
- memory_usage > 75%
- replication_lag > 10s
scale_down_triggers:
- cpu_usage < 30% for 10 minutes
- memory_usage < 40% for 10 minutes

Horizontal Scaling:

pub struct ScalingPolicy {
min_workers: usize,
max_workers: usize,
scale_up_policy: ScaleUpPolicy {
threshold: 100, // tenants per worker
cooldown: Duration::from_secs(300),
},
scale_down_policy: ScaleDownPolicy {
threshold: 30, // tenants per worker
cooldown: Duration::from_secs(600),
},
}

8.4 Performance Optimization Techniques

Batching:

pub struct BatchConfig {
max_batch_size: usize, // 5000 rows
max_batch_delay: Duration, // 100ms
// Dynamic batching
adaptive_batch_size: bool, // Adjust based on load
batch_size_range: (usize, usize), // (1000, 10000)
}

Connection Pooling:

pub struct ConnectionPool {
min_connections: u32, // 10
max_connections: u32, // 100
connection_timeout: Duration, // 5s
idle_timeout: Duration, // 60s
// Per-tenant pools
tenant_pool_size: u32, // 2
}

Caching:

pub struct CacheStrategy {
// Metadata cache
schema_cache_ttl: Duration, // 5 minutes
config_cache_ttl: Duration, // 1 minute
// Prediction cache
prediction_cache_size: usize, // 10K entries
prediction_cache_ttl: Duration, // 1 hour
}

8.5 Load Testing Plan

load_tests:
- name: "Baseline Load"
tenants: 100
transactions_per_second: 1000
duration_minutes: 30
- name: "Peak Load"
tenants: 1000
transactions_per_second: 10000
duration_minutes: 60
- name: "Stress Test"
tenants: 5000
transactions_per_second: 50000
duration_minutes: 30
- name: "Endurance Test"
tenants: 500
transactions_per_second: 5000
duration_hours: 24

9. Monitoring & Observability

9.1 Observability Stack

graph TB
subgraph "Application"
APP[Replication Service]
METRICS_EXPORTER[Metrics Exporter<br/>Prometheus Client]
TRACES_EXPORTER[Traces Exporter<br/>OTLP]
LOGS_EXPORTER[Logs Exporter<br/>Fluentd]
end
subgraph "Collection Layer"
PROMETHEUS[Prometheus<br/>Metrics Store]
JAEGER[Jaeger<br/>Distributed Tracing]
LOKI[Loki<br/>Log Aggregation]
end
subgraph "Visualization Layer"
GRAFANA[Grafana<br/>Dashboards]
ALERTMANAGER[Alertmanager<br/>Alert Routing]
end
subgraph "Notification Layer"
PAGERDUTY[PagerDuty]
SLACK[Slack]
EMAIL[Email]
end
APP --> METRICS_EXPORTER
APP --> TRACES_EXPORTER
APP --> LOGS_EXPORTER
METRICS_EXPORTER --> PROMETHEUS
TRACES_EXPORTER --> JAEGER
LOGS_EXPORTER --> LOKI
PROMETHEUS --> GRAFANA
JAEGER --> GRAFANA
LOKI --> GRAFANA
PROMETHEUS --> ALERTMANAGER
ALERTMANAGER --> PAGERDUTY
ALERTMANAGER --> SLACK
ALERTMANAGER --> EMAIL
style PROMETHEUS fill:#E74C3C
style JAEGER fill:#3498DB
style LOKI fill:#2ECC71
style GRAFANA fill:#9B59B6

9.2 Metrics Collection

System Metrics:

pub struct SystemMetrics {
// Replication lag
replication_lag_seconds: Histogram,
replication_lag_bytes: Gauge,
// Throughput
rows_replicated_total: Counter,
bytes_replicated_total: Counter,
transactions_replicated_total: Counter,
// Performance
replication_duration_seconds: Histogram,
compression_ratio: Histogram,
network_latency_seconds: Histogram,
// Errors
replication_errors_total: Counter,
conflicts_detected_total: Counter,
conflicts_resolved_total: Counter,
// Resource usage
cpu_usage_percent: Gauge,
memory_usage_bytes: Gauge,
connection_pool_active: Gauge,
connection_pool_idle: Gauge,
}

Business Metrics:

pub struct BusinessMetrics {
// Tenants
active_replications: Gauge,
tenants_by_qos_tier: GaugeVec, // Labels: [tier]
// SLA compliance
sla_violations_total: Counter,
sla_compliance_ratio: Gauge,
// Failovers
failovers_total: Counter,
failover_duration_seconds: Histogram,
// Migrations
migrations_total: Counter,
migration_downtime_seconds: Histogram,
}

9.3 Dashboards

Overview Dashboard:

dashboard: "Tenant Replication Overview"
panels:
- title: "Replication Lag (P99)"
query: "histogram_quantile(0.99, replication_lag_seconds)"
visualization: "time_series"
- title: "Throughput"
query: "rate(rows_replicated_total[5m])"
visualization: "graph"
- title: "Active Replications by QoS"
query: "tenants_by_qos_tier"
visualization: "bar_chart"
- title: "SLA Compliance"
query: "sla_compliance_ratio"
visualization: "gauge"
- title: "Error Rate"
query: "rate(replication_errors_total[5m])"
visualization: "graph"

Performance Dashboard:

dashboard: "Performance Metrics"
panels:
- title: "Compression Ratio"
query: "avg(compression_ratio)"
- title: "Network Latency"
query: "histogram_quantile(0.95, network_latency_seconds)"
- title: "CPU Usage"
query: "avg(cpu_usage_percent)"
- title: "Memory Usage"
query: "avg(memory_usage_bytes)"

9.4 Alerting Rules

alerts:
- name: "HighReplicationLag"
severity: "critical"
condition: "replication_lag_seconds > 30"
duration: "5m"
annotations:
summary: "Replication lag is too high"
description: "Replication lag for {{ $labels.tenant_id }} is {{ $value }}s"
- name: "ReplicationError"
severity: "error"
condition: "rate(replication_errors_total[5m]) > 0.1"
duration: "2m"
annotations:
summary: "High replication error rate"
- name: "SLAViolation"
severity: "warning"
condition: "sla_compliance_ratio < 0.95"
duration: "10m"
annotations:
summary: "SLA compliance below 95%"
- name: "FailoverTriggered"
severity: "critical"
condition: "increase(failovers_total[5m]) > 0"
annotations:
summary: "Automatic failover triggered"
description: "Failover occurred for {{ $labels.tenant_id }}"

9.5 Distributed Tracing

use opentelemetry::trace::{Tracer, Span};
#[tracing::instrument(skip(self))]
pub async fn replicate_batch(&self, batch: ReplicationBatch) -> Result<()> {
let mut span = self.tracer.start("replicate_batch");
span.set_attribute("tenant_id", batch.tenant_id.clone());
span.set_attribute("batch_size", batch.size as i64);
// CDC read
let changes = self.read_changes(&batch).await
.with_context(|| "Failed to read changes")?;
span.add_event("cdc_read_complete", vec![
("rows", changes.len().to_string())
]);
// Transform
let transformed = self.transform_data(changes).await?;
span.add_event("transform_complete", vec![]);
// Compress
let compressed = self.compress_data(transformed).await?;
span.add_event("compress_complete", vec![
("compression_ratio", compressed.ratio.to_string())
]);
// Apply
self.apply_changes(compressed).await?;
span.add_event("apply_complete", vec![]);
span.end();
Ok(())
}

9.6 Log Aggregation

use tracing::{info, warn, error};
// Structured logging
info!(
tenant_id = %tenant_id,
replication_id = %replication_id,
lag_seconds = lag.as_secs(),
"Replication lag measured"
);
warn!(
tenant_id = %tenant_id,
error = %err,
retry_count = retry_count,
"Replication retry after error"
);
error!(
tenant_id = %tenant_id,
failover_reason = %reason,
downtime_ms = downtime_ms,
"Automatic failover triggered"
);

10. Disaster Recovery

10.1 DR Architecture

graph TB
subgraph "Primary Region (us-east-1)"
PRIMARY_APP[Replication Service<br/>Primary]
PRIMARY_DB[(Source DB<br/>us-east-1)]
end
subgraph "DR Region (us-west-2)"
DR_APP[Replication Service<br/>Standby]
DR_DB[(Target DB<br/>us-west-2)]
end
subgraph "Failover Controller"
HEALTH[Health Monitor]
DECISION[Failover Decision Engine]
DNS[DNS Manager]
end
PRIMARY_APP --> PRIMARY_DB
PRIMARY_APP --> DR_DB
HEALTH --> PRIMARY_APP
HEALTH --> PRIMARY_DB
HEALTH --> DR_APP
HEALTH --> DR_DB
HEALTH --> DECISION
DECISION --> DNS
DNS -.->|Failover| DR_APP
style PRIMARY_APP fill:#2ECC71
style DR_APP fill:#E74C3C
style DECISION fill:#9B59B6

10.2 Failover Scenarios

Scenario 1: Database Failure

sequenceDiagram
participant Health as Health Monitor
participant Primary as Primary DB
participant Replica as Replica DB
participant DNS as DNS Manager
participant Client as Client Apps
Health->>Primary: Health Check
Primary-->>Health: Timeout/Error
Health->>Health: Evaluate Criteria<br/>(30s downtime)
Health->>Replica: Promote to Primary
Replica->>Replica: Set Read-Write
Health->>DNS: Update DNS<br/>(primary.db -> replica)
DNS->>Client: Propagate Change
Client->>Replica: Route Traffic
Note over Replica: Now Primary DB

Scenario 2: Region Failure

sequenceDiagram
participant Health as Health Monitor
participant PrimaryRegion as us-east-1
participant DRRegion as us-west-2
participant LB as Load Balancer
participant Client as Clients
Health->>PrimaryRegion: Health Check
PrimaryRegion-->>Health: Region Unavailable
Health->>Health: Confirm Outage<br/>(Multiple checks)
Health->>DRRegion: Activate DR Region
DRRegion->>DRRegion: Promote DBs<br/>Start Services
Health->>LB: Update Routing<br/>(Route 100% to DR)
LB->>Client: Redirect to DR
Client->>DRRegion: Routed Traffic
Note over DRRegion: Now Active Region

10.3 RTO/RPO Strategy

TierRTO TargetRPO TargetStrategy
Synchronous<5 seconds0 (zero data loss)Synchronous replication, automatic failover
Premium<30 seconds<5 secondsAsynchronous replication (<1s lag), automatic failover
Standard<5 minutes<30 secondsAsynchronous replication, automatic failover
Best-Effort<30 minutes<5 minutesBatch replication, manual failover

10.4 Backup Strategy

backup_policy:
# Full backups
full_backup:
frequency: "daily"
time: "02:00 UTC"
retention_days: 30
# Incremental backups
incremental_backup:
frequency: "hourly"
retention_days: 7
# Transaction log backups
transaction_log_backup:
frequency: "5 minutes"
retention_hours: 24
# Cross-region backup
cross_region_backup:
enabled: true
regions: ["us-west-2", "eu-west-1"]
# Backup encryption
encryption:
enabled: true
algorithm: "AES-256-GCM"
key_rotation_days: 90

10.5 Recovery Procedures

Database Recovery:

Terminal window
# 1. Stop replication
./heliosdb-cli replication stop --tenant-id ${TENANT_ID}
# 2. Restore from backup
./heliosdb-cli backup restore \
--tenant-id ${TENANT_ID} \
--backup-id ${BACKUP_ID} \
--target-time "2025-11-02T10:30:00Z"
# 3. Apply transaction logs (point-in-time recovery)
./heliosdb-cli backup apply-logs \
--tenant-id ${TENANT_ID} \
--start-time "2025-11-02T10:30:00Z" \
--end-time "2025-11-02T11:00:00Z"
# 4. Validate data integrity
./heliosdb-cli validate data \
--tenant-id ${TENANT_ID}
# 5. Resume replication
./heliosdb-cli replication start --tenant-id ${TENANT_ID}

Region Failover:

Terminal window
# 1. Initiate failover
./heliosdb-cli failover initiate \
--source-region us-east-1 \
--target-region us-west-2 \
--dry-run false
# 2. Monitor failover progress
./heliosdb-cli failover status --failover-id ${FAILOVER_ID}
# 3. Validate failover
./heliosdb-cli failover validate \
--failover-id ${FAILOVER_ID}
# 4. Update application config
kubectl set env deployment/app \
DATABASE_URL=postgres://us-west-2.db.helios.com/db

11. Deployment Architecture

11.1 Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
name: tenant-replication-worker
namespace: heliosdb
spec:
replicas: 5
selector:
matchLabels:
app: replication-worker
template:
metadata:
labels:
app: replication-worker
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
spec:
serviceAccountName: replication-worker
containers:
- name: worker
image: heliosdb/tenant-replication:v6.0.0
imagePullPolicy: IfNotPresent
ports:
- name: metrics
containerPort: 9090
- name: health
containerPort: 8080
resources:
requests:
cpu: "2000m"
memory: "4Gi"
limits:
cpu: "4000m"
memory: "8Gi"
env:
- name: RUST_LOG
value: "info"
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
- name: NATS_URL
value: "nats://nats.heliosdb.svc.cluster.local:4222"
livenessProbe:
httpGet:
path: /health
port: health
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: health
initialDelaySeconds: 10
periodSeconds: 5
volumeMounts:
- name: config
mountPath: /etc/heliosdb
readOnly: true
volumes:
- name: config
configMap:
name: replication-config
---
apiVersion: v1
kind: Service
metadata:
name: replication-worker
namespace: heliosdb
spec:
selector:
app: replication-worker
ports:
- name: metrics
port: 9090
targetPort: metrics
- name: health
port: 8080
targetPort: health
type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: replication-worker-hpa
namespace: heliosdb
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: tenant-replication-worker
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 75
- type: Pods
pods:
metric:
name: replication_lag_seconds
target:
type: AverageValue
averageValue: "5"

11.2 Infrastructure Components

infrastructure:
# Database
postgres:
version: "16.1"
instance_type: "db.r6g.2xlarge"
storage_type: "io2"
storage_size_gb: 1000
iops: 10000
backup_retention_days: 30
multi_az: true
# Message Queue
nats:
version: "2.10"
cluster_size: 3
storage_type: "gp3"
storage_size_gb: 500
jetstream_enabled: true
# Kubernetes
eks:
version: "1.28"
node_groups:
- name: "replication-workers"
instance_type: "c6i.2xlarge"
min_size: 3
max_size: 20
disk_size: 100
# Monitoring
prometheus:
retention_days: 90
storage_size_gb: 500
# Load Balancer
alb:
type: "application"
scheme: "internet-facing"
ssl_policy: "ELBSecurityPolicy-TLS13-1-2-2021-06"

11.3 Multi-Region Deployment

graph TB
subgraph "Global"
DNS[Route 53<br/>Global DNS]
CLOUDFRONT[CloudFront<br/>CDN]
end
subgraph "us-east-1"
LB_EAST[ALB us-east-1]
K8S_EAST[EKS Cluster]
DB_EAST[(RDS Primary)]
NATS_EAST[NATS Cluster]
end
subgraph "us-west-2"
LB_WEST[ALB us-west-2]
K8S_WEST[EKS Cluster]
DB_WEST[(RDS Replica)]
NATS_WEST[NATS Cluster]
end
subgraph "eu-west-1"
LB_EU[ALB eu-west-1]
K8S_EU[EKS Cluster]
DB_EU[(RDS Replica)]
NATS_EU[NATS Cluster]
end
DNS --> CLOUDFRONT
CLOUDFRONT --> LB_EAST
CLOUDFRONT --> LB_WEST
CLOUDFRONT --> LB_EU
LB_EAST --> K8S_EAST
LB_WEST --> K8S_WEST
LB_EU --> K8S_EU
K8S_EAST --> DB_EAST
K8S_WEST --> DB_WEST
K8S_EU --> DB_EU
K8S_EAST --> NATS_EAST
K8S_WEST --> NATS_WEST
K8S_EU --> NATS_EU
DB_EAST -.->|Replication| DB_WEST
DB_EAST -.->|Replication| DB_EU
NATS_EAST -.->|Sync| NATS_WEST
NATS_WEST -.->|Sync| NATS_EU

12. Architecture Decision Records

ADR-001: PostgreSQL Logical Replication for CDC

Status: Accepted Date: 2025-11-02 Context: Need efficient CDC mechanism with low overhead.

Decision: Use PostgreSQL logical replication slots with pgoutput plugin.

Rationale:

  • Native to PostgreSQL, no external dependencies
  • Low overhead (<5% CPU)
  • Transactional consistency guaranteed
  • Schema evolution support
  • Battle-tested in production

Alternatives Considered:

  1. Debezium: Requires JVM, higher resource usage, additional complexity
  2. Custom WAL Parser: Significant development effort, maintenance burden
  3. Trigger-based CDC: High overhead (20-30%), not scalable

Consequences:

  • Positive: Low overhead, native support, reliable
  • Negative: Locked to PostgreSQL, requires replication slot management

ADR-002: NATS JetStream for Message Queue

Status: Accepted Date: 2025-11-02 Context: Need reliable message queue for replication batches.

Decision: Use NATS JetStream for message queuing.

Rationale:

  • Lightweight (written in Go)
  • Exactly-once delivery semantics
  • Persistence with durability options
  • Lower latency than Kafka
  • Simpler operations

Alternatives Considered:

  1. Kafka: Heavier (JVM), more complex, higher latency
  2. RabbitMQ: Lower throughput, more resource-intensive
  3. Redis Streams: Less durable, not designed for this use case

Consequences:

  • Positive: Simple, fast, reliable, low resource usage
  • Negative: Smaller ecosystem than Kafka, fewer third-party integrations

ADR-003: Rust for Implementation Language

Status: Accepted Date: 2025-11-02 Context: Need high-performance, memory-safe language for replication.

Decision: Implement in Rust.

Rationale:

  • Memory safety without GC
  • High performance (C/C++ level)
  • Strong async ecosystem (Tokio)
  • Type safety reduces bugs
  • Excellent concurrency primitives

Alternatives Considered:

  1. Go: Simpler, but GC overhead, less performance
  2. Java: Mature ecosystem, but GC pauses, higher memory usage
  3. C++: High performance, but memory safety issues

Consequences:

  • Positive: Performance, safety, modern tooling
  • Negative: Steeper learning curve, longer compile times

ADR-004: Schema-Aware Compression

Status: Accepted Date: 2025-11-02 Context: Cross-region replication has high bandwidth costs.

Decision: Implement schema-aware compression using different algorithms per data type.

Rationale:

  • 3-5x better compression than generic algorithms
  • Reduces bandwidth costs by 60-70%
  • Faster decompression (type-specific)
  • Minimal CPU overhead

Alternatives Considered:

  1. Generic Zstd Only: Simpler, but less compression (2x vs 3-5x)
  2. No Compression: Highest bandwidth costs
  3. LZ4: Faster, but lower compression ratio

Consequences:

  • Positive: Significant bandwidth savings, cost reduction
  • Negative: More complex implementation, slight CPU increase

ADR-005: AI-Powered Predictive Replication

Status: Accepted Date: 2025-11-02 Context: Not all data is equally important; some is accessed more frequently.

Decision: Use ML to predict access patterns and prioritize hot data replication.

Rationale:

  • 40-60% reduction in lag for critical data
  • Better user experience during failover
  • Optimized bandwidth usage
  • Unique competitive advantage

Alternatives Considered:

  1. Round-Robin Replication: Simple, but treats all data equally
  2. Static Priority: Requires manual configuration
  3. LRU-based: Only considers recency, not access patterns

Consequences:

  • Positive: Improved performance for critical data, innovation
  • Negative: ML model training/maintenance overhead, complexity

ADR-006: Tenant-Level Granularity

Status: Accepted Date: 2025-11-02 Context: Traditional replication is database or table-level.

Decision: Implement replication at tenant granularity.

Rationale:

  • Selective replication (replicate only needed tenants)
  • Tenant-specific SLAs (Premium vs Standard)
  • Tenant mobility (migrate tenants across regions)
  • Cost allocation per tenant
  • World-first innovation

Alternatives Considered:

  1. Database-Level: Cannot isolate tenants
  2. Table-Level: Doesn’t capture tenant semantics
  3. Row-Level: Too granular, high overhead

Consequences:

  • Positive: Flexibility, innovation, business value
  • Negative: More complex routing, metadata management

ADR-007: Automatic Failover with Health Checks

Status: Accepted Date: 2025-11-02 Context: Manual failover is slow and error-prone.

Decision: Implement automatic failover based on health checks.

Rationale:

  • RTO <30 seconds (vs minutes for manual)
  • 24/7 availability (no human intervention)
  • Consistent decision-making
  • Reduced blast radius (tenant-level)

Alternatives Considered:

  1. Manual Failover Only: Slower, requires on-call engineer
  2. Automatic Without Health Checks: Risk of unnecessary failovers
  3. External Orchestrator (e.g., Consul): Additional dependency

Consequences:

  • Positive: Fast recovery, reliability, automation
  • Negative: Risk of false positives, needs careful tuning

13. Risk Analysis

13.1 Technical Risks

RiskLikelihoodImpactMitigation
CDC Performance OverheadMediumHighUse logical replication (low overhead), benchmark extensively
Cross-Region LatencyMediumMediumPredictive replication, compression, edge caching
Schema Evolution BugsLowHighExtensive testing, gradual rollout, rollback capability
Data CorruptionLowCriticalChecksums, validation, point-in-time recovery
Conflict Resolution ErrorsMediumMediumSemantic resolution, manual override, audit logging
Failover False PositivesMediumMediumMulti-factor health checks, cooldown periods

13.2 Operational Risks

RiskLikelihoodImpactMitigation
Runaway Resource UsageMediumHighResource limits, auto-scaling, monitoring alerts
Configuration ErrorsMediumMediumValidation, dry-run mode, rollback capability
Monitoring Blind SpotsLowHighComprehensive metrics, distributed tracing, alerting
Security VulnerabilitiesLowCriticalSecurity audits, penetration testing, bug bounty

13.3 Business Risks

RiskLikelihoodImpactMitigation
Market TimingLowMediumPhased rollout, MVP first, iterate based on feedback
Competitive ResponseMediumLowPatent filings, continuous innovation
Customer AdoptionMediumHighPilot programs, documentation, support

14. Conclusion

14.1 Summary

The F6.21 Tenant Replication architecture provides a comprehensive, production-ready design for intelligent, tenant-level database replication with the following key innovations:

  1. AI-Powered Predictive Replication: Prioritizes hot data for faster replication
  2. Schema-Aware Compression: 3-5x better compression than generic algorithms
  3. Automatic Failover: <30s RTO with health-based promotion
  4. Tenant Mobility: Zero-downtime cross-region migration
  5. Semantic Conflict Resolution: AI-driven intelligent merging
  6. Differentiated QoS: SLA tiers per tenant priority

14.2 Next Steps

  1. Immediate (Week 1):

    • Review and approval
    • Team assignment
    • Environment setup
  2. Phase 1 (Months 1-2):

    • Core replication implementation
    • CDC integration
    • Basic monitoring
  3. Phase 2 (Months 3-4):

    • AI predictive engine
    • Transform pipeline
    • Advanced compression
  4. Phase 3 (Month 5):

    • Failover automation
    • Migration orchestration
  5. Phase 4 (Month 6):

    • QoS implementation
    • Production hardening
    • Documentation

14.3 Success Criteria

  • Replication lag <5 seconds (P99)
  • Throughput >50K rows/sec
  • RTO <30 seconds, RPO <5 seconds
  • Compression ratio 3-5x
  • 99.99% availability
  • Zero security breaches
  • 90%+ test coverage

Document Version: 1.0 Status: Draft for Review Authors: System Architecture Team Reviewers: Engineering Lead, CTO Approval Date: TBD


HeliosDB: World’s First Intelligent Tenant-Level Database Replication