F6.21 Tenant Replication Architecture

System Architecture & Design Document

Feature ID: F6.21 Version: 6.0 Status: Design Phase Priority: P1 (HIGH) Author: System Architecture Team Date: November 2, 2025 Last Updated: November 2, 2025

Executive Summary
System Architecture
Component Design
Data Architecture
API Design
Technology Stack
Security Architecture
Performance & Scalability
Monitoring & Observability
Disaster Recovery
Deployment Architecture
Architecture Decision Records

1. Executive Summary

1.1 Overview

The Tenant Replication system provides intelligent, unidirectional replication for multi-tenant databases with innovative features including AI-powered predictive replication, semantic conflict resolution, and zero-downtime tenant migration.

1.2 Key Architectural Principles

Tenant-First Design: All operations granular at tenant level
Cloud-Native: Kubernetes-native, containerized deployment
Event-Driven: Asynchronous CDC-based architecture
Scalable: Horizontal scaling for replication workers
Observable: Comprehensive metrics and tracing
Secure: End-to-end encryption, RBAC, audit logging

1.3 Quality Attributes

Attribute	Target	Measurement
Availability	99.99%	Uptime monitoring
Performance	<5s replication lag (P99)	Lag metrics
Scalability	10K tenants per cluster	Load testing
Security	Zero data breaches	Security audits
Reliability	<30s RTO, <5s RPO	DR testing
Maintainability	<2 day bug fix cycle	Issue tracking

2. System Architecture

2.1 High-Level Architecture (C4 Model - Level 1)

graph TB
    subgraph "External Systems"
        CLIENT[Client Applications]
        MONITORING[Monitoring Systems<br/>Prometheus/Grafana]
        ALERTS[Alert Systems<br/>PagerDuty/Slack]
    end

    subgraph "HeliosDB Tenant Replication System"
        API[REST/gRPC API Gateway]
        CONTROLLER[Replication Controller]
        WORKERS[Replication Workers Pool]
        PREDICTOR[AI Predictive Engine]
        FAILOVER[Automatic Failover Controller]
    end

    subgraph "Data Layer"
        SOURCE_DB[(Source DB<br/>Read-Write)]
        TARGET_DB[(Target DB<br/>Read-Only)]
        METADATA[(Metadata Store<br/>PostgreSQL)]
        QUEUE[Message Queue<br/>Kafka/NATS]
    end

    CLIENT --> API
    API --> CONTROLLER
    CONTROLLER --> WORKERS
    CONTROLLER --> PREDICTOR
    CONTROLLER --> FAILOVER

    WORKERS --> SOURCE_DB
    WORKERS --> TARGET_DB
    WORKERS --> QUEUE

    PREDICTOR --> METADATA
    FAILOVER --> METADATA

    WORKERS --> MONITORING
    FAILOVER --> ALERTS

    style API fill:#4A90E2
    style CONTROLLER fill:#50C878
    style WORKERS fill:#FFB347
    style PREDICTOR fill:#E74C3C
    style FAILOVER fill:#9B59B6

2.2 Container Architecture (C4 Model - Level 2)

graph TB
    subgraph "API Layer"
        REST[REST API Server<br/>Axum Framework]
        GRPC[gRPC Server<br/>Tonic Framework]
        AUTH[Auth Service<br/>JWT/OAuth2]
    end

    subgraph "Control Plane"
        ORCHESTRATOR[Replication Orchestrator]
        SCHEDULER[QoS Scheduler]
        MIGRATION[Migration Controller]
        HEALTH[Health Monitor]
    end

    subgraph "Data Plane"
        CDC[CDC Processor<br/>Logical Replication]
        TRANSFORM[Transform Pipeline]
        COMPRESS[Compression Engine]
        APPLY[Apply Worker]
    end

    subgraph "AI/ML Layer"
        PREDICTOR_ML[Access Pattern Predictor]
        CONFLICT_AI[Semantic Conflict Resolver]
        OPTIMIZER[Performance Optimizer]
    end

    subgraph "Storage Layer"
        REPL_META[(Replication Metadata)]
        CHECKPOINT[(Checkpoint Store)]
        METRICS_DB[(Metrics Store)]
    end

    REST --> ORCHESTRATOR
    GRPC --> ORCHESTRATOR
    AUTH --> REST
    AUTH --> GRPC

    ORCHESTRATOR --> SCHEDULER
    ORCHESTRATOR --> MIGRATION
    ORCHESTRATOR --> HEALTH

    SCHEDULER --> CDC
    CDC --> TRANSFORM
    TRANSFORM --> COMPRESS
    COMPRESS --> APPLY

    PREDICTOR_ML --> SCHEDULER
    CONFLICT_AI --> APPLY
    OPTIMIZER --> SCHEDULER

    ORCHESTRATOR --> REPL_META
    CDC --> CHECKPOINT
    HEALTH --> METRICS_DB

2.3 Component Diagram (C4 Model - Level 3)

graph TB
    subgraph "Replication Worker"
        CDC_READER[CDC Reader<br/>PostgreSQL Logical Slot]
        PARSER[WAL Parser<br/>Decode Binary Log]
        TRANSFORMER[Data Transformer<br/>Anonymize/Aggregate]
        COMPRESSOR[Schema-Aware Compressor<br/>Delta/Dictionary]
        NETWORK[Network Client<br/>TLS 1.3]
        APPLIER[Transaction Applier<br/>Batch Insert/Update]
    end

    subgraph "AI Engine"
        ACCESS_ANALYZER[Access Pattern Analyzer<br/>Time-series ML]
        PRIORITY_QUEUE[Priority Queue<br/>Hot Data First]
        PATTERN_DETECTOR[Pattern Detector<br/>Anomaly Detection]
    end

    subgraph "Conflict Resolution"
        DETECTOR[Conflict Detector<br/>Timestamp/Version]
        SEMANTIC[Semantic Resolver<br/>LLM Integration]
        STRATEGY[Resolution Strategy<br/>LWW/Custom]
    end

    CDC_READER --> PARSER
    PARSER --> TRANSFORMER
    TRANSFORMER --> COMPRESSOR
    COMPRESSOR --> NETWORK
    NETWORK --> APPLIER

    ACCESS_ANALYZER --> PRIORITY_QUEUE
    PRIORITY_QUEUE --> CDC_READER
    PATTERN_DETECTOR --> ACCESS_ANALYZER

    APPLIER --> DETECTOR
    DETECTOR --> SEMANTIC
    SEMANTIC --> STRATEGY
    STRATEGY --> APPLIER

2.4 Data Flow Architecture

sequenceDiagram
    participant Source as Source DB<br/>(Read-Write)
    participant CDC as CDC Processor
    participant Transform as Transform Pipeline
    participant Compress as Compression
    participant Queue as Message Queue
    participant Apply as Apply Worker
    participant Target as Target DB<br/>(Read-Only)
    participant Monitor as Monitoring

    Source->>CDC: Transaction Commit
    CDC->>CDC: Read WAL/Redo Log
    CDC->>Transform: Raw Change Events

    Transform->>Transform: Apply Transforms<br/>(Anonymize/Aggregate)
    Transform->>Compress: Transformed Data

    Compress->>Compress: Schema-Aware<br/>Compression
    Compress->>Queue: Compressed Batch

    Queue->>Apply: Fetch Batch
    Apply->>Apply: Decompress
    Apply->>Apply: Conflict Detection

    alt No Conflict
        Apply->>Target: Apply Changes
        Target-->>Apply: ACK
    else Conflict Detected
        Apply->>Apply: Semantic Resolution
        Apply->>Target: Apply Merged
        Target-->>Apply: ACK
    end

    Apply->>Monitor: Update Metrics<br/>(Lag, Throughput)
    Monitor->>Monitor: Check SLA

    alt SLA Violated
        Monitor->>Monitor: Trigger Alert
    end

3. Component Design

3.1 Replication Controller

Responsibility: Orchestrate replication lifecycle, manage worker pool, enforce SLAs.

graph LR
    subgraph "Replication Controller"
        API_HANDLER[API Handler]
        LIFECYCLE[Lifecycle Manager]
        POOL[Worker Pool Manager]
        SLA[SLA Enforcer]
        STATE[State Machine]
    end

    subgraph "State Machine States"
        INIT[Initializing]
        SYNC[Initial Sync]
        CDC[CDC Streaming]
        PAUSE[Paused]
        FAILOVER[Failing Over]
        STOPPED[Stopped]
    end

    API_HANDLER --> LIFECYCLE
    LIFECYCLE --> STATE
    STATE --> POOL
    POOL --> SLA

    STATE --> INIT
    INIT --> SYNC
    SYNC --> CDC
    CDC --> PAUSE
    CDC --> FAILOVER
    PAUSE --> CDC
    FAILOVER --> STOPPED
    CDC --> STOPPED

Key Algorithms:

Worker Pool Sizing:

workers = min(
    max_workers,
    ceil(tenant_count / tenants_per_worker),
    available_cpu_cores
)

SLA Enforcement:

if replication_lag > qos_threshold:
    if priority == Premium:
        allocate_dedicated_worker()
    elif priority == Standard:
        increase_batch_frequency()
    else:
        log_warning()

3.2 CDC Processor

Responsibility: Capture changes from source database using logical replication.

graph TB
    subgraph "CDC Processor"
        SLOT[Replication Slot<br/>PostgreSQL]
        DECODER[Binary Decoder<br/>WAL Format]
        FILTER[Table Filter<br/>Tenant Isolation]
        BUFFER[Event Buffer<br/>Ring Buffer]
        CHECKPOINT[Checkpoint Manager<br/>LSN Tracking]
    end

    SOURCE_DB[(Source DB)] --> SLOT
    SLOT --> DECODER
    DECODER --> FILTER
    FILTER --> BUFFER
    BUFFER --> TRANSFORM[To Transform Pipeline]

    BUFFER --> CHECKPOINT
    CHECKPOINT --> METADATA[(Metadata Store)]

    style SLOT fill:#E74C3C
    style DECODER fill:#3498DB
    style FILTER fill:#2ECC71

Technologies:

PostgreSQL: Logical replication slots (pgoutput plugin)
Protocol: PostgreSQL replication protocol (streaming)
Format: Binary WAL decoding (faster than text)

Checkpoint Strategy:

Checkpoint = {
    tenant_id: String,
    lsn: u64,              // Log Sequence Number
    timestamp: DateTime,
    transaction_id: u64,
    table_lsns: HashMap<String, u64>
}

3.3 Transform Pipeline

Responsibility: Apply transformations to replicated data (anonymization, aggregation, filtering).

graph LR
    INPUT[Input Batch] --> VALIDATOR[Schema Validator]
    VALIDATOR --> ANONYMIZER[PII Anonymizer]
    ANONYMIZER --> AGGREGATOR[Aggregator]
    AGGREGATOR --> FILTER[Row Filter]
    FILTER --> ENRICHER[Data Enricher]
    ENRICHER --> OUTPUT[Output Batch]

    SCHEMA[(Schema Metadata)] --> VALIDATOR
    RULES[(Transform Rules)] --> ANONYMIZER
    RULES --> AGGREGATOR
    RULES --> FILTER

    style ANONYMIZER fill:#E74C3C
    style AGGREGATOR fill:#3498DB
    style FILTER fill:#2ECC71
    style ENRICHER fill:#9B59B6

Transform Types:

Anonymization:

pub enum AnonymizationMethod {
    Hash(HashAlgorithm),           // SHA-256 hash
    Tokenize(TokenService),        // Vault token
    Redact,                        // Replace with ***
    Generalize(GeneralizationRule) // Age -> Age Group
}

Aggregation:

pub struct AggregationTransform {
    group_by: Vec<String>,
    aggregations: Vec<AggregationFunc>,
    window: TimeWindow,
    emit_policy: EmitPolicy
}

3.4 AI Predictive Engine

Responsibility: Predict access patterns and prioritize replication of hot data.

graph TB
    subgraph "Predictive Engine"
        COLLECTOR[Access Pattern Collector]
        FEATURES[Feature Extractor]
        MODEL[ML Model<br/>Gradient Boosting]
        PREDICTOR[Access Predictor]
        PRIORITIZER[Replication Prioritizer]
    end

    ACCESS_LOG[Access Logs] --> COLLECTOR
    COLLECTOR --> FEATURES

    FEATURES --> MODEL
    MODEL --> PREDICTOR
    PREDICTOR --> PRIORITIZER

    PRIORITIZER --> SCHEDULER[QoS Scheduler]

    subgraph "Features"
        FREQ[Access Frequency]
        RECENCY[Recency]
        SIZE[Data Size]
        PATTERN[Access Pattern<br/>Sequential/Random]
    end

    FEATURES --> FREQ
    FEATURES --> RECENCY
    FEATURES --> SIZE
    FEATURES --> PATTERN

    style MODEL fill:#E74C3C
    style PREDICTOR fill:#3498DB

ML Model:

Algorithm: Gradient Boosting (XGBoost)
Features:
- Access frequency (last 1h, 24h, 7d)
- Last access timestamp
- Table size and row count
- Read:write ratio
- User count accessing data
Output: Priority score (0-100)
Training: Online learning every 1 hour

Prioritization Algorithm:

priority_score = (
    0.4 * access_frequency_score +
    0.3 * recency_score +
    0.2 * size_score +
    0.1 * pattern_score
)

replication_order = sort_by(priority_score, descending)

3.5 Compression Engine

Responsibility: Schema-aware compression for bandwidth optimization.

graph TB
    subgraph "Compression Engine"
        ANALYZER[Schema Analyzer]
        SELECTOR[Compression Selector]
        DELTA[Delta Encoding]
        DICT[Dictionary Encoding]
        GORILLA[Gorilla Compression]
        ZSTD[Zstd Compression]
    end

    BATCH[Input Batch] --> ANALYZER
    ANALYZER --> SELECTOR

    SELECTOR -->|Integer| DELTA
    SELECTOR -->|String| DICT
    SELECTOR -->|Float/Timestamp| GORILLA
    SELECTOR -->|JSON/Text| ZSTD

    DELTA --> OUTPUT[Compressed Batch]
    DICT --> OUTPUT
    GORILLA --> OUTPUT
    ZSTD --> OUTPUT

    SCHEMA[(Schema)] --> ANALYZER

    style DELTA fill:#3498DB
    style DICT fill:#2ECC71
    style GORILLA fill:#E74C3C
    style ZSTD fill:#9B59B6

Compression Strategies:

Data Type	Strategy	Ratio	Use Case
Integer (sequential)	Delta Encoding	5-10x	IDs, counters
Integer (random)	Zstd	2-3x	Random values
String (low cardinality)	Dictionary	10-50x	Status codes, categories
String (high cardinality)	Zstd	2-4x	Names, descriptions
Float (time-series)	Gorilla	5-12x	Metrics, sensors
Timestamp	Delta-of-Delta	8-15x	Event timestamps
JSON	Zstd	3-5x	Document data

3.6 Failover Controller

Responsibility: Monitor source health, automatically promote replica on failure.

stateDiagram-v2
    [*] --> Monitoring

    Monitoring --> SourceHealthy: Health Check OK
    SourceHealthy --> Monitoring: Continue

    Monitoring --> SourceUnhealthy: Health Check Failed
    SourceUnhealthy --> Evaluating: Evaluate Criteria

    Evaluating --> Monitoring: Don't Failover<br/>(Transient Issue)
    Evaluating --> InitiateFailover: Failover Needed

    InitiateFailover --> StopReplication: Stop CDC
    StopReplication --> PromoteReplica: Set Read-Write
    PromoteReplica --> UpdateRouting: Update DNS/LB
    UpdateRouting --> SendAlerts: Notify Operators
    SendAlerts --> [*]

    note right of Evaluating
        Criteria:
        - Downtime > 30s
        - Replication lag < 5s
        - Replica healthy
        - Not maintenance window
    end note

Health Check Algorithm:

pub async fn check_health(&self) -> HealthStatus {
    let checks = vec![
        self.check_database_connectivity(),
        self.check_replication_lag(),
        self.check_disk_space(),
        self.check_cpu_usage(),
        self.check_network_connectivity(),
    ];

    let results = futures::join_all(checks).await;

    HealthStatus {
        is_healthy: results.iter().all(|r| r.is_ok()),
        checks: results,
        timestamp: Utc::now(),
    }
}

4. Data Architecture

4.1 Replication Metadata Schema

-- Replication configuration
CREATE TABLE replication_configs (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id VARCHAR(255) NOT NULL,
    source_connection_id UUID NOT NULL REFERENCES connections(id),
    target_connection_id UUID NOT NULL REFERENCES connections(id),

    -- Configuration
    qos_tier VARCHAR(50) NOT NULL, -- BestEffort, Standard, Premium, Synchronous
    max_lag_seconds INTEGER NOT NULL,
    priority SMALLINT NOT NULL DEFAULT 50, -- 0-100

    -- Filters
    table_filter JSONB, -- ["users.*", "orders.*"]
    row_filter JSONB,   -- Complex predicates

    -- Transforms
    transforms JSONB,   -- Transform pipeline config

    -- State
    status VARCHAR(50) NOT NULL, -- Initializing, Syncing, Streaming, Paused, Stopped
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),

    CONSTRAINT unique_tenant_source_target UNIQUE(tenant_id, source_connection_id, target_connection_id)
);

CREATE INDEX idx_replication_configs_tenant ON replication_configs(tenant_id);
CREATE INDEX idx_replication_configs_status ON replication_configs(status);
CREATE INDEX idx_replication_configs_qos ON replication_configs(qos_tier);

-- Replication checkpoints
CREATE TABLE replication_checkpoints (
    id BIGSERIAL PRIMARY KEY,
    replication_id UUID NOT NULL REFERENCES replication_configs(id) ON DELETE CASCADE,

    -- Checkpoint data
    lsn BIGINT NOT NULL,              -- Log Sequence Number
    transaction_id BIGINT,             -- Transaction ID
    checkpoint_time TIMESTAMPTZ NOT NULL,

    -- Per-table checkpoints
    table_checkpoints JSONB,           -- {table_name: lsn}

    -- Metrics at checkpoint
    rows_replicated BIGINT NOT NULL DEFAULT 0,
    bytes_replicated BIGINT NOT NULL DEFAULT 0,
    lag_seconds NUMERIC(10,3),

    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_checkpoints_replication ON replication_checkpoints(replication_id);
CREATE INDEX idx_checkpoints_time ON replication_checkpoints(checkpoint_time DESC);

-- Replication metrics (time-series)
CREATE TABLE replication_metrics (
    id BIGSERIAL PRIMARY KEY,
    replication_id UUID NOT NULL REFERENCES replication_configs(id) ON DELETE CASCADE,

    timestamp TIMESTAMPTZ NOT NULL,

    -- Lag metrics
    replication_lag_seconds NUMERIC(10,3),
    replication_lag_bytes BIGINT,

    -- Throughput metrics
    rows_per_second NUMERIC(12,2),
    bytes_per_second BIGINT,
    transactions_per_second NUMERIC(10,2),

    -- Performance metrics
    compression_ratio NUMERIC(5,2),
    network_latency_ms NUMERIC(8,2),
    apply_latency_ms NUMERIC(8,2),

    -- Resource metrics
    cpu_usage_percent NUMERIC(5,2),
    memory_usage_mb BIGINT,

    -- Error metrics
    conflicts_detected INTEGER DEFAULT 0,
    conflicts_resolved INTEGER DEFAULT 0,
    errors_count INTEGER DEFAULT 0
);

-- Hypertable for time-series (TimescaleDB extension)
SELECT create_hypertable('replication_metrics', 'timestamp', if_not_exists => TRUE);

CREATE INDEX idx_metrics_replication_time ON replication_metrics(replication_id, timestamp DESC);

-- Replication events (audit log)
CREATE TABLE replication_events (
    id BIGSERIAL PRIMARY KEY,
    replication_id UUID NOT NULL REFERENCES replication_configs(id) ON DELETE CASCADE,

    event_type VARCHAR(100) NOT NULL, -- StateChange, Failover, Error, etc.
    severity VARCHAR(20) NOT NULL,     -- Info, Warning, Error, Critical

    message TEXT NOT NULL,
    details JSONB,

    -- Context
    user_id UUID,
    source_ip INET,

    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_events_replication ON replication_events(replication_id);
CREATE INDEX idx_events_type ON replication_events(event_type);
CREATE INDEX idx_events_severity ON replication_events(severity);
CREATE INDEX idx_events_time ON replication_events(created_at DESC);

-- Failover history
CREATE TABLE failover_history (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    replication_id UUID NOT NULL REFERENCES replication_configs(id),

    -- Failover details
    trigger VARCHAR(100) NOT NULL,     -- Automatic, Manual, Scheduled
    reason TEXT,

    old_source_id UUID,
    new_source_id UUID,

    -- Metrics
    failover_duration_ms BIGINT,
    downtime_ms BIGINT,
    data_loss_bytes BIGINT DEFAULT 0,

    -- Status
    status VARCHAR(50) NOT NULL,       -- InProgress, Completed, Failed
    error_message TEXT,

    started_at TIMESTAMPTZ NOT NULL,
    completed_at TIMESTAMPTZ,

    created_by UUID
);

CREATE INDEX idx_failover_replication ON failover_history(replication_id);
CREATE INDEX idx_failover_status ON failover_history(status);
CREATE INDEX idx_failover_started ON failover_history(started_at DESC);

-- Migration history
CREATE TABLE migration_history (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id VARCHAR(255) NOT NULL,

    -- Migration details
    source_region VARCHAR(100) NOT NULL,
    target_region VARCHAR(100) NOT NULL,

    -- Phases
    bulk_copy_completed_at TIMESTAMPTZ,
    cdc_catchup_completed_at TIMESTAMPTZ,
    cutover_completed_at TIMESTAMPTZ,

    -- Metrics
    data_size_gb NUMERIC(12,2),
    total_duration_seconds INTEGER,
    downtime_ms BIGINT,

    -- Status
    status VARCHAR(50) NOT NULL,       -- InProgress, Completed, Failed, RolledBack
    error_message TEXT,

    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    completed_at TIMESTAMPTZ,

    created_by UUID
);

CREATE INDEX idx_migration_tenant ON migration_history(tenant_id);
CREATE INDEX idx_migration_status ON migration_history(status);
CREATE INDEX idx_migration_created ON migration_history(created_at DESC);

-- AI model metadata
CREATE TABLE ml_models (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    model_type VARCHAR(100) NOT NULL, -- AccessPatternPredictor, ConflictResolver
    version VARCHAR(50) NOT NULL,

    -- Model artifacts
    model_path TEXT NOT NULL,
    model_size_mb NUMERIC(10,2),

    -- Training metadata
    training_dataset_size BIGINT,
    training_duration_seconds INTEGER,

    -- Performance metrics
    accuracy NUMERIC(5,4),
    precision_score NUMERIC(5,4),
    recall NUMERIC(5,4),
    f1_score NUMERIC(5,4),

    -- Deployment
    status VARCHAR(50) NOT NULL,       -- Training, Validating, Deployed, Deprecated
    deployed_at TIMESTAMPTZ,

    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    created_by UUID
);

CREATE INDEX idx_ml_models_type ON ml_models(model_type);
CREATE INDEX idx_ml_models_status ON ml_models(status);

-- Access pattern predictions
CREATE TABLE access_predictions (
    id BIGSERIAL PRIMARY KEY,
    tenant_id VARCHAR(255) NOT NULL,
    table_name VARCHAR(255) NOT NULL,

    prediction_time TIMESTAMPTZ NOT NULL,
    prediction_window_hours INTEGER NOT NULL,

    -- Predictions
    predicted_access_count INTEGER,
    predicted_read_bytes BIGINT,
    priority_score NUMERIC(5,2),      -- 0-100

    -- Confidence
    confidence NUMERIC(5,4),           -- 0-1

    model_id UUID REFERENCES ml_models(id),
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

SELECT create_hypertable('access_predictions', 'prediction_time', if_not_exists => TRUE);

CREATE INDEX idx_predictions_tenant ON access_predictions(tenant_id);
CREATE INDEX idx_predictions_priority ON access_predictions(priority_score DESC);

4.2 Database Indexes Strategy

Performance Indexes:

-- Lookup replication configs by tenant
CREATE INDEX idx_replication_tenant_active
ON replication_configs(tenant_id)
WHERE status IN ('Syncing', 'Streaming');

-- Find high-priority replications
CREATE INDEX idx_replication_priority_qos
ON replication_configs(priority DESC, qos_tier)
WHERE status = 'Streaming';

-- Checkpoint lookup for resume
CREATE INDEX idx_checkpoint_latest
ON replication_checkpoints(replication_id, checkpoint_time DESC);

-- Metrics time-range queries
CREATE INDEX idx_metrics_time_range
ON replication_metrics(replication_id, timestamp DESC)
WHERE replication_lag_seconds > 5; -- Alert threshold

4.3 Data Retention Policies

-- Retention policies (cleanup old data)
CREATE TABLE data_retention_policies (
    table_name VARCHAR(255) PRIMARY KEY,
    retention_days INTEGER NOT NULL,
    cleanup_enabled BOOLEAN DEFAULT TRUE
);

INSERT INTO data_retention_policies VALUES
    ('replication_metrics', 90, TRUE),      -- 90 days
    ('replication_events', 365, TRUE),      -- 1 year
    ('replication_checkpoints', 30, TRUE),  -- 30 days
    ('access_predictions', 7, TRUE),        -- 7 days
    ('failover_history', 730, TRUE),        -- 2 years
    ('migration_history', 730, TRUE);       -- 2 years

-- Cleanup job (run daily)
CREATE OR REPLACE FUNCTION cleanup_old_data() RETURNS void AS $$
DECLARE
    policy RECORD;
BEGIN
    FOR policy IN SELECT * FROM data_retention_policies WHERE cleanup_enabled LOOP
        EXECUTE format(
            'DELETE FROM %I WHERE created_at < NOW() - INTERVAL ''%s days''',
            policy.table_name,
            policy.retention_days
        );
    END LOOP;
END;
$$ LANGUAGE plpgsql;

5. API Design

See separate document: F6.21_API_SPECIFICATION.md

Summary:

REST API: CRUD operations, management, monitoring
gRPC API: High-performance replication operations
WebSocket API: Real-time metrics streaming

6. Technology Stack

6.1 Technology Selection Matrix

Component	Technology	Rationale	Alternatives Considered
Language	Rust	Performance, safety, async	Go (simpler, slower), Java (GC overhead)
Web Framework	Axum	Fast, ergonomic, Tower ecosystem	Actix-web (less maintained), Rocket (slower)
gRPC	Tonic	Native Rust, HTTP/2	grpc-rs (C++ bindings), tarpc (simpler)
Async Runtime	Tokio	Industry standard, mature	async-std (smaller ecosystem)
Database	PostgreSQL 16+	Logical replication, JSON, TimescaleDB	MySQL (weaker replication), MongoDB (NoSQL)
Message Queue	NATS JetStream	Lightweight, persistence, exactly-once	Kafka (heavier), RabbitMQ (slower)
CDC	PostgreSQL Logical Replication	Native, low overhead	Debezium (JVM required), custom WAL parser
Compression	Zstd, custom	Best ratio, fast	LZ4 (less compression), Snappy (faster, less ratio)
ML Framework	Candle (Rust)	Native Rust, no Python	PyTorch (requires Python bridge), ONNX Runtime
Metrics	Prometheus	Industry standard, pull-based	InfluxDB (time-series only), Datadog (expensive)
Tracing	OpenTelemetry	Standard, vendor-agnostic	Jaeger only (limited), custom
Config	YAML + Env Vars	Human-readable, secure secrets	TOML (less common), JSON (no comments)
Orchestration	Kubernetes	Cloud-native, scalable	Docker Swarm (simpler, less features), Nomad

6.2 Dependency Graph

graph TB
    subgraph "External Dependencies"
        TOKIO[tokio 1.35+]
        AXUM[axum 0.7+]
        TONIC[tonic 0.11+]
        SQLX[sqlx 0.7+]
        NATS[async-nats 0.33+]
        ZSTD[zstd 0.13+]
        PROMETHEUS[prometheus-client 0.22+]
        OTEL[opentelemetry 0.21+]
    end

    subgraph "Internal Dependencies"
        MULTITENANCY[heliosdb-multitenancy]
        METADATA[heliosdb-metadata]
        SECURITY[heliosdb-security]
        NETWORK[heliosdb-network]
        STREAMING[heliosdb-streaming]
    end

    APP[heliosdb-tenant-replication] --> TOKIO
    APP --> AXUM
    APP --> TONIC
    APP --> SQLX
    APP --> NATS
    APP --> ZSTD
    APP --> PROMETHEUS
    APP --> OTEL

    APP --> MULTITENANCY
    APP --> METADATA
    APP --> SECURITY
    APP --> NETWORK
    APP --> STREAMING

    style APP fill:#E74C3C
    style MULTITENANCY fill:#3498DB
    style METADATA fill:#2ECC71

6.3 Technology Versions

[dependencies]
# Async runtime
tokio = { version = "1.35", features = ["full"] }
tokio-util = "0.7"

# Web frameworks
axum = "0.7"
tower = "0.4"
tower-http = "0.5"
tonic = "0.11"
prost = "0.12"

# Database
sqlx = { version = "0.7", features = ["postgres", "runtime-tokio", "tls-rustls", "json", "uuid", "chrono"] }
deadpool-postgres = "0.12"

# Message queue
async-nats = "0.33"

# Serialization
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
serde_yaml = "0.9"

# Compression
zstd = "0.13"
snap = "1.1"  # Snappy

# Cryptography
aes-gcm = "0.10"
ring = "0.17"

# Metrics & Observability
prometheus-client = "0.22"
opentelemetry = "0.21"
opentelemetry-otlp = "0.14"
tracing = "0.1"
tracing-subscriber = "0.3"

# ML/AI
candle-core = "0.3"
candle-nn = "0.3"

# Utilities
anyhow = "1.0"
thiserror = "1.0"
uuid = { version = "1.6", features = ["v4", "serde"] }
chrono = { version = "0.4", features = ["serde"] }

7. Security Architecture

7.1 Security Layers

graph TB
    subgraph "Network Security"
        TLS[TLS 1.3<br/>All Connections]
        MTLS[mTLS<br/>Service-to-Service]
        FW[Firewall Rules<br/>Network Policies]
    end

    subgraph "Authentication & Authorization"
        JWT[JWT Tokens<br/>API Access]
        RBAC[RBAC<br/>Permission Model]
        OAUTH[OAuth2<br/>Third-party Auth]
    end

    subgraph "Data Security"
        ENCRYPT_TRANSIT[Encryption in Transit<br/>AES-256-GCM]
        ENCRYPT_REST[Encryption at Rest<br/>Database TDE]
        MASKING[Data Masking<br/>PII Protection]
    end

    subgraph "Audit & Compliance"
        AUDIT_LOG[Audit Logging<br/>All Operations]
        RETENTION[Data Retention<br/>GDPR Compliance]
        ACCESS_LOG[Access Logs<br/>Security Events]
    end

    CLIENT[Client] --> TLS
    TLS --> JWT
    JWT --> RBAC

    RBAC --> ENCRYPT_TRANSIT
    ENCRYPT_TRANSIT --> ENCRYPT_REST
    ENCRYPT_REST --> MASKING

    RBAC --> AUDIT_LOG
    ENCRYPT_TRANSIT --> ACCESS_LOG
    MASKING --> RETENTION

7.2 Authentication Flow

sequenceDiagram
    participant Client
    participant API Gateway
    participant Auth Service
    participant JWT Validator
    participant RBAC
    participant Backend

    Client->>API Gateway: Request + JWT Token
    API Gateway->>JWT Validator: Validate Token

    alt Invalid Token
        JWT Validator-->>Client: 401 Unauthorized
    else Valid Token
        JWT Validator->>RBAC: Check Permissions

        alt Insufficient Permissions
            RBAC-->>Client: 403 Forbidden
        else Authorized
            RBAC->>Backend: Forward Request
            Backend-->>Client: Response
        end
    end

    Backend->>Auth Service: Log Access Event

7.3 Encryption Strategy

In-Transit Encryption:

pub struct EncryptionConfig {
    // TLS configuration
    tls_version: TlsVersion::TLS13,
    cipher_suites: vec![
        CipherSuite::TLS_AES_256_GCM_SHA384,
        CipherSuite::TLS_CHACHA20_POLY1305_SHA256,
    ],

    // Certificate management
    cert_path: String,
    key_path: String,
    ca_cert_path: String,

    // mTLS for service-to-service
    require_client_cert: bool,
}

At-Rest Encryption:

pub struct DataEncryption {
    algorithm: EncryptionAlgorithm::AES256GCM,
    key_derivation: KeyDerivation::PBKDF2,

    // Key management
    kms_provider: KMSProvider::AWS,  // AWS KMS, Azure Key Vault, GCP KMS
    key_rotation_days: 30,

    // Encrypted fields
    encrypted_columns: vec![
        "password_hash",
        "api_key",
        "connection_string",
    ],
}

7.4 Access Control Model (RBAC)

pub enum Role {
    Admin,          // Full access
    Operator,       // Manage replications
    ReadOnly,       // View only
    TenantAdmin,    // Manage own tenant
}

pub enum Permission {
    // Replication management
    CreateReplication,
    UpdateReplication,
    DeleteReplication,
    ViewReplication,

    // Failover
    TriggerFailover,
    ViewFailoverHistory,

    // Migration
    StartMigration,
    CancelMigration,

    // Monitoring
    ViewMetrics,
    ViewLogs,
    ConfigureAlerts,
}

pub struct RBACPolicy {
    role: Role,
    permissions: Vec<Permission>,
    resource_filter: ResourceFilter, // Tenant-level isolation
}

Role-Permission Matrix:

Permission	ReadOnly	TenantAdmin
CreateReplication	❌	(own tenant)
UpdateReplication	❌	(own tenant)
DeleteReplication	❌	(own tenant)
ViewReplication		(own tenant)
TriggerFailover	❌	❌
StartMigration	❌	❌
ViewMetrics		(own tenant)
ConfigureAlerts	❌	❌

7.5 Tenant Isolation

pub struct TenantIsolation {
    // Database-level isolation
    row_level_security: bool,  // PostgreSQL RLS
    tenant_id_column: String,  // Tenant identifier

    // Network isolation
    vpc_per_tenant: bool,      // VPC isolation (enterprise)
    network_policies: Vec<NetworkPolicy>,

    // Resource isolation
    cpu_quota: CpuQuota,
    memory_limit: MemoryLimit,

    // Data isolation
    encryption_key_per_tenant: bool,
}

7.6 Audit Logging

pub struct AuditLog {
    event_id: Uuid,
    event_type: AuditEventType,
    timestamp: DateTime<Utc>,

    // Actor
    user_id: Option<Uuid>,
    service_account: Option<String>,
    source_ip: IpAddr,

    // Resource
    resource_type: ResourceType,
    resource_id: String,
    tenant_id: String,

    // Action
    action: Action,
    status: ActionStatus,  // Success, Failure

    // Details
    details: serde_json::Value,

    // Compliance
    gdpr_relevant: bool,
    hipaa_relevant: bool,
}

8. Performance & Scalability

8.1 Performance Targets

Metric	Target	Baseline	Stretch Goal
Replication Lag (P50)	<2 seconds	<5 seconds	<1 second
Replication Lag (P99)	<5 seconds	<15 seconds	<3 seconds
Throughput	>50K rows/sec	>10K rows/sec	>100K rows/sec
Latency (Apply)	<100ms	<500ms	<50ms
CPU Usage	<60%	<80%	<40%
Memory Usage	<4GB per worker	<8GB	<2GB
Network Bandwidth	<100 Mbps	<500 Mbps	<50 Mbps
Compression Ratio	3-5x	2x	5-10x

8.2 Scalability Architecture

graph TB
    subgraph "Horizontal Scaling"
        LB[Load Balancer<br/>Nginx/HAProxy]

        subgraph "API Tier (Stateless)"
            API1[API Server 1]
            API2[API Server 2]
            API3[API Server N]
        end

        subgraph "Worker Tier (Stateful)"
            W1[Replication Worker 1<br/>Tenants 1-100]
            W2[Replication Worker 2<br/>Tenants 101-200]
            W3[Replication Worker N<br/>Tenants N]
        end
    end

    subgraph "Data Tier"
        PGPOOL[PgPool<br/>Connection Pooling]

        subgraph "Database Cluster"
            PG_PRIMARY[(PostgreSQL Primary)]
            PG_REPLICA1[(PostgreSQL Replica 1)]
            PG_REPLICA2[(PostgreSQL Replica 2)]
        end
    end

    LB --> API1
    LB --> API2
    LB --> API3

    API1 --> W1
    API2 --> W2
    API3 --> W3

    W1 --> PGPOOL
    W2 --> PGPOOL
    W3 --> PGPOOL

    PGPOOL --> PG_PRIMARY
    PGPOOL --> PG_REPLICA1
    PGPOOL --> PG_REPLICA2

    PG_PRIMARY --> PG_REPLICA1
    PG_PRIMARY --> PG_REPLICA2

    style LB fill:#3498DB
    style PGPOOL fill:#2ECC71
    style PG_PRIMARY fill:#E74C3C

8.3 Scaling Strategies

Vertical Scaling:

# Worker resource limits
worker:
  cpu_request: "2000m"
  cpu_limit: "4000m"
  memory_request: "4Gi"
  memory_limit: "8Gi"

# Scaling triggers
scale_up_triggers:
  - cpu_usage > 70%
  - memory_usage > 75%
  - replication_lag > 10s

scale_down_triggers:
  - cpu_usage < 30% for 10 minutes
  - memory_usage < 40% for 10 minutes

Horizontal Scaling:

pub struct ScalingPolicy {
    min_workers: usize,
    max_workers: usize,

    scale_up_policy: ScaleUpPolicy {
        threshold: 100, // tenants per worker
        cooldown: Duration::from_secs(300),
    },

    scale_down_policy: ScaleDownPolicy {
        threshold: 30,  // tenants per worker
        cooldown: Duration::from_secs(600),
    },
}

8.4 Performance Optimization Techniques

Batching:

pub struct BatchConfig {
    max_batch_size: usize,       // 5000 rows
    max_batch_delay: Duration,   // 100ms

    // Dynamic batching
    adaptive_batch_size: bool,   // Adjust based on load
    batch_size_range: (usize, usize), // (1000, 10000)
}

Connection Pooling:

pub struct ConnectionPool {
    min_connections: u32,        // 10
    max_connections: u32,        // 100
    connection_timeout: Duration, // 5s
    idle_timeout: Duration,      // 60s

    // Per-tenant pools
    tenant_pool_size: u32,       // 2
}

Caching:

pub struct CacheStrategy {
    // Metadata cache
    schema_cache_ttl: Duration,  // 5 minutes
    config_cache_ttl: Duration,  // 1 minute

    // Prediction cache
    prediction_cache_size: usize, // 10K entries
    prediction_cache_ttl: Duration, // 1 hour
}

8.5 Load Testing Plan

load_tests:
  - name: "Baseline Load"
    tenants: 100
    transactions_per_second: 1000
    duration_minutes: 30

  - name: "Peak Load"
    tenants: 1000
    transactions_per_second: 10000
    duration_minutes: 60

  - name: "Stress Test"
    tenants: 5000
    transactions_per_second: 50000
    duration_minutes: 30

  - name: "Endurance Test"
    tenants: 500
    transactions_per_second: 5000
    duration_hours: 24

9. Monitoring & Observability

9.1 Observability Stack

graph TB
    subgraph "Application"
        APP[Replication Service]
        METRICS_EXPORTER[Metrics Exporter<br/>Prometheus Client]
        TRACES_EXPORTER[Traces Exporter<br/>OTLP]
        LOGS_EXPORTER[Logs Exporter<br/>Fluentd]
    end

    subgraph "Collection Layer"
        PROMETHEUS[Prometheus<br/>Metrics Store]
        JAEGER[Jaeger<br/>Distributed Tracing]
        LOKI[Loki<br/>Log Aggregation]
    end

    subgraph "Visualization Layer"
        GRAFANA[Grafana<br/>Dashboards]
        ALERTMANAGER[Alertmanager<br/>Alert Routing]
    end

    subgraph "Notification Layer"
        PAGERDUTY[PagerDuty]
        SLACK[Slack]
        EMAIL[Email]
    end

    APP --> METRICS_EXPORTER
    APP --> TRACES_EXPORTER
    APP --> LOGS_EXPORTER

    METRICS_EXPORTER --> PROMETHEUS
    TRACES_EXPORTER --> JAEGER
    LOGS_EXPORTER --> LOKI

    PROMETHEUS --> GRAFANA
    JAEGER --> GRAFANA
    LOKI --> GRAFANA

    PROMETHEUS --> ALERTMANAGER
    ALERTMANAGER --> PAGERDUTY
    ALERTMANAGER --> SLACK
    ALERTMANAGER --> EMAIL

    style PROMETHEUS fill:#E74C3C
    style JAEGER fill:#3498DB
    style LOKI fill:#2ECC71
    style GRAFANA fill:#9B59B6

9.2 Metrics Collection

System Metrics:

pub struct SystemMetrics {
    // Replication lag
    replication_lag_seconds: Histogram,
    replication_lag_bytes: Gauge,

    // Throughput
    rows_replicated_total: Counter,
    bytes_replicated_total: Counter,
    transactions_replicated_total: Counter,

    // Performance
    replication_duration_seconds: Histogram,
    compression_ratio: Histogram,
    network_latency_seconds: Histogram,

    // Errors
    replication_errors_total: Counter,
    conflicts_detected_total: Counter,
    conflicts_resolved_total: Counter,

    // Resource usage
    cpu_usage_percent: Gauge,
    memory_usage_bytes: Gauge,
    connection_pool_active: Gauge,
    connection_pool_idle: Gauge,
}

Business Metrics:

pub struct BusinessMetrics {
    // Tenants
    active_replications: Gauge,
    tenants_by_qos_tier: GaugeVec,  // Labels: [tier]

    // SLA compliance
    sla_violations_total: Counter,
    sla_compliance_ratio: Gauge,

    // Failovers
    failovers_total: Counter,
    failover_duration_seconds: Histogram,

    // Migrations
    migrations_total: Counter,
    migration_downtime_seconds: Histogram,
}

9.3 Dashboards

Overview Dashboard:

dashboard: "Tenant Replication Overview"
panels:
  - title: "Replication Lag (P99)"
    query: "histogram_quantile(0.99, replication_lag_seconds)"
    visualization: "time_series"

  - title: "Throughput"
    query: "rate(rows_replicated_total[5m])"
    visualization: "graph"

  - title: "Active Replications by QoS"
    query: "tenants_by_qos_tier"
    visualization: "bar_chart"

  - title: "SLA Compliance"
    query: "sla_compliance_ratio"
    visualization: "gauge"

  - title: "Error Rate"
    query: "rate(replication_errors_total[5m])"
    visualization: "graph"

Performance Dashboard:

dashboard: "Performance Metrics"
panels:
  - title: "Compression Ratio"
    query: "avg(compression_ratio)"

  - title: "Network Latency"
    query: "histogram_quantile(0.95, network_latency_seconds)"

  - title: "CPU Usage"
    query: "avg(cpu_usage_percent)"

  - title: "Memory Usage"
    query: "avg(memory_usage_bytes)"

9.4 Alerting Rules

alerts:
  - name: "HighReplicationLag"
    severity: "critical"
    condition: "replication_lag_seconds > 30"
    duration: "5m"
    annotations:
      summary: "Replication lag is too high"
      description: "Replication lag for {{ $labels.tenant_id }} is {{ $value }}s"

  - name: "ReplicationError"
    severity: "error"
    condition: "rate(replication_errors_total[5m]) > 0.1"
    duration: "2m"
    annotations:
      summary: "High replication error rate"

  - name: "SLAViolation"
    severity: "warning"
    condition: "sla_compliance_ratio < 0.95"
    duration: "10m"
    annotations:
      summary: "SLA compliance below 95%"

  - name: "FailoverTriggered"
    severity: "critical"
    condition: "increase(failovers_total[5m]) > 0"
    annotations:
      summary: "Automatic failover triggered"
      description: "Failover occurred for {{ $labels.tenant_id }}"

9.5 Distributed Tracing

use opentelemetry::trace::{Tracer, Span};

#[tracing::instrument(skip(self))]
pub async fn replicate_batch(&self, batch: ReplicationBatch) -> Result<()> {
    let mut span = self.tracer.start("replicate_batch");
    span.set_attribute("tenant_id", batch.tenant_id.clone());
    span.set_attribute("batch_size", batch.size as i64);

    // CDC read
    let changes = self.read_changes(&batch).await
        .with_context(|| "Failed to read changes")?;
    span.add_event("cdc_read_complete", vec![
        ("rows", changes.len().to_string())
    ]);

    // Transform
    let transformed = self.transform_data(changes).await?;
    span.add_event("transform_complete", vec![]);

    // Compress
    let compressed = self.compress_data(transformed).await?;
    span.add_event("compress_complete", vec![
        ("compression_ratio", compressed.ratio.to_string())
    ]);

    // Apply
    self.apply_changes(compressed).await?;
    span.add_event("apply_complete", vec![]);

    span.end();
    Ok(())
}

9.6 Log Aggregation

use tracing::{info, warn, error};

// Structured logging
info!(
    tenant_id = %tenant_id,
    replication_id = %replication_id,
    lag_seconds = lag.as_secs(),
    "Replication lag measured"
);

warn!(
    tenant_id = %tenant_id,
    error = %err,
    retry_count = retry_count,
    "Replication retry after error"
);

error!(
    tenant_id = %tenant_id,
    failover_reason = %reason,
    downtime_ms = downtime_ms,
    "Automatic failover triggered"
);

10. Disaster Recovery

10.1 DR Architecture

graph TB
    subgraph "Primary Region (us-east-1)"
        PRIMARY_APP[Replication Service<br/>Primary]
        PRIMARY_DB[(Source DB<br/>us-east-1)]
    end

    subgraph "DR Region (us-west-2)"
        DR_APP[Replication Service<br/>Standby]
        DR_DB[(Target DB<br/>us-west-2)]
    end

    subgraph "Failover Controller"
        HEALTH[Health Monitor]
        DECISION[Failover Decision Engine]
        DNS[DNS Manager]
    end

    PRIMARY_APP --> PRIMARY_DB
    PRIMARY_APP --> DR_DB

    HEALTH --> PRIMARY_APP
    HEALTH --> PRIMARY_DB
    HEALTH --> DR_APP
    HEALTH --> DR_DB

    HEALTH --> DECISION
    DECISION --> DNS

    DNS -.->|Failover| DR_APP

    style PRIMARY_APP fill:#2ECC71
    style DR_APP fill:#E74C3C
    style DECISION fill:#9B59B6

10.2 Failover Scenarios

Scenario 1: Database Failure

sequenceDiagram
    participant Health as Health Monitor
    participant Primary as Primary DB
    participant Replica as Replica DB
    participant DNS as DNS Manager
    participant Client as Client Apps

    Health->>Primary: Health Check
    Primary-->>Health: Timeout/Error

    Health->>Health: Evaluate Criteria<br/>(30s downtime)

    Health->>Replica: Promote to Primary
    Replica->>Replica: Set Read-Write

    Health->>DNS: Update DNS<br/>(primary.db -> replica)

    DNS->>Client: Propagate Change
    Client->>Replica: Route Traffic

    Note over Replica: Now Primary DB

Scenario 2: Region Failure

sequenceDiagram
    participant Health as Health Monitor
    participant PrimaryRegion as us-east-1
    participant DRRegion as us-west-2
    participant LB as Load Balancer
    participant Client as Clients

    Health->>PrimaryRegion: Health Check
    PrimaryRegion-->>Health: Region Unavailable

    Health->>Health: Confirm Outage<br/>(Multiple checks)

    Health->>DRRegion: Activate DR Region
    DRRegion->>DRRegion: Promote DBs<br/>Start Services

    Health->>LB: Update Routing<br/>(Route 100% to DR)

    LB->>Client: Redirect to DR
    Client->>DRRegion: Routed Traffic

    Note over DRRegion: Now Active Region

10.3 RTO/RPO Strategy

Tier	RTO Target	RPO Target	Strategy
Synchronous	<5 seconds	0 (zero data loss)	Synchronous replication, automatic failover
Premium	<30 seconds	<5 seconds	Asynchronous replication (<1s lag), automatic failover
Standard	<5 minutes	<30 seconds	Asynchronous replication, automatic failover
Best-Effort	<30 minutes	<5 minutes	Batch replication, manual failover

10.4 Backup Strategy

backup_policy:
  # Full backups
  full_backup:
    frequency: "daily"
    time: "02:00 UTC"
    retention_days: 30

  # Incremental backups
  incremental_backup:
    frequency: "hourly"
    retention_days: 7

  # Transaction log backups
  transaction_log_backup:
    frequency: "5 minutes"
    retention_hours: 24

  # Cross-region backup
  cross_region_backup:
    enabled: true
    regions: ["us-west-2", "eu-west-1"]

  # Backup encryption
  encryption:
    enabled: true
    algorithm: "AES-256-GCM"
    key_rotation_days: 90

10.5 Recovery Procedures

Database Recovery:

# 1. Stop replication
./heliosdb-cli replication stop --tenant-id ${TENANT_ID}

# 2. Restore from backup
./heliosdb-cli backup restore \
  --tenant-id ${TENANT_ID} \
  --backup-id ${BACKUP_ID} \
  --target-time "2025-11-02T10:30:00Z"

# 3. Apply transaction logs (point-in-time recovery)
./heliosdb-cli backup apply-logs \
  --tenant-id ${TENANT_ID} \
  --start-time "2025-11-02T10:30:00Z" \
  --end-time "2025-11-02T11:00:00Z"

# 4. Validate data integrity
./heliosdb-cli validate data \
  --tenant-id ${TENANT_ID}

# 5. Resume replication
./heliosdb-cli replication start --tenant-id ${TENANT_ID}

Region Failover:

# 1. Initiate failover
./heliosdb-cli failover initiate \
  --source-region us-east-1 \
  --target-region us-west-2 \
  --dry-run false

# 2. Monitor failover progress
./heliosdb-cli failover status --failover-id ${FAILOVER_ID}

# 3. Validate failover
./heliosdb-cli failover validate \
  --failover-id ${FAILOVER_ID}

# 4. Update application config
kubectl set env deployment/app \
  DATABASE_URL=postgres://us-west-2.db.helios.com/db

11. Deployment Architecture

11.1 Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tenant-replication-worker
  namespace: heliosdb
spec:
  replicas: 5
  selector:
    matchLabels:
      app: replication-worker
  template:
    metadata:
      labels:
        app: replication-worker
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
    spec:
      serviceAccountName: replication-worker

      containers:
      - name: worker
        image: heliosdb/tenant-replication:v6.0.0
        imagePullPolicy: IfNotPresent

        ports:
        - name: metrics
          containerPort: 9090
        - name: health
          containerPort: 8080

        resources:
          requests:
            cpu: "2000m"
            memory: "4Gi"
          limits:
            cpu: "4000m"
            memory: "8Gi"

        env:
        - name: RUST_LOG
          value: "info"
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: url
        - name: NATS_URL
          value: "nats://nats.heliosdb.svc.cluster.local:4222"

        livenessProbe:
          httpGet:
            path: /health
            port: health
          initialDelaySeconds: 30
          periodSeconds: 10

        readinessProbe:
          httpGet:
            path: /ready
            port: health
          initialDelaySeconds: 10
          periodSeconds: 5

        volumeMounts:
        - name: config
          mountPath: /etc/heliosdb
          readOnly: true

      volumes:
      - name: config
        configMap:
          name: replication-config
---
apiVersion: v1
kind: Service
metadata:
  name: replication-worker
  namespace: heliosdb
spec:
  selector:
    app: replication-worker
  ports:
  - name: metrics
    port: 9090
    targetPort: metrics
  - name: health
    port: 8080
    targetPort: health
  type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: replication-worker-hpa
  namespace: heliosdb
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tenant-replication-worker
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 75
  - type: Pods
    pods:
      metric:
        name: replication_lag_seconds
      target:
        type: AverageValue
        averageValue: "5"

11.2 Infrastructure Components

infrastructure:
  # Database
  postgres:
    version: "16.1"
    instance_type: "db.r6g.2xlarge"
    storage_type: "io2"
    storage_size_gb: 1000
    iops: 10000
    backup_retention_days: 30
    multi_az: true

  # Message Queue
  nats:
    version: "2.10"
    cluster_size: 3
    storage_type: "gp3"
    storage_size_gb: 500
    jetstream_enabled: true

  # Kubernetes
  eks:
    version: "1.28"
    node_groups:
      - name: "replication-workers"
        instance_type: "c6i.2xlarge"
        min_size: 3
        max_size: 20
        disk_size: 100

  # Monitoring
  prometheus:
    retention_days: 90
    storage_size_gb: 500

  # Load Balancer
  alb:
    type: "application"
    scheme: "internet-facing"
    ssl_policy: "ELBSecurityPolicy-TLS13-1-2-2021-06"

11.3 Multi-Region Deployment

graph TB
    subgraph "Global"
        DNS[Route 53<br/>Global DNS]
        CLOUDFRONT[CloudFront<br/>CDN]
    end

    subgraph "us-east-1"
        LB_EAST[ALB us-east-1]
        K8S_EAST[EKS Cluster]
        DB_EAST[(RDS Primary)]
        NATS_EAST[NATS Cluster]
    end

    subgraph "us-west-2"
        LB_WEST[ALB us-west-2]
        K8S_WEST[EKS Cluster]
        DB_WEST[(RDS Replica)]
        NATS_WEST[NATS Cluster]
    end

    subgraph "eu-west-1"
        LB_EU[ALB eu-west-1]
        K8S_EU[EKS Cluster]
        DB_EU[(RDS Replica)]
        NATS_EU[NATS Cluster]
    end

    DNS --> CLOUDFRONT
    CLOUDFRONT --> LB_EAST
    CLOUDFRONT --> LB_WEST
    CLOUDFRONT --> LB_EU

    LB_EAST --> K8S_EAST
    LB_WEST --> K8S_WEST
    LB_EU --> K8S_EU

    K8S_EAST --> DB_EAST
    K8S_WEST --> DB_WEST
    K8S_EU --> DB_EU

    K8S_EAST --> NATS_EAST
    K8S_WEST --> NATS_WEST
    K8S_EU --> NATS_EU

    DB_EAST -.->|Replication| DB_WEST
    DB_EAST -.->|Replication| DB_EU

    NATS_EAST -.->|Sync| NATS_WEST
    NATS_WEST -.->|Sync| NATS_EU

12. Architecture Decision Records

ADR-001: PostgreSQL Logical Replication for CDC

Status: Accepted Date: 2025-11-02 Context: Need efficient CDC mechanism with low overhead.

Decision: Use PostgreSQL logical replication slots with pgoutput plugin.

Rationale:

Native to PostgreSQL, no external dependencies
Low overhead (<5% CPU)
Transactional consistency guaranteed
Schema evolution support
Battle-tested in production

Alternatives Considered:

Debezium: Requires JVM, higher resource usage, additional complexity
Custom WAL Parser: Significant development effort, maintenance burden
Trigger-based CDC: High overhead (20-30%), not scalable

Consequences:

Positive: Low overhead, native support, reliable
Negative: Locked to PostgreSQL, requires replication slot management

ADR-002: NATS JetStream for Message Queue

Status: Accepted Date: 2025-11-02 Context: Need reliable message queue for replication batches.

Decision: Use NATS JetStream for message queuing.

Rationale:

Lightweight (written in Go)
Exactly-once delivery semantics
Persistence with durability options
Lower latency than Kafka
Simpler operations

Alternatives Considered:

Kafka: Heavier (JVM), more complex, higher latency
RabbitMQ: Lower throughput, more resource-intensive
Redis Streams: Less durable, not designed for this use case

Consequences:

Positive: Simple, fast, reliable, low resource usage
Negative: Smaller ecosystem than Kafka, fewer third-party integrations

ADR-003: Rust for Implementation Language

Status: Accepted Date: 2025-11-02 Context: Need high-performance, memory-safe language for replication.

Decision: Implement in Rust.

Rationale:

Memory safety without GC
High performance (C/C++ level)
Strong async ecosystem (Tokio)
Type safety reduces bugs
Excellent concurrency primitives

Alternatives Considered:

Go: Simpler, but GC overhead, less performance
Java: Mature ecosystem, but GC pauses, higher memory usage
C++: High performance, but memory safety issues

Consequences:

Positive: Performance, safety, modern tooling
Negative: Steeper learning curve, longer compile times

ADR-004: Schema-Aware Compression

Status: Accepted Date: 2025-11-02 Context: Cross-region replication has high bandwidth costs.

Decision: Implement schema-aware compression using different algorithms per data type.

Rationale:

3-5x better compression than generic algorithms
Reduces bandwidth costs by 60-70%
Faster decompression (type-specific)
Minimal CPU overhead

Alternatives Considered:

Generic Zstd Only: Simpler, but less compression (2x vs 3-5x)
No Compression: Highest bandwidth costs
LZ4: Faster, but lower compression ratio

Consequences:

Positive: Significant bandwidth savings, cost reduction
Negative: More complex implementation, slight CPU increase

ADR-005: AI-Powered Predictive Replication

Status: Accepted Date: 2025-11-02 Context: Not all data is equally important; some is accessed more frequently.

Decision: Use ML to predict access patterns and prioritize hot data replication.

Rationale:

40-60% reduction in lag for critical data
Better user experience during failover
Optimized bandwidth usage
Unique competitive advantage

Alternatives Considered:

Round-Robin Replication: Simple, but treats all data equally
Static Priority: Requires manual configuration
LRU-based: Only considers recency, not access patterns

Consequences:

Positive: Improved performance for critical data, innovation
Negative: ML model training/maintenance overhead, complexity

ADR-006: Tenant-Level Granularity

Status: Accepted Date: 2025-11-02 Context: Traditional replication is database or table-level.

Decision: Implement replication at tenant granularity.

Rationale:

Selective replication (replicate only needed tenants)
Tenant-specific SLAs (Premium vs Standard)
Tenant mobility (migrate tenants across regions)
Cost allocation per tenant
World-first innovation

Alternatives Considered:

Database-Level: Cannot isolate tenants
Table-Level: Doesn’t capture tenant semantics
Row-Level: Too granular, high overhead

Consequences:

Positive: Flexibility, innovation, business value
Negative: More complex routing, metadata management

ADR-007: Automatic Failover with Health Checks

Status: Accepted Date: 2025-11-02 Context: Manual failover is slow and error-prone.

Decision: Implement automatic failover based on health checks.

Rationale:

RTO <30 seconds (vs minutes for manual)
24/7 availability (no human intervention)
Consistent decision-making
Reduced blast radius (tenant-level)

Alternatives Considered:

Manual Failover Only: Slower, requires on-call engineer
Automatic Without Health Checks: Risk of unnecessary failovers
External Orchestrator (e.g., Consul): Additional dependency

Consequences:

Positive: Fast recovery, reliability, automation
Negative: Risk of false positives, needs careful tuning

13. Risk Analysis

13.1 Technical Risks

Risk	Likelihood	Impact	Mitigation
CDC Performance Overhead	Medium	High	Use logical replication (low overhead), benchmark extensively
Cross-Region Latency	Medium	Medium	Predictive replication, compression, edge caching
Schema Evolution Bugs	Low	High	Extensive testing, gradual rollout, rollback capability
Data Corruption	Low	Critical	Checksums, validation, point-in-time recovery
Conflict Resolution Errors	Medium	Medium	Semantic resolution, manual override, audit logging
Failover False Positives	Medium	Medium	Multi-factor health checks, cooldown periods

13.2 Operational Risks

Risk	Likelihood	Impact	Mitigation
Runaway Resource Usage	Medium	High	Resource limits, auto-scaling, monitoring alerts
Configuration Errors	Medium	Medium	Validation, dry-run mode, rollback capability
Monitoring Blind Spots	Low	High	Comprehensive metrics, distributed tracing, alerting
Security Vulnerabilities	Low	Critical	Security audits, penetration testing, bug bounty

13.3 Business Risks

Risk	Likelihood	Impact	Mitigation
Market Timing	Low	Medium	Phased rollout, MVP first, iterate based on feedback
Competitive Response	Medium	Low	Patent filings, continuous innovation
Customer Adoption	Medium	High	Pilot programs, documentation, support

14. Conclusion

14.1 Summary

The F6.21 Tenant Replication architecture provides a comprehensive, production-ready design for intelligent, tenant-level database replication with the following key innovations:

AI-Powered Predictive Replication: Prioritizes hot data for faster replication
Schema-Aware Compression: 3-5x better compression than generic algorithms
Automatic Failover: <30s RTO with health-based promotion
Tenant Mobility: Zero-downtime cross-region migration
Semantic Conflict Resolution: AI-driven intelligent merging
Differentiated QoS: SLA tiers per tenant priority

14.2 Next Steps

Immediate (Week 1):
- Review and approval
- Team assignment
- Environment setup
Phase 1 (Months 1-2):
- Core replication implementation
- CDC integration
- Basic monitoring
Phase 2 (Months 3-4):
- AI predictive engine
- Transform pipeline
- Advanced compression
Phase 3 (Month 5):
- Failover automation
- Migration orchestration
Phase 4 (Month 6):
- QoS implementation
- Production hardening
- Documentation

14.3 Success Criteria

Replication lag <5 seconds (P99)
Throughput >50K rows/sec
RTO <30 seconds, RPO <5 seconds
Compression ratio 3-5x
99.99% availability
Zero security breaches
90%+ test coverage

Document Version: 1.0 Status: Draft for Review Authors: System Architecture Team Reviewers: Engineering Lead, CTO Approval Date: TBD

HeliosDB: World’s First Intelligent Tenant-Level Database Replication

F6.21 Tenant Replication Architecture

F6.21 Tenant Replication Architecture

System Architecture & Design Document

Table of Contents

1. Executive Summary

1.1 Overview

1.2 Key Architectural Principles

1.3 Quality Attributes

2. System Architecture

2.1 High-Level Architecture (C4 Model - Level 1)

2.2 Container Architecture (C4 Model - Level 2)

2.3 Component Diagram (C4 Model - Level 3)

2.4 Data Flow Architecture

3. Component Design

3.1 Replication Controller

3.2 CDC Processor

3.3 Transform Pipeline

3.4 AI Predictive Engine

3.5 Compression Engine

3.6 Failover Controller

4. Data Architecture

4.1 Replication Metadata Schema

4.2 Database Indexes Strategy

4.3 Data Retention Policies

5. API Design

6. Technology Stack

6.1 Technology Selection Matrix

6.2 Dependency Graph

6.3 Technology Versions

7. Security Architecture

7.1 Security Layers

7.2 Authentication Flow

7.3 Encryption Strategy

7.4 Access Control Model (RBAC)

7.5 Tenant Isolation

7.6 Audit Logging

8. Performance & Scalability

8.1 Performance Targets

8.2 Scalability Architecture

8.3 Scaling Strategies

8.4 Performance Optimization Techniques

8.5 Load Testing Plan

9. Monitoring & Observability

9.1 Observability Stack

9.2 Metrics Collection

9.3 Dashboards

9.4 Alerting Rules

9.5 Distributed Tracing

9.6 Log Aggregation

10. Disaster Recovery

10.1 DR Architecture

10.2 Failover Scenarios

10.3 RTO/RPO Strategy

10.4 Backup Strategy

10.5 Recovery Procedures

11. Deployment Architecture

11.1 Kubernetes Deployment

11.2 Infrastructure Components

11.3 Multi-Region Deployment

12. Architecture Decision Records

ADR-001: PostgreSQL Logical Replication for CDC

ADR-002: NATS JetStream for Message Queue

ADR-003: Rust for Implementation Language

ADR-004: Schema-Aware Compression

ADR-005: AI-Powered Predictive Replication

ADR-006: Tenant-Level Granularity

ADR-007: Automatic Failover with Health Checks

13. Risk Analysis

13.1 Technical Risks

13.2 Operational Risks

13.3 Business Risks

14. Conclusion

14.1 Summary

14.2 Next Steps

14.3 Success Criteria