HeliosDB Multi-Cloud Federation User Guide

Overview

Multi-cloud federation in HeliosDB enables seamless querying across data stored in AWS S3, Azure Blob Storage, and Google Cloud Storage from a single SQL interface. This powerful capability allows organizations to leverage investments in multiple cloud providers while maintaining unified data access patterns, optimizing costs through intelligent query execution, and enforcing data sovereignty constraints across geographic regions.

Federation addresses a critical challenge in modern data infrastructure: the inevitable reality that enterprise data lives across multiple cloud platforms. Rather than forcing data consolidation (expensive and time-consuming), federation allows you to query and analyze data where it lives, with HeliosDB handling the complexity of distributed execution, intelligent caching, filter pushdown optimization, and cost tracking.

Why Use Federation?

Multi-Cloud Strategy: Avoid vendor lock-in and leverage each cloud provider’s strengths. Store analytics data on AWS S3 for cost-effectiveness, sensitive EU customer data on Azure in European regions for GDPR compliance, and machine learning models on GCP for BigQuery integration.

Cost Optimization: Reduce egress charges and minimize data scanning through intelligent filter and projection pushdown. HeliosDB analyzes queries and applies filtering at the storage layer, meaning you only pay for the data you actually need.

Data Sovereignty: Enforce geographic data residency requirements. Keep PII in specific regions, comply with GDPR/CCPA/HIPAA requirements, and automatically migrate data if policies change.

Operational Flexibility: Add or remove cloud providers without reimplementing data pipelines. Query results are cached intelligently to minimize repeated scans and API calls.

Performance: Parallel query execution across clouds, optimized data format selection, and multi-level caching ensure enterprise-grade query performance even with large-scale distributed datasets.

Federation Concepts

Distributed Query Execution

Federated queries are executed using a coordinator-worker pattern. The query planner determines which data sources must be accessed, estimates costs based on data location and format, and then routes query execution to specialized connectors for each cloud provider. Results are streamed back and merged, with optional caching at multiple levels to optimize repeated queries.

The federation engine automatically parallelizes work where possible. If your query joins S3 data with Azure data, separate scans execute in parallel with results merged efficiently. Filters are pushed down to the storage layer whenever supported, dramatically reducing network traffic.

Federated Learning Integration

Beyond data access, HeliosDB supports privacy-preserving analytics through federated learning. Train models across distributed datasets without moving sensitive data. Each cloud provider’s data contributes to model training while remaining in its home region. This is particularly valuable for healthcare (HIPAA compliance), financial services (data residency requirements), and regulated industries.

Data Sovereignty and Compliance

Data sovereignty ensures data never violates regulatory constraints. The sovereignty engine tracks data classification (PII, PHI, payment card data, etc.), enforces policies (GDPR, CCPA, HIPAA, custom regulations), and automatically migrates data when policies change. The compliance dashboard provides real-time visibility into data locations and regulatory status.

Compliance Considerations

Federated queries inherently handle compliance better than data consolidation. Rather than moving sensitive data and creating compliance risks, federation keeps data at rest while enabling analysis. The audit system logs all data access across clouds, creating complete compliance trails for regulatory inspection. Policies enforce requirements like “customer data from EU cannot be processed in US regions” automatically.

Cloud Backend Setup

AWS S3 Configuration

Authentication Methods

IAM Role-Based Access (Recommended for production)

When running HeliosDB on EC2 or ECS, attach an IAM role with S3 permissions:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket",
        "s3:GetObjectTagging"
      ],
      "Resource": [
        "arn:aws:s3:::my-analytics-bucket/*",
        "arn:aws:s3:::my-analytics-bucket"
      ]
    }
  ]
}

Access Key and Secret (For development/cross-account scenarios)

use heliosdb_federation::fdw::{CloudCredentials, CloudProvider};

let credentials = CloudCredentials {
    provider: CloudProvider::AWS,
    access_key: "AKIA...".to_string(),
    secret_key: "wJal...".to_string(),
    session_token: None,
    region: "us-east-1".to_string(),
};

Cross-Account Access (For multi-account organizations)

Configure a role in the data account that trusts the HeliosDB account
Use STS AssumeRole to gain temporary credentials
Essential for enterprise multi-account strategies

Bucket and Prefix Configuration

use heliosdb_federation::fdw::{S3ForeignDataWrapper, S3Options};

let options = S3Options {
    use_s3_select: true,           // Enable for CSV/JSON queries
    use_virtual_host_style: true,  // Newer buckets
    use_accelerate: true,          // If you have high latency
    connect_timeout_secs: 30,
    read_timeout_secs: 300,
    max_retry_attempts: 3,
    enable_partitioning: true,
    partition_columns: vec!["year".to_string(), "month".to_string()],
    enable_caching: true,
    cache_ttl_secs: 3600,
    auto_detect_format: true,
    ..Default::default()
};

let s3_fdw = S3ForeignDataWrapper::with_options(options);

Partition Pruning Strategy

Organize S3 data in partitioned directory structures:

s3://my-bucket/
  transactions/
    year=2024/
      month=01/
        day=15/
          data.parquet
      month=02/
        day=01/
          data.parquet
  customers/
    region=US/
      state=CA/
        data.parquet
    region=EU/
      country=Germany/
        data.parquet

The federation engine automatically prunes partitions during query execution, eliminating unnecessary scans. When you query WHERE year=2024 AND month=01, only that partition is accessed.

Azure Blob Storage Setup

Authentication Methods

Managed Identity (Recommended for production on Azure VMs/AKS)

use heliosdb_federation::fdw::{CloudCredentials, CloudProvider};

// Automatically uses managed identity from environment
let credentials = CloudCredentials {
    provider: CloudProvider::Azure,
    access_key: String::new(),  // Not needed with managed identity
    secret_key: String::new(),
    session_token: None,
    region: "eastus".to_string(),
};

Storage Account Key (For development)

let credentials = CloudCredentials {
    provider: CloudProvider::Azure,
    access_key: "DefaultEndpointsProtocol=https;...".to_string(),
    secret_key: "AccountKey=...".to_string(),
    session_token: None,
    region: "eastus".to_string(),
};

Shared Access Signature (SAS) (For limited, time-scoped access)

Ideal for third-party integrations or read-only scenarios
Token automatically expires after specified duration
Granular permissions control

Container and Blob Organization

https://mystorageaccount.blob.core.windows.net/
  loyalty/
    2024/
      01/
        customers.parquet
        transactions.csv
  compliance/
    pii/
      eu/
        customer-data.parquet
      us/
        customer-data.parquet
  analytics/
    models/
      recommendation/
        v1.0.parquet

Synapse Integration

Azure Synapse Analytics integration enables leveraging PolyBase for high-performance scans:

-- Configure Azure Synapse endpoint
CREATE EXTERNAL DATA SOURCE SynapseSource
WITH (
    TYPE = BLOB_STORAGE,
    LOCATION = 'https://mystorageaccount.blob.core.windows.net',
    CREDENTIAL = AzureStorageCredential
);

-- Query through Synapse for better performance on large scans
SELECT * FROM SynapseSource WHERE partitionKey = '2024-01-15';

Google Cloud Storage Setup

Authentication Methods

Service Account Key (Recommended)

use heliosdb_federation::fdw::{CloudCredentials, CloudProvider};

let credentials = CloudCredentials {
    provider: CloudProvider::GCP,
    access_key: "gcp-project-id".to_string(),
    secret_key: r#"{"type": "service_account", "project_id": ...}"#.to_string(),
    session_token: None,
    region: "us-central1".to_string(),
};

Application Default Credentials (For development)

Automatically uses credentials from GOOGLE_APPLICATION_CREDENTIALS environment variable
Simplifies local development without hardcoded keys

Workload Identity (For GKE deployments)

Pod service accounts directly assume GCP service accounts
No credential files needed in containers
Best practice for Kubernetes deployments

BigQuery Integration

Combine GCS data access with BigQuery analysis:

use heliosdb_federation::fdw::GcpForeignDataWrapper;

let gcp_fdw = GcpForeignDataWrapper::new();

// Query GCS data and optionally load into BigQuery for analysis
// Seamless integration through BigQuery external tables

Bucket and Dataset Organization

gs://my-analytics-bucket/
  customer_data/
    regions/
      us/
        2024/
          01/
            data.parquet
      eu/
        2024/
          01/
            data.parquet
  ml_models/
    versions/
      recommendation/
        v1/
          model.parquet
  raw_events/
    year=2024/
      month=01/
        day=15/
          events.json

Supported Data Formats

Parquet (Recommended)

Advantages:

Excellent filter and projection pushdown support (column pruning)
Compression (Snappy, GZIP, Brotli) reduces storage and network costs
Row group pruning enables filtering entire blocks without reading
Statistics metadata enables cost estimation before scanning
Widely adopted for data lakes

Usage:

-- Query Parquet files from S3
SELECT customer_id, SUM(amount) as total_sales
FROM s3://analytics-bucket/sales/2024/01/*.parquet
WHERE date >= '2024-01-01' AND status = 'completed'
GROUP BY customer_id
LIMIT 100;

Optimization Tips:

Partition by high-cardinality columns (date, region)
Compress with Snappy for balance of compression and speed
Use smaller row groups (128MB) for better pruning
Organize columns with frequently-filtered ones first

ORC (Columnar Optimization)

Advantages:

Similar compression to Parquet but often tighter (better for storage costs)
Built-in ACID support enables updates on cloud storage
Stripe-level indexing for precise filtering
Better for time-series data with temporal partitioning

Usage:

-- Query ORC files from Azure
SELECT product_id, event_count, avg_duration
FROM azure://data-container/events/2024/01/*.orc
WHERE event_type IN ('purchase', 'add_to_cart')
ORDER BY event_count DESC;

Optimization Tips:

Use ACID transactions for evolving datasets
Stripe size of 64MB-256MB provides good balance
Enable dictionary encoding for string columns
Partition by date for temporal queries

CSV (Schema Flexibility)

Advantages:

Human-readable, easy to import from legacy systems
No schema definition needed (auto-detection supported)
Supports variable numbers of columns
Ideal for data ingestion and ETL workflows

Limitations:

No compression at file level (use .gz)
No schema metadata (requires inference)
Slower parsing than columnar formats
No filter pushdown support

Usage:

-- Query CSV files with schema detection
SELECT id, name, email
FROM s3://import-bucket/customers/*.csv
WHERE email LIKE '%@enterprise.com'
LIMIT 1000;

Configuration:

let csv_options = CsvOptions {
    delimiter: b',',
    quote: b'"',
    has_headers: true,
    date_format: "%Y-%m-%d".to_string(),
    infer_schema: true,
    infer_schema_rows: 1000,  // Scan first 1000 rows for types
};

JSON (Nested Data)

Advantages:

Native support for nested structures
Schema flexibility for semi-structured data
Ideal for API responses and log aggregation
Supports nested filtering

Limitations:

Variable performance on large files
No pushdown filtering support
Requires full scan for filtering

Usage:

-- Query JSON with nested extraction
SELECT
    data->>'customer_id' as customer_id,
    (data->'order'->'total')::numeric as amount,
    data->'items' as items
FROM s3://events-bucket/orders/2024/*.json
WHERE data->>'status' = 'completed'
LIMIT 500;

Optimization:

Use line-delimited JSON (JSONL) for large files
Nest frequently-filtered fields at root level
Consider converting to Parquet for production workloads

Arrow IPC (Efficient Interchange)

Advantages:

Language-agnostic columnar format
Zero-copy deserialization
Streaming format perfect for pipelines
Native Python/R/Java support

Usage:

-- Query Arrow IPC files
SELECT * FROM azure://staging/exports/*.arrow
WHERE timestamp > NOW() - INTERVAL '7 days';

Federated Queries

Setting Up Federated Data Sources

First, register your cloud credentials and FDW instances with the federation manager:

use heliosdb_federation::{FederationManager, FederationConfig};
use heliosdb_federation::fdw::{
    S3ForeignDataWrapper, AzureForeignDataWrapper, GcpForeignDataWrapper,
    CloudProvider, CloudCredentials
};
use std::sync::Arc;
use std::time::Duration;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // Configure federation with performance settings
    let config = FederationConfig {
        max_concurrent_queries: 64,
        query_timeout: Duration::from_secs(300),
        enable_caching: true,
        cache_size_bytes: 10 * 1024 * 1024 * 1024,  // 10GB
        cache_ttl: Duration::from_secs(3600),
        enable_pushdown: true,
        enable_parallel_scans: true,
        parallel_scan_batch_size: 8192,
        ..Default::default()
    };

    let manager = FederationManager::new(config).await?;

    // Register S3 FDW with main analytics bucket
    let s3_fdw = Arc::new(S3ForeignDataWrapper::new());
    manager
        .register_fdw("analytics-s3".to_string(), CloudProvider::AWS, s3_fdw)
        .await?;

    // Register Azure FDW with EU compliance data
    let azure_fdw = Arc::new(AzureForeignDataWrapper::new());
    manager
        .register_fdw("compliance-azure".to_string(), CloudProvider::Azure, azure_fdw)
        .await?;

    // Register GCP FDW for ML datasets
    let gcp_fdw = Arc::new(GcpForeignDataWrapper::new());
    manager
        .register_fdw("ml-gcp".to_string(), CloudProvider::GCP, gcp_fdw)
        .await?;

    Ok(())
}

Writing Federated Queries

Federation seamlessly integrates with standard SQL. Reference cloud storage paths using URL-style notation:

-- AWS S3 query
SELECT customer_id, SUM(amount) as lifetime_value
FROM s3://sales-bucket/transactions/2024/parquet
WHERE status = 'completed'
GROUP BY customer_id
ORDER BY lifetime_value DESC
LIMIT 100;

-- Azure Blob query
SELECT loyalty_id, points_balance, last_transaction_date
FROM azure://loyalty-container/members/*.parquet
WHERE points_balance > 0
ORDER BY points_balance DESC;

-- Google Cloud Storage query
SELECT user_id, recommendation_score, model_version
FROM gs://ml-bucket/predictions/2024/01/*.parquet
WHERE recommendation_score > 0.8;

Join Strategies Across Clouds

HeliosDB optimizes joins across clouds intelligently:

Broadcast Join (Small dimension tables)

-- Join small customer dimension from Azure with large transactions from S3
SELECT
    t.transaction_id,
    c.customer_name,
    t.amount,
    c.country
FROM s3://sales/transactions.parquet t
INNER JOIN azure://dimensions/customers.parquet c
    ON t.customer_id = c.customer_id
WHERE t.amount > 1000
    AND c.country IN ('US', 'CA', 'MX');

The small customer table is fetched entirely and broadcast to all S3 transaction scanners, minimizing network traffic.

Hash Join (Medium datasets)

-- Join similar-sized datasets from different clouds
SELECT
    s.session_id,
    s.user_id,
    g.game_data
FROM s3://gaming/sessions.parquet s
INNER JOIN gcs://gaming/gameplay.parquet g
    ON s.session_id = g.session_id
WHERE s.duration > 600;

Both datasets are partitioned on the join key and pairs of partitions are joined locally.

Nested Loop Join (One-off analysis)

-- Small result set from one source joined with another
WITH top_customers AS (
    SELECT customer_id
    FROM s3://sales/transactions.parquet
    GROUP BY customer_id
    HAVING COUNT(*) > 100
    LIMIT 50
)
SELECT
    tc.customer_id,
    cl.sentiment_analysis
FROM top_customers tc
CROSS JOIN azure://customer-surveys/feedback.parquet cl
    ON tc.customer_id = cl.customer_id;

Performance Characteristics

Federation performance depends on several factors:

Data Format Impact:

Parquet: ~50-100MB/sec per scanner (highly optimized with pushdown)
ORC: ~40-80MB/sec (compression overhead)
CSV: ~10-20MB/sec (no column pruning)
JSON: ~5-15MB/sec (parsing overhead)

Network Impact:

Intra-cloud (same cloud provider): Minimal latency, 10Gbps+ bandwidth
Inter-cloud: 100-500ms latency, bandwidth varies by region
Cross-continent: 150-300ms latency, potential for bottleneck

Caching Impact:

Cache hit: <1ms response (in-memory)
Cache miss with pushdown: 5-30 seconds (depends on data size)
Cache miss without pushdown: 30-300 seconds (proportional to data volume)

Filter and Projection Pushdown

How Pushdown Works

Rather than fetching entire files and filtering in-database, filter pushdown pushes predicates to the storage layer:

Without Pushdown:

Fetch entire Parquet file (10TB)
Parse all 100 billion rows
Filter in-database: 100 billion rows → 1 million qualifying rows
Project requested columns
Return 1 million rows

Network Cost: 10TB ingress Processing: Parse 100 billion rows, filter to 1 million

With Pushdown:

Analyze query: WHERE status='active' AND amount > 1000
Send filter predicates to S3/Azure/GCS
Storage layer reads metadata, prunes row groups/stripes
Return only matching rows (1 million rows)
Return projected columns

Network Cost: ~40GB egress (99.6% reduction!) Processing: Read file metadata, apply filters, parse 1 million rows

Performance Impact

For typical queries, pushdown provides:

Data Transfer Reduction: 90-99% fewer bytes
Query Speed: 10-100x faster
Cost Reduction: 90-99% less egress charges

Example with concrete numbers:

Scenario: Query 10TB Parquet file with 100 billion customer records
Query: SELECT id, name WHERE status='active' AND signup_date > '2023-01-01'
Expected matches: 1 million rows across 10TB

Without Pushdown:
  - Transfer: 10TB = $1,000+ (at $0.12/GB typical AWS rate)
  - Query time: ~300 seconds

With Pushdown:
  - Transfer: ~40GB = ~$5
  - Query time: ~10 seconds
  - Savings: $995 per query, 30x faster!

When Pushdown Is Applied

Always Applied:

Parquet format with supported operators (=, <, >, BETWEEN, IN, LIKE)
ORC format with similar operators
Filtering on partition columns
Projection of subset of columns

Sometimes Applied:

Complex filters (AND/OR combinations) - partially pushed
CSV files (projection only, no filtering)
JSON files (limited to structure-based filtering)

Never Applied:

User-defined functions in filters
Filters on computed columns
Non-supported file formats

Optimization Strategies

Choose Pushdown-Friendly Formats:

-- Good: Will use pushdown
SELECT customer_id, amount
FROM s3://sales/2024/*.parquet
WHERE amount > 1000 AND status = 'completed';

-- Suboptimal: CSV doesn't support filter pushdown
SELECT customer_id, amount
FROM s3://sales/2024/*.csv
WHERE amount > 1000 AND status = 'completed';

Use Partition Pruning:

-- Good: Partition pruning happens automatically
SELECT * FROM s3://sales/year=2024/month=01/day=15/*.parquet
WHERE amount > 1000;

-- Less optimal: Requires reading multiple partitions
SELECT * FROM s3://sales/all-years/*.parquet
WHERE EXTRACT(year FROM date) = 2024
  AND amount > 1000;

Order Filters for Selectivity:

-- Good: Most selective filter first
SELECT * FROM s3://customers/*.parquet
WHERE country = 'US'        -- 5% of rows
  AND loyalty_status = 'premium'  -- 20% of remaining rows
  AND lifetime_value > 10000;    -- 30% of remaining rows

-- Less optimal: Less selective filter first
SELECT * FROM s3://customers/*.parquet
WHERE lifetime_value > 10000  -- 50% of rows
  AND loyalty_status = 'premium'  -- 20% of remaining rows
  AND country = 'US';  -- 5% of remaining rows

Multi-Level Caching

Caching Strategy

HeliosDB implements three caching layers for optimal performance:

Layer 1: Query Result Cache

Stores complete query results
Fastest: Returns cached results in <1ms
Best for: Repeated exact queries
TTL: Configurable (default 1 hour)
Size: Up to configured cache (default 10GB)

Layer 2: Metadata Cache

Caches file schemas, statistics, partitions
Avoids repeated metadata calls
Particularly valuable across clouds
TTL: Longer than query cache (default 24 hours)

Layer 3: Distributed Scan Cache

Caches parsed and projected batches
Shared across similar queries
Enables partial result reuse
Automatic eviction on data changes

Cache Invalidation

Smart invalidation prevents stale data:

Time-Based Expiration:

Query results: 1 hour by default
Metadata: 24 hours by default
Configurable per datasource

Dependency Tracking:

// Cache is invalidated when underlying Parquet files change
let scan_options = ScanOptionsBuilder::new()
    .with_projection(vec!["id".to_string(), "amount".to_string()])
    .with_cache_ttl(Duration::from_secs(3600))
    .build();

Manual Invalidation:

-- Clear all cached results from a source
INVALIDATE CACHE FOR s3://sales-bucket/*;

-- Clear specific pattern
INVALIDATE CACHE FOR azure://loyalty/members/*.parquet;

Automatic Invalidation:

// When files change (detected via S3 events or polling)
// Cache automatically refreshes
manager.enable_s3_event_invalidation("sales-bucket").await?;

Hit Rates and Performance

Monitor cache effectiveness:

let cache_stats = manager.get_cache_stats().await?;
println!("Cache Hits: {}", cache_stats.hits);
println!("Cache Misses: {}", cache_stats.misses);
println!("Hit Ratio: {:.2}%", cache_stats.hit_ratio * 100.0);
println!("Cache Size: {} GB", cache_stats.size_bytes / (1024*1024*1024));
println!("Entries: {}", cache_stats.entry_count);

Typical Hit Ratios:

Well-optimized workload: 60-80%
Interactive BI dashboards: 40-60%
Ad-hoc analytics: 20-40%
One-time scans: <5%

Improving Hit Ratios:

Pre-warm cache with common queries
Longer TTL for stable datasets
Partition data to enable result reuse
Enable query result logging for pattern analysis

Cost Optimization

Reducing Data Scans

Scan Reduction Techniques:

Pushdown Everything Possible:
- Filters, projections, and limits are the primary cost lever
- A 99% scan reduction (through pushdown) saves 99% in egress costs
- Parquet/ORC recommended; CSV/JSON avoided for cost-sensitive workloads

Partition Pruning:

-- Good: Query specific partition
SELECT * FROM s3://data/year=2024/month=01/day=15/*.parquet
WHERE customer_status = 'active';
-- Scans only ~3GB (1/365 of annual data)

-- Bad: Query full year, filter in database
SELECT * FROM s3://data/year=2024/**/*.parquet
WHERE date = '2024-01-15' AND customer_status = 'active';
-- Scans 1TB+ of irrelevant data

Column Pruning:

-- Good: Select only needed columns
SELECT customer_id, amount
FROM s3://transactions/large_table.parquet;
-- Fewer columns = fewer bytes transferred

-- Bad: SELECT *
SELECT *
FROM s3://transactions/large_table.parquet
WHERE amount > 1000;
-- Transfers unnecessary columns

Query Sampling:

-- For exploratory queries on massive datasets
SELECT customer_id, SUM(amount) as total
FROM s3://transactions/2024/**.parquet
WHERE RAND() < 0.01  -- Analyze 1% sample
GROUP BY customer_id
LIMIT 1000;

Query Optimization for Cost

Pre-aggregation at Source:

-- Bad: Aggregate after transfer
SELECT customer_id, SUM(amount)
FROM s3://detailed_transactions/2024/**/*.parquet
GROUP BY customer_id;

-- Better: Use pre-aggregated data if available
SELECT customer_id, total_amount
FROM s3://aggregated_transactions/2024_monthly/*.parquet
-- Much smaller dataset, same business results

Data Format Selection:

Data Size Comparison (1TB raw data):
- CSV (gzip):     ~300GB uncompressed, ~100GB gzip (~$12/query)
- JSON (gzip):    ~400GB uncompressed, ~120GB gzip ($14/query)
- ORC:            ~250GB (~$30/month storage, $3/query)
- Parquet:        ~200GB (~$24/month storage, $2.40/query)

For frequent queries, Parquet breaks even after ~5 queries

Caching Strategy:

// Cache high-value queries that run repeatedly
let cached_config = FederationConfig {
    cache_ttl: Duration::from_secs(86400),  // 24 hours
    cache_size_bytes: 100 * 1024 * 1024 * 1024,  // 100GB
    enable_caching: true,
    ..Default::default()
};

Pricing Comparison

Typical Cost per Query (10GB dataset scan):

Provider	Storage	API Calls	Egress	Total
AWS S3 (without pushdown)	$0.23	$0.004	$1.20	$1.43
AWS S3 (with pushdown 99%)	$0.23	$0.004	$0.012	$0.25
Azure Blob (without pushdown)	$0.018	$0.001	$0.80	$0.82
Azure Blob (with pushdown 99%)	$0.018	$0.001	$0.008	$0.03
GCP GCS (without pushdown)	$0.020	$0.004	$1.23	$1.25
GCP GCS (with pushdown 99%)	$0.020	$0.004	$0.012	$0.04

Note: Costs vary by region, volume discounts apply at scale, and exact rates change quarterly.

Cost Monitoring and Controls

// Enable cost tracking
let config = FederationConfig {
    enable_statistics: true,
    ..Default::default()
};

// Monitor federation statistics
let stats = manager.get_federation_stats().await?;
println!("Bytes Scanned: {} GB", stats.total_bytes_scanned / (1024*1024*1024));
println!("Estimated Cost: ${:.2}",
    stats.total_bytes_scanned as f64 * 0.12 / (1024*1024*1024));
println!("Queries by Provider:");
for (provider, count) in stats.queries_by_provider {
    println!("  {}: {}", provider, count);
}

Cost Control Strategies:

Set query timeouts to prevent runaway scans
Monitor bytes_scanned metrics
Alert on sudden spikes in data transfer
Use sampling for exploratory queries
Schedule heavy workloads during off-peak hours

Data Sovereignty and Compliance

Keeping Data in Specific Regions

The sovereignty engine enforces geographic data residency:

use heliosdb_sovereignty::{
    ResidencyPolicy, PolicyEngine, PlacementEngine,
    DataClassification, Regulation
};

// Create GDPR-compliant policy for EU data
let mut eu_policy = ResidencyPolicy::new(
    "gdpr-eu-only".to_string(),
    "EU GDPR Compliance".to_string(),
    DataClassification::PII,
);

eu_policy.allowed_countries = vec![
    "Ireland".to_string(),
    "Germany".to_string(),
    "France".to_string(),
    "Netherlands".to_string(),
];

eu_policy.required_regulations = vec![Regulation::GDPR];

policy_engine.create_policy(eu_policy).await?;

// Data placement is automatically enforced
let placement = placement_engine
    .place_data(
        "customer_table",
        Some("cust_123".to_string()),
        &eu_policy,
        customer_data.as_bytes(),
        None,
    )
    .await?;

println!("Data placed in: {:?}", placement.region);
// Output: Data placed in: "eu-west-1" (Ireland)

GDPR Article 32 requires data transfers to non-EU countries must have appropriate safeguards. HeliosDB ensures compliance:

Automatic Enforcement:

Personal data of EU residents stays in EU regions
International transfers blocked automatically
Audit logging of all access and transfers
User consent tracking and revocation

Implementation:

-- Data residency enforced at query time
-- This query runs entirely in EU regions despite multi-cloud setup
SELECT customer_id, email, purchase_history
FROM s3://eu-customers/pii.parquet  -- EU region
WHERE customer_gdpr_consent = true;

-- This would fail (attempt to cross-border transfer)
SELECT
    eu.customer_id,
    us.recommendation_score
FROM s3://eu-customers/pii.parquet eu  -- EU region
JOIN gcs://us-recommendations/scores.parquet us  -- US region
    ON eu.customer_id = us.customer_id;
-- Error: GDPR violation - cannot join EU PII with US data

Compliance Frameworks

HIPAA (Healthcare):

let mut hipaa_policy = ResidencyPolicy::new(
    "hipaa-us-only".to_string(),
    "HIPAA Compliance".to_string(),
    DataClassification::PHI,
);

hipaa_policy.allowed_countries = vec!["United States".to_string()];
hipaa_policy.required_regulations = vec![Regulation::HIPAA];
hipaa_policy.encrypt_at_rest = true;
hipaa_policy.encrypt_in_transit = true;
policy_engine.create_policy(hipaa_policy).await?;

CCPA (California):

let mut ccpa_policy = ResidencyPolicy::new(
    "ccpa-consumer-pii".to_string(),
    "CCPA Compliance".to_string(),
    DataClassification::PII,
);

ccpa_policy.user_rights_enforced = true;
ccpa_policy.deletion_enforced = true;
ccpa_policy.required_regulations = vec![Regulation::CCPA];
policy_engine.create_policy(ccpa_policy).await?;

PCI-DSS (Payment Cards):

let mut pci_policy = ResidencyPolicy::new(
    "pci-card-data".to_string(),
    "PCI-DSS Compliance".to_string(),
    DataClassification::PaymentCardData,
);

pci_policy.tokenization_required = true;
pci_policy.encryption_required = true;
pci_policy.required_regulations = vec![Regulation::PCIDSS];
policy_engine.create_policy(pci_policy).await?;

Audit Trails Across Clouds

Complete audit logging across all cloud providers:

// Monitor compliance
let compliance_report = policy_engine.generate_compliance_report().await?;

println!("Policies: {}", compliance_report.policies.len());
println!("Violations: {}", compliance_report.violations.len());
println!("Data Locations:");
for (data_id, location) in compliance_report.data_locations {
    println!("  {} -> {} ({})",
        data_id,
        location.region,
        location.provider
    );
}

// Access audit trail
let audit_events = audit_logger.query_events(
    Some("customer_table"),
    None,  // All users
    Some(Duration::from_secs(86400)),  // Last 24 hours
).await?;

for event in audit_events {
    println!("{}: User {} accessed {} in {}",
        event.timestamp,
        event.user_id,
        event.data_id,
        event.location
    );
}

Federated Learning

Distributed ML Training

Train models on data distributed across clouds without moving sensitive information:

use heliosdb_federation::federated_learning::{
    FederatedLearningEngine, FederatedLearningConfig,
    TrainingConfig, AggregationStrategy
};

let fl_config = FederatedLearningConfig {
    num_rounds: 10,
    learning_rate: 0.01,
    batch_size: 32,
    aggregation_strategy: AggregationStrategy::FedAvg,
    ..Default::default()
};

let engine = FederatedLearningEngine::new(fl_config).await?;

// Training happens on data in each cloud
// Updates are aggregated, not raw data
let model = engine.train_model("customer_churn_model", vec![
    TrainingDataset {
        name: "aws_customers".to_string(),
        path: "s3://customer-data/".to_string(),
    },
    TrainingDataset {
        name: "azure_customers".to_string(),
        path: "azure://customer-data/".to_string(),
    },
    TrainingDataset {
        name: "gcp_customers".to_string(),
        path: "gs://customer-data/".to_string(),
    },
]).await?;

println!("Model trained without moving sensitive customer data");

Privacy-Preserving Analytics

Differential privacy ensures individual privacy while enabling aggregate analysis:

// Query with differential privacy
let result = manager.execute_query_with_privacy(
    "SELECT COUNT(*) FROM customer_table WHERE country='US'",
    DifferentialPrivacyParams {
        epsilon: 0.1,  // Privacy budget
        delta: 1e-5,   // Failure probability
    }
).await?;

// Result includes noise to prevent identifying individuals
println!("Count (with differential privacy): {}", result);

Model Deployment

Deploy trained models back to cloud providers for inference:

// Deploy model to S3 for batch scoring
engine.deploy_model_to_s3(
    "churn_model_v1",
    "s3://models/churn-prediction/v1.0",
    ModelFormat::ONNX
).await?;

// Deploy to Azure for real-time inference
engine.deploy_model_to_azure(
    "churn_model_v1",
    "azure://models/churn-prediction",
    AzureDeploymentTarget::MachineLearning
).await?;

// Deploy to BigQuery for integrated analytics
engine.deploy_model_to_bigquery(
    "churn_model_v1",
    "project_id.dataset.churn_prediction_udf"
).await?;

Architecture Patterns

Pattern 1: Multi-Cloud Data Lake

Scenario: Large enterprise using multiple clouds for different business units.

Architecture:

┌─────────────────────────────────────────────────────┐
│              HeliosDB Federation Layer               │
├─────────────────────────────────────────────────────┤
│  Query Router │ Cost Optimizer │ Compliance Engine  │
│       ↓       │       ↓        │       ↓            │
├─────────────────────────────────────────────────────┤
│  S3 (AWS)    │  Blob (Azure)  │   GCS (GCP)        │
│  Finance     │  Compliance    │   Analytics        │
│  Data Lake   │  Data (GDPR)   │   & ML Models      │
└─────────────────────────────────────────────────────┘

Benefits:

Finance team uses AWS S3 for cost optimization
Compliance team uses Azure for GDPR enforcement
ML team uses GCP for BigQuery integration
Single query interface across all sources

Implementation:

-- Cross-cloud financial dashboard
SELECT
    f.fiscal_quarter,
    f.department,
    f.revenue,
    c.compliance_status,
    m.forecast_adjustment
FROM s3://finance/quarterly_results.parquet f
INNER JOIN azure://compliance/approval_status.parquet c
    ON f.fiscal_quarter = c.period
INNER JOIN gs://forecasting/predictions.parquet m
    ON f.department = m.department
WHERE f.fiscal_quarter = '2024-Q1'
ORDER BY f.revenue DESC;

Pattern 2: Hybrid Cloud with On-Prem

Scenario: Data in on-premises database plus cloud cold storage.

Architecture:

┌────────────────────────────────────────────────────┐
│         HeliosDB Federation Manager                 │
├────────────────────────────────────────────────────┤
│  On-Prem PostgreSQL (Hot) │ S3 Glacier (Cold)      │
│  Recent 90 days           │ Historical Archive     │
│  Fast, expensive storage  │ Cheap, slow storage    │
│  Real-time operational    │ Compliance archive     │
└────────────────────────────────────────────────────┘

Benefits:

Keep hot data on-premises for performance
Archive to S3 Glacier for cost
Single query combines both
Automatic tiering as data cools

Implementation:

-- Query spans hot and cold storage
SELECT date, revenue, customer_id
FROM postgres://prod/sales_current  -- On-prem hot
WHERE date >= CURRENT_DATE - INTERVAL '90 days'

UNION ALL

SELECT date, revenue, customer_id
FROM s3://archive/sales_historical/*/data.parquet  -- Cold archive
WHERE date < CURRENT_DATE - INTERVAL '90 days'
ORDER BY date DESC;

Pattern 3: Regional Distribution

Scenario: Global company with regional data sovereignty requirements.

Architecture:

┌─────────────────────────────────────────────────┐
│           HeliosDB Federation Layer              │
│        with Sovereignty Enforcement              │
├─────────────────────────────────────────────────┤
│   EU Region    │  US Region    │   APAC Region  │
│  Azure EU      │  AWS US       │   GCP APAC     │
│  GDPR Data     │  CCPA Data    │   Local Data   │
│  (Ireland)     │  (Virginia)   │   (Singapore)  │
└─────────────────────────────────────────────────┘

Benefits:

Automatic region routing by data classification
Compliance enforced at query time
Cross-region analytics with privacy
Data sovereignty built-in

Implementation:

// Queries automatically route to correct regions
let results = manager.execute_query(
    "SELECT * FROM customers WHERE revenue > 100000",
    PrivacyPolicyId::GDPR,  // Only runs in EU regions
).await?;

// Multi-region summary (respects boundaries)
let summary = manager.execute_query(
    "SELECT region, COUNT(*) as count FROM customers GROUP BY region",
    PrivacyPolicyId::Public,  // Aggregate across all regions
).await?;

Performance Tuning

Query Optimization

Analyze Execution Plans:

EXPLAIN VERBOSE
SELECT c.customer_id, c.name, SUM(o.amount) as total_orders
FROM s3://customers/*.parquet c
INNER JOIN azure://orders/2024/*.parquet o
    ON c.customer_id = o.customer_id
WHERE c.country = 'US'
GROUP BY c.customer_id, c.name
ORDER BY total_orders DESC
LIMIT 100;

Look for:

Pushdown applied to filters and projections
Partition pruning for date/region columns
Join strategy (broadcast vs hash vs nested loop)
Estimated rows and bytes

Partition Pruning:

Always query specific date partitions when possible
Use partition columns in WHERE clause early
Avoid computing on partition columns (date arithmetic)

Format Selection:

Parquet for frequently-queried analytics data
ORC for ACID workloads and compression
CSV only for one-time imports
JSON for semi-structured/API data

Partition Pruning

Good Partitioning:

s3://sales/
  year=2024/
    month=01/
      day=15/
        region=US/
          data.parquet

Query with maximum pruning:

SELECT * FROM s3://sales/**/*.parquet
WHERE year = 2024
  AND month IN (1, 2, 3)
  AND day >= 10
  AND region = 'US'
LIMIT 1000;

Only ~60 files scanned out of ~1500 possible (96% reduction!)

Format Selection for Performance

Scenario: 1TB dataset, need to run 100 queries this month

CSV (uncompressed): 1TB, 0 cost
  - 100 queries × 1TB = 100TB egress = $12,000
  - Total cost: $12,000

Parquet (with pushdown): 100GB, $24/month storage
  - 100 queries × 100MB avg = 10GB egress = $1.20
  - Total cost: $25.20
  - Savings: $11,974.80!

Converting to Parquet pays for itself in under 1 second of queries.
For annual workloads, savings are massive.

Monitoring

Query Performance Across Clouds

// Get detailed query statistics
let stats = manager.get_federation_stats().await?;

println!("=== Federation Performance ===");
println!("Total Queries: {}", stats.total_queries);
println!("Total Bytes Scanned: {} GB",
    stats.total_bytes_scanned / (1024*1024*1024));
println!("Avg Query Duration: {:.2}ms",
    stats.avg_query_duration_ms);
println!("Cache Hit Ratio: {:.2}%",
    stats.cache_hit_ratio * 100.0);

println!("\n=== Queries by Provider ===");
for (provider, count) in stats.queries_by_provider {
    println!("{}: {} queries", provider, count);
}

Cost Tracking

// Monitor federation costs in real-time
let cost_report = manager.get_cost_report().await?;

println!("=== Cost Summary ===");
println!("AWS S3: ${:.2}", cost_report.aws_cost);
println!("Azure: ${:.2}", cost_report.azure_cost);
println!("GCP: ${:.2}", cost_report.gcp_cost);
println!("Total: ${:.2}", cost_report.total_cost);

println!("\n=== Cost Breakdown ===");
println!("Storage: ${:.2}", cost_report.storage_cost);
println!("API Calls: ${:.2}", cost_report.api_cost);
println!("Egress: ${:.2}", cost_report.egress_cost);

Data Transfer Monitoring

-- Monitor data transfer patterns
SELECT
    provider,
    DATE_TRUNC('hour', query_time) as hour,
    COUNT(*) as query_count,
    SUM(bytes_scanned) as bytes_scanned,
    SUM(bytes_transferred) as bytes_transferred
FROM federation_audit_log
WHERE query_time > NOW() - INTERVAL '7 days'
GROUP BY provider, DATE_TRUNC('hour', query_time)
ORDER BY hour DESC;

Troubleshooting

Common Issues

“Connection timed out” to S3/Azure/GCP

Cause: Network connectivity or credential issues
Solution:
1. Verify credentials in FederationManager configuration
2. Check security group / firewall rules
3. Test connectivity: `aws s3 ls`, `az storage blob list`, `gsutil ls`
4. Verify IAM permissions for service account
5. Check for temporary quota exhaustion (usually auto-resolves)

“Pushdown not applied - query running slow”

Cause: Format or filter not supporting pushdown
Solution:
1. Convert to Parquet/ORC from CSV/JSON
2. Simplify filters (avoid UDFs, computed columns)
3. Use partition columns in WHERE clause
4. Check EXPLAIN plan for pushdown status
5. Verify statistics collection is enabled

“Cache hit ratio too low”

Cause: Queries are too diverse or cache too small
Solution:
1. Increase cache size (more memory)
2. Increase cache TTL for stable data
3. Analyze query patterns - standardize where possible
4. Pre-warm cache with common queries
5. Check if data changes frequently (invalidation)

“High egress costs”

Cause: Pushdown not working or unnecessary columns selected
Solution:
1. Use pushdown-friendly formats (Parquet)
2. Only SELECT needed columns
3. Apply filters for highest selectivity
4. Use partition pruning
5. Sample data for exploratory queries
6. Consider regional replication for frequently-accessed data

Federated Learning User Guide: Advanced privacy-preserving distributed ML training
Data Sovereignty Guide: Complete compliance policy management
Cache Configuration Guide: Advanced caching strategies and tuning
Query Optimization Guide: Performance tuning for complex queries
Cost Management Guide: Budget controls and cost allocation
Cloud Integration Guides: Specific setup for AWS, Azure, GCP
Architecture Reference: Federation system design and internals

Last Updated: December 2024 Version: 1.0

HeliosDB Multi-Cloud Federation User Guide

HeliosDB Multi-Cloud Federation User Guide

Overview

Why Use Federation?

Federation Concepts

Distributed Query Execution

Federated Learning Integration

Data Sovereignty and Compliance

Compliance Considerations

Cloud Backend Setup

AWS S3 Configuration

Authentication Methods

Bucket and Prefix Configuration

Partition Pruning Strategy

Azure Blob Storage Setup

Authentication Methods

Container and Blob Organization

Synapse Integration

Google Cloud Storage Setup

Authentication Methods

BigQuery Integration

Bucket and Dataset Organization

Supported Data Formats

Parquet (Recommended)

ORC (Columnar Optimization)

CSV (Schema Flexibility)

JSON (Nested Data)

Arrow IPC (Efficient Interchange)

Federated Queries

Setting Up Federated Data Sources

Writing Federated Queries

Join Strategies Across Clouds

Performance Characteristics

Filter and Projection Pushdown

How Pushdown Works

Performance Impact

When Pushdown Is Applied

Optimization Strategies

Multi-Level Caching

Caching Strategy

Cache Invalidation

Hit Rates and Performance

Cost Optimization

Reducing Data Scans

Query Optimization for Cost

Pricing Comparison

Cost Monitoring and Controls

Data Sovereignty and Compliance

Keeping Data in Specific Regions

GDPR and Data Residency

Compliance Frameworks

Audit Trails Across Clouds

Federated Learning

Distributed ML Training

Privacy-Preserving Analytics

Model Deployment

Architecture Patterns

Pattern 1: Multi-Cloud Data Lake

Pattern 2: Hybrid Cloud with On-Prem

Pattern 3: Regional Distribution

Performance Tuning

Query Optimization

Partition Pruning

Format Selection for Performance

Monitoring

Query Performance Across Clouds

Cost Tracking

Data Transfer Monitoring

Troubleshooting

Common Issues

Related Documentation