Skip to content

HeliosDB Multi-Cloud Federation User Guide

HeliosDB Multi-Cloud Federation User Guide

Overview

Multi-cloud federation in HeliosDB enables seamless querying across data stored in AWS S3, Azure Blob Storage, and Google Cloud Storage from a single SQL interface. This powerful capability allows organizations to leverage investments in multiple cloud providers while maintaining unified data access patterns, optimizing costs through intelligent query execution, and enforcing data sovereignty constraints across geographic regions.

Federation addresses a critical challenge in modern data infrastructure: the inevitable reality that enterprise data lives across multiple cloud platforms. Rather than forcing data consolidation (expensive and time-consuming), federation allows you to query and analyze data where it lives, with HeliosDB handling the complexity of distributed execution, intelligent caching, filter pushdown optimization, and cost tracking.

Why Use Federation?

Multi-Cloud Strategy: Avoid vendor lock-in and leverage each cloud provider’s strengths. Store analytics data on AWS S3 for cost-effectiveness, sensitive EU customer data on Azure in European regions for GDPR compliance, and machine learning models on GCP for BigQuery integration.

Cost Optimization: Reduce egress charges and minimize data scanning through intelligent filter and projection pushdown. HeliosDB analyzes queries and applies filtering at the storage layer, meaning you only pay for the data you actually need.

Data Sovereignty: Enforce geographic data residency requirements. Keep PII in specific regions, comply with GDPR/CCPA/HIPAA requirements, and automatically migrate data if policies change.

Operational Flexibility: Add or remove cloud providers without reimplementing data pipelines. Query results are cached intelligently to minimize repeated scans and API calls.

Performance: Parallel query execution across clouds, optimized data format selection, and multi-level caching ensure enterprise-grade query performance even with large-scale distributed datasets.

Federation Concepts

Distributed Query Execution

Federated queries are executed using a coordinator-worker pattern. The query planner determines which data sources must be accessed, estimates costs based on data location and format, and then routes query execution to specialized connectors for each cloud provider. Results are streamed back and merged, with optional caching at multiple levels to optimize repeated queries.

The federation engine automatically parallelizes work where possible. If your query joins S3 data with Azure data, separate scans execute in parallel with results merged efficiently. Filters are pushed down to the storage layer whenever supported, dramatically reducing network traffic.

Federated Learning Integration

Beyond data access, HeliosDB supports privacy-preserving analytics through federated learning. Train models across distributed datasets without moving sensitive data. Each cloud provider’s data contributes to model training while remaining in its home region. This is particularly valuable for healthcare (HIPAA compliance), financial services (data residency requirements), and regulated industries.

Data Sovereignty and Compliance

Data sovereignty ensures data never violates regulatory constraints. The sovereignty engine tracks data classification (PII, PHI, payment card data, etc.), enforces policies (GDPR, CCPA, HIPAA, custom regulations), and automatically migrates data when policies change. The compliance dashboard provides real-time visibility into data locations and regulatory status.

Compliance Considerations

Federated queries inherently handle compliance better than data consolidation. Rather than moving sensitive data and creating compliance risks, federation keeps data at rest while enabling analysis. The audit system logs all data access across clouds, creating complete compliance trails for regulatory inspection. Policies enforce requirements like “customer data from EU cannot be processed in US regions” automatically.

Cloud Backend Setup

AWS S3 Configuration

Authentication Methods

IAM Role-Based Access (Recommended for production)

When running HeliosDB on EC2 or ECS, attach an IAM role with S3 permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket",
"s3:GetObjectTagging"
],
"Resource": [
"arn:aws:s3:::my-analytics-bucket/*",
"arn:aws:s3:::my-analytics-bucket"
]
}
]
}

Access Key and Secret (For development/cross-account scenarios)

use heliosdb_federation::fdw::{CloudCredentials, CloudProvider};
let credentials = CloudCredentials {
provider: CloudProvider::AWS,
access_key: "AKIA...".to_string(),
secret_key: "wJal...".to_string(),
session_token: None,
region: "us-east-1".to_string(),
};

Cross-Account Access (For multi-account organizations)

  • Configure a role in the data account that trusts the HeliosDB account
  • Use STS AssumeRole to gain temporary credentials
  • Essential for enterprise multi-account strategies

Bucket and Prefix Configuration

use heliosdb_federation::fdw::{S3ForeignDataWrapper, S3Options};
let options = S3Options {
use_s3_select: true, // Enable for CSV/JSON queries
use_virtual_host_style: true, // Newer buckets
use_accelerate: true, // If you have high latency
connect_timeout_secs: 30,
read_timeout_secs: 300,
max_retry_attempts: 3,
enable_partitioning: true,
partition_columns: vec!["year".to_string(), "month".to_string()],
enable_caching: true,
cache_ttl_secs: 3600,
auto_detect_format: true,
..Default::default()
};
let s3_fdw = S3ForeignDataWrapper::with_options(options);

Partition Pruning Strategy

Organize S3 data in partitioned directory structures:

s3://my-bucket/
transactions/
year=2024/
month=01/
day=15/
data.parquet
month=02/
day=01/
data.parquet
customers/
region=US/
state=CA/
data.parquet
region=EU/
country=Germany/
data.parquet

The federation engine automatically prunes partitions during query execution, eliminating unnecessary scans. When you query WHERE year=2024 AND month=01, only that partition is accessed.

Azure Blob Storage Setup

Authentication Methods

Managed Identity (Recommended for production on Azure VMs/AKS)

use heliosdb_federation::fdw::{CloudCredentials, CloudProvider};
// Automatically uses managed identity from environment
let credentials = CloudCredentials {
provider: CloudProvider::Azure,
access_key: String::new(), // Not needed with managed identity
secret_key: String::new(),
session_token: None,
region: "eastus".to_string(),
};

Storage Account Key (For development)

let credentials = CloudCredentials {
provider: CloudProvider::Azure,
access_key: "DefaultEndpointsProtocol=https;...".to_string(),
secret_key: "AccountKey=...".to_string(),
session_token: None,
region: "eastus".to_string(),
};

Shared Access Signature (SAS) (For limited, time-scoped access)

  • Ideal for third-party integrations or read-only scenarios
  • Token automatically expires after specified duration
  • Granular permissions control

Container and Blob Organization

https://mystorageaccount.blob.core.windows.net/
loyalty/
2024/
01/
customers.parquet
transactions.csv
compliance/
pii/
eu/
customer-data.parquet
us/
customer-data.parquet
analytics/
models/
recommendation/
v1.0.parquet

Synapse Integration

Azure Synapse Analytics integration enables leveraging PolyBase for high-performance scans:

-- Configure Azure Synapse endpoint
CREATE EXTERNAL DATA SOURCE SynapseSource
WITH (
TYPE = BLOB_STORAGE,
LOCATION = 'https://mystorageaccount.blob.core.windows.net',
CREDENTIAL = AzureStorageCredential
);
-- Query through Synapse for better performance on large scans
SELECT * FROM SynapseSource WHERE partitionKey = '2024-01-15';

Google Cloud Storage Setup

Authentication Methods

Service Account Key (Recommended)

use heliosdb_federation::fdw::{CloudCredentials, CloudProvider};
let credentials = CloudCredentials {
provider: CloudProvider::GCP,
access_key: "gcp-project-id".to_string(),
secret_key: r#"{"type": "service_account", "project_id": ...}"#.to_string(),
session_token: None,
region: "us-central1".to_string(),
};

Application Default Credentials (For development)

  • Automatically uses credentials from GOOGLE_APPLICATION_CREDENTIALS environment variable
  • Simplifies local development without hardcoded keys

Workload Identity (For GKE deployments)

  • Pod service accounts directly assume GCP service accounts
  • No credential files needed in containers
  • Best practice for Kubernetes deployments

BigQuery Integration

Combine GCS data access with BigQuery analysis:

use heliosdb_federation::fdw::GcpForeignDataWrapper;
let gcp_fdw = GcpForeignDataWrapper::new();
// Query GCS data and optionally load into BigQuery for analysis
// Seamless integration through BigQuery external tables

Bucket and Dataset Organization

gs://my-analytics-bucket/
customer_data/
regions/
us/
2024/
01/
data.parquet
eu/
2024/
01/
data.parquet
ml_models/
versions/
recommendation/
v1/
model.parquet
raw_events/
year=2024/
month=01/
day=15/
events.json

Supported Data Formats

Advantages:

  • Excellent filter and projection pushdown support (column pruning)
  • Compression (Snappy, GZIP, Brotli) reduces storage and network costs
  • Row group pruning enables filtering entire blocks without reading
  • Statistics metadata enables cost estimation before scanning
  • Widely adopted for data lakes

Usage:

-- Query Parquet files from S3
SELECT customer_id, SUM(amount) as total_sales
FROM s3://analytics-bucket/sales/2024/01/*.parquet
WHERE date >= '2024-01-01' AND status = 'completed'
GROUP BY customer_id
LIMIT 100;

Optimization Tips:

  • Partition by high-cardinality columns (date, region)
  • Compress with Snappy for balance of compression and speed
  • Use smaller row groups (128MB) for better pruning
  • Organize columns with frequently-filtered ones first

ORC (Columnar Optimization)

Advantages:

  • Similar compression to Parquet but often tighter (better for storage costs)
  • Built-in ACID support enables updates on cloud storage
  • Stripe-level indexing for precise filtering
  • Better for time-series data with temporal partitioning

Usage:

-- Query ORC files from Azure
SELECT product_id, event_count, avg_duration
FROM azure://data-container/events/2024/01/*.orc
WHERE event_type IN ('purchase', 'add_to_cart')
ORDER BY event_count DESC;

Optimization Tips:

  • Use ACID transactions for evolving datasets
  • Stripe size of 64MB-256MB provides good balance
  • Enable dictionary encoding for string columns
  • Partition by date for temporal queries

CSV (Schema Flexibility)

Advantages:

  • Human-readable, easy to import from legacy systems
  • No schema definition needed (auto-detection supported)
  • Supports variable numbers of columns
  • Ideal for data ingestion and ETL workflows

Limitations:

  • No compression at file level (use .gz)
  • No schema metadata (requires inference)
  • Slower parsing than columnar formats
  • No filter pushdown support

Usage:

-- Query CSV files with schema detection
SELECT id, name, email
FROM s3://import-bucket/customers/*.csv
WHERE email LIKE '%@enterprise.com'
LIMIT 1000;

Configuration:

let csv_options = CsvOptions {
delimiter: b',',
quote: b'"',
has_headers: true,
date_format: "%Y-%m-%d".to_string(),
infer_schema: true,
infer_schema_rows: 1000, // Scan first 1000 rows for types
};

JSON (Nested Data)

Advantages:

  • Native support for nested structures
  • Schema flexibility for semi-structured data
  • Ideal for API responses and log aggregation
  • Supports nested filtering

Limitations:

  • Variable performance on large files
  • No pushdown filtering support
  • Requires full scan for filtering

Usage:

-- Query JSON with nested extraction
SELECT
data->>'customer_id' as customer_id,
(data->'order'->'total')::numeric as amount,
data->'items' as items
FROM s3://events-bucket/orders/2024/*.json
WHERE data->>'status' = 'completed'
LIMIT 500;

Optimization:

  • Use line-delimited JSON (JSONL) for large files
  • Nest frequently-filtered fields at root level
  • Consider converting to Parquet for production workloads

Arrow IPC (Efficient Interchange)

Advantages:

  • Language-agnostic columnar format
  • Zero-copy deserialization
  • Streaming format perfect for pipelines
  • Native Python/R/Java support

Usage:

-- Query Arrow IPC files
SELECT * FROM azure://staging/exports/*.arrow
WHERE timestamp > NOW() - INTERVAL '7 days';

Federated Queries

Setting Up Federated Data Sources

First, register your cloud credentials and FDW instances with the federation manager:

use heliosdb_federation::{FederationManager, FederationConfig};
use heliosdb_federation::fdw::{
S3ForeignDataWrapper, AzureForeignDataWrapper, GcpForeignDataWrapper,
CloudProvider, CloudCredentials
};
use std::sync::Arc;
use std::time::Duration;
#[tokio::main]
async fn main() -> anyhow::Result<()> {
// Configure federation with performance settings
let config = FederationConfig {
max_concurrent_queries: 64,
query_timeout: Duration::from_secs(300),
enable_caching: true,
cache_size_bytes: 10 * 1024 * 1024 * 1024, // 10GB
cache_ttl: Duration::from_secs(3600),
enable_pushdown: true,
enable_parallel_scans: true,
parallel_scan_batch_size: 8192,
..Default::default()
};
let manager = FederationManager::new(config).await?;
// Register S3 FDW with main analytics bucket
let s3_fdw = Arc::new(S3ForeignDataWrapper::new());
manager
.register_fdw("analytics-s3".to_string(), CloudProvider::AWS, s3_fdw)
.await?;
// Register Azure FDW with EU compliance data
let azure_fdw = Arc::new(AzureForeignDataWrapper::new());
manager
.register_fdw("compliance-azure".to_string(), CloudProvider::Azure, azure_fdw)
.await?;
// Register GCP FDW for ML datasets
let gcp_fdw = Arc::new(GcpForeignDataWrapper::new());
manager
.register_fdw("ml-gcp".to_string(), CloudProvider::GCP, gcp_fdw)
.await?;
Ok(())
}

Writing Federated Queries

Federation seamlessly integrates with standard SQL. Reference cloud storage paths using URL-style notation:

-- AWS S3 query
SELECT customer_id, SUM(amount) as lifetime_value
FROM s3://sales-bucket/transactions/2024/parquet
WHERE status = 'completed'
GROUP BY customer_id
ORDER BY lifetime_value DESC
LIMIT 100;
-- Azure Blob query
SELECT loyalty_id, points_balance, last_transaction_date
FROM azure://loyalty-container/members/*.parquet
WHERE points_balance > 0
ORDER BY points_balance DESC;
-- Google Cloud Storage query
SELECT user_id, recommendation_score, model_version
FROM gs://ml-bucket/predictions/2024/01/*.parquet
WHERE recommendation_score > 0.8;

Join Strategies Across Clouds

HeliosDB optimizes joins across clouds intelligently:

Broadcast Join (Small dimension tables)

-- Join small customer dimension from Azure with large transactions from S3
SELECT
t.transaction_id,
c.customer_name,
t.amount,
c.country
FROM s3://sales/transactions.parquet t
INNER JOIN azure://dimensions/customers.parquet c
ON t.customer_id = c.customer_id
WHERE t.amount > 1000
AND c.country IN ('US', 'CA', 'MX');

The small customer table is fetched entirely and broadcast to all S3 transaction scanners, minimizing network traffic.

Hash Join (Medium datasets)

-- Join similar-sized datasets from different clouds
SELECT
s.session_id,
s.user_id,
g.game_data
FROM s3://gaming/sessions.parquet s
INNER JOIN gcs://gaming/gameplay.parquet g
ON s.session_id = g.session_id
WHERE s.duration > 600;

Both datasets are partitioned on the join key and pairs of partitions are joined locally.

Nested Loop Join (One-off analysis)

-- Small result set from one source joined with another
WITH top_customers AS (
SELECT customer_id
FROM s3://sales/transactions.parquet
GROUP BY customer_id
HAVING COUNT(*) > 100
LIMIT 50
)
SELECT
tc.customer_id,
cl.sentiment_analysis
FROM top_customers tc
CROSS JOIN azure://customer-surveys/feedback.parquet cl
ON tc.customer_id = cl.customer_id;

Performance Characteristics

Federation performance depends on several factors:

Data Format Impact:

  • Parquet: ~50-100MB/sec per scanner (highly optimized with pushdown)
  • ORC: ~40-80MB/sec (compression overhead)
  • CSV: ~10-20MB/sec (no column pruning)
  • JSON: ~5-15MB/sec (parsing overhead)

Network Impact:

  • Intra-cloud (same cloud provider): Minimal latency, 10Gbps+ bandwidth
  • Inter-cloud: 100-500ms latency, bandwidth varies by region
  • Cross-continent: 150-300ms latency, potential for bottleneck

Caching Impact:

  • Cache hit: <1ms response (in-memory)
  • Cache miss with pushdown: 5-30 seconds (depends on data size)
  • Cache miss without pushdown: 30-300 seconds (proportional to data volume)

Filter and Projection Pushdown

How Pushdown Works

Rather than fetching entire files and filtering in-database, filter pushdown pushes predicates to the storage layer:

Without Pushdown:

  1. Fetch entire Parquet file (10TB)
  2. Parse all 100 billion rows
  3. Filter in-database: 100 billion rows → 1 million qualifying rows
  4. Project requested columns
  5. Return 1 million rows

Network Cost: 10TB ingress Processing: Parse 100 billion rows, filter to 1 million

With Pushdown:

  1. Analyze query: WHERE status='active' AND amount > 1000
  2. Send filter predicates to S3/Azure/GCS
  3. Storage layer reads metadata, prunes row groups/stripes
  4. Return only matching rows (1 million rows)
  5. Return projected columns

Network Cost: ~40GB egress (99.6% reduction!) Processing: Read file metadata, apply filters, parse 1 million rows

Performance Impact

For typical queries, pushdown provides:

  • Data Transfer Reduction: 90-99% fewer bytes
  • Query Speed: 10-100x faster
  • Cost Reduction: 90-99% less egress charges

Example with concrete numbers:

Scenario: Query 10TB Parquet file with 100 billion customer records
Query: SELECT id, name WHERE status='active' AND signup_date > '2023-01-01'
Expected matches: 1 million rows across 10TB
Without Pushdown:
- Transfer: 10TB = $1,000+ (at $0.12/GB typical AWS rate)
- Query time: ~300 seconds
With Pushdown:
- Transfer: ~40GB = ~$5
- Query time: ~10 seconds
- Savings: $995 per query, 30x faster!

When Pushdown Is Applied

Always Applied:

  • Parquet format with supported operators (=, <, >, BETWEEN, IN, LIKE)
  • ORC format with similar operators
  • Filtering on partition columns
  • Projection of subset of columns

Sometimes Applied:

  • Complex filters (AND/OR combinations) - partially pushed
  • CSV files (projection only, no filtering)
  • JSON files (limited to structure-based filtering)

Never Applied:

  • User-defined functions in filters
  • Filters on computed columns
  • Non-supported file formats

Optimization Strategies

Choose Pushdown-Friendly Formats:

-- Good: Will use pushdown
SELECT customer_id, amount
FROM s3://sales/2024/*.parquet
WHERE amount > 1000 AND status = 'completed';
-- Suboptimal: CSV doesn't support filter pushdown
SELECT customer_id, amount
FROM s3://sales/2024/*.csv
WHERE amount > 1000 AND status = 'completed';

Use Partition Pruning:

-- Good: Partition pruning happens automatically
SELECT * FROM s3://sales/year=2024/month=01/day=15/*.parquet
WHERE amount > 1000;
-- Less optimal: Requires reading multiple partitions
SELECT * FROM s3://sales/all-years/*.parquet
WHERE EXTRACT(year FROM date) = 2024
AND amount > 1000;

Order Filters for Selectivity:

-- Good: Most selective filter first
SELECT * FROM s3://customers/*.parquet
WHERE country = 'US' -- 5% of rows
AND loyalty_status = 'premium' -- 20% of remaining rows
AND lifetime_value > 10000; -- 30% of remaining rows
-- Less optimal: Less selective filter first
SELECT * FROM s3://customers/*.parquet
WHERE lifetime_value > 10000 -- 50% of rows
AND loyalty_status = 'premium' -- 20% of remaining rows
AND country = 'US'; -- 5% of remaining rows

Multi-Level Caching

Caching Strategy

HeliosDB implements three caching layers for optimal performance:

Layer 1: Query Result Cache

  • Stores complete query results
  • Fastest: Returns cached results in <1ms
  • Best for: Repeated exact queries
  • TTL: Configurable (default 1 hour)
  • Size: Up to configured cache (default 10GB)

Layer 2: Metadata Cache

  • Caches file schemas, statistics, partitions
  • Avoids repeated metadata calls
  • Particularly valuable across clouds
  • TTL: Longer than query cache (default 24 hours)

Layer 3: Distributed Scan Cache

  • Caches parsed and projected batches
  • Shared across similar queries
  • Enables partial result reuse
  • Automatic eviction on data changes

Cache Invalidation

Smart invalidation prevents stale data:

Time-Based Expiration:

  • Query results: 1 hour by default
  • Metadata: 24 hours by default
  • Configurable per datasource

Dependency Tracking:

// Cache is invalidated when underlying Parquet files change
let scan_options = ScanOptionsBuilder::new()
.with_projection(vec!["id".to_string(), "amount".to_string()])
.with_cache_ttl(Duration::from_secs(3600))
.build();

Manual Invalidation:

-- Clear all cached results from a source
INVALIDATE CACHE FOR s3://sales-bucket/*;
-- Clear specific pattern
INVALIDATE CACHE FOR azure://loyalty/members/*.parquet;

Automatic Invalidation:

// When files change (detected via S3 events or polling)
// Cache automatically refreshes
manager.enable_s3_event_invalidation("sales-bucket").await?;

Hit Rates and Performance

Monitor cache effectiveness:

let cache_stats = manager.get_cache_stats().await?;
println!("Cache Hits: {}", cache_stats.hits);
println!("Cache Misses: {}", cache_stats.misses);
println!("Hit Ratio: {:.2}%", cache_stats.hit_ratio * 100.0);
println!("Cache Size: {} GB", cache_stats.size_bytes / (1024*1024*1024));
println!("Entries: {}", cache_stats.entry_count);

Typical Hit Ratios:

  • Well-optimized workload: 60-80%
  • Interactive BI dashboards: 40-60%
  • Ad-hoc analytics: 20-40%
  • One-time scans: <5%

Improving Hit Ratios:

  • Pre-warm cache with common queries
  • Longer TTL for stable datasets
  • Partition data to enable result reuse
  • Enable query result logging for pattern analysis

Cost Optimization

Reducing Data Scans

Scan Reduction Techniques:

  1. Pushdown Everything Possible:

    • Filters, projections, and limits are the primary cost lever
    • A 99% scan reduction (through pushdown) saves 99% in egress costs
    • Parquet/ORC recommended; CSV/JSON avoided for cost-sensitive workloads
  2. Partition Pruning:

    -- Good: Query specific partition
    SELECT * FROM s3://data/year=2024/month=01/day=15/*.parquet
    WHERE customer_status = 'active';
    -- Scans only ~3GB (1/365 of annual data)
    -- Bad: Query full year, filter in database
    SELECT * FROM s3://data/year=2024/**/*.parquet
    WHERE date = '2024-01-15' AND customer_status = 'active';
    -- Scans 1TB+ of irrelevant data
  3. Column Pruning:

    -- Good: Select only needed columns
    SELECT customer_id, amount
    FROM s3://transactions/large_table.parquet;
    -- Fewer columns = fewer bytes transferred
    -- Bad: SELECT *
    SELECT *
    FROM s3://transactions/large_table.parquet
    WHERE amount > 1000;
    -- Transfers unnecessary columns
  4. Query Sampling:

    -- For exploratory queries on massive datasets
    SELECT customer_id, SUM(amount) as total
    FROM s3://transactions/2024/**.parquet
    WHERE RAND() < 0.01 -- Analyze 1% sample
    GROUP BY customer_id
    LIMIT 1000;

Query Optimization for Cost

Pre-aggregation at Source:

-- Bad: Aggregate after transfer
SELECT customer_id, SUM(amount)
FROM s3://detailed_transactions/2024/**/*.parquet
GROUP BY customer_id;
-- Better: Use pre-aggregated data if available
SELECT customer_id, total_amount
FROM s3://aggregated_transactions/2024_monthly/*.parquet
-- Much smaller dataset, same business results

Data Format Selection:

Data Size Comparison (1TB raw data):
- CSV (gzip): ~300GB uncompressed, ~100GB gzip (~$12/query)
- JSON (gzip): ~400GB uncompressed, ~120GB gzip ($14/query)
- ORC: ~250GB (~$30/month storage, $3/query)
- Parquet: ~200GB (~$24/month storage, $2.40/query)
For frequent queries, Parquet breaks even after ~5 queries

Caching Strategy:

// Cache high-value queries that run repeatedly
let cached_config = FederationConfig {
cache_ttl: Duration::from_secs(86400), // 24 hours
cache_size_bytes: 100 * 1024 * 1024 * 1024, // 100GB
enable_caching: true,
..Default::default()
};

Pricing Comparison

Typical Cost per Query (10GB dataset scan):

ProviderStorageAPI CallsEgressTotal
AWS S3 (without pushdown)$0.23$0.004$1.20$1.43
AWS S3 (with pushdown 99%)$0.23$0.004$0.012$0.25
Azure Blob (without pushdown)$0.018$0.001$0.80$0.82
Azure Blob (with pushdown 99%)$0.018$0.001$0.008$0.03
GCP GCS (without pushdown)$0.020$0.004$1.23$1.25
GCP GCS (with pushdown 99%)$0.020$0.004$0.012$0.04

Note: Costs vary by region, volume discounts apply at scale, and exact rates change quarterly.

Cost Monitoring and Controls

// Enable cost tracking
let config = FederationConfig {
enable_statistics: true,
..Default::default()
};
// Monitor federation statistics
let stats = manager.get_federation_stats().await?;
println!("Bytes Scanned: {} GB", stats.total_bytes_scanned / (1024*1024*1024));
println!("Estimated Cost: ${:.2}",
stats.total_bytes_scanned as f64 * 0.12 / (1024*1024*1024));
println!("Queries by Provider:");
for (provider, count) in stats.queries_by_provider {
println!(" {}: {}", provider, count);
}

Cost Control Strategies:

  • Set query timeouts to prevent runaway scans
  • Monitor bytes_scanned metrics
  • Alert on sudden spikes in data transfer
  • Use sampling for exploratory queries
  • Schedule heavy workloads during off-peak hours

Data Sovereignty and Compliance

Keeping Data in Specific Regions

The sovereignty engine enforces geographic data residency:

use heliosdb_sovereignty::{
ResidencyPolicy, PolicyEngine, PlacementEngine,
DataClassification, Regulation
};
// Create GDPR-compliant policy for EU data
let mut eu_policy = ResidencyPolicy::new(
"gdpr-eu-only".to_string(),
"EU GDPR Compliance".to_string(),
DataClassification::PII,
);
eu_policy.allowed_countries = vec![
"Ireland".to_string(),
"Germany".to_string(),
"France".to_string(),
"Netherlands".to_string(),
];
eu_policy.required_regulations = vec![Regulation::GDPR];
policy_engine.create_policy(eu_policy).await?;
// Data placement is automatically enforced
let placement = placement_engine
.place_data(
"customer_table",
Some("cust_123".to_string()),
&eu_policy,
customer_data.as_bytes(),
None,
)
.await?;
println!("Data placed in: {:?}", placement.region);
// Output: Data placed in: "eu-west-1" (Ireland)

GDPR and Data Residency

GDPR Article 32 requires data transfers to non-EU countries must have appropriate safeguards. HeliosDB ensures compliance:

Automatic Enforcement:

  • Personal data of EU residents stays in EU regions
  • International transfers blocked automatically
  • Audit logging of all access and transfers
  • User consent tracking and revocation

Implementation:

-- Data residency enforced at query time
-- This query runs entirely in EU regions despite multi-cloud setup
SELECT customer_id, email, purchase_history
FROM s3://eu-customers/pii.parquet -- EU region
WHERE customer_gdpr_consent = true;
-- This would fail (attempt to cross-border transfer)
SELECT
eu.customer_id,
us.recommendation_score
FROM s3://eu-customers/pii.parquet eu -- EU region
JOIN gcs://us-recommendations/scores.parquet us -- US region
ON eu.customer_id = us.customer_id;
-- Error: GDPR violation - cannot join EU PII with US data

Compliance Frameworks

HIPAA (Healthcare):

let mut hipaa_policy = ResidencyPolicy::new(
"hipaa-us-only".to_string(),
"HIPAA Compliance".to_string(),
DataClassification::PHI,
);
hipaa_policy.allowed_countries = vec!["United States".to_string()];
hipaa_policy.required_regulations = vec![Regulation::HIPAA];
hipaa_policy.encrypt_at_rest = true;
hipaa_policy.encrypt_in_transit = true;
policy_engine.create_policy(hipaa_policy).await?;

CCPA (California):

let mut ccpa_policy = ResidencyPolicy::new(
"ccpa-consumer-pii".to_string(),
"CCPA Compliance".to_string(),
DataClassification::PII,
);
ccpa_policy.user_rights_enforced = true;
ccpa_policy.deletion_enforced = true;
ccpa_policy.required_regulations = vec![Regulation::CCPA];
policy_engine.create_policy(ccpa_policy).await?;

PCI-DSS (Payment Cards):

let mut pci_policy = ResidencyPolicy::new(
"pci-card-data".to_string(),
"PCI-DSS Compliance".to_string(),
DataClassification::PaymentCardData,
);
pci_policy.tokenization_required = true;
pci_policy.encryption_required = true;
pci_policy.required_regulations = vec![Regulation::PCIDSS];
policy_engine.create_policy(pci_policy).await?;

Audit Trails Across Clouds

Complete audit logging across all cloud providers:

// Monitor compliance
let compliance_report = policy_engine.generate_compliance_report().await?;
println!("Policies: {}", compliance_report.policies.len());
println!("Violations: {}", compliance_report.violations.len());
println!("Data Locations:");
for (data_id, location) in compliance_report.data_locations {
println!(" {} -> {} ({})",
data_id,
location.region,
location.provider
);
}
// Access audit trail
let audit_events = audit_logger.query_events(
Some("customer_table"),
None, // All users
Some(Duration::from_secs(86400)), // Last 24 hours
).await?;
for event in audit_events {
println!("{}: User {} accessed {} in {}",
event.timestamp,
event.user_id,
event.data_id,
event.location
);
}

Federated Learning

Distributed ML Training

Train models on data distributed across clouds without moving sensitive information:

use heliosdb_federation::federated_learning::{
FederatedLearningEngine, FederatedLearningConfig,
TrainingConfig, AggregationStrategy
};
let fl_config = FederatedLearningConfig {
num_rounds: 10,
learning_rate: 0.01,
batch_size: 32,
aggregation_strategy: AggregationStrategy::FedAvg,
..Default::default()
};
let engine = FederatedLearningEngine::new(fl_config).await?;
// Training happens on data in each cloud
// Updates are aggregated, not raw data
let model = engine.train_model("customer_churn_model", vec![
TrainingDataset {
name: "aws_customers".to_string(),
path: "s3://customer-data/".to_string(),
},
TrainingDataset {
name: "azure_customers".to_string(),
path: "azure://customer-data/".to_string(),
},
TrainingDataset {
name: "gcp_customers".to_string(),
path: "gs://customer-data/".to_string(),
},
]).await?;
println!("Model trained without moving sensitive customer data");

Privacy-Preserving Analytics

Differential privacy ensures individual privacy while enabling aggregate analysis:

// Query with differential privacy
let result = manager.execute_query_with_privacy(
"SELECT COUNT(*) FROM customer_table WHERE country='US'",
DifferentialPrivacyParams {
epsilon: 0.1, // Privacy budget
delta: 1e-5, // Failure probability
}
).await?;
// Result includes noise to prevent identifying individuals
println!("Count (with differential privacy): {}", result);

Model Deployment

Deploy trained models back to cloud providers for inference:

// Deploy model to S3 for batch scoring
engine.deploy_model_to_s3(
"churn_model_v1",
"s3://models/churn-prediction/v1.0",
ModelFormat::ONNX
).await?;
// Deploy to Azure for real-time inference
engine.deploy_model_to_azure(
"churn_model_v1",
"azure://models/churn-prediction",
AzureDeploymentTarget::MachineLearning
).await?;
// Deploy to BigQuery for integrated analytics
engine.deploy_model_to_bigquery(
"churn_model_v1",
"project_id.dataset.churn_prediction_udf"
).await?;

Architecture Patterns

Pattern 1: Multi-Cloud Data Lake

Scenario: Large enterprise using multiple clouds for different business units.

Architecture:

┌─────────────────────────────────────────────────────┐
│ HeliosDB Federation Layer │
├─────────────────────────────────────────────────────┤
│ Query Router │ Cost Optimizer │ Compliance Engine │
│ ↓ │ ↓ │ ↓ │
├─────────────────────────────────────────────────────┤
│ S3 (AWS) │ Blob (Azure) │ GCS (GCP) │
│ Finance │ Compliance │ Analytics │
│ Data Lake │ Data (GDPR) │ & ML Models │
└─────────────────────────────────────────────────────┘

Benefits:

  • Finance team uses AWS S3 for cost optimization
  • Compliance team uses Azure for GDPR enforcement
  • ML team uses GCP for BigQuery integration
  • Single query interface across all sources

Implementation:

-- Cross-cloud financial dashboard
SELECT
f.fiscal_quarter,
f.department,
f.revenue,
c.compliance_status,
m.forecast_adjustment
FROM s3://finance/quarterly_results.parquet f
INNER JOIN azure://compliance/approval_status.parquet c
ON f.fiscal_quarter = c.period
INNER JOIN gs://forecasting/predictions.parquet m
ON f.department = m.department
WHERE f.fiscal_quarter = '2024-Q1'
ORDER BY f.revenue DESC;

Pattern 2: Hybrid Cloud with On-Prem

Scenario: Data in on-premises database plus cloud cold storage.

Architecture:

┌────────────────────────────────────────────────────┐
│ HeliosDB Federation Manager │
├────────────────────────────────────────────────────┤
│ On-Prem PostgreSQL (Hot) │ S3 Glacier (Cold) │
│ Recent 90 days │ Historical Archive │
│ Fast, expensive storage │ Cheap, slow storage │
│ Real-time operational │ Compliance archive │
└────────────────────────────────────────────────────┘

Benefits:

  • Keep hot data on-premises for performance
  • Archive to S3 Glacier for cost
  • Single query combines both
  • Automatic tiering as data cools

Implementation:

-- Query spans hot and cold storage
SELECT date, revenue, customer_id
FROM postgres://prod/sales_current -- On-prem hot
WHERE date >= CURRENT_DATE - INTERVAL '90 days'
UNION ALL
SELECT date, revenue, customer_id
FROM s3://archive/sales_historical/*/data.parquet -- Cold archive
WHERE date < CURRENT_DATE - INTERVAL '90 days'
ORDER BY date DESC;

Pattern 3: Regional Distribution

Scenario: Global company with regional data sovereignty requirements.

Architecture:

┌─────────────────────────────────────────────────┐
│ HeliosDB Federation Layer │
│ with Sovereignty Enforcement │
├─────────────────────────────────────────────────┤
│ EU Region │ US Region │ APAC Region │
│ Azure EU │ AWS US │ GCP APAC │
│ GDPR Data │ CCPA Data │ Local Data │
│ (Ireland) │ (Virginia) │ (Singapore) │
└─────────────────────────────────────────────────┘

Benefits:

  • Automatic region routing by data classification
  • Compliance enforced at query time
  • Cross-region analytics with privacy
  • Data sovereignty built-in

Implementation:

// Queries automatically route to correct regions
let results = manager.execute_query(
"SELECT * FROM customers WHERE revenue > 100000",
PrivacyPolicyId::GDPR, // Only runs in EU regions
).await?;
// Multi-region summary (respects boundaries)
let summary = manager.execute_query(
"SELECT region, COUNT(*) as count FROM customers GROUP BY region",
PrivacyPolicyId::Public, // Aggregate across all regions
).await?;

Performance Tuning

Query Optimization

Analyze Execution Plans:

EXPLAIN VERBOSE
SELECT c.customer_id, c.name, SUM(o.amount) as total_orders
FROM s3://customers/*.parquet c
INNER JOIN azure://orders/2024/*.parquet o
ON c.customer_id = o.customer_id
WHERE c.country = 'US'
GROUP BY c.customer_id, c.name
ORDER BY total_orders DESC
LIMIT 100;

Look for:

  • Pushdown applied to filters and projections
  • Partition pruning for date/region columns
  • Join strategy (broadcast vs hash vs nested loop)
  • Estimated rows and bytes

Partition Pruning:

  • Always query specific date partitions when possible
  • Use partition columns in WHERE clause early
  • Avoid computing on partition columns (date arithmetic)

Format Selection:

  • Parquet for frequently-queried analytics data
  • ORC for ACID workloads and compression
  • CSV only for one-time imports
  • JSON for semi-structured/API data

Partition Pruning

Good Partitioning:

s3://sales/
year=2024/
month=01/
day=15/
region=US/
data.parquet

Query with maximum pruning:

SELECT * FROM s3://sales/**/*.parquet
WHERE year = 2024
AND month IN (1, 2, 3)
AND day >= 10
AND region = 'US'
LIMIT 1000;

Only ~60 files scanned out of ~1500 possible (96% reduction!)

Format Selection for Performance

Scenario: 1TB dataset, need to run 100 queries this month
CSV (uncompressed): 1TB, 0 cost
- 100 queries × 1TB = 100TB egress = $12,000
- Total cost: $12,000
Parquet (with pushdown): 100GB, $24/month storage
- 100 queries × 100MB avg = 10GB egress = $1.20
- Total cost: $25.20
- Savings: $11,974.80!
Converting to Parquet pays for itself in under 1 second of queries.
For annual workloads, savings are massive.

Monitoring

Query Performance Across Clouds

// Get detailed query statistics
let stats = manager.get_federation_stats().await?;
println!("=== Federation Performance ===");
println!("Total Queries: {}", stats.total_queries);
println!("Total Bytes Scanned: {} GB",
stats.total_bytes_scanned / (1024*1024*1024));
println!("Avg Query Duration: {:.2}ms",
stats.avg_query_duration_ms);
println!("Cache Hit Ratio: {:.2}%",
stats.cache_hit_ratio * 100.0);
println!("\n=== Queries by Provider ===");
for (provider, count) in stats.queries_by_provider {
println!("{}: {} queries", provider, count);
}

Cost Tracking

// Monitor federation costs in real-time
let cost_report = manager.get_cost_report().await?;
println!("=== Cost Summary ===");
println!("AWS S3: ${:.2}", cost_report.aws_cost);
println!("Azure: ${:.2}", cost_report.azure_cost);
println!("GCP: ${:.2}", cost_report.gcp_cost);
println!("Total: ${:.2}", cost_report.total_cost);
println!("\n=== Cost Breakdown ===");
println!("Storage: ${:.2}", cost_report.storage_cost);
println!("API Calls: ${:.2}", cost_report.api_cost);
println!("Egress: ${:.2}", cost_report.egress_cost);

Data Transfer Monitoring

-- Monitor data transfer patterns
SELECT
provider,
DATE_TRUNC('hour', query_time) as hour,
COUNT(*) as query_count,
SUM(bytes_scanned) as bytes_scanned,
SUM(bytes_transferred) as bytes_transferred
FROM federation_audit_log
WHERE query_time > NOW() - INTERVAL '7 days'
GROUP BY provider, DATE_TRUNC('hour', query_time)
ORDER BY hour DESC;

Troubleshooting

Common Issues

“Connection timed out” to S3/Azure/GCP

Cause: Network connectivity or credential issues
Solution:
1. Verify credentials in FederationManager configuration
2. Check security group / firewall rules
3. Test connectivity: `aws s3 ls`, `az storage blob list`, `gsutil ls`
4. Verify IAM permissions for service account
5. Check for temporary quota exhaustion (usually auto-resolves)

“Pushdown not applied - query running slow”

Cause: Format or filter not supporting pushdown
Solution:
1. Convert to Parquet/ORC from CSV/JSON
2. Simplify filters (avoid UDFs, computed columns)
3. Use partition columns in WHERE clause
4. Check EXPLAIN plan for pushdown status
5. Verify statistics collection is enabled

“Cache hit ratio too low”

Cause: Queries are too diverse or cache too small
Solution:
1. Increase cache size (more memory)
2. Increase cache TTL for stable data
3. Analyze query patterns - standardize where possible
4. Pre-warm cache with common queries
5. Check if data changes frequently (invalidation)

“High egress costs”

Cause: Pushdown not working or unnecessary columns selected
Solution:
1. Use pushdown-friendly formats (Parquet)
2. Only SELECT needed columns
3. Apply filters for highest selectivity
4. Use partition pruning
5. Sample data for exploratory queries
6. Consider regional replication for frequently-accessed data
  • Federated Learning User Guide: Advanced privacy-preserving distributed ML training
  • Data Sovereignty Guide: Complete compliance policy management
  • Cache Configuration Guide: Advanced caching strategies and tuning
  • Query Optimization Guide: Performance tuning for complex queries
  • Cost Management Guide: Budget controls and cost allocation
  • Cloud Integration Guides: Specific setup for AWS, Azure, GCP
  • Architecture Reference: Federation system design and internals

Last Updated: December 2024 Version: 1.0