Multi-Tenancy — 1000+ Tenants on a Single Cluster

UVP

Run 10,000+ tenants on a single HeliosDB Full cluster with <3% query overhead. The tenancy layer combines automatic SQL rewriting (RLS for every query, every protocol), per-tenant resource quotas (storage / QPS / connections / CPU / memory), three isolation modes (logical / physical / hybrid), tenant-scoped caching that prevents cross-tenant cache poisoning, and a full REST + SQL admin surface. Per-tenant Prometheus metrics and an auto-generated Grafana dashboard ship with the crate. Provision an isolated tenant in <100 ms; never write a WHERE tenant_id = … clause again.

Prerequisites

HeliosDB Full v8.0.3
~30 minutes
A working understanding of row-level security (RLS) — useful but not required, the crate handles the SQL rewriting for you

The crate is heliosdb-multi-tenancy (in heliosdb-tenancy/crates/multi-tenancy/). It’s a workspace-internal crate; the public surface is the SQL CREATE TENANT syntax and the REST /api/v1/tenants endpoint.

1. Three Isolation Modes — Pick Your Trade-Off

Mode	What it gives you	Overhead	Use when
Logical	Single schema, automatic `WHERE tenant_id = …` injection, shared cache namespace per tenant	<3% per query	90% of SaaS workloads — cheapest, scales to 10K tenants
Physical	Separate database per tenant	~0% query overhead, more storage	High-security tenants, regulated industries (healthcare, finance)
Hybrid	Logical for shared tables, physical for sensitive ones	between the two	When some data must be physically isolated but most is fine logical

You set the mode per tenant, not per cluster. A single HeliosDB instance can host 9,950 logical tenants and 50 physical tenants happily.

2. Quick Start — From Zero to First Tenant

2a. Bring up the system

use heliosdb_multi_tenancy::*;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let system = MultiTenancySystem::with_defaults()?;

    let tenant = system.tenant_manager.create_tenant(TenantProvisionRequest {
        name: "acme_corp".into(),
        quotas: QuotaConfig {
            storage_quota_bytes: 100 * 1024 * 1024 * 1024,  // 100 GB
            qps_limit: 1000,
            max_connections: 100,
            ..Default::default()
        },
        isolation_mode: Some(IsolationMode::Logical),
        parent_id: None,
        settings: None,
        features: None,
        tags: None,
    }).await?;

    println!("Created tenant: {} ({})", tenant.name, tenant.id);
    Ok(())
}

2b. Or — create with SQL

CREATE TENANT 'acme_corp' WITH (
  STORAGE_QUOTA   = '100GB',
  QPS_LIMIT       = 1000,
  MAX_CONNECTIONS = 100,
  ISOLATION_MODE  = 'logical'
);

SHOW TENANTS;
SHOW TENANT USAGE FOR 'acme_corp';

2c. Or — create over REST

curl -X POST http://localhost:8080/api/v1/tenants \
  -H "Content-Type: application/json" \
  -d '{
    "name": "acme_corp",
    "quotas": {
      "storage_quota_bytes": 107374182400,
      "qps_limit": 1000,
      "max_connections": 100
    }
  }'

All three paths produce the same outcome — pick whichever fits your provisioning automation.

3. RLS — Automatic Query Rewriting

Once you enable RLS on a table, every query against it is automatically rewritten to include the tenant filter. Your application code never has to know about tenant_id.

// Enable RLS for a table
system.rls_manager.enable_table_rls("users")?;
system.rls_manager.enable_table_rls("orders")?;

// Set the tenant context (your auth layer does this once per connection)
let context = RlsContext::for_tenant(tenant.id);

// Query — written as if single-tenant
let sql = "SELECT * FROM users WHERE active = true";
let rewritten = system.query_rewriter.rewrite_query(sql, &context)?;
// → "SELECT * FROM users WHERE active = true AND users.tenant_id = '<uuid>'"

Custom policies

The default policy (tenant_id = {tenant_id}) covers most cases, but you can write fine-grained policies — e.g. tenant-admins can see other tenants’ rows in a parent-tenant relationship:

let policy = RlsPolicy {
    name: "tenant_isolation".into(),
    table: "sensitive_data".into(),
    expression: "tenant_id = {tenant_id} OR {tenant_id} IN (SELECT child FROM tenant_hierarchy WHERE parent = tenant_id)".into(),
    operations: vec![RlsOperation::Select],
    allow_superuser_bypass: true,
    enabled: true,
};
system.rls_manager.create_policy(policy)?;

4. Quotas — Defence in Depth

Every tenant has eight enforceable limits:

Quota	Default	Purpose
`storage_quota_bytes`	unlimited	Cap raw disk usage per tenant
`qps_limit`	unlimited	Rate-limit queries per second
`max_connections`	unlimited	Cap concurrent connections
`max_concurrent_queries`	unlimited	Cap parallel queries
`query_timeout_seconds`	300	Hard SLA ceiling
`max_cpu_seconds`	unlimited	Per-query CPU budget
`max_memory_bytes`	unlimited	Per-query memory budget
`network_egress_bytes`	unlimited	Per-tenant network cap (Beta)

Three suggested tiers, copy-paste:

// Starter
QuotaConfig { storage_quota_bytes: 10  * GB, qps_limit:   100, max_connections:  25, ..Default::default() }

// Professional
QuotaConfig { storage_quota_bytes: 100 * GB, qps_limit:  1000, max_connections: 100, ..Default::default() }

// Enterprise
QuotaConfig { storage_quota_bytes:   1 * TB, qps_limit: 10000, max_connections: 500, ..Default::default() }

Pre-flight check

let result = system.quota_manager.check_query_quota(tenant.id).await?;
if !result.is_allowed() {
    return Err(anyhow!("Quota exceeded: {}", result.error_message().unwrap()));
}

The quota manager runs a check before the query is parsed — cheap, ~1 µs per call (see benchmarks below).

5. Tenant-Scoped Caching

A naive shared cache is a cache-poisoning attack waiting to happen. The crate ships TenantCache<K, V> — every entry is keyed by (key, tenant_id), so tenant A literally cannot see tenant B’s cached entries.

let cache: TenantCache<String, Vec<u8>> = TenantCache::new(
    1000,
    Duration::from_secs(300),
);

cache.put("user_123".into(), data, tenant_id_a)?;

assert_eq!(cache.get(&"user_123".into(), tenant_id_a), Some(data.clone()));
assert_eq!(cache.get(&"user_123".into(), tenant_id_b), None);  // Isolation enforced

The same isolation pattern is applied to the result cache, plan cache, and prepared-statement cache — all are tenant-scoped.

6. Per-Tenant Encryption Keys (v5.5+)

Multi-tenancy v2 uses a single cluster KEK; per-tenant DEKs (data-encryption keys) are on the v5.5 roadmap (see the crate’s Roadmap section). For most workloads the cluster KEK + RLS combination is sufficient — and physical isolation is the recommended path for tenants who need cryptographic separation today.

If you need per-tenant DEKs now, use IsolationMode::Physical and rotate the KEK per database — that’s the supported path until v5.5 ships.

7. Observability

Prometheus metrics

heliosdb_tenant_query_count{tenant_id, query_type}
heliosdb_tenant_storage_bytes{tenant_id}
heliosdb_tenant_active_connections{tenant_id}
heliosdb_tenant_quota_usage{tenant_id, quota_type}
heliosdb_tenant_query_duration_seconds{tenant_id, query_type}
heliosdb_tenant_error_count{tenant_id, error_type}
heliosdb_tenant_cache_hits{tenant_id}

Auto-generated Grafana dashboard

use heliosdb_multi_tenancy::GrafanaDashboardGenerator;

GrafanaDashboardGenerator::export_dashboard_to_file("./grafana-dashboard.json")?;

Import the JSON into Grafana — you get tenant-level QPS, storage usage, error rate, p95 latency, cache hit ratio, and quota saturation panels out of the box.

Pre-built alert rules

use heliosdb_multi_tenancy::AlertRuleGenerator;

let rules = AlertRuleGenerator::generate_default_rules();
let prometheus_rules = AlertRuleGenerator::export_prometheus_rules(&rules);

Default alerts cover: storage at 80%, storage exceeded, error-rate spike, p95 over 5 s, connection limit approaching.

8. REST Admin API

POST   /api/v1/tenants                           Create
GET    /api/v1/tenants                           List
GET    /api/v1/tenants/{id}                      Detail
GET    /api/v1/tenants/{id}/usage                Live usage
PUT    /api/v1/tenants/{id}/quotas               Update quotas
POST   /api/v1/tenants/{id}/suspend              Pause
POST   /api/v1/tenants/{id}/resume               Resume
DELETE /api/v1/tenants/{id}                      Delete

Suspend / resume

Suspended tenants reject all queries with tenant suspended. Useful for billing failures, compliance freezes, or controlled migrations.

curl -X POST http://localhost:8080/api/v1/tenants/$TID/suspend
# tenant rejects new connections; existing ones are gracefully drained
curl -X POST http://localhost:8080/api/v1/tenants/$TID/resume

9. Migration — Single Tenant → Multi-Tenant

If you’re already running HeliosDB without tenancy, the migration is four steps:

-- 1. Add tenant_id to existing tables
ALTER TABLE users  ADD COLUMN tenant_id UUID;
ALTER TABLE orders ADD COLUMN tenant_id UUID;

-- 2. Backfill (assign existing data to the default tenant)
UPDATE users  SET tenant_id = '<default-tenant-uuid>';
UPDATE orders SET tenant_id = '<default-tenant-uuid>';

-- 3. Enable RLS
SELECT heliosdb.enable_table_rls('users');
SELECT heliosdb.enable_table_rls('orders');

-- 4. Application: set tenant context on each connection
-- (your auth layer should do this from the JWT or session)

There is no required downtime for steps 1-3 — they’re online operations. Step 4 needs to be deployed coherently across all app instances (gate it behind a feature flag).

10. Performance Reference

Validated on the project benchmark suite (cargo bench --bench multi_tenancy_benchmarks):

Operation	Time
`query_rewriting/simple_select`	12.5–13.1 µs
`query_rewriting/select_with_join`	18.2–19.0 µs
`tenant_cache/cache_get_hit`	85–90 ns
`quota_checks/check_query_quota`	1.2–1.4 µs
`overhead_comparison/rls_enforced`	13.1–13.8 µs
`overhead_comparison/baseline`	12.7–13.3 µs

That’s ~3% overhead on RLS-enforced queries — well under the 5% target.

Capacity:

10,000+ concurrent tenants per instance (validated)
Up to 10,000 QPS per tenant (validated; total instance QPS depends on hardware)
85%+ cache hit rate with tenant-scoped caching

11. Production Checklist

Picked the right isolation mode for each tenant tier (don’t put paying enterprises on Logical alongside free tier — use Hybrid or Physical)
All multi-tenant tables have RLS enabled
Default policy is deny-overrides — RLS bypass requires allow_superuser_bypass = true and a superuser session
Quotas are set on every tenant, even free-tier ones
Suspend / resume is wired into your billing system (failed payment → suspend)
Prometheus is scraping; alerts on storage saturation and error rate are firing to a real on-call channel
Grafana dashboard is imported and on the team’s screen
Tenant deletion is soft by default in your policy (status Deleted, data retained for 30 days)
An emergency superuser role exists, and access is logged + MFA-gated

Where Next

ABAC + SSO Setup — wire your IdP groups to per-tenant roles
Cognitive Agents — the agents are tenant-aware out of the box; pin the SchemaManager per-tenant for hierarchical workloads
Intelligent Tiering — tier free-tier tenants to cold storage automatically
PITR Recovery — backup retention per tenant tier

References

Source: /home/app/Helios/Full/heliosdb-tenancy/crates/multi-tenancy/
100+ isolation tests, 50+ integration tests, 90% coverage
Benchmark suite: multi_tenancy_benchmarks.rs