Skip to content

Multi-Tenancy — 1000+ Tenants on a Single Cluster

UVP

Run 10,000+ tenants on a single HeliosDB Full cluster with <3% query overhead. The tenancy layer combines automatic SQL rewriting (RLS for every query, every protocol), per-tenant resource quotas (storage / QPS / connections / CPU / memory), three isolation modes (logical / physical / hybrid), tenant-scoped caching that prevents cross-tenant cache poisoning, and a full REST + SQL admin surface. Per-tenant Prometheus metrics and an auto-generated Grafana dashboard ship with the crate. Provision an isolated tenant in <100 ms; never write a WHERE tenant_id = … clause again.


Prerequisites

  • HeliosDB Full v8.0.3
  • ~30 minutes
  • A working understanding of row-level security (RLS) — useful but not required, the crate handles the SQL rewriting for you

The crate is heliosdb-multi-tenancy (in heliosdb-tenancy/crates/multi-tenancy/). It’s a workspace-internal crate; the public surface is the SQL CREATE TENANT syntax and the REST /api/v1/tenants endpoint.


1. Three Isolation Modes — Pick Your Trade-Off

ModeWhat it gives youOverheadUse when
LogicalSingle schema, automatic WHERE tenant_id = … injection, shared cache namespace per tenant<3% per query90% of SaaS workloads — cheapest, scales to 10K tenants
PhysicalSeparate database per tenant~0% query overhead, more storageHigh-security tenants, regulated industries (healthcare, finance)
HybridLogical for shared tables, physical for sensitive onesbetween the twoWhen some data must be physically isolated but most is fine logical

You set the mode per tenant, not per cluster. A single HeliosDB instance can host 9,950 logical tenants and 50 physical tenants happily.


2. Quick Start — From Zero to First Tenant

2a. Bring up the system

use heliosdb_multi_tenancy::*;
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let system = MultiTenancySystem::with_defaults()?;
let tenant = system.tenant_manager.create_tenant(TenantProvisionRequest {
name: "acme_corp".into(),
quotas: QuotaConfig {
storage_quota_bytes: 100 * 1024 * 1024 * 1024, // 100 GB
qps_limit: 1000,
max_connections: 100,
..Default::default()
},
isolation_mode: Some(IsolationMode::Logical),
parent_id: None,
settings: None,
features: None,
tags: None,
}).await?;
println!("Created tenant: {} ({})", tenant.name, tenant.id);
Ok(())
}

2b. Or — create with SQL

CREATE TENANT 'acme_corp' WITH (
STORAGE_QUOTA = '100GB',
QPS_LIMIT = 1000,
MAX_CONNECTIONS = 100,
ISOLATION_MODE = 'logical'
);
SHOW TENANTS;
SHOW TENANT USAGE FOR 'acme_corp';

2c. Or — create over REST

Terminal window
curl -X POST http://localhost:8080/api/v1/tenants \
-H "Content-Type: application/json" \
-d '{
"name": "acme_corp",
"quotas": {
"storage_quota_bytes": 107374182400,
"qps_limit": 1000,
"max_connections": 100
}
}'

All three paths produce the same outcome — pick whichever fits your provisioning automation.


3. RLS — Automatic Query Rewriting

Once you enable RLS on a table, every query against it is automatically rewritten to include the tenant filter. Your application code never has to know about tenant_id.

// Enable RLS for a table
system.rls_manager.enable_table_rls("users")?;
system.rls_manager.enable_table_rls("orders")?;
// Set the tenant context (your auth layer does this once per connection)
let context = RlsContext::for_tenant(tenant.id);
// Query — written as if single-tenant
let sql = "SELECT * FROM users WHERE active = true";
let rewritten = system.query_rewriter.rewrite_query(sql, &context)?;
// → "SELECT * FROM users WHERE active = true AND users.tenant_id = '<uuid>'"

Custom policies

The default policy (tenant_id = {tenant_id}) covers most cases, but you can write fine-grained policies — e.g. tenant-admins can see other tenants’ rows in a parent-tenant relationship:

let policy = RlsPolicy {
name: "tenant_isolation".into(),
table: "sensitive_data".into(),
expression: "tenant_id = {tenant_id} OR {tenant_id} IN (SELECT child FROM tenant_hierarchy WHERE parent = tenant_id)".into(),
operations: vec![RlsOperation::Select],
allow_superuser_bypass: true,
enabled: true,
};
system.rls_manager.create_policy(policy)?;

4. Quotas — Defence in Depth

Every tenant has eight enforceable limits:

QuotaDefaultPurpose
storage_quota_bytesunlimitedCap raw disk usage per tenant
qps_limitunlimitedRate-limit queries per second
max_connectionsunlimitedCap concurrent connections
max_concurrent_queriesunlimitedCap parallel queries
query_timeout_seconds300Hard SLA ceiling
max_cpu_secondsunlimitedPer-query CPU budget
max_memory_bytesunlimitedPer-query memory budget
network_egress_bytesunlimitedPer-tenant network cap (Beta)

Three suggested tiers, copy-paste:

// Starter
QuotaConfig { storage_quota_bytes: 10 * GB, qps_limit: 100, max_connections: 25, ..Default::default() }
// Professional
QuotaConfig { storage_quota_bytes: 100 * GB, qps_limit: 1000, max_connections: 100, ..Default::default() }
// Enterprise
QuotaConfig { storage_quota_bytes: 1 * TB, qps_limit: 10000, max_connections: 500, ..Default::default() }

Pre-flight check

let result = system.quota_manager.check_query_quota(tenant.id).await?;
if !result.is_allowed() {
return Err(anyhow!("Quota exceeded: {}", result.error_message().unwrap()));
}

The quota manager runs a check before the query is parsed — cheap, ~1 µs per call (see benchmarks below).


5. Tenant-Scoped Caching

A naive shared cache is a cache-poisoning attack waiting to happen. The crate ships TenantCache<K, V> — every entry is keyed by (key, tenant_id), so tenant A literally cannot see tenant B’s cached entries.

let cache: TenantCache<String, Vec<u8>> = TenantCache::new(
1000,
Duration::from_secs(300),
);
cache.put("user_123".into(), data, tenant_id_a)?;
assert_eq!(cache.get(&"user_123".into(), tenant_id_a), Some(data.clone()));
assert_eq!(cache.get(&"user_123".into(), tenant_id_b), None); // Isolation enforced

The same isolation pattern is applied to the result cache, plan cache, and prepared-statement cache — all are tenant-scoped.


6. Per-Tenant Encryption Keys (v5.5+)

Multi-tenancy v2 uses a single cluster KEK; per-tenant DEKs (data-encryption keys) are on the v5.5 roadmap (see the crate’s Roadmap section). For most workloads the cluster KEK + RLS combination is sufficient — and physical isolation is the recommended path for tenants who need cryptographic separation today.

If you need per-tenant DEKs now, use IsolationMode::Physical and rotate the KEK per database — that’s the supported path until v5.5 ships.


7. Observability

Prometheus metrics

heliosdb_tenant_query_count{tenant_id, query_type}
heliosdb_tenant_storage_bytes{tenant_id}
heliosdb_tenant_active_connections{tenant_id}
heliosdb_tenant_quota_usage{tenant_id, quota_type}
heliosdb_tenant_query_duration_seconds{tenant_id, query_type}
heliosdb_tenant_error_count{tenant_id, error_type}
heliosdb_tenant_cache_hits{tenant_id}

Auto-generated Grafana dashboard

use heliosdb_multi_tenancy::GrafanaDashboardGenerator;
GrafanaDashboardGenerator::export_dashboard_to_file("./grafana-dashboard.json")?;

Import the JSON into Grafana — you get tenant-level QPS, storage usage, error rate, p95 latency, cache hit ratio, and quota saturation panels out of the box.

Pre-built alert rules

use heliosdb_multi_tenancy::AlertRuleGenerator;
let rules = AlertRuleGenerator::generate_default_rules();
let prometheus_rules = AlertRuleGenerator::export_prometheus_rules(&rules);

Default alerts cover: storage at 80%, storage exceeded, error-rate spike, p95 over 5 s, connection limit approaching.


8. REST Admin API

POST /api/v1/tenants Create
GET /api/v1/tenants List
GET /api/v1/tenants/{id} Detail
GET /api/v1/tenants/{id}/usage Live usage
PUT /api/v1/tenants/{id}/quotas Update quotas
POST /api/v1/tenants/{id}/suspend Pause
POST /api/v1/tenants/{id}/resume Resume
DELETE /api/v1/tenants/{id} Delete

Suspend / resume

Suspended tenants reject all queries with tenant suspended. Useful for billing failures, compliance freezes, or controlled migrations.

Terminal window
curl -X POST http://localhost:8080/api/v1/tenants/$TID/suspend
# tenant rejects new connections; existing ones are gracefully drained
curl -X POST http://localhost:8080/api/v1/tenants/$TID/resume

9. Migration — Single Tenant → Multi-Tenant

If you’re already running HeliosDB without tenancy, the migration is four steps:

-- 1. Add tenant_id to existing tables
ALTER TABLE users ADD COLUMN tenant_id UUID;
ALTER TABLE orders ADD COLUMN tenant_id UUID;
-- 2. Backfill (assign existing data to the default tenant)
UPDATE users SET tenant_id = '<default-tenant-uuid>';
UPDATE orders SET tenant_id = '<default-tenant-uuid>';
-- 3. Enable RLS
SELECT heliosdb.enable_table_rls('users');
SELECT heliosdb.enable_table_rls('orders');
-- 4. Application: set tenant context on each connection
-- (your auth layer should do this from the JWT or session)

There is no required downtime for steps 1-3 — they’re online operations. Step 4 needs to be deployed coherently across all app instances (gate it behind a feature flag).


10. Performance Reference

Validated on the project benchmark suite (cargo bench --bench multi_tenancy_benchmarks):

OperationTime
query_rewriting/simple_select12.5–13.1 µs
query_rewriting/select_with_join18.2–19.0 µs
tenant_cache/cache_get_hit85–90 ns
quota_checks/check_query_quota1.2–1.4 µs
overhead_comparison/rls_enforced13.1–13.8 µs
overhead_comparison/baseline12.7–13.3 µs

That’s ~3% overhead on RLS-enforced queries — well under the 5% target.

Capacity:

  • 10,000+ concurrent tenants per instance (validated)
  • Up to 10,000 QPS per tenant (validated; total instance QPS depends on hardware)
  • 85%+ cache hit rate with tenant-scoped caching

11. Production Checklist

  • Picked the right isolation mode for each tenant tier (don’t put paying enterprises on Logical alongside free tier — use Hybrid or Physical)
  • All multi-tenant tables have RLS enabled
  • Default policy is deny-overrides — RLS bypass requires allow_superuser_bypass = true and a superuser session
  • Quotas are set on every tenant, even free-tier ones
  • Suspend / resume is wired into your billing system (failed payment → suspend)
  • Prometheus is scraping; alerts on storage saturation and error rate are firing to a real on-call channel
  • Grafana dashboard is imported and on the team’s screen
  • Tenant deletion is soft by default in your policy (status Deleted, data retained for 30 days)
  • An emergency superuser role exists, and access is logged + MFA-gated

Where Next


References

  • Source: /home/app/Helios/Full/heliosdb-tenancy/crates/multi-tenancy/
  • 100+ isolation tests, 50+ integration tests, 90% coverage
  • Benchmark suite: multi_tenancy_benchmarks.rs