Multi-Tenancy — 1000+ Tenants on a Single Cluster
UVP
Run 10,000+ tenants on a single HeliosDB Full cluster with <3% query overhead. The tenancy layer combines automatic SQL rewriting (RLS for every query, every protocol), per-tenant resource quotas (storage / QPS / connections / CPU / memory), three isolation modes (logical / physical / hybrid), tenant-scoped caching that prevents cross-tenant cache poisoning, and a full REST + SQL admin surface. Per-tenant Prometheus metrics and an auto-generated Grafana dashboard ship with the crate. Provision an isolated tenant in <100 ms; never write a WHERE tenant_id = … clause again.
Prerequisites
- HeliosDB Full v8.0.3
- ~30 minutes
- A working understanding of row-level security (RLS) — useful but not required, the crate handles the SQL rewriting for you
The crate is heliosdb-multi-tenancy (in heliosdb-tenancy/crates/multi-tenancy/). It’s a workspace-internal crate; the public surface is the SQL CREATE TENANT syntax and the REST /api/v1/tenants endpoint.
1. Three Isolation Modes — Pick Your Trade-Off
| Mode | What it gives you | Overhead | Use when |
|---|---|---|---|
| Logical | Single schema, automatic WHERE tenant_id = … injection, shared cache namespace per tenant | <3% per query | 90% of SaaS workloads — cheapest, scales to 10K tenants |
| Physical | Separate database per tenant | ~0% query overhead, more storage | High-security tenants, regulated industries (healthcare, finance) |
| Hybrid | Logical for shared tables, physical for sensitive ones | between the two | When some data must be physically isolated but most is fine logical |
You set the mode per tenant, not per cluster. A single HeliosDB instance can host 9,950 logical tenants and 50 physical tenants happily.
2. Quick Start — From Zero to First Tenant
2a. Bring up the system
use heliosdb_multi_tenancy::*;
#[tokio::main]async fn main() -> anyhow::Result<()> { let system = MultiTenancySystem::with_defaults()?;
let tenant = system.tenant_manager.create_tenant(TenantProvisionRequest { name: "acme_corp".into(), quotas: QuotaConfig { storage_quota_bytes: 100 * 1024 * 1024 * 1024, // 100 GB qps_limit: 1000, max_connections: 100, ..Default::default() }, isolation_mode: Some(IsolationMode::Logical), parent_id: None, settings: None, features: None, tags: None, }).await?;
println!("Created tenant: {} ({})", tenant.name, tenant.id); Ok(())}2b. Or — create with SQL
CREATE TENANT 'acme_corp' WITH ( STORAGE_QUOTA = '100GB', QPS_LIMIT = 1000, MAX_CONNECTIONS = 100, ISOLATION_MODE = 'logical');
SHOW TENANTS;SHOW TENANT USAGE FOR 'acme_corp';2c. Or — create over REST
curl -X POST http://localhost:8080/api/v1/tenants \ -H "Content-Type: application/json" \ -d '{ "name": "acme_corp", "quotas": { "storage_quota_bytes": 107374182400, "qps_limit": 1000, "max_connections": 100 } }'All three paths produce the same outcome — pick whichever fits your provisioning automation.
3. RLS — Automatic Query Rewriting
Once you enable RLS on a table, every query against it is automatically rewritten to include the tenant filter. Your application code never has to know about tenant_id.
// Enable RLS for a tablesystem.rls_manager.enable_table_rls("users")?;system.rls_manager.enable_table_rls("orders")?;
// Set the tenant context (your auth layer does this once per connection)let context = RlsContext::for_tenant(tenant.id);
// Query — written as if single-tenantlet sql = "SELECT * FROM users WHERE active = true";let rewritten = system.query_rewriter.rewrite_query(sql, &context)?;// → "SELECT * FROM users WHERE active = true AND users.tenant_id = '<uuid>'"Custom policies
The default policy (tenant_id = {tenant_id}) covers most cases, but you can write fine-grained policies — e.g. tenant-admins can see other tenants’ rows in a parent-tenant relationship:
let policy = RlsPolicy { name: "tenant_isolation".into(), table: "sensitive_data".into(), expression: "tenant_id = {tenant_id} OR {tenant_id} IN (SELECT child FROM tenant_hierarchy WHERE parent = tenant_id)".into(), operations: vec![RlsOperation::Select], allow_superuser_bypass: true, enabled: true,};system.rls_manager.create_policy(policy)?;4. Quotas — Defence in Depth
Every tenant has eight enforceable limits:
| Quota | Default | Purpose |
|---|---|---|
storage_quota_bytes | unlimited | Cap raw disk usage per tenant |
qps_limit | unlimited | Rate-limit queries per second |
max_connections | unlimited | Cap concurrent connections |
max_concurrent_queries | unlimited | Cap parallel queries |
query_timeout_seconds | 300 | Hard SLA ceiling |
max_cpu_seconds | unlimited | Per-query CPU budget |
max_memory_bytes | unlimited | Per-query memory budget |
network_egress_bytes | unlimited | Per-tenant network cap (Beta) |
Three suggested tiers, copy-paste:
// StarterQuotaConfig { storage_quota_bytes: 10 * GB, qps_limit: 100, max_connections: 25, ..Default::default() }
// ProfessionalQuotaConfig { storage_quota_bytes: 100 * GB, qps_limit: 1000, max_connections: 100, ..Default::default() }
// EnterpriseQuotaConfig { storage_quota_bytes: 1 * TB, qps_limit: 10000, max_connections: 500, ..Default::default() }Pre-flight check
let result = system.quota_manager.check_query_quota(tenant.id).await?;if !result.is_allowed() { return Err(anyhow!("Quota exceeded: {}", result.error_message().unwrap()));}The quota manager runs a check before the query is parsed — cheap, ~1 µs per call (see benchmarks below).
5. Tenant-Scoped Caching
A naive shared cache is a cache-poisoning attack waiting to happen. The crate ships TenantCache<K, V> — every entry is keyed by (key, tenant_id), so tenant A literally cannot see tenant B’s cached entries.
let cache: TenantCache<String, Vec<u8>> = TenantCache::new( 1000, Duration::from_secs(300),);
cache.put("user_123".into(), data, tenant_id_a)?;
assert_eq!(cache.get(&"user_123".into(), tenant_id_a), Some(data.clone()));assert_eq!(cache.get(&"user_123".into(), tenant_id_b), None); // Isolation enforcedThe same isolation pattern is applied to the result cache, plan cache, and prepared-statement cache — all are tenant-scoped.
6. Per-Tenant Encryption Keys (v5.5+)
Multi-tenancy v2 uses a single cluster KEK; per-tenant DEKs (data-encryption keys) are on the v5.5 roadmap (see the crate’s Roadmap section). For most workloads the cluster KEK + RLS combination is sufficient — and physical isolation is the recommended path for tenants who need cryptographic separation today.
If you need per-tenant DEKs now, use IsolationMode::Physical and rotate the KEK per database — that’s the supported path until v5.5 ships.
7. Observability
Prometheus metrics
heliosdb_tenant_query_count{tenant_id, query_type}heliosdb_tenant_storage_bytes{tenant_id}heliosdb_tenant_active_connections{tenant_id}heliosdb_tenant_quota_usage{tenant_id, quota_type}heliosdb_tenant_query_duration_seconds{tenant_id, query_type}heliosdb_tenant_error_count{tenant_id, error_type}heliosdb_tenant_cache_hits{tenant_id}Auto-generated Grafana dashboard
use heliosdb_multi_tenancy::GrafanaDashboardGenerator;
GrafanaDashboardGenerator::export_dashboard_to_file("./grafana-dashboard.json")?;Import the JSON into Grafana — you get tenant-level QPS, storage usage, error rate, p95 latency, cache hit ratio, and quota saturation panels out of the box.
Pre-built alert rules
use heliosdb_multi_tenancy::AlertRuleGenerator;
let rules = AlertRuleGenerator::generate_default_rules();let prometheus_rules = AlertRuleGenerator::export_prometheus_rules(&rules);Default alerts cover: storage at 80%, storage exceeded, error-rate spike, p95 over 5 s, connection limit approaching.
8. REST Admin API
POST /api/v1/tenants CreateGET /api/v1/tenants ListGET /api/v1/tenants/{id} DetailGET /api/v1/tenants/{id}/usage Live usagePUT /api/v1/tenants/{id}/quotas Update quotasPOST /api/v1/tenants/{id}/suspend PausePOST /api/v1/tenants/{id}/resume ResumeDELETE /api/v1/tenants/{id} DeleteSuspend / resume
Suspended tenants reject all queries with tenant suspended. Useful for billing failures, compliance freezes, or controlled migrations.
curl -X POST http://localhost:8080/api/v1/tenants/$TID/suspend# tenant rejects new connections; existing ones are gracefully drainedcurl -X POST http://localhost:8080/api/v1/tenants/$TID/resume9. Migration — Single Tenant → Multi-Tenant
If you’re already running HeliosDB without tenancy, the migration is four steps:
-- 1. Add tenant_id to existing tablesALTER TABLE users ADD COLUMN tenant_id UUID;ALTER TABLE orders ADD COLUMN tenant_id UUID;
-- 2. Backfill (assign existing data to the default tenant)UPDATE users SET tenant_id = '<default-tenant-uuid>';UPDATE orders SET tenant_id = '<default-tenant-uuid>';
-- 3. Enable RLSSELECT heliosdb.enable_table_rls('users');SELECT heliosdb.enable_table_rls('orders');
-- 4. Application: set tenant context on each connection-- (your auth layer should do this from the JWT or session)There is no required downtime for steps 1-3 — they’re online operations. Step 4 needs to be deployed coherently across all app instances (gate it behind a feature flag).
10. Performance Reference
Validated on the project benchmark suite (cargo bench --bench multi_tenancy_benchmarks):
| Operation | Time |
|---|---|
query_rewriting/simple_select | 12.5–13.1 µs |
query_rewriting/select_with_join | 18.2–19.0 µs |
tenant_cache/cache_get_hit | 85–90 ns |
quota_checks/check_query_quota | 1.2–1.4 µs |
overhead_comparison/rls_enforced | 13.1–13.8 µs |
overhead_comparison/baseline | 12.7–13.3 µs |
That’s ~3% overhead on RLS-enforced queries — well under the 5% target.
Capacity:
- 10,000+ concurrent tenants per instance (validated)
- Up to 10,000 QPS per tenant (validated; total instance QPS depends on hardware)
- 85%+ cache hit rate with tenant-scoped caching
11. Production Checklist
- Picked the right isolation mode for each tenant tier (don’t put paying enterprises on Logical alongside free tier — use Hybrid or Physical)
- All multi-tenant tables have RLS enabled
- Default policy is
deny-overrides— RLS bypass requiresallow_superuser_bypass = trueand a superuser session - Quotas are set on every tenant, even free-tier ones
- Suspend / resume is wired into your billing system (failed payment → suspend)
- Prometheus is scraping; alerts on storage saturation and error rate are firing to a real on-call channel
- Grafana dashboard is imported and on the team’s screen
- Tenant deletion is soft by default in your policy (status
Deleted, data retained for 30 days) - An emergency superuser role exists, and access is logged + MFA-gated
Where Next
- ABAC + SSO Setup — wire your IdP groups to per-tenant roles
- Cognitive Agents — the agents are tenant-aware out of the box; pin the SchemaManager per-tenant for hierarchical workloads
- Intelligent Tiering — tier free-tier tenants to cold storage automatically
- PITR Recovery — backup retention per tenant tier
References
- Source:
/home/app/Helios/Full/heliosdb-tenancy/crates/multi-tenancy/ - 100+ isolation tests, 50+ integration tests, 90% coverage
- Benchmark suite:
multi_tenancy_benchmarks.rs