How a B2B SaaS platform replaced 200+ RDS instances with a single HeliosDB-Lite deployment using copy-on-write branching.
A B2B SaaS company provides a project management platform used by over 200 companies ranging from 10-person startups to 2,000-person enterprises. Each tenant has between 500 and 150,000 records across projects, tasks, time entries, documents, and team configurations. The platform had been running on AWS RDS PostgreSQL since its founding, starting with a shared-database model where all tenants lived in a single PostgreSQL instance with a tenant_id column on every table.
As the company grew past 100 tenants, the shared-database model started breaking down. Large tenants' queries caused lock contention that affected smaller tenants. A single tenant accidentally running an unfiltered SELECT * on the time entries table would saturate the connection pool. Row-Level Security policies added complexity but not true isolation -- a bug in any query could theoretically expose cross-tenant data. After a near-miss where a reporting query briefly returned rows belonging to another tenant (caught in QA, never reached production), the CTO mandated a move to database-per-tenant isolation.
The database-per-tenant migration solved the isolation problem but created a cost problem. Each tenant got their own RDS instance (db.t3.medium minimum for production SLA), and the monthly bill went from $2,800 for a single large RDS instance to $18,400 for 200+ individual instances. Worse, provisioning a new tenant took 30 seconds to spin up an RDS instance, run migrations, and seed default data -- an eternity when a prospect was signing up for a free trial. Schema migrations became a nightmare: rolling out a single ALTER TABLE required orchestrating changes across 200+ independent databases, and any failure left the fleet in an inconsistent state.
| Problem | Impact |
|---|---|
| Linear cost scaling | Each tenant = one RDS instance ($72-$180/month minimum) |
| $18.4K/month for 200 tenants | Projected $50K+ at 500 tenants; unsustainable unit economics |
| 30-second tenant provisioning | Free trial signup conversion dropped 15% due to loading time |
| Schema migration across 200+ DBs | 4-6 hours of orchestrated rolling updates per migration |
| Migration failure rate | ~3% of instances failed per migration, requiring manual intervention |
| Idle resource waste | 60% of tenant instances averaged <5% CPU utilization |
| Backup management | 200+ individual backup schedules and retention policies |
| Connection pool exhaustion | Each instance maintained a warm pool; 200 pools = 2,000+ idle connections |
The team replaced their fleet of 200+ RDS instances with a single HeliosDB-Lite deployment using database branching for tenant isolation. Each tenant gets their own branch -- a lightweight, copy-on-write fork of a template database. Branches share unchanged data pages, so 200 tenants with similar schemas use a fraction of the storage that 200 independent databases would require.
The deployment uses a single HeliosDB-Lite server with a template branch containing the base schema and default data. When a new tenant signs up, the platform creates a branch from the template in under 500 milliseconds. Schema migrations are applied to the template branch first, then each tenant branch is rebased -- or, for breaking changes, a new branch is created from the updated template and data is migrated.
-- The template branch contains the base schema and seed data.
-- It is maintained as the "golden master" for all tenants.
-- Step 1: Create the template (done once, updated with each migration)
CREATE BRANCH template_v47;
-- Switch to the template branch to set up the schema
USE BRANCH template_v47;
CREATE TABLE projects (
id BIGSERIAL PRIMARY KEY,
name VARCHAR(200) NOT NULL,
description TEXT,
status VARCHAR(20) DEFAULT 'active',
owner_id BIGINT NOT NULL,
settings JSONB DEFAULT '{}',
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE tasks (
id BIGSERIAL PRIMARY KEY,
project_id BIGINT NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
title VARCHAR(500) NOT NULL,
description TEXT,
status VARCHAR(20) DEFAULT 'todo',
priority INT DEFAULT 0,
assignee_id BIGINT,
due_date DATE,
estimated_hours DECIMAL(6, 2),
tags JSONB DEFAULT '[]',
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE time_entries (
id BIGSERIAL PRIMARY KEY,
task_id BIGINT NOT NULL REFERENCES tasks(id) ON DELETE CASCADE,
user_id BIGINT NOT NULL,
started_at TIMESTAMP NOT NULL,
ended_at TIMESTAMP,
duration_minutes INT,
notes TEXT,
billable BOOLEAN DEFAULT TRUE
);
CREATE TABLE team_members (
id BIGSERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL UNIQUE,
display_name VARCHAR(200) NOT NULL,
role VARCHAR(50) DEFAULT 'member',
settings JSONB DEFAULT '{}',
created_at TIMESTAMP DEFAULT NOW()
);
-- Indexes
CREATE INDEX idx_tasks_project ON tasks (project_id, status);
CREATE INDEX idx_tasks_assignee ON tasks (assignee_id) WHERE status != 'done';
CREATE INDEX idx_time_project ON time_entries (task_id, started_at DESC);
-- Seed default data (project templates, default settings, etc.)
INSERT INTO projects (name, description, status, owner_id, settings)
VALUES ('Getting Started', 'Your first project', 'active', 0,
'{"template": true, "tutorial": true}');
-- Step 2: Create a tenant branch (takes <500ms)
-- This is a copy-on-write fork -- no data is physically copied.
CREATE BRANCH tenant_acme FROM template_v47;
-- The tenant immediately has the full schema + seed data.
-- Writes go to the tenant's branch; reads fall through to shared pages.
import { Pool, PoolClient } from "pg";
interface TenantInfo {
tenantId: string;
companyName: string;
plan: "free" | "pro" | "enterprise";
branchName: string;
createdAt: Date;
}
interface ProvisioningResult {
tenantId: string;
branchName: string;
provisioningMs: number;
status: "ready" | "error";
}
class TenantManager {
private adminPool: Pool;
private tenantPools: Map<string, Pool> = new Map();
constructor(heliosDsn: string) {
// Admin pool for branch management operations
this.adminPool = new Pool({
connectionString: heliosDsn,
max: 10,
});
}
/**
* Provision a new tenant by creating a branch from the template.
*
* Previous architecture (AWS RDS):
* 1. CreateDBInstance API call (15-25s)
* 2. Wait for instance available (5-10s)
* 3. Run schema migrations (2-3s)
* 4. Seed default data (1-2s)
* Total: ~30 seconds
*
* HeliosDB branching:
* 1. CREATE BRANCH (copy-on-write fork) (<500ms)
* Total: <500ms
*/
async provisionTenant(
tenantId: string,
companyName: string,
plan: "free" | "pro" | "enterprise"
): Promise<ProvisioningResult> {
const start = Date.now();
const branchName = `tenant_${tenantId}`;
const client = await this.adminPool.connect();
try {
// Create a copy-on-write branch from the template.
// This is nearly instantaneous regardless of template size
// because no data is physically copied.
await client.query(`CREATE BRANCH ${branchName} FROM template_v47`);
// Switch to the new branch and customize for this tenant
await client.query(`USE BRANCH ${branchName}`);
// Set tenant-specific configuration
await client.query(
`
INSERT INTO team_members (email, display_name, role, settings)
VALUES ($1, $2, 'admin', $3)
`,
[
`admin@${tenantId}.example.com`,
`${companyName} Admin`,
JSON.stringify({ plan, onboarding: true }),
]
);
// Update the getting-started project for this tenant
await client.query(
`
UPDATE projects
SET settings = settings || $1::jsonb
WHERE settings->>'template' = 'true'
`,
[JSON.stringify({ company_name: companyName, plan })]
);
const elapsed = Date.now() - start;
return {
tenantId,
branchName,
provisioningMs: elapsed,
status: "ready",
};
} catch (err) {
console.error(`Failed to provision tenant ${tenantId}:`, err);
// Clean up on failure
try {
await client.query(`DROP BRANCH IF EXISTS ${branchName}`);
} catch {
// Best effort cleanup
}
return {
tenantId,
branchName,
provisioningMs: Date.now() - start,
status: "error",
};
} finally {
client.release();
}
}
/**
* Get a connection pool for a specific tenant's branch.
* Pools are created lazily and cached.
*/
getTenantPool(tenantId: string): Pool {
const branchName = `tenant_${tenantId}`;
if (!this.tenantPools.has(tenantId)) {
const pool = new Pool({
connectionString: `postgresql://app:password@localhost:5433/saasdb?options=-c%20helios.branch=${branchName}`,
max: 5, // Per-tenant pool size -- much smaller than RDS per-instance
idleTimeoutMillis: 30000,
});
this.tenantPools.set(tenantId, pool);
}
return this.tenantPools.get(tenantId)!;
}
/**
* Execute a query in a tenant's isolated branch.
* Complete data isolation -- no tenant_id filters needed.
*/
async queryTenant<T>(
tenantId: string,
sql: string,
params?: any[]
): Promise<T[]> {
const pool = this.getTenantPool(tenantId);
const result = await pool.query(sql, params);
return result.rows as T[];
}
/**
* Delete a tenant and all their data by dropping the branch.
* Instant, regardless of data volume.
*/
async deprovisionTenant(tenantId: string): Promise<void> {
const branchName = `tenant_${tenantId}`;
// Close and remove the connection pool
const pool = this.tenantPools.get(tenantId);
if (pool) {
await pool.end();
this.tenantPools.delete(tenantId);
}
// Drop the branch -- instant, frees copy-on-write pages
const client = await this.adminPool.connect();
try {
await client.query(`DROP BRANCH ${branchName}`);
} finally {
client.release();
}
}
}
// Usage: Tenant provisioning during free trial signup
const manager = new TenantManager(
"postgresql://admin:password@localhost:5433/saasdb"
);
// This completes in <500ms -- the prospect sees their dashboard immediately
const result = await manager.provisionTenant(
"acme-corp",
"Acme Corporation",
"free"
);
console.log(`Provisioned in ${result.provisioningMs}ms`);
// Querying is standard SQL -- no tenant_id filters needed
// The branch provides complete isolation at the database level
const projects = await manager.queryTenant<{ name: string; status: string }>(
"acme-corp",
"SELECT name, status FROM projects ORDER BY created_at DESC LIMIT 10"
);
The most operationally impactful capability -- migrating 200+ tenant databases safely:
import psycopg2
import time
from dataclasses import dataclass
from typing import Optional
from concurrent.futures import ThreadPoolExecutor, as_completed
@dataclass
class MigrationResult:
tenant_id: str
status: str # 'success', 'failed', 'rolled_back'
duration_ms: float
error: Optional[str] = None
class BranchMigrator:
"""
Performs zero-downtime schema migrations across all tenant branches.
Strategy:
1. Apply migration to the template branch (golden master)
2. For each tenant, create a migration branch, apply changes, verify, swap
3. If any step fails, the tenant stays on their original branch
Previous approach (200 independent RDS instances):
- Serial execution: 4-6 hours
- ~3% failure rate requiring manual intervention
- Risky: failed migration left database in inconsistent state
HeliosDB branching approach:
- Parallel execution: 2-3 minutes for 200 tenants
- 0% data loss risk: original branch preserved until verified
- Atomic swap: tenant sees old or new schema, never a partial state
"""
def __init__(self, dsn: str, max_workers: int = 20):
self.dsn = dsn
self.max_workers = max_workers
def migrate_all_tenants(
self, migration_sql: str, version: int
) -> list[MigrationResult]:
"""Apply a migration to all tenant branches in parallel."""
# Step 1: Apply to template first
print(f"Applying migration v{version} to template...")
self._apply_to_template(migration_sql, version)
# Step 2: Get all tenant branches
tenant_ids = self._list_tenant_branches()
print(f"Migrating {len(tenant_ids)} tenant branches...")
# Step 3: Migrate tenants in parallel
results = []
with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
futures = {
executor.submit(
self._migrate_tenant, tid, migration_sql, version
): tid
for tid in tenant_ids
}
for future in as_completed(futures):
result = future.result()
results.append(result)
symbol = "OK" if result.status == "success" else "FAIL"
print(
f" [{symbol}] {result.tenant_id} "
f"({result.duration_ms:.0f}ms)"
)
# Summary
succeeded = sum(1 for r in results if r.status == "success")
failed = sum(1 for r in results if r.status != "success")
print(f"\nMigration v{version} complete: "
f"{succeeded} succeeded, {failed} failed")
return results
def _apply_to_template(self, migration_sql: str, version: int) -> None:
"""Apply migration to the template branch (golden master)."""
conn = psycopg2.connect(self.dsn)
try:
with conn:
with conn.cursor() as cur:
# Update template to new version
cur.execute(
f"CREATE BRANCH template_v{version} "
f"FROM template_v{version - 1}"
)
cur.execute(f"USE BRANCH template_v{version}")
cur.execute(migration_sql)
finally:
conn.close()
def _migrate_tenant(
self, tenant_id: str, migration_sql: str, version: int
) -> MigrationResult:
"""
Migrate a single tenant using branch-and-swap strategy.
1. Create a migration branch from the tenant's current branch
2. Apply the migration SQL
3. Run verification queries
4. Swap the tenant to the migrated branch
5. Keep old branch for 24h rollback window
"""
start = time.time()
conn = psycopg2.connect(self.dsn)
old_branch = f"tenant_{tenant_id}"
new_branch = f"tenant_{tenant_id}_v{version}"
try:
with conn:
with conn.cursor() as cur:
# Create migration branch (copy-on-write, instant)
cur.execute(
f"CREATE BRANCH {new_branch} FROM {old_branch}"
)
# Apply migration on the new branch
cur.execute(f"USE BRANCH {new_branch}")
cur.execute(migration_sql)
# Verify: check row counts match (data integrity)
cur.execute(f"USE BRANCH {old_branch}")
cur.execute(
"SELECT COUNT(*) FROM projects"
)
old_count = cur.fetchone()[0]
cur.execute(f"USE BRANCH {new_branch}")
cur.execute(
"SELECT COUNT(*) FROM projects"
)
new_count = cur.fetchone()[0]
if old_count != new_count:
raise ValueError(
f"Row count mismatch: {old_count} vs {new_count}"
)
# Swap: rename branches atomically
cur.execute(
f"ALTER BRANCH {old_branch} "
f"RENAME TO {old_branch}_retired_v{version}"
)
cur.execute(
f"ALTER BRANCH {new_branch} "
f"RENAME TO {old_branch}"
)
elapsed = (time.time() - start) * 1000
return MigrationResult(tenant_id, "success", elapsed)
except Exception as e:
elapsed = (time.time() - start) * 1000
# Rollback: drop the migration branch, tenant stays on original
try:
with conn:
with conn.cursor() as cur:
cur.execute(f"DROP BRANCH IF EXISTS {new_branch}")
except Exception:
pass
return MigrationResult(tenant_id, "failed", elapsed, str(e))
finally:
conn.close()
def _list_tenant_branches(self) -> list[str]:
"""List all active tenant branch names."""
conn = psycopg2.connect(self.dsn)
try:
with conn.cursor() as cur:
cur.execute("""
SELECT branch_name
FROM helios_branches()
WHERE branch_name LIKE 'tenant_%'
AND branch_name NOT LIKE '%_retired_%'
ORDER BY branch_name
""")
return [row[0].replace("tenant_", "") for row in cur.fetchall()]
finally:
conn.close()
def rollback_tenant(self, tenant_id: str, version: int) -> None:
"""
Rollback a single tenant to their pre-migration branch.
The retired branch is preserved for exactly this purpose.
"""
conn = psycopg2.connect(self.dsn)
try:
with conn:
with conn.cursor() as cur:
current = f"tenant_{tenant_id}"
retired = f"tenant_{tenant_id}_retired_v{version}"
# Swap back
cur.execute(
f"ALTER BRANCH {current} "
f"RENAME TO tenant_{tenant_id}_failed_v{version}"
)
cur.execute(
f"ALTER BRANCH {retired} RENAME TO {current}"
)
finally:
conn.close()
# Usage: Adding a 'priority_label' column to tasks table
migrator = BranchMigrator(
"postgresql://admin:password@localhost:5433/saasdb",
max_workers=20,
)
migration_sql = """
ALTER TABLE tasks ADD COLUMN priority_label VARCHAR(20);
UPDATE tasks SET priority_label = CASE
WHEN priority >= 8 THEN 'critical'
WHEN priority >= 5 THEN 'high'
WHEN priority >= 3 THEN 'medium'
ELSE 'low'
END;
CREATE INDEX idx_tasks_priority_label ON tasks (priority_label)
WHERE status != 'done';
"""
results = migrator.migrate_all_tenants(migration_sql, version=48)
# If any tenant had issues, roll them back individually
for r in results:
if r.status == "failed":
print(f"Rolling back {r.tenant_id}: {r.error}")
migrator.rollback_tenant(r.tenant_id, version=48)
Queries that run within a tenant's isolated branch -- no WHERE tenant_id = ? needed:
-- Dashboard query: project health overview
-- Runs in the tenant's branch -- complete isolation, no tenant_id filter
SELECT
p.id,
p.name,
p.status,
COUNT(t.id) AS total_tasks,
SUM(CASE WHEN t.status = 'done' THEN 1 ELSE 0 END) AS completed_tasks,
SUM(CASE WHEN t.status = 'done' THEN 1 ELSE 0 END)::DECIMAL
/ NULLIF(COUNT(t.id), 0) * 100 AS completion_pct,
SUM(CASE WHEN t.due_date < CURRENT_DATE AND t.status != 'done'
THEN 1 ELSE 0 END) AS overdue_tasks,
COALESCE(SUM(te.duration_minutes), 0) / 60.0 AS total_hours_logged,
COALESCE(SUM(t.estimated_hours), 0) AS total_estimated_hours
FROM projects p
LEFT JOIN tasks t ON t.project_id = p.id
LEFT JOIN time_entries te ON te.task_id = t.id
WHERE p.status = 'active'
GROUP BY p.id, p.name, p.status
ORDER BY overdue_tasks DESC, completion_pct ASC;
-- Team utilization report: hours by member over the past 4 weeks
WITH weekly_hours AS (
SELECT
tm.display_name,
tm.role,
DATE_TRUNC('week', te.started_at) AS week,
SUM(te.duration_minutes) / 60.0 AS hours_worked,
SUM(CASE WHEN te.billable THEN te.duration_minutes ELSE 0 END)
/ 60.0 AS billable_hours
FROM team_members tm
JOIN time_entries te ON te.user_id = tm.id
WHERE te.started_at >= NOW() - INTERVAL '4 weeks'
GROUP BY tm.display_name, tm.role, DATE_TRUNC('week', te.started_at)
)
SELECT
display_name,
role,
ROUND(AVG(hours_worked)::DECIMAL, 1) AS avg_weekly_hours,
ROUND(AVG(billable_hours)::DECIMAL, 1) AS avg_billable_hours,
ROUND(
AVG(billable_hours) / NULLIF(AVG(hours_worked), 0) * 100, 1
) AS billable_pct,
ROUND(MAX(hours_worked)::DECIMAL, 1) AS peak_week_hours
FROM weekly_hours
GROUP BY display_name, role
ORDER BY avg_weekly_hours DESC;
-- Cross-project task search with JSONB tag filtering
SELECT
t.id,
t.title,
t.status,
t.priority_label,
p.name AS project_name,
tm.display_name AS assignee,
t.due_date,
t.tags
FROM tasks t
JOIN projects p ON t.project_id = p.id
LEFT JOIN team_members tm ON t.assignee_id = tm.id
WHERE t.status != 'done'
AND t.tags @> '["urgent"]'::jsonb
ORDER BY t.priority DESC, t.due_date ASC NULLS LAST;
| Metric | Before (200 RDS Instances) | After (HeliosDB-Lite Branching) | Improvement |
|---|---|---|---|
| Monthly infrastructure cost | $18,400 | $2,700 | 85% reduction |
| Cost per tenant | $92 | $13.50 | 85% reduction |
| Projected cost at 500 tenants | $46,000 | $4,200 | 91% reduction |
| Tenant provisioning time | 30 seconds | 0.5 seconds | 98% faster |
| Schema migration (200 tenants) | 4-6 hours (serial) | 2-3 minutes (parallel) | 99% faster |
| Migration failure rate | ~3% per migration | 0% (branch-and-swap) | Eliminated |
| Migration rollback time | 30-60 minutes (restore from backup) | <1 second (branch rename) | Near-instant |
| Idle connection pools | 2,000+ connections | 200 connections (shared) | 90% reduction |
| Storage efficiency | 200 full copies | Copy-on-write (shared pages) | ~70% reduction |
| Backup management | 200 individual schedules | 1 unified backup | 99.5% simpler |
| Trial signup conversion | 72% (30s wait) | 89% (<1s wait) | +17 percentage points |
| Cost Category | Before (Monthly) | After (Monthly) | Notes |
|---|---|---|---|
| RDS instances (200x db.t3.medium) | $14,400 | $0 | Replaced by HeliosDB-Lite |
| RDS storage (200x 20GB gp3) | $1,600 | $0 | Copy-on-write branching |
| RDS backups | $800 | $0 | Single unified backup |
| RDS data transfer | $600 | $0 | Local connections |
| Migration tooling (Step Functions) | $200 | $0 | Built-in branching |
| Connection pooling (PgBouncer on EC2) | $800 | $0 | Not needed |
| HeliosDB-Lite compute (2x c6g.2xlarge) | $0 | $2,200 | Primary + standby |
| HeliosDB-Lite storage (2TB gp3) | $0 | $320 | Shared via COW |
| Monitoring | $0 | $180 | Prometheus + Grafana |
| Total | $18,400 | $2,700 | 85% reduction |
tenant_id from every query reduced bugs. With branch-based isolation, there is no possibility of cross-tenant data leakage from a missing WHERE clause. Every query runs in a tenant's isolated branch by default. This eliminated an entire class of security bugs and simplified the application code significantly -- no more RLS policies, no more mandatory tenant_id parameters on every database function.| Component | Technology |
|---|---|
| Database | HeliosDB-Lite (server mode, 2-node primary/standby) |
| Multi-tenancy | Database branching (copy-on-write) |
| Client Protocol | PostgreSQL wire protocol (SCRAM-SHA-256 auth) |
| Application | TypeScript (Next.js), Python (migration tooling) |
| Driver | pg (Node.js), psycopg2 (Python) |
| Deployment | AWS EC2 (c6g.2xlarge ARM64), gp3 EBS storage |
| Monitoring | Prometheus + Grafana (branch-level metrics) |
| CI/CD | GitHub Actions (migration testing against template branch) |
| Previous Stack (replaced) | 200+ AWS RDS PostgreSQL instances, PgBouncer, AWS Step Functions (migration orchestration) |
Schedule a demo to see how HeliosDB can transform your data infrastructure.