How a 12-person legal tech startup eliminated Pinecone costs and achieved ACID guarantees across document and embedding operations.
A 12-person legal tech startup was building an AI-powered contract analysis platform used by mid-size law firms. The core product allows attorneys to upload contracts, ask natural language questions ("Does this agreement contain a non-compete clause wider than 50 miles?"), and receive answers grounded in the actual document text with precise citations. The system needed to handle documents ranging from 2-page NDAs to 400-page M&A agreements, with a corpus of over 180,000 documents across all customers.
Their initial architecture followed the standard RAG playbook: documents were stored in S3, metadata and relational data lived in PostgreSQL on RDS, embeddings were indexed in Pinecone, and the retrieval pipeline called OpenAI's embedding API before querying Pinecone, then fetched the source documents from S3, assembled the context, and called GPT-4 for the final answer. Each question required four network round-trips across three external services before the LLM could even begin generating a response.
The architecture had a deeper problem beyond latency. When a user uploaded a new version of a contract, the system had to update PostgreSQL, re-embed the chunks and upsert to Pinecone, and replace the file in S3 -- three separate operations with no transactional guarantee. In production, they discovered that roughly 2% of document updates resulted in stale embeddings in Pinecone pointing to old document versions in S3, causing the AI to cite text that no longer existed in the current contract. For a product used by attorneys making legal decisions, this was unacceptable.
| Problem | Impact |
|---|---|
| 4 network round-trips per query | 820ms average before LLM generation even starts |
| No cross-service transactions | 2% of document updates leave stale embeddings citing deleted text |
| Pinecone costs scaling with corpus | $2,100/month for 180K documents, projected $8K/month at 500K |
| S3 eventual consistency | Occasionally served old document version after update |
| Schema split | Document metadata in PostgreSQL, content in S3, embeddings in Pinecone -- no single source of truth |
| Cold start latency | Pinecone serverless indexes had 200-400ms cold start on infrequent queries |
| Embedding sync pipeline | Custom Lambda function to keep Pinecone in sync with S3 uploads -- frequent failures |
| No hybrid search | Could not combine vector similarity with full-text keyword search in a single query |
The team consolidated their entire data layer into HeliosDB-Lite, using its combined relational, vector, and full-text search capabilities within a single ACID-compliant database.
-- Documents table with metadata
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
tenant_id INT NOT NULL,
title TEXT NOT NULL,
doc_type VARCHAR(50) NOT NULL, -- 'nda', 'msa', 'employment', 'merger'
parties JSONB, -- structured party information
effective_date DATE,
expiration_date DATE,
full_text TEXT NOT NULL, -- complete document text for FTS
uploaded_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
-- Chunked embeddings table -- vector and text in the same database
CREATE TABLE document_chunks (
id BIGSERIAL PRIMARY KEY,
document_id BIGINT NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
chunk_index INT NOT NULL,
chunk_text TEXT NOT NULL, -- the actual text of this chunk
embedding VECTOR(1536) NOT NULL, -- OpenAI text-embedding-3-small
section_type VARCHAR(50), -- 'preamble', 'definitions', 'obligations', 'termination'
page_start INT,
page_end INT,
UNIQUE(document_id, chunk_index)
);
-- HNSW index for fast approximate nearest neighbor search
CREATE INDEX idx_chunks_embedding ON document_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);
-- Full-text search on the complete document
CREATE INDEX idx_documents_fulltext ON documents USING gin (full_text);
-- Compound index for tenant isolation
CREATE INDEX idx_chunks_tenant ON document_chunks (document_id)
WHERE document_id IN (SELECT id FROM documents);
The critical improvement: document content, metadata, and embeddings are written in a single transaction.
import psycopg2
import openai
from typing import Optional
class DocumentIngester:
"""
Ingests documents into HeliosDB-Lite with transactional consistency.
Document text, metadata, and vector embeddings are committed atomically.
"""
def __init__(self, helios_dsn: str, openai_key: str):
self.conn = psycopg2.connect(helios_dsn)
self.openai = openai.OpenAI(api_key=openai_key)
self.CHUNK_SIZE = 512 # tokens per chunk
self.CHUNK_OVERLAP = 64
def ingest_document(
self,
tenant_id: int,
title: str,
doc_type: str,
full_text: str,
parties: Optional[dict] = None
) -> int:
"""
Ingest a document with all its embeddings in a single ACID transaction.
Previously this required:
1. Upload to S3
2. INSERT into PostgreSQL
3. Embed chunks via OpenAI
4. Upsert to Pinecone
Any failure between steps left the system in an inconsistent state.
Now it's a single transaction -- all or nothing.
"""
chunks = self._chunk_text(full_text)
embeddings = self._embed_chunks(chunks)
with self.conn:
with self.conn.cursor() as cur:
# Insert document and chunks in one transaction
cur.execute("""
INSERT INTO documents
(tenant_id, title, doc_type, parties, full_text)
VALUES (%s, %s, %s, %s::jsonb, %s)
RETURNING id
""", (tenant_id, title, doc_type,
psycopg2.extras.Json(parties), full_text))
doc_id = cur.fetchone()[0]
# Insert all chunks with embeddings -- same transaction
for i, (chunk_text, embedding) in enumerate(
zip(chunks, embeddings)
):
section = self._classify_section(chunk_text)
cur.execute("""
INSERT INTO document_chunks
(document_id, chunk_index, chunk_text,
embedding, section_type)
VALUES (%s, %s, %s, %s::vector, %s)
""", (doc_id, i, chunk_text, embedding, section))
# Transaction commits here -- atomic.
# No stale embeddings, no orphaned chunks, no S3 sync issues.
return doc_id
def update_document(self, doc_id: int, new_text: str) -> None:
"""
Atomically replace document text and all embeddings.
The old version remains accessible via time-travel queries.
"""
chunks = self._chunk_text(new_text)
embeddings = self._embed_chunks(chunks)
with self.conn:
with self.conn.cursor() as cur:
# Update document text
cur.execute("""
UPDATE documents
SET full_text = %s, updated_at = NOW()
WHERE id = %s
""", (new_text, doc_id))
# Delete old chunks and insert new ones -- same transaction
cur.execute(
"DELETE FROM document_chunks WHERE document_id = %s",
(doc_id,)
)
for i, (chunk_text, embedding) in enumerate(
zip(chunks, embeddings)
):
section = self._classify_section(chunk_text)
cur.execute("""
INSERT INTO document_chunks
(document_id, chunk_index, chunk_text,
embedding, section_type)
VALUES (%s, %s, %s, %s::vector, %s)
""", (doc_id, i, chunk_text, embedding, section))
# Atomic commit: text + embeddings update together.
# Old version still accessible via:
# SELECT * FROM documents AS OF TIMESTAMP '...'
def _embed_chunks(self, chunks: list[str]) -> list[list[float]]:
response = self.openai.embeddings.create(
model="text-embedding-3-small",
input=chunks
)
return [item.embedding for item in response.data]
def _chunk_text(self, text: str) -> list[str]:
# Simplified chunking -- production uses tiktoken with overlap
words = text.split()
chunks = []
for i in range(0, len(words), self.CHUNK_SIZE - self.CHUNK_OVERLAP):
chunk = " ".join(words[i:i + self.CHUNK_SIZE])
if chunk:
chunks.append(chunk)
return chunks
def _classify_section(self, text: str) -> str:
lower = text.lower()
if "whereas" in lower or "recital" in lower:
return "preamble"
if "shall mean" in lower or "defined as" in lower:
return "definitions"
if "shall not" in lower or "obligat" in lower:
return "obligations"
if "terminat" in lower or "expir" in lower:
return "termination"
return "general"
The most impactful capability -- combining three search modalities in a single query that previously required orchestrating three services:
-- RAG retrieval: find the most relevant chunks for a user question
-- Combines vector similarity with keyword matching and metadata filtering
-- This single query replaces: Pinecone search + S3 fetch + PostgreSQL metadata lookup
WITH question_embedding AS (
-- In practice, the application computes this and passes it as a parameter
SELECT '[0.023, -0.041, 0.067, ...]'::vector(1536) AS vec
),
vector_matches AS (
-- Vector similarity search via HNSW index
SELECT
dc.id,
dc.document_id,
dc.chunk_text,
dc.section_type,
dc.page_start,
dc.page_end,
1 - (dc.embedding <=> (SELECT vec FROM question_embedding)) AS similarity
FROM document_chunks dc
JOIN documents d ON dc.document_id = d.id
WHERE d.tenant_id = 42 -- tenant isolation
AND d.doc_type IN ('nda', 'employment') -- metadata filter
AND d.effective_date >= '2024-01-01' -- date range filter
ORDER BY dc.embedding <=> (SELECT vec FROM question_embedding)
LIMIT 20
),
keyword_boost AS (
-- Boost chunks that also contain exact keyword matches
SELECT
vm.*,
CASE
WHEN vm.chunk_text @@ 'non-compete AND radius' THEN 0.15
WHEN vm.chunk_text @@ 'non-compete' THEN 0.08
ELSE 0.0
END AS keyword_score
FROM vector_matches vm
)
SELECT
document_id,
chunk_text,
section_type,
page_start,
page_end,
similarity + keyword_score AS combined_score
FROM keyword_boost
WHERE similarity > 0.72 -- relevance threshold
ORDER BY combined_score DESC
LIMIT 5;
import { Client } from "pg";
import OpenAI from "openai";
interface RetrievedChunk {
documentId: number;
chunkText: string;
sectionType: string;
pageStart: number;
pageEnd: number;
score: number;
}
interface AnswerResult {
answer: string;
citations: { documentId: number; page: number; text: string }[];
latencyMs: number;
}
class ContractAnalyzer {
private db: Client;
private openai: OpenAI;
constructor(heliosDsn: string, openaiKey: string) {
this.db = new Client({ connectionString: heliosDsn });
this.openai = new OpenAI({ apiKey: openaiKey });
}
async connect(): Promise<void> {
await this.db.connect();
}
/**
* Full RAG pipeline: embed question -> hybrid search -> LLM answer.
*
* Previous architecture (4 network round-trips):
* 1. App -> OpenAI Embeddings API (120ms)
* 2. App -> Pinecone search (180ms + cold start)
* 3. App -> S3 fetch source text (95ms)
* 4. App -> PostgreSQL metadata (45ms)
* Total: ~440ms before LLM call, plus 2% stale data risk
*
* HeliosDB architecture (1 database query):
* 1. App -> OpenAI Embeddings API (120ms)
* 2. App -> HeliosDB hybrid search (28ms) -- vectors + text + metadata
* Total: ~148ms before LLM call, 0% stale data risk
*/
async analyzeContract(
tenantId: number,
question: string,
docTypes?: string[],
dateAfter?: string
): Promise<AnswerResult> {
const start = Date.now();
// Step 1: Embed the question
const embeddingResp = await this.openai.embeddings.create({
model: "text-embedding-3-small",
input: question,
});
const queryVec = embeddingResp.data[0].embedding;
// Step 2: Single hybrid query -- vectors + FTS + metadata filters
const chunks = await this.hybridSearch(
tenantId,
queryVec,
question,
docTypes,
dateAfter
);
// Step 3: Generate answer with citations
const context = chunks
.map(
(c, i) =>
`[Source ${i + 1} | Doc ${c.documentId}, Page ${c.pageStart}]\n${c.chunkText}`
)
.join("\n\n---\n\n");
const completion = await this.openai.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "system",
content: `You are a legal contract analyst. Answer questions based ONLY on
the provided contract excerpts. Always cite your sources using [Source N] notation.
If the answer cannot be determined from the provided text, say so explicitly.`,
},
{
role: "user",
content: `Contract excerpts:\n\n${context}\n\nQuestion: ${question}`,
},
],
temperature: 0.1,
});
const answer = completion.choices[0].message.content ?? "";
const citations = this.extractCitations(answer, chunks);
return {
answer,
citations,
latencyMs: Date.now() - start,
};
}
private async hybridSearch(
tenantId: number,
queryVec: number[],
questionText: string,
docTypes?: string[],
dateAfter?: string
): Promise<RetrievedChunk[]> {
// Extract key terms for full-text boost
const keywords = this.extractKeyTerms(questionText);
const ftsQuery = keywords.join(" AND ");
const vecString = `[${queryVec.join(",")}]`;
const result = await this.db.query(
`
WITH vector_matches AS (
SELECT
dc.id,
dc.document_id,
dc.chunk_text,
dc.section_type,
dc.page_start,
dc.page_end,
1 - (dc.embedding <=> $1::vector) AS similarity
FROM document_chunks dc
JOIN documents d ON dc.document_id = d.id
WHERE d.tenant_id = $2
AND ($3::text[] IS NULL OR d.doc_type = ANY($3))
AND ($4::date IS NULL OR d.effective_date >= $4::date)
ORDER BY dc.embedding <=> $1::vector
LIMIT 20
)
SELECT
document_id,
chunk_text,
section_type,
page_start,
page_end,
similarity + CASE
WHEN chunk_text @@ $5 THEN 0.12
ELSE 0.0
END AS combined_score
FROM vector_matches
WHERE similarity > 0.70
ORDER BY combined_score DESC
LIMIT 5
`,
[vecString, tenantId, docTypes ?? null, dateAfter ?? null, ftsQuery]
);
return result.rows.map((row) => ({
documentId: row.document_id,
chunkText: row.chunk_text,
sectionType: row.section_type,
pageStart: row.page_start,
pageEnd: row.page_end,
score: parseFloat(row.combined_score),
}));
}
private extractKeyTerms(question: string): string[] {
const stopWords = new Set([
"the", "a", "an", "is", "are", "does", "do", "this",
"that", "in", "on", "for", "with", "any", "how", "what",
]);
return question
.toLowerCase()
.replace(/[^\w\s-]/g, "")
.split(/\s+/)
.filter((w) => w.length > 2 && !stopWords.has(w));
}
private extractCitations(
answer: string,
chunks: RetrievedChunk[]
): { documentId: number; page: number; text: string }[] {
const cited: { documentId: number; page: number; text: string }[] = [];
const sourceRefs = answer.matchAll(/\[Source (\d+)\]/g);
for (const match of sourceRefs) {
const idx = parseInt(match[1]) - 1;
if (idx >= 0 && idx < chunks.length) {
const chunk = chunks[idx];
cited.push({
documentId: chunk.documentId,
page: chunk.pageStart,
text: chunk.chunkText.substring(0, 200) + "...",
});
}
}
return cited;
}
}
// Usage
const analyzer = new ContractAnalyzer(
"postgresql://app:password@localhost:5433/legalai",
process.env.OPENAI_API_KEY!
);
await analyzer.connect();
const result = await analyzer.analyzeContract(
42,
"Does this agreement contain a non-compete clause wider than 50 miles?",
["employment", "nda"],
"2024-01-01"
);
console.log(`Answer: ${result.answer}`);
console.log(`Latency: ${result.latencyMs}ms`);
console.log(`Citations: ${result.citations.length} sources`);
A capability that was impossible with the previous architecture:
-- Attorney wants to see what changed between contract versions
-- Old version accessible via time-travel, current version is live
-- Get the relevant chunks from the previous version
WITH old_chunks AS (
SELECT chunk_index, chunk_text, section_type
FROM document_chunks
AS OF TIMESTAMP '2025-03-01T00:00:00Z'
WHERE document_id = 1847
),
current_chunks AS (
SELECT chunk_index, chunk_text, section_type
FROM document_chunks
WHERE document_id = 1847
)
SELECT
COALESCE(o.chunk_index, c.chunk_index) AS chunk,
o.section_type AS old_section,
c.section_type AS new_section,
CASE
WHEN o.chunk_text IS NULL THEN 'ADDED'
WHEN c.chunk_text IS NULL THEN 'REMOVED'
WHEN o.chunk_text != c.chunk_text THEN 'MODIFIED'
ELSE 'UNCHANGED'
END AS change_type
FROM old_chunks o
FULL OUTER JOIN current_chunks c ON o.chunk_index = c.chunk_index
WHERE o.chunk_text IS DISTINCT FROM c.chunk_text
ORDER BY chunk;
| Metric | Before (PG + Pinecone + S3) | After (HeliosDB-Lite) | Improvement |
|---|---|---|---|
| Retrieval latency (P50) | 440ms | 118ms | 73% reduction |
| Retrieval latency (P95) | 1,200ms (Pinecone cold start) | 195ms | 84% reduction |
| End-to-end query time (P50) | 2.8s | 1.9s | 32% reduction |
| Stale embedding rate | ~2% of updates | 0% (ACID) | Eliminated |
| Monthly Pinecone cost | $2,100 | $0 | Eliminated |
| Monthly S3 cost | $340 | $0 | Eliminated |
| Monthly RDS cost | $580 | $0 | Eliminated |
| Total monthly data layer cost | $3,020 | $180 (compute only) | 94% reduction |
| Services to maintain | 4 (PG, Pinecone, S3, Lambda sync) | 1 (HeliosDB-Lite) | 75% fewer |
| Document update consistency | Eventual (3 systems) | Immediate (ACID) | Guaranteed |
| Hybrid search capability | Not possible (separate systems) | Native (single query) | New capability |
| Version comparison | Not possible | Time-travel queries | New capability |
| Component | Technology |
|---|---|
| Database | HeliosDB-Lite (embedded, single binary) |
| Vector Index | HNSW (m=16, ef_construction=200), cosine similarity |
| Embedding Model | OpenAI text-embedding-3-small (1536 dimensions) |
| LLM | GPT-4o (question answering) |
| Client Protocol | PostgreSQL wire protocol (SCRAM-SHA-256 auth) |
| Application | TypeScript (Next.js API routes), Python (ingestion pipeline) |
| Driver | pg (Node.js), psycopg2 (Python) |
| Deployment | AWS ECS Fargate, HeliosDB-Lite embedded in application container |
| Chunking Strategy | 512-token chunks with 64-token overlap, section-aware boundaries |
| Previous Stack (replaced) | PostgreSQL 15 (RDS), Pinecone (Serverless), S3, Lambda (sync pipeline) |
Schedule a demo to see how HeliosDB can transform your data infrastructure.