Legal Tech Startup: 73% Latency Reduction by Unifying Document Storage, Vector Search, and RAG
Legal Tech Startup: 73% Latency Reduction by Unifying Document Storage, Vector Search, and RAG
Industry: Legal Technology / Artificial Intelligence Challenge: Building a production-grade AI contract analysis platform without the latency and consistency gaps of a multi-service architecture Solution: HeliosDB-Lite (embedded, single-binary deployment) Results: 73% end-to-end latency reduction, $2K/month Pinecone costs eliminated, ACID guarantees across document and embedding operations
The Challenge
A 12-person legal tech startup was building an AI-powered contract analysis platform used by mid-size law firms. The core product allows attorneys to upload contracts, ask natural language questions (“Does this agreement contain a non-compete clause wider than 50 miles?”), and receive answers grounded in the actual document text with precise citations. The system needed to handle documents ranging from 2-page NDAs to 400-page M&A agreements, with a corpus of over 180,000 documents across all customers.
Their initial architecture followed the standard RAG playbook: documents were stored in S3, metadata and relational data lived in PostgreSQL on RDS, embeddings were indexed in Pinecone, and the retrieval pipeline called OpenAI’s embedding API before querying Pinecone, then fetched the source documents from S3, assembled the context, and called GPT-4 for the final answer. Each question required four network round-trips across three external services before the LLM could even begin generating a response.
The architecture had a deeper problem beyond latency. When a user uploaded a new version of a contract, the system had to update PostgreSQL, re-embed the chunks and upsert to Pinecone, and replace the file in S3 — three separate operations with no transactional guarantee. In production, they discovered that roughly 2% of document updates resulted in stale embeddings in Pinecone pointing to old document versions in S3, causing the AI to cite text that no longer existed in the current contract. For a product used by attorneys making legal decisions, this was unacceptable.
Why PostgreSQL + Pinecone + S3 Wasn’t Enough
| Problem | Impact |
|---|---|
| 4 network round-trips per query | 820ms average before LLM generation even starts |
| No cross-service transactions | 2% of document updates leave stale embeddings citing deleted text |
| Pinecone costs scaling with corpus | $2,100/month for 180K documents, projected $8K/month at 500K |
| S3 eventual consistency | Occasionally served old document version after update |
| Schema split | Document metadata in PostgreSQL, content in S3, embeddings in Pinecone — no single source of truth |
| Cold start latency | Pinecone serverless indexes had 200-400ms cold start on infrequent queries |
| Embedding sync pipeline | Custom Lambda function to keep Pinecone in sync with S3 uploads — frequent failures |
| No hybrid search | Could not combine vector similarity with full-text keyword search in a single query |
The HeliosDB Solution
The team consolidated their entire data layer into HeliosDB-Lite, using its combined relational, vector, and full-text search capabilities within a single ACID-compliant database.
Schema Design
-- Documents table with metadataCREATE TABLE documents ( id BIGSERIAL PRIMARY KEY, tenant_id INT NOT NULL, title TEXT NOT NULL, doc_type VARCHAR(50) NOT NULL, -- 'nda', 'msa', 'employment', 'merger' parties JSONB, -- structured party information effective_date DATE, expiration_date DATE, full_text TEXT NOT NULL, -- complete document text for FTS uploaded_at TIMESTAMP DEFAULT NOW(), updated_at TIMESTAMP DEFAULT NOW());
-- Chunked embeddings table -- vector and text in the same databaseCREATE TABLE document_chunks ( id BIGSERIAL PRIMARY KEY, document_id BIGINT NOT NULL REFERENCES documents(id) ON DELETE CASCADE, chunk_index INT NOT NULL, chunk_text TEXT NOT NULL, -- the actual text of this chunk embedding VECTOR(1536) NOT NULL, -- OpenAI text-embedding-3-small section_type VARCHAR(50), -- 'preamble', 'definitions', 'obligations', 'termination' page_start INT, page_end INT, UNIQUE(document_id, chunk_index));
-- HNSW index for fast approximate nearest neighbor searchCREATE INDEX idx_chunks_embedding ON document_chunks USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 200);
-- Full-text search on the complete documentCREATE INDEX idx_documents_fulltext ON documents USING gin (full_text);
-- Compound index for tenant isolationCREATE INDEX idx_chunks_tenant ON document_chunks (document_id) WHERE document_id IN (SELECT id FROM documents);Document Ingestion with ACID Guarantees
The critical improvement: document content, metadata, and embeddings are written in a single transaction.
import psycopg2import openaifrom typing import Optional
class DocumentIngester: """ Ingests documents into HeliosDB-Lite with transactional consistency. Document text, metadata, and vector embeddings are committed atomically. """
def __init__(self, helios_dsn: str, openai_key: str): self.conn = psycopg2.connect(helios_dsn) self.openai = openai.OpenAI(api_key=openai_key) self.CHUNK_SIZE = 512 # tokens per chunk self.CHUNK_OVERLAP = 64
def ingest_document( self, tenant_id: int, title: str, doc_type: str, full_text: str, parties: Optional[dict] = None ) -> int: """ Ingest a document with all its embeddings in a single ACID transaction.
Previously this required: 1. Upload to S3 2. INSERT into PostgreSQL 3. Embed chunks via OpenAI 4. Upsert to Pinecone Any failure between steps left the system in an inconsistent state.
Now it's a single transaction -- all or nothing. """ chunks = self._chunk_text(full_text) embeddings = self._embed_chunks(chunks)
with self.conn: with self.conn.cursor() as cur: # Insert document and chunks in one transaction cur.execute(""" INSERT INTO documents (tenant_id, title, doc_type, parties, full_text) VALUES (%s, %s, %s, %s::jsonb, %s) RETURNING id """, (tenant_id, title, doc_type, psycopg2.extras.Json(parties), full_text))
doc_id = cur.fetchone()[0]
# Insert all chunks with embeddings -- same transaction for i, (chunk_text, embedding) in enumerate( zip(chunks, embeddings) ): section = self._classify_section(chunk_text) cur.execute(""" INSERT INTO document_chunks (document_id, chunk_index, chunk_text, embedding, section_type) VALUES (%s, %s, %s, %s::vector, %s) """, (doc_id, i, chunk_text, embedding, section))
# Transaction commits here -- atomic. # No stale embeddings, no orphaned chunks, no S3 sync issues.
return doc_id
def update_document(self, doc_id: int, new_text: str) -> None: """ Atomically replace document text and all embeddings. The old version remains accessible via time-travel queries. """ chunks = self._chunk_text(new_text) embeddings = self._embed_chunks(chunks)
with self.conn: with self.conn.cursor() as cur: # Update document text cur.execute(""" UPDATE documents SET full_text = %s, updated_at = NOW() WHERE id = %s """, (new_text, doc_id))
# Delete old chunks and insert new ones -- same transaction cur.execute( "DELETE FROM document_chunks WHERE document_id = %s", (doc_id,) )
for i, (chunk_text, embedding) in enumerate( zip(chunks, embeddings) ): section = self._classify_section(chunk_text) cur.execute(""" INSERT INTO document_chunks (document_id, chunk_index, chunk_text, embedding, section_type) VALUES (%s, %s, %s, %s::vector, %s) """, (doc_id, i, chunk_text, embedding, section))
# Atomic commit: text + embeddings update together. # Old version still accessible via: # SELECT * FROM documents AS OF TIMESTAMP '...'
def _embed_chunks(self, chunks: list[str]) -> list[list[float]]: response = self.openai.embeddings.create( model="text-embedding-3-small", input=chunks ) return [item.embedding for item in response.data]
def _chunk_text(self, text: str) -> list[str]: # Simplified chunking -- production uses tiktoken with overlap words = text.split() chunks = [] for i in range(0, len(words), self.CHUNK_SIZE - self.CHUNK_OVERLAP): chunk = " ".join(words[i:i + self.CHUNK_SIZE]) if chunk: chunks.append(chunk) return chunks
def _classify_section(self, text: str) -> str: lower = text.lower() if "whereas" in lower or "recital" in lower: return "preamble" if "shall mean" in lower or "defined as" in lower: return "definitions" if "shall not" in lower or "obligat" in lower: return "obligations" if "terminat" in lower or "expir" in lower: return "termination" return "general"Hybrid Search: Vector + Full-Text + SQL in One Query
The most impactful capability — combining three search modalities in a single query that previously required orchestrating three services:
-- RAG retrieval: find the most relevant chunks for a user question-- Combines vector similarity with keyword matching and metadata filtering-- This single query replaces: Pinecone search + S3 fetch + PostgreSQL metadata lookup
WITH question_embedding AS ( -- In practice, the application computes this and passes it as a parameter SELECT '[0.023, -0.041, 0.067, ...]'::vector(1536) AS vec),vector_matches AS ( -- Vector similarity search via HNSW index SELECT dc.id, dc.document_id, dc.chunk_text, dc.section_type, dc.page_start, dc.page_end, 1 - (dc.embedding <=> (SELECT vec FROM question_embedding)) AS similarity FROM document_chunks dc JOIN documents d ON dc.document_id = d.id WHERE d.tenant_id = 42 -- tenant isolation AND d.doc_type IN ('nda', 'employment') -- metadata filter AND d.effective_date >= '2024-01-01' -- date range filter ORDER BY dc.embedding <=> (SELECT vec FROM question_embedding) LIMIT 20),keyword_boost AS ( -- Boost chunks that also contain exact keyword matches SELECT vm.*, CASE WHEN vm.chunk_text @@ 'non-compete AND radius' THEN 0.15 WHEN vm.chunk_text @@ 'non-compete' THEN 0.08 ELSE 0.0 END AS keyword_score FROM vector_matches vm)SELECT document_id, chunk_text, section_type, page_start, page_end, similarity + keyword_score AS combined_scoreFROM keyword_boostWHERE similarity > 0.72 -- relevance thresholdORDER BY combined_score DESCLIMIT 5;RAG Pipeline (TypeScript)
import { Client } from "pg";import OpenAI from "openai";
interface RetrievedChunk { documentId: number; chunkText: string; sectionType: string; pageStart: number; pageEnd: number; score: number;}
interface AnswerResult { answer: string; citations: { documentId: number; page: number; text: string }[]; latencyMs: number;}
class ContractAnalyzer { private db: Client; private openai: OpenAI;
constructor(heliosDsn: string, openaiKey: string) { this.db = new Client({ connectionString: heliosDsn }); this.openai = new OpenAI({ apiKey: openaiKey }); }
async connect(): Promise<void> { await this.db.connect(); }
/** * Full RAG pipeline: embed question -> hybrid search -> LLM answer. * * Previous architecture (4 network round-trips): * 1. App -> OpenAI Embeddings API (120ms) * 2. App -> Pinecone search (180ms + cold start) * 3. App -> S3 fetch source text (95ms) * 4. App -> PostgreSQL metadata (45ms) * Total: ~440ms before LLM call, plus 2% stale data risk * * HeliosDB architecture (1 database query): * 1. App -> OpenAI Embeddings API (120ms) * 2. App -> HeliosDB hybrid search (28ms) -- vectors + text + metadata * Total: ~148ms before LLM call, 0% stale data risk */ async analyzeContract( tenantId: number, question: string, docTypes?: string[], dateAfter?: string ): Promise<AnswerResult> { const start = Date.now();
// Step 1: Embed the question const embeddingResp = await this.openai.embeddings.create({ model: "text-embedding-3-small", input: question, }); const queryVec = embeddingResp.data[0].embedding;
// Step 2: Single hybrid query -- vectors + FTS + metadata filters const chunks = await this.hybridSearch( tenantId, queryVec, question, docTypes, dateAfter );
// Step 3: Generate answer with citations const context = chunks .map( (c, i) => `[Source ${i + 1} | Doc ${c.documentId}, Page ${c.pageStart}]\n${c.chunkText}` ) .join("\n\n---\n\n");
const completion = await this.openai.chat.completions.create({ model: "gpt-4o", messages: [ { role: "system", content: `You are a legal contract analyst. Answer questions based ONLY onthe provided contract excerpts. Always cite your sources using [Source N] notation.If the answer cannot be determined from the provided text, say so explicitly.`, }, { role: "user", content: `Contract excerpts:\n\n${context}\n\nQuestion: ${question}`, }, ], temperature: 0.1, });
const answer = completion.choices[0].message.content ?? ""; const citations = this.extractCitations(answer, chunks);
return { answer, citations, latencyMs: Date.now() - start, }; }
private async hybridSearch( tenantId: number, queryVec: number[], questionText: string, docTypes?: string[], dateAfter?: string ): Promise<RetrievedChunk[]> { // Extract key terms for full-text boost const keywords = this.extractKeyTerms(questionText); const ftsQuery = keywords.join(" AND "); const vecString = `[${queryVec.join(",")}]`;
const result = await this.db.query( ` WITH vector_matches AS ( SELECT dc.id, dc.document_id, dc.chunk_text, dc.section_type, dc.page_start, dc.page_end, 1 - (dc.embedding <=> $1::vector) AS similarity FROM document_chunks dc JOIN documents d ON dc.document_id = d.id WHERE d.tenant_id = $2 AND ($3::text[] IS NULL OR d.doc_type = ANY($3)) AND ($4::date IS NULL OR d.effective_date >= $4::date) ORDER BY dc.embedding <=> $1::vector LIMIT 20 ) SELECT document_id, chunk_text, section_type, page_start, page_end, similarity + CASE WHEN chunk_text @@ $5 THEN 0.12 ELSE 0.0 END AS combined_score FROM vector_matches WHERE similarity > 0.70 ORDER BY combined_score DESC LIMIT 5 `, [vecString, tenantId, docTypes ?? null, dateAfter ?? null, ftsQuery] );
return result.rows.map((row) => ({ documentId: row.document_id, chunkText: row.chunk_text, sectionType: row.section_type, pageStart: row.page_start, pageEnd: row.page_end, score: parseFloat(row.combined_score), })); }
private extractKeyTerms(question: string): string[] { const stopWords = new Set([ "the", "a", "an", "is", "are", "does", "do", "this", "that", "in", "on", "for", "with", "any", "how", "what", ]); return question .toLowerCase() .replace(/[^\w\s-]/g, "") .split(/\s+/) .filter((w) => w.length > 2 && !stopWords.has(w)); }
private extractCitations( answer: string, chunks: RetrievedChunk[] ): { documentId: number; page: number; text: string }[] { const cited: { documentId: number; page: number; text: string }[] = []; const sourceRefs = answer.matchAll(/\[Source (\d+)\]/g);
for (const match of sourceRefs) { const idx = parseInt(match[1]) - 1; if (idx >= 0 && idx < chunks.length) { const chunk = chunks[idx]; cited.push({ documentId: chunk.documentId, page: chunk.pageStart, text: chunk.chunkText.substring(0, 200) + "...", }); } } return cited; }}
// Usageconst analyzer = new ContractAnalyzer( "postgresql://app:password@localhost:5433/legalai", process.env.OPENAI_API_KEY!);
await analyzer.connect();
const result = await analyzer.analyzeContract( 42, "Does this agreement contain a non-compete clause wider than 50 miles?", ["employment", "nda"], "2024-01-01");
console.log(`Answer: ${result.answer}`);console.log(`Latency: ${result.latencyMs}ms`);console.log(`Citations: ${result.citations.length} sources`);Comparing Document Versions with Time-Travel
A capability that was impossible with the previous architecture:
-- Attorney wants to see what changed between contract versions-- Old version accessible via time-travel, current version is live
-- Get the relevant chunks from the previous versionWITH old_chunks AS ( SELECT chunk_index, chunk_text, section_type FROM document_chunks AS OF TIMESTAMP '2025-03-01T00:00:00Z' WHERE document_id = 1847),current_chunks AS ( SELECT chunk_index, chunk_text, section_type FROM document_chunks WHERE document_id = 1847)SELECT COALESCE(o.chunk_index, c.chunk_index) AS chunk, o.section_type AS old_section, c.section_type AS new_section, CASE WHEN o.chunk_text IS NULL THEN 'ADDED' WHEN c.chunk_text IS NULL THEN 'REMOVED' WHEN o.chunk_text != c.chunk_text THEN 'MODIFIED' ELSE 'UNCHANGED' END AS change_typeFROM old_chunks oFULL OUTER JOIN current_chunks c ON o.chunk_index = c.chunk_indexWHERE o.chunk_text IS DISTINCT FROM c.chunk_textORDER BY chunk;Results
| Metric | Before (PG + Pinecone + S3) | After (HeliosDB-Lite) | Improvement |
|---|---|---|---|
| Retrieval latency (P50) | 440ms | 118ms | 73% reduction |
| Retrieval latency (P95) | 1,200ms (Pinecone cold start) | 195ms | 84% reduction |
| End-to-end query time (P50) | 2.8s | 1.9s | 32% reduction |
| Stale embedding rate | ~2% of updates | 0% (ACID) | Eliminated |
| Monthly Pinecone cost | $2,100 | $0 | Eliminated |
| Monthly S3 cost | $340 | $0 | Eliminated |
| Monthly RDS cost | $580 | $0 | Eliminated |
| Total monthly data layer cost | $3,020 | $180 (compute only) | 94% reduction |
| Services to maintain | 4 (PG, Pinecone, S3, Lambda sync) | 1 (HeliosDB-Lite) | 75% fewer |
| Document update consistency | Eventual (3 systems) | Immediate (ACID) | Guaranteed |
| Hybrid search capability | Not possible (separate systems) | Native (single query) | New capability |
| Version comparison | Not possible | Time-travel queries | New capability |
Key Takeaways
-
Unified vector + relational + full-text search eliminates the “RAG tax.” The standard RAG architecture (separate vector DB + document store + relational DB) introduces latency from cross-service calls and consistency gaps from distributed writes. HeliosDB-Lite collapses these into a single query with a single transaction boundary. The 73% latency reduction comes almost entirely from eliminating network round-trips.
-
ACID transactions on document + embedding writes eliminated stale data. The 2% stale embedding rate was a serious liability for a legal product. With HeliosDB-Lite, document text and its vector embeddings are updated in a single transaction — either both succeed or neither does. This guarantee is impossible with separate Pinecone and S3 stores.
-
Hybrid search (vector + FTS + SQL filters) in one query improved relevance. Being able to boost vector similarity scores with keyword matches and filter by metadata (document type, date ranges, parties) in the same query produced measurably better retrieval quality than vector-only search, which improved the final LLM answer quality.
-
Time-travel queries enabled contract version comparison. Attorneys can now ask “What changed in the non-compete clause between the March and June versions?” — a feature that was on the roadmap but deemed infeasible with the previous architecture because historical document states were not preserved in Pinecone.
-
The cost savings funded an additional ML engineer. Eliminating $2,800/month in Pinecone + S3 costs and reducing operational complexity freed up budget and engineering time that was redirected to improving the AI model’s contract understanding capabilities.
Technical Stack
| Component | Technology |
|---|---|
| Database | HeliosDB-Lite (embedded, single binary) |
| Vector Index | HNSW (m=16, ef_construction=200), cosine similarity |
| Embedding Model | OpenAI text-embedding-3-small (1536 dimensions) |
| LLM | GPT-4o (question answering) |
| Client Protocol | PostgreSQL wire protocol (SCRAM-SHA-256 auth) |
| Application | TypeScript (Next.js API routes), Python (ingestion pipeline) |
| Driver | pg (Node.js), psycopg2 (Python) |
| Deployment | AWS ECS Fargate, HeliosDB-Lite embedded in application container |
| Chunking Strategy | 512-token chunks with 64-token overlap, section-aware boundaries |
| Previous Stack (replaced) | PostgreSQL 15 (RDS), Pinecone (Serverless), S3, Lambda (sync pipeline) |