Legal Tech Startup: 73% Latency Reduction by Unifying Document Storage, Vector Search, and RAG

Industry: Legal Technology / Artificial Intelligence Challenge: Building a production-grade AI contract analysis platform without the latency and consistency gaps of a multi-service architecture Solution: HeliosDB-Lite (embedded, single-binary deployment) Results: 73% end-to-end latency reduction, $2K/month Pinecone costs eliminated, ACID guarantees across document and embedding operations

The Challenge

A 12-person legal tech startup was building an AI-powered contract analysis platform used by mid-size law firms. The core product allows attorneys to upload contracts, ask natural language questions (“Does this agreement contain a non-compete clause wider than 50 miles?”), and receive answers grounded in the actual document text with precise citations. The system needed to handle documents ranging from 2-page NDAs to 400-page M&A agreements, with a corpus of over 180,000 documents across all customers.

Their initial architecture followed the standard RAG playbook: documents were stored in S3, metadata and relational data lived in PostgreSQL on RDS, embeddings were indexed in Pinecone, and the retrieval pipeline called OpenAI’s embedding API before querying Pinecone, then fetched the source documents from S3, assembled the context, and called GPT-4 for the final answer. Each question required four network round-trips across three external services before the LLM could even begin generating a response.

The architecture had a deeper problem beyond latency. When a user uploaded a new version of a contract, the system had to update PostgreSQL, re-embed the chunks and upsert to Pinecone, and replace the file in S3 — three separate operations with no transactional guarantee. In production, they discovered that roughly 2% of document updates resulted in stale embeddings in Pinecone pointing to old document versions in S3, causing the AI to cite text that no longer existed in the current contract. For a product used by attorneys making legal decisions, this was unacceptable.

Why PostgreSQL + Pinecone + S3 Wasn’t Enough

Problem	Impact
4 network round-trips per query	820ms average before LLM generation even starts
No cross-service transactions	2% of document updates leave stale embeddings citing deleted text
Pinecone costs scaling with corpus	$2,100/month for 180K documents, projected $8K/month at 500K
S3 eventual consistency	Occasionally served old document version after update
Schema split	Document metadata in PostgreSQL, content in S3, embeddings in Pinecone — no single source of truth
Cold start latency	Pinecone serverless indexes had 200-400ms cold start on infrequent queries
Embedding sync pipeline	Custom Lambda function to keep Pinecone in sync with S3 uploads — frequent failures
No hybrid search	Could not combine vector similarity with full-text keyword search in a single query

The HeliosDB Solution

The team consolidated their entire data layer into HeliosDB-Lite, using its combined relational, vector, and full-text search capabilities within a single ACID-compliant database.

Schema Design

-- Documents table with metadata
CREATE TABLE documents (
    id BIGSERIAL PRIMARY KEY,
    tenant_id INT NOT NULL,
    title TEXT NOT NULL,
    doc_type VARCHAR(50) NOT NULL,         -- 'nda', 'msa', 'employment', 'merger'
    parties JSONB,                          -- structured party information
    effective_date DATE,
    expiration_date DATE,
    full_text TEXT NOT NULL,                -- complete document text for FTS
    uploaded_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

-- Chunked embeddings table -- vector and text in the same database
CREATE TABLE document_chunks (
    id BIGSERIAL PRIMARY KEY,
    document_id BIGINT NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
    chunk_index INT NOT NULL,
    chunk_text TEXT NOT NULL,               -- the actual text of this chunk
    embedding VECTOR(1536) NOT NULL,        -- OpenAI text-embedding-3-small
    section_type VARCHAR(50),               -- 'preamble', 'definitions', 'obligations', 'termination'
    page_start INT,
    page_end INT,
    UNIQUE(document_id, chunk_index)
);

-- HNSW index for fast approximate nearest neighbor search
CREATE INDEX idx_chunks_embedding ON document_chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 200);

-- Full-text search on the complete document
CREATE INDEX idx_documents_fulltext ON documents USING gin (full_text);

-- Compound index for tenant isolation
CREATE INDEX idx_chunks_tenant ON document_chunks (document_id)
    WHERE document_id IN (SELECT id FROM documents);

Document Ingestion with ACID Guarantees

The critical improvement: document content, metadata, and embeddings are written in a single transaction.

import psycopg2
import openai
from typing import Optional

class DocumentIngester:
    """
    Ingests documents into HeliosDB-Lite with transactional consistency.
    Document text, metadata, and vector embeddings are committed atomically.
    """

    def __init__(self, helios_dsn: str, openai_key: str):
        self.conn = psycopg2.connect(helios_dsn)
        self.openai = openai.OpenAI(api_key=openai_key)
        self.CHUNK_SIZE = 512  # tokens per chunk
        self.CHUNK_OVERLAP = 64

    def ingest_document(
        self,
        tenant_id: int,
        title: str,
        doc_type: str,
        full_text: str,
        parties: Optional[dict] = None
    ) -> int:
        """
        Ingest a document with all its embeddings in a single ACID transaction.

        Previously this required:
          1. Upload to S3
          2. INSERT into PostgreSQL
          3. Embed chunks via OpenAI
          4. Upsert to Pinecone
        Any failure between steps left the system in an inconsistent state.

        Now it's a single transaction -- all or nothing.
        """
        chunks = self._chunk_text(full_text)
        embeddings = self._embed_chunks(chunks)

        with self.conn:
            with self.conn.cursor() as cur:
                # Insert document and chunks in one transaction
                cur.execute("""
                    INSERT INTO documents
                        (tenant_id, title, doc_type, parties, full_text)
                    VALUES (%s, %s, %s, %s::jsonb, %s)
                    RETURNING id
                """, (tenant_id, title, doc_type,
                      psycopg2.extras.Json(parties), full_text))

                doc_id = cur.fetchone()[0]

                # Insert all chunks with embeddings -- same transaction
                for i, (chunk_text, embedding) in enumerate(
                    zip(chunks, embeddings)
                ):
                    section = self._classify_section(chunk_text)
                    cur.execute("""
                        INSERT INTO document_chunks
                            (document_id, chunk_index, chunk_text,
                             embedding, section_type)
                        VALUES (%s, %s, %s, %s::vector, %s)
                    """, (doc_id, i, chunk_text, embedding, section))

                # Transaction commits here -- atomic.
                # No stale embeddings, no orphaned chunks, no S3 sync issues.

        return doc_id

    def update_document(self, doc_id: int, new_text: str) -> None:
        """
        Atomically replace document text and all embeddings.
        The old version remains accessible via time-travel queries.
        """
        chunks = self._chunk_text(new_text)
        embeddings = self._embed_chunks(chunks)

        with self.conn:
            with self.conn.cursor() as cur:
                # Update document text
                cur.execute("""
                    UPDATE documents
                    SET full_text = %s, updated_at = NOW()
                    WHERE id = %s
                """, (new_text, doc_id))

                # Delete old chunks and insert new ones -- same transaction
                cur.execute(
                    "DELETE FROM document_chunks WHERE document_id = %s",
                    (doc_id,)
                )

                for i, (chunk_text, embedding) in enumerate(
                    zip(chunks, embeddings)
                ):
                    section = self._classify_section(chunk_text)
                    cur.execute("""
                        INSERT INTO document_chunks
                            (document_id, chunk_index, chunk_text,
                             embedding, section_type)
                        VALUES (%s, %s, %s, %s::vector, %s)
                    """, (doc_id, i, chunk_text, embedding, section))

                # Atomic commit: text + embeddings update together.
                # Old version still accessible via:
                #   SELECT * FROM documents AS OF TIMESTAMP '...'

    def _embed_chunks(self, chunks: list[str]) -> list[list[float]]:
        response = self.openai.embeddings.create(
            model="text-embedding-3-small",
            input=chunks
        )
        return [item.embedding for item in response.data]

    def _chunk_text(self, text: str) -> list[str]:
        # Simplified chunking -- production uses tiktoken with overlap
        words = text.split()
        chunks = []
        for i in range(0, len(words), self.CHUNK_SIZE - self.CHUNK_OVERLAP):
            chunk = " ".join(words[i:i + self.CHUNK_SIZE])
            if chunk:
                chunks.append(chunk)
        return chunks

    def _classify_section(self, text: str) -> str:
        lower = text.lower()
        if "whereas" in lower or "recital" in lower:
            return "preamble"
        if "shall mean" in lower or "defined as" in lower:
            return "definitions"
        if "shall not" in lower or "obligat" in lower:
            return "obligations"
        if "terminat" in lower or "expir" in lower:
            return "termination"
        return "general"

Hybrid Search: Vector + Full-Text + SQL in One Query

The most impactful capability — combining three search modalities in a single query that previously required orchestrating three services:

-- RAG retrieval: find the most relevant chunks for a user question
-- Combines vector similarity with keyword matching and metadata filtering
-- This single query replaces: Pinecone search + S3 fetch + PostgreSQL metadata lookup

WITH question_embedding AS (
    -- In practice, the application computes this and passes it as a parameter
    SELECT '[0.023, -0.041, 0.067, ...]'::vector(1536) AS vec
),
vector_matches AS (
    -- Vector similarity search via HNSW index
    SELECT
        dc.id,
        dc.document_id,
        dc.chunk_text,
        dc.section_type,
        dc.page_start,
        dc.page_end,
        1 - (dc.embedding <=> (SELECT vec FROM question_embedding)) AS similarity
    FROM document_chunks dc
    JOIN documents d ON dc.document_id = d.id
    WHERE d.tenant_id = 42                          -- tenant isolation
      AND d.doc_type IN ('nda', 'employment')       -- metadata filter
      AND d.effective_date >= '2024-01-01'           -- date range filter
    ORDER BY dc.embedding <=> (SELECT vec FROM question_embedding)
    LIMIT 20
),
keyword_boost AS (
    -- Boost chunks that also contain exact keyword matches
    SELECT
        vm.*,
        CASE
            WHEN vm.chunk_text @@ 'non-compete AND radius' THEN 0.15
            WHEN vm.chunk_text @@ 'non-compete' THEN 0.08
            ELSE 0.0
        END AS keyword_score
    FROM vector_matches vm
)
SELECT
    document_id,
    chunk_text,
    section_type,
    page_start,
    page_end,
    similarity + keyword_score AS combined_score
FROM keyword_boost
WHERE similarity > 0.72                             -- relevance threshold
ORDER BY combined_score DESC
LIMIT 5;

RAG Pipeline (TypeScript)

import { Client } from "pg";
import OpenAI from "openai";

interface RetrievedChunk {
  documentId: number;
  chunkText: string;
  sectionType: string;
  pageStart: number;
  pageEnd: number;
  score: number;
}

interface AnswerResult {
  answer: string;
  citations: { documentId: number; page: number; text: string }[];
  latencyMs: number;
}

class ContractAnalyzer {
  private db: Client;
  private openai: OpenAI;

  constructor(heliosDsn: string, openaiKey: string) {
    this.db = new Client({ connectionString: heliosDsn });
    this.openai = new OpenAI({ apiKey: openaiKey });
  }

  async connect(): Promise<void> {
    await this.db.connect();
  }

  /**
   * Full RAG pipeline: embed question -> hybrid search -> LLM answer.
   *
   * Previous architecture (4 network round-trips):
   *   1. App -> OpenAI Embeddings API (120ms)
   *   2. App -> Pinecone search (180ms + cold start)
   *   3. App -> S3 fetch source text (95ms)
   *   4. App -> PostgreSQL metadata (45ms)
   *   Total: ~440ms before LLM call, plus 2% stale data risk
   *
   * HeliosDB architecture (1 database query):
   *   1. App -> OpenAI Embeddings API (120ms)
   *   2. App -> HeliosDB hybrid search (28ms) -- vectors + text + metadata
   *   Total: ~148ms before LLM call, 0% stale data risk
   */
  async analyzeContract(
    tenantId: number,
    question: string,
    docTypes?: string[],
    dateAfter?: string
  ): Promise<AnswerResult> {
    const start = Date.now();

    // Step 1: Embed the question
    const embeddingResp = await this.openai.embeddings.create({
      model: "text-embedding-3-small",
      input: question,
    });
    const queryVec = embeddingResp.data[0].embedding;

    // Step 2: Single hybrid query -- vectors + FTS + metadata filters
    const chunks = await this.hybridSearch(
      tenantId,
      queryVec,
      question,
      docTypes,
      dateAfter
    );

    // Step 3: Generate answer with citations
    const context = chunks
      .map(
        (c, i) =>
          `[Source ${i + 1} | Doc ${c.documentId}, Page ${c.pageStart}]\n${c.chunkText}`
      )
      .join("\n\n---\n\n");

    const completion = await this.openai.chat.completions.create({
      model: "gpt-4o",
      messages: [
        {
          role: "system",
          content: `You are a legal contract analyst. Answer questions based ONLY on
the provided contract excerpts. Always cite your sources using [Source N] notation.
If the answer cannot be determined from the provided text, say so explicitly.`,
        },
        {
          role: "user",
          content: `Contract excerpts:\n\n${context}\n\nQuestion: ${question}`,
        },
      ],
      temperature: 0.1,
    });

    const answer = completion.choices[0].message.content ?? "";
    const citations = this.extractCitations(answer, chunks);

    return {
      answer,
      citations,
      latencyMs: Date.now() - start,
    };
  }

  private async hybridSearch(
    tenantId: number,
    queryVec: number[],
    questionText: string,
    docTypes?: string[],
    dateAfter?: string
  ): Promise<RetrievedChunk[]> {
    // Extract key terms for full-text boost
    const keywords = this.extractKeyTerms(questionText);
    const ftsQuery = keywords.join(" AND ");
    const vecString = `[${queryVec.join(",")}]`;

    const result = await this.db.query(
      `
      WITH vector_matches AS (
          SELECT
              dc.id,
              dc.document_id,
              dc.chunk_text,
              dc.section_type,
              dc.page_start,
              dc.page_end,
              1 - (dc.embedding <=> $1::vector) AS similarity
          FROM document_chunks dc
          JOIN documents d ON dc.document_id = d.id
          WHERE d.tenant_id = $2
            AND ($3::text[] IS NULL OR d.doc_type = ANY($3))
            AND ($4::date IS NULL OR d.effective_date >= $4::date)
          ORDER BY dc.embedding <=> $1::vector
          LIMIT 20
      )
      SELECT
          document_id,
          chunk_text,
          section_type,
          page_start,
          page_end,
          similarity + CASE
              WHEN chunk_text @@ $5 THEN 0.12
              ELSE 0.0
          END AS combined_score
      FROM vector_matches
      WHERE similarity > 0.70
      ORDER BY combined_score DESC
      LIMIT 5
      `,
      [vecString, tenantId, docTypes ?? null, dateAfter ?? null, ftsQuery]
    );

    return result.rows.map((row) => ({
      documentId: row.document_id,
      chunkText: row.chunk_text,
      sectionType: row.section_type,
      pageStart: row.page_start,
      pageEnd: row.page_end,
      score: parseFloat(row.combined_score),
    }));
  }

  private extractKeyTerms(question: string): string[] {
    const stopWords = new Set([
      "the", "a", "an", "is", "are", "does", "do", "this",
      "that", "in", "on", "for", "with", "any", "how", "what",
    ]);
    return question
      .toLowerCase()
      .replace(/[^\w\s-]/g, "")
      .split(/\s+/)
      .filter((w) => w.length > 2 && !stopWords.has(w));
  }

  private extractCitations(
    answer: string,
    chunks: RetrievedChunk[]
  ): { documentId: number; page: number; text: string }[] {
    const cited: { documentId: number; page: number; text: string }[] = [];
    const sourceRefs = answer.matchAll(/\[Source (\d+)\]/g);

    for (const match of sourceRefs) {
      const idx = parseInt(match[1]) - 1;
      if (idx >= 0 && idx < chunks.length) {
        const chunk = chunks[idx];
        cited.push({
          documentId: chunk.documentId,
          page: chunk.pageStart,
          text: chunk.chunkText.substring(0, 200) + "...",
        });
      }
    }
    return cited;
  }
}

// Usage
const analyzer = new ContractAnalyzer(
  "postgresql://app:password@localhost:5433/legalai",
  process.env.OPENAI_API_KEY!
);

await analyzer.connect();

const result = await analyzer.analyzeContract(
  42,
  "Does this agreement contain a non-compete clause wider than 50 miles?",
  ["employment", "nda"],
  "2024-01-01"
);

console.log(`Answer: ${result.answer}`);
console.log(`Latency: ${result.latencyMs}ms`);
console.log(`Citations: ${result.citations.length} sources`);

Comparing Document Versions with Time-Travel

A capability that was impossible with the previous architecture:

-- Attorney wants to see what changed between contract versions
-- Old version accessible via time-travel, current version is live

-- Get the relevant chunks from the previous version
WITH old_chunks AS (
    SELECT chunk_index, chunk_text, section_type
    FROM document_chunks
    AS OF TIMESTAMP '2025-03-01T00:00:00Z'
    WHERE document_id = 1847
),
current_chunks AS (
    SELECT chunk_index, chunk_text, section_type
    FROM document_chunks
    WHERE document_id = 1847
)
SELECT
    COALESCE(o.chunk_index, c.chunk_index) AS chunk,
    o.section_type AS old_section,
    c.section_type AS new_section,
    CASE
        WHEN o.chunk_text IS NULL THEN 'ADDED'
        WHEN c.chunk_text IS NULL THEN 'REMOVED'
        WHEN o.chunk_text != c.chunk_text THEN 'MODIFIED'
        ELSE 'UNCHANGED'
    END AS change_type
FROM old_chunks o
FULL OUTER JOIN current_chunks c ON o.chunk_index = c.chunk_index
WHERE o.chunk_text IS DISTINCT FROM c.chunk_text
ORDER BY chunk;

Results

Metric	Before (PG + Pinecone + S3)	After (HeliosDB-Lite)	Improvement
Retrieval latency (P50)	440ms	118ms	73% reduction
Retrieval latency (P95)	1,200ms (Pinecone cold start)	195ms	84% reduction
End-to-end query time (P50)	2.8s	1.9s	32% reduction
Stale embedding rate	~2% of updates	0% (ACID)	Eliminated
Monthly Pinecone cost	$2,100	$0	Eliminated
Monthly S3 cost	$340	$0	Eliminated
Monthly RDS cost	$580	$0	Eliminated
Total monthly data layer cost	$3,020	$180 (compute only)	94% reduction
Services to maintain	4 (PG, Pinecone, S3, Lambda sync)	1 (HeliosDB-Lite)	75% fewer
Document update consistency	Eventual (3 systems)	Immediate (ACID)	Guaranteed
Hybrid search capability	Not possible (separate systems)	Native (single query)	New capability
Version comparison	Not possible	Time-travel queries	New capability

Key Takeaways

Unified vector + relational + full-text search eliminates the “RAG tax.” The standard RAG architecture (separate vector DB + document store + relational DB) introduces latency from cross-service calls and consistency gaps from distributed writes. HeliosDB-Lite collapses these into a single query with a single transaction boundary. The 73% latency reduction comes almost entirely from eliminating network round-trips.
ACID transactions on document + embedding writes eliminated stale data. The 2% stale embedding rate was a serious liability for a legal product. With HeliosDB-Lite, document text and its vector embeddings are updated in a single transaction — either both succeed or neither does. This guarantee is impossible with separate Pinecone and S3 stores.
Hybrid search (vector + FTS + SQL filters) in one query improved relevance. Being able to boost vector similarity scores with keyword matches and filter by metadata (document type, date ranges, parties) in the same query produced measurably better retrieval quality than vector-only search, which improved the final LLM answer quality.
Time-travel queries enabled contract version comparison. Attorneys can now ask “What changed in the non-compete clause between the March and June versions?” — a feature that was on the roadmap but deemed infeasible with the previous architecture because historical document states were not preserved in Pinecone.
The cost savings funded an additional ML engineer. Eliminating $2,800/month in Pinecone + S3 costs and reducing operational complexity freed up budget and engineering time that was redirected to improving the AI model’s contract understanding capabilities.

Technical Stack

Component	Technology
Database	HeliosDB-Lite (embedded, single binary)
Vector Index	HNSW (m=16, ef_construction=200), cosine similarity
Embedding Model	OpenAI text-embedding-3-small (1536 dimensions)
LLM	GPT-4o (question answering)
Client Protocol	PostgreSQL wire protocol (SCRAM-SHA-256 auth)
Application	TypeScript (Next.js API routes), Python (ingestion pipeline)
Driver	pg (Node.js), psycopg2 (Python)
Deployment	AWS ECS Fargate, HeliosDB-Lite embedded in application container
Chunking Strategy	512-token chunks with 64-token overlap, section-aware boundaries
Previous Stack (replaced)	PostgreSQL 15 (RDS), Pinecone (Serverless), S3, Lambda (sync pipeline)