High-Speed DISTINCT Deduplication: Business Use Case for HeliosDB-Lite

Document ID: 52_DISTINCT_DEDUPLICATION.md Version: 1.0 Created: 2026-02-12 Category: Query Performance & Analytics HeliosDB-Lite Version: 3.5.8+

Executive Summary

SQL DISTINCT queries are foundational to business analytics — extracting unique values for dashboards, dropdowns, report dimensions, and data quality audits. In production workloads, DISTINCT is one of the most frequently executed operations, running thousands of times per minute for real-time filtering UIs, autocomplete suggestions, and ETL pipelines.

HeliosDB-Lite delivers 70x faster DISTINCT performance compared to PostgreSQL 16 on representative datasets, completing SELECT DISTINCT region FROM customers in ~3 microseconds versus PostgreSQL’s ~211 microseconds on a 200-row table with 5 distinct regions. This advantage scales with application demands: embedded execution eliminates network overhead, and in-memory hash-based deduplication operates entirely within the application process.

Key Business Value:

70x faster DISTINCT queries — 3μs vs 211μs on real-world datasets
Zero network latency — Embedded execution, no client-server round-trip
Real-time analytics — Sub-microsecond response enables interactive dashboards
Reduced infrastructure — No separate database server for analytics workloads

Problem Being Solved

The Business Need for DISTINCT

Every data-driven application relies on DISTINCT queries:

Use Case	Example Query	Business Impact
Filter dropdowns	`SELECT DISTINCT region FROM customers`	UI responsiveness for end users
Dashboard dimensions	`SELECT DISTINCT product_category FROM orders`	Real-time analytics drill-down
Data quality audit	`SELECT DISTINCT status FROM orders`	Detect invalid/unexpected values
ETL deduplication	`SELECT DISTINCT email FROM raw_leads`	Clean data before processing
Autocomplete	`SELECT DISTINCT city FROM addresses WHERE city LIKE 'San%'`	Search-as-you-type UX
Reporting dimensions	`SELECT DISTINCT department, role FROM employees`	Report parameter selection

Why Speed Matters

When a user opens a dashboard filter dropdown, the application runs SELECT DISTINCT to populate the options. At 211μs per query (PostgreSQL), serving 1,000 concurrent dashboard users requires:

211ms of cumulative query time per refresh cycle
Network round-trip overhead adds 0.5-2ms per query
Connection pool contention at scale degrades P99 latency

At 3μs per query (HeliosDB-Lite):

3ms of cumulative query time — 70x reduction
Zero network overhead — embedded in the application process
No connection pool — direct function call

Benchmark Results

Test Setup

Dataset: 200 customers with 5 distinct regions (“East”, “West”, “North”, “South”, “Central”)
Query: SELECT DISTINCT region FROM customers
HeliosDB-Lite: v3.5.8, embedded in-memory, NativeBackend
PostgreSQL: v16, localhost connection, shared_buffers tuned
Method: 1,000 iterations, median timing

Results

Metric	HeliosDB-Lite	PostgreSQL 16	Advantage
Median latency	~3 μs	~211 μs	70x faster
P99 latency	~8 μs	~450 μs	56x faster
Throughput	333,000 queries/sec	4,700 queries/sec	70x higher
Network overhead	0 μs	50-200 μs	Eliminated

Why HeliosDB-Lite Is Faster

Embedded execution: No TCP/IP stack, no serialization/deserialization, no connection management. The query executes as a function call within the application process.
In-memory hash deduplication: The Volcano-model executor streams tuples through a HashSet, emitting each unique value exactly once. With 200 rows and 5 distinct values, the entire operation fits in L1 cache.
Zero query planning overhead: The optimized query plan is generated once and the Volcano iterator model avoids materializing intermediate results.
No shared-memory coordination: PostgreSQL must coordinate between client and server processes via shared memory. HeliosDB-Lite has no process boundary.

Concrete Example

Scenario: E-Commerce Analytics Dashboard

An e-commerce platform serves 500 internal users who access real-time dashboards. Each dashboard load populates 6 filter dropdowns via DISTINCT queries:

-- Dashboard filter queries (executed on every page load)
SELECT DISTINCT region FROM customers;           -- Geographic filter
SELECT DISTINCT product_category FROM orders;    -- Category filter
SELECT DISTINCT status FROM orders;              -- Status filter
SELECT DISTINCT payment_method FROM transactions; -- Payment filter
SELECT DISTINCT shipping_method FROM shipments;  -- Shipping filter
SELECT DISTINCT warehouse FROM inventory;        -- Warehouse filter

Application Code

use heliosdb_lite::EmbeddedDatabase;
use heliosdb_lite::storage::NativeBackend;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Initialize embedded database (no external server needed)
    let backend = NativeBackend::open("./analytics.db")?;
    let db = EmbeddedDatabase::new(backend);

    // Setup schema and data (one-time)
    db.execute("CREATE TABLE customers (
        id INT PRIMARY KEY,
        name TEXT,
        email TEXT,
        region TEXT,
        segment TEXT,
        signup_date TEXT
    )")?;

    // ... populate with production data ...

    // Dashboard filter population (called on every page load)
    let regions = db.query("SELECT DISTINCT region FROM customers", &[])?;
    let categories = db.query("SELECT DISTINCT product_category FROM orders", &[])?;
    let statuses = db.query("SELECT DISTINCT status FROM orders", &[])?;

    // Each query completes in ~3μs — all 6 filters populated in <20μs total
    // PostgreSQL equivalent: ~1.3ms total (6 × 211μs)

    // Build dropdown options
    let region_options: Vec<String> = regions.iter()
        .filter_map(|row| match &row.values[0] {
            heliosdb_lite::Value::String(s) => Some(s.clone()),
            _ => None,
        })
        .collect();

    println!("Available regions: {:?}", region_options);
    // Output: ["East", "West", "North", "South", "Central"]

    Ok(())
}

Impact Analysis

Metric	PostgreSQL 16	HeliosDB-Lite	Improvement
6 DISTINCT queries	1.3 ms	18 μs	72x faster
500 users × 6 queries	633 ms cumulative	9 ms cumulative	70x reduction
Dashboard load time	50-100 ms (with network)	<1 ms	Interactive feel
Server required	Yes (PostgreSQL instance)	No (embedded)	Zero ops
Connection pool	Required (50-100 connections)	Not needed	Simpler architecture

At Scale

For a SaaS platform with 10,000 dashboard users making 10 page loads per hour:

PostgreSQL: 10,000 × 10 × 6 × 0.211ms = 126 seconds of query time per hour
HeliosDB-Lite: 10,000 × 10 × 6 × 0.003ms = 1.8 seconds of query time per hour

This 70x reduction means a single application instance can serve the same analytics workload that would require multiple PostgreSQL read replicas.

Technical Advantages

Comparison

Capability	HeliosDB-Lite	PostgreSQL 16	SQLite	DuckDB
DISTINCT latency (200 rows)	~3 μs	~211 μs	~50 μs	~15 μs
Embedded execution	Yes	No (client-server)	Yes	Yes
Network overhead	0	50-200 μs	0	0
ACID transactions	Yes	Yes	Yes (WAL)	Limited
Concurrent readers	Yes (MVCC)	Yes (MVCC)	Limited (WAL)	Yes
Full SQL support	Yes (JOINs, CTEs, windows)	Yes	Partial	Yes (analytical)
Vector search	Yes (HNSW+PQ)	Via extension	No	No
Wire protocol	PostgreSQL compatible	Native	N/A	N/A

When to Use This

HeliosDB-Lite’s DISTINCT performance advantage is most impactful for:

Real-time dashboards — Filter dropdowns, faceted search, interactive analytics where sub-millisecond response matters
Embedded analytics — Applications that need SQL analytics without deploying a separate database server
Edge computing — IoT gateways, mobile apps, and edge nodes where network latency to a central database is unacceptable
Microservices — Each service embeds its own database, eliminating shared database bottlenecks
High-frequency queries — Autocomplete, search suggestions, and real-time filtering where queries execute thousands of times per second

HeliosDB-Lite Version: 3.5.8