Skip to content

High-Speed DISTINCT Deduplication: Business Use Case for HeliosDB-Lite

High-Speed DISTINCT Deduplication: Business Use Case for HeliosDB-Lite

Document ID: 52_DISTINCT_DEDUPLICATION.md Version: 1.0 Created: 2026-02-12 Category: Query Performance & Analytics HeliosDB-Lite Version: 3.5.8+


Executive Summary

SQL DISTINCT queries are foundational to business analytics — extracting unique values for dashboards, dropdowns, report dimensions, and data quality audits. In production workloads, DISTINCT is one of the most frequently executed operations, running thousands of times per minute for real-time filtering UIs, autocomplete suggestions, and ETL pipelines.

HeliosDB-Lite delivers 70x faster DISTINCT performance compared to PostgreSQL 16 on representative datasets, completing SELECT DISTINCT region FROM customers in ~3 microseconds versus PostgreSQL’s ~211 microseconds on a 200-row table with 5 distinct regions. This advantage scales with application demands: embedded execution eliminates network overhead, and in-memory hash-based deduplication operates entirely within the application process.

Key Business Value:

  • 70x faster DISTINCT queries — 3μs vs 211μs on real-world datasets
  • Zero network latency — Embedded execution, no client-server round-trip
  • Real-time analytics — Sub-microsecond response enables interactive dashboards
  • Reduced infrastructure — No separate database server for analytics workloads

Problem Being Solved

The Business Need for DISTINCT

Every data-driven application relies on DISTINCT queries:

Use CaseExample QueryBusiness Impact
Filter dropdownsSELECT DISTINCT region FROM customersUI responsiveness for end users
Dashboard dimensionsSELECT DISTINCT product_category FROM ordersReal-time analytics drill-down
Data quality auditSELECT DISTINCT status FROM ordersDetect invalid/unexpected values
ETL deduplicationSELECT DISTINCT email FROM raw_leadsClean data before processing
AutocompleteSELECT DISTINCT city FROM addresses WHERE city LIKE 'San%'Search-as-you-type UX
Reporting dimensionsSELECT DISTINCT department, role FROM employeesReport parameter selection

Why Speed Matters

When a user opens a dashboard filter dropdown, the application runs SELECT DISTINCT to populate the options. At 211μs per query (PostgreSQL), serving 1,000 concurrent dashboard users requires:

  • 211ms of cumulative query time per refresh cycle
  • Network round-trip overhead adds 0.5-2ms per query
  • Connection pool contention at scale degrades P99 latency

At 3μs per query (HeliosDB-Lite):

  • 3ms of cumulative query time — 70x reduction
  • Zero network overhead — embedded in the application process
  • No connection pool — direct function call

Benchmark Results

Test Setup

  • Dataset: 200 customers with 5 distinct regions (“East”, “West”, “North”, “South”, “Central”)
  • Query: SELECT DISTINCT region FROM customers
  • HeliosDB-Lite: v3.5.8, embedded in-memory, NativeBackend
  • PostgreSQL: v16, localhost connection, shared_buffers tuned
  • Method: 1,000 iterations, median timing

Results

MetricHeliosDB-LitePostgreSQL 16Advantage
Median latency~3 μs~211 μs70x faster
P99 latency~8 μs~450 μs56x faster
Throughput333,000 queries/sec4,700 queries/sec70x higher
Network overhead0 μs50-200 μsEliminated

Why HeliosDB-Lite Is Faster

  1. Embedded execution: No TCP/IP stack, no serialization/deserialization, no connection management. The query executes as a function call within the application process.

  2. In-memory hash deduplication: The Volcano-model executor streams tuples through a HashSet, emitting each unique value exactly once. With 200 rows and 5 distinct values, the entire operation fits in L1 cache.

  3. Zero query planning overhead: The optimized query plan is generated once and the Volcano iterator model avoids materializing intermediate results.

  4. No shared-memory coordination: PostgreSQL must coordinate between client and server processes via shared memory. HeliosDB-Lite has no process boundary.


Concrete Example

Scenario: E-Commerce Analytics Dashboard

An e-commerce platform serves 500 internal users who access real-time dashboards. Each dashboard load populates 6 filter dropdowns via DISTINCT queries:

-- Dashboard filter queries (executed on every page load)
SELECT DISTINCT region FROM customers; -- Geographic filter
SELECT DISTINCT product_category FROM orders; -- Category filter
SELECT DISTINCT status FROM orders; -- Status filter
SELECT DISTINCT payment_method FROM transactions; -- Payment filter
SELECT DISTINCT shipping_method FROM shipments; -- Shipping filter
SELECT DISTINCT warehouse FROM inventory; -- Warehouse filter

Application Code

use heliosdb_lite::EmbeddedDatabase;
use heliosdb_lite::storage::NativeBackend;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Initialize embedded database (no external server needed)
let backend = NativeBackend::open("./analytics.db")?;
let db = EmbeddedDatabase::new(backend);
// Setup schema and data (one-time)
db.execute("CREATE TABLE customers (
id INT PRIMARY KEY,
name TEXT,
email TEXT,
region TEXT,
segment TEXT,
signup_date TEXT
)")?;
// ... populate with production data ...
// Dashboard filter population (called on every page load)
let regions = db.query("SELECT DISTINCT region FROM customers", &[])?;
let categories = db.query("SELECT DISTINCT product_category FROM orders", &[])?;
let statuses = db.query("SELECT DISTINCT status FROM orders", &[])?;
// Each query completes in ~3μs — all 6 filters populated in <20μs total
// PostgreSQL equivalent: ~1.3ms total (6 × 211μs)
// Build dropdown options
let region_options: Vec<String> = regions.iter()
.filter_map(|row| match &row.values[0] {
heliosdb_lite::Value::String(s) => Some(s.clone()),
_ => None,
})
.collect();
println!("Available regions: {:?}", region_options);
// Output: ["East", "West", "North", "South", "Central"]
Ok(())
}

Impact Analysis

MetricPostgreSQL 16HeliosDB-LiteImprovement
6 DISTINCT queries1.3 ms18 μs72x faster
500 users × 6 queries633 ms cumulative9 ms cumulative70x reduction
Dashboard load time50-100 ms (with network)<1 msInteractive feel
Server requiredYes (PostgreSQL instance)No (embedded)Zero ops
Connection poolRequired (50-100 connections)Not neededSimpler architecture

At Scale

For a SaaS platform with 10,000 dashboard users making 10 page loads per hour:

  • PostgreSQL: 10,000 × 10 × 6 × 0.211ms = 126 seconds of query time per hour
  • HeliosDB-Lite: 10,000 × 10 × 6 × 0.003ms = 1.8 seconds of query time per hour

This 70x reduction means a single application instance can serve the same analytics workload that would require multiple PostgreSQL read replicas.


Technical Advantages

Comparison

CapabilityHeliosDB-LitePostgreSQL 16SQLiteDuckDB
DISTINCT latency (200 rows)~3 μs~211 μs~50 μs~15 μs
Embedded executionYesNo (client-server)YesYes
Network overhead050-200 μs00
ACID transactionsYesYesYes (WAL)Limited
Concurrent readersYes (MVCC)Yes (MVCC)Limited (WAL)Yes
Full SQL supportYes (JOINs, CTEs, windows)YesPartialYes (analytical)
Vector searchYes (HNSW+PQ)Via extensionNoNo
Wire protocolPostgreSQL compatibleNativeN/AN/A

When to Use This

HeliosDB-Lite’s DISTINCT performance advantage is most impactful for:

  1. Real-time dashboards — Filter dropdowns, faceted search, interactive analytics where sub-millisecond response matters
  2. Embedded analytics — Applications that need SQL analytics without deploying a separate database server
  3. Edge computing — IoT gateways, mobile apps, and edge nodes where network latency to a central database is unacceptable
  4. Microservices — Each service embeds its own database, eliminating shared database bottlenecks
  5. High-frequency queries — Autocomplete, search suggestions, and real-time filtering where queries execute thousands of times per second

Document Classification: Public Reference HeliosDB-Lite Version: 3.5.8