Skip to content

Approximate Query Processing Module

Approximate Query Processing Module

BlinkDB-style approximate query processing for ultra-fast analytics with error bounds.

Overview

This module implements sophisticated approximate query processing inspired by BlinkDB, providing:

  • Multi-dimensional Stratified Sampling: Create samples stratified by different column combinations
  • Error Bound Computation: Statistical confidence intervals using CLT, Hoeffding’s inequality, and bootstrap methods
  • Optimal Sample Selection: Automatically select the best sample for each query based on predicates
  • Sample Quality Monitoring: Track sample freshness, coverage, and stratification balance
  • BlinkDB Integration: Manage sample families and execute queries with automatic sample selection

Architecture

approximate/
├── mod.rs # Core engine and data structures
├── blinkdb.rs # BlinkDB-style sample management
├── sampling.rs # Stratified sampling algorithms
├── error_bounds.rs # Statistical error bound computation
└── sample_selection.rs # Optimal sample selection logic

Key Concepts

Stratified Sampling

Stratified sampling divides data into homogeneous strata (e.g., by region) and samples from each stratum proportionally. This reduces variance and improves accuracy compared to simple random sampling.

Example:

let sampler = StratifiedSampler::new();
let (sample, strata_info) = sampler.create_stratified_sample(
&data,
&[region_col, category_col], // Stratify by region and category
10.0, // 10% sample size
)?;

Error Bounds

Error bounds provide confidence intervals for approximate results:

  • Wilson Score Interval: For COUNT queries (binomial proportions)
  • Central Limit Theorem: For SUM and AVG queries (sample means)
  • Hoeffding’s Inequality: Conservative bounds without distribution assumptions
  • Bootstrap: Non-parametric confidence intervals

Example:

let calculator = ErrorBoundsCalculator::new();
let (lower, upper) = calculator.compute_avg_bounds(
&sample_values,
0.95, // 95% confidence
);

Sample Selection

The sample selector chooses the optimal sample for a query based on:

  1. Predicate Coverage: Does the sample stratification match query filters?
  2. Sample Size: Larger samples generally provide better accuracy
  3. Freshness: Recent samples reflect current data

Scoring Formula:

score = 0.6 * predicate_score + 0.3 * coverage_score + 0.1 * freshness_score

Usage Examples

Basic Approximate Query

use heliosdb_compute::approximate::*;
// Create engine
let engine = ApproximateEngine::new().await?;
// Create sample
let config = SampleConfig {
name: "sales_sample".to_string(),
base_table: "sales".to_string(),
stratify_columns: vec!["region".to_string()],
sample_size_percent: 10.0,
target_error_percent: 5.0,
filter: None,
refresh_interval: None,
};
engine.create_sample(config).await?;
// Execute approximate COUNT query
let query = ApproximateQuery {
sample_name: "sales_sample".to_string(),
query: "SELECT COUNT(*) FROM sales WHERE region = 'US'".to_string(),
confidence_level: 0.95,
aggregate_type: Some(AggregateType::Count),
aggregate_column: None,
};
let result = engine.approx_count(query).await?;
println!("Estimated Count: {:.0}", result.value);
println!("Error: ±{:.2}%", result.error_percent);
println!("95% CI: [{:.0}, {:.0}]", result.lower_bound, result.upper_bound);

BlinkDB Sample Families

let manager = BlinkDBManager::new();
// Create multiple samples with different stratifications
let base_config = SampleConfig {
name: "sales_family".to_string(),
base_table: "sales".to_string(),
stratify_columns: vec![],
sample_size_percent: 10.0,
target_error_percent: 5.0,
filter: None,
refresh_interval: None,
};
let schemes = vec![
vec!["region".to_string()],
vec!["category".to_string()],
vec!["region".to_string(), "category".to_string()],
];
let sample_names = manager.create_sample_family(base_config, schemes).await?;
// Execute query - BlinkDB selects the best sample automatically
let query = ApproximateQuery {
sample_name: "".to_string(),
query: "SELECT COUNT(*) FROM sales WHERE region = 'US'".to_string(),
confidence_level: 0.95,
aggregate_type: Some(AggregateType::Count),
aggregate_column: None,
};
let result = manager.execute_approximate_query(query).await?;
println!("Selected sample: {}", result.selected_sample);

Sample Quality Monitoring

let quality = engine.get_sample_quality("sales_sample").await?;
println!("Coverage: {:.2}%", quality.coverage);
println!("Freshness: {:.2}s", quality.freshness.as_secs_f64());
println!("Estimated Error: {:.2}%", quality.estimated_error);
println!("Stratification Balance: {:.2}", quality.stratification_balance);

SQL Syntax (Planned)

-- Create sample
CREATE SAMPLE sales_sample ON sales
STRATIFIED BY region, product_category
WITH SIZE 1% ERROR 5%;
-- Approximate query
SELECT
region,
APPROX_COUNT(*) as count,
APPROX_AVG(revenue) as avg_revenue
FROM sales_sample
WHERE date >= '2025-01-01'
GROUP BY region
WITH CONFIDENCE 95%;
-- Result includes error bounds:
-- region | count | count_lower | count_upper | avg_revenue | revenue_error
-- US | 1000 | 950 | 1050 | 123.45 | 5.2%

Performance Characteristics

OperationTime ComplexitySpace ComplexityNotes
Sample CreationO(n)O(s)n = base table size, s = sample size
Query ExecutionO(s)O(1)s = sample size
Sample SelectionO(k)O(1)k = number of samples
Error ComputationO(s)O(1)s = sample size

Speedup Example:

  • Base table: 1 billion rows
  • Sample (1%): 10 million rows
  • Query speedup: ~100x
  • Accuracy: ±5% error with 95% confidence

Error Bounds Formulas

COUNT (Wilson Score Interval)

p̂ = sample_count / sample_size
CI = (p̂ + z²/2n ± z√(p̂(1-p̂)/n + z²/4n²)) / (1 + z²/n)
where z is the z-score for confidence level (e.g., 1.96 for 95%)

SUM and AVG (Central Limit Theorem)

SE = σ / √n
CI = x̄ ± z * SE
where:
x̄ = sample mean
σ = sample standard deviation
n = sample size
z = z-score for confidence level

Hoeffding’s Inequality (Conservative)

P(|x̄ - μ| > ε) ≤ 2 * exp(-2nε²/R²)
Solving for ε given confidence δ:
ε = R * √(ln(2/δ) / 2n)
where R is the range of values

Best Practices

  1. Stratification Strategy:

    • Stratify by columns frequently used in WHERE clauses
    • Use multi-dimensional stratification for complex queries
    • Create sample families covering common query patterns
  2. Sample Size Selection:

    • Larger samples → lower error, higher storage/query cost
    • Target: 1-10% for most use cases
    • Use compute_required_sample_size() for error targets
  3. Refresh Strategy:

    • Refresh samples when base data changes significantly
    • Use incremental refresh for large tables
    • Monitor sample quality metrics
  4. When to Use Approximate Queries:

    • Exploratory analytics and dashboards
    • Trend analysis and aggregations
    • Large datasets where exact results are too slow
    • ❌ Regulatory reports requiring exact values
    • ❌ Small datasets where exact queries are fast

Testing

The module includes comprehensive tests:

  • Unit Tests: 50+ tests covering all components
  • Integration Tests: 15+ end-to-end scenarios
  • Test Coverage: >80%

Run tests:

Terminal window
cargo test -p heliosdb-compute approximate
cargo test -p heliosdb-compute --test approximate_integration_test

References

Future Enhancements

  • Online sample maintenance (incremental updates)
  • Adaptive sampling (adjust sample size based on query workload)
  • Join support (approximate join results)
  • Time-series aware sampling
  • GPU-accelerated sampling
  • Sample compression