Approximate Query Processing Module

BlinkDB-style approximate query processing for ultra-fast analytics with error bounds.

Overview

This module implements sophisticated approximate query processing inspired by BlinkDB, providing:

Multi-dimensional Stratified Sampling: Create samples stratified by different column combinations
Error Bound Computation: Statistical confidence intervals using CLT, Hoeffding’s inequality, and bootstrap methods
Optimal Sample Selection: Automatically select the best sample for each query based on predicates
Sample Quality Monitoring: Track sample freshness, coverage, and stratification balance
BlinkDB Integration: Manage sample families and execute queries with automatic sample selection

Architecture

approximate/
├── mod.rs              # Core engine and data structures
├── blinkdb.rs          # BlinkDB-style sample management
├── sampling.rs         # Stratified sampling algorithms
├── error_bounds.rs     # Statistical error bound computation
└── sample_selection.rs # Optimal sample selection logic

Key Concepts

Stratified Sampling

Stratified sampling divides data into homogeneous strata (e.g., by region) and samples from each stratum proportionally. This reduces variance and improves accuracy compared to simple random sampling.

Example:

let sampler = StratifiedSampler::new();
let (sample, strata_info) = sampler.create_stratified_sample(
    &data,
    &[region_col, category_col], // Stratify by region and category
    10.0, // 10% sample size
)?;

Error Bounds

Error bounds provide confidence intervals for approximate results:

Wilson Score Interval: For COUNT queries (binomial proportions)
Central Limit Theorem: For SUM and AVG queries (sample means)
Hoeffding’s Inequality: Conservative bounds without distribution assumptions
Bootstrap: Non-parametric confidence intervals

Example:

let calculator = ErrorBoundsCalculator::new();
let (lower, upper) = calculator.compute_avg_bounds(
    &sample_values,
    0.95, // 95% confidence
);

Sample Selection

The sample selector chooses the optimal sample for a query based on:

Predicate Coverage: Does the sample stratification match query filters?
Sample Size: Larger samples generally provide better accuracy
Freshness: Recent samples reflect current data

Scoring Formula:

score = 0.6 * predicate_score + 0.3 * coverage_score + 0.1 * freshness_score

Usage Examples

Basic Approximate Query

use heliosdb_compute::approximate::*;

// Create engine
let engine = ApproximateEngine::new().await?;

// Create sample
let config = SampleConfig {
    name: "sales_sample".to_string(),
    base_table: "sales".to_string(),
    stratify_columns: vec!["region".to_string()],
    sample_size_percent: 10.0,
    target_error_percent: 5.0,
    filter: None,
    refresh_interval: None,
};
engine.create_sample(config).await?;

// Execute approximate COUNT query
let query = ApproximateQuery {
    sample_name: "sales_sample".to_string(),
    query: "SELECT COUNT(*) FROM sales WHERE region = 'US'".to_string(),
    confidence_level: 0.95,
    aggregate_type: Some(AggregateType::Count),
    aggregate_column: None,
};

let result = engine.approx_count(query).await?;

println!("Estimated Count: {:.0}", result.value);
println!("Error: ±{:.2}%", result.error_percent);
println!("95% CI: [{:.0}, {:.0}]", result.lower_bound, result.upper_bound);

BlinkDB Sample Families

let manager = BlinkDBManager::new();

// Create multiple samples with different stratifications
let base_config = SampleConfig {
    name: "sales_family".to_string(),
    base_table: "sales".to_string(),
    stratify_columns: vec![],
    sample_size_percent: 10.0,
    target_error_percent: 5.0,
    filter: None,
    refresh_interval: None,
};

let schemes = vec![
    vec!["region".to_string()],
    vec!["category".to_string()],
    vec!["region".to_string(), "category".to_string()],
];

let sample_names = manager.create_sample_family(base_config, schemes).await?;

// Execute query - BlinkDB selects the best sample automatically
let query = ApproximateQuery {
    sample_name: "".to_string(),
    query: "SELECT COUNT(*) FROM sales WHERE region = 'US'".to_string(),
    confidence_level: 0.95,
    aggregate_type: Some(AggregateType::Count),
    aggregate_column: None,
};

let result = manager.execute_approximate_query(query).await?;
println!("Selected sample: {}", result.selected_sample);

Sample Quality Monitoring

let quality = engine.get_sample_quality("sales_sample").await?;

println!("Coverage: {:.2}%", quality.coverage);
println!("Freshness: {:.2}s", quality.freshness.as_secs_f64());
println!("Estimated Error: {:.2}%", quality.estimated_error);
println!("Stratification Balance: {:.2}", quality.stratification_balance);

SQL Syntax (Planned)

-- Create sample
CREATE SAMPLE sales_sample ON sales
STRATIFIED BY region, product_category
WITH SIZE 1% ERROR 5%;

-- Approximate query
SELECT
    region,
    APPROX_COUNT(*) as count,
    APPROX_AVG(revenue) as avg_revenue
FROM sales_sample
WHERE date >= '2025-01-01'
GROUP BY region
WITH CONFIDENCE 95%;

-- Result includes error bounds:
-- region | count | count_lower | count_upper | avg_revenue | revenue_error
-- US     | 1000  | 950         | 1050        | 123.45      | 5.2%

Performance Characteristics

Operation	Time Complexity	Space Complexity	Notes
Sample Creation	O(n)	O(s)	n = base table size, s = sample size
Query Execution	O(s)	O(1)	s = sample size
Sample Selection	O(k)	O(1)	k = number of samples
Error Computation	O(s)	O(1)	s = sample size

Speedup Example:

Base table: 1 billion rows
Sample (1%): 10 million rows
Query speedup: ~100x
Accuracy: ±5% error with 95% confidence

Error Bounds Formulas

COUNT (Wilson Score Interval)

p̂ = sample_count / sample_size
CI = (p̂ + z²/2n ± z√(p̂(1-p̂)/n + z²/4n²)) / (1 + z²/n)

where z is the z-score for confidence level (e.g., 1.96 for 95%)

SUM and AVG (Central Limit Theorem)

SE = σ / √n

CI = x̄ ± z * SE

where:
  x̄ = sample mean
  σ = sample standard deviation
  n = sample size
  z = z-score for confidence level

Hoeffding’s Inequality (Conservative)

P(|x̄ - μ| > ε) ≤ 2 * exp(-2nε²/R²)

Solving for ε given confidence δ:
ε = R * √(ln(2/δ) / 2n)

where R is the range of values

Best Practices

Stratification Strategy:
- Stratify by columns frequently used in WHERE clauses
- Use multi-dimensional stratification for complex queries
- Create sample families covering common query patterns
Sample Size Selection:
- Larger samples → lower error, higher storage/query cost
- Target: 1-10% for most use cases
- Use compute_required_sample_size() for error targets
Refresh Strategy:
- Refresh samples when base data changes significantly
- Use incremental refresh for large tables
- Monitor sample quality metrics
When to Use Approximate Queries:
- Exploratory analytics and dashboards
- Trend analysis and aggregations
- Large datasets where exact results are too slow
- ❌ Regulatory reports requiring exact values
- ❌ Small datasets where exact queries are fast

Testing

The module includes comprehensive tests:

Unit Tests: 50+ tests covering all components
Integration Tests: 15+ end-to-end scenarios
Test Coverage: >80%

Run tests:

cargo test -p heliosdb-compute approximate
cargo test -p heliosdb-compute --test approximate_integration_test

References

Future Enhancements

Online sample maintenance (incremental updates)
Adaptive sampling (adjust sample size based on query workload)
Join support (approximate join results)
Time-series aware sampling
GPU-accelerated sampling
Sample compression