Skip to content

HeliosDB Data Quality Management Guide

HeliosDB Data Quality Management Guide

Version: 1.0 Last Updated: 2025-11-30 Status: Complete


Table of Contents

  1. Overview
  2. Core Concepts
  3. Getting Started
  4. Quality Metrics
  5. Data Validation
  6. Anomaly Detection
  7. Auto-Correction
  8. Monitoring & Alerting
  9. Best Practices
  10. Troubleshooting

Overview

HeliosDB Data Quality Management provides enterprise-grade data quality monitoring, validation, and automated correction. It uses machine learning to detect anomalies, enforce data standards, and maintain data integrity across your database.

Key Features

  • Automated Quality Monitoring - Continuous quality metric tracking
  • ML-Based Anomaly Detection - Identify unusual patterns and outliers
  • Validation Rule Engine - Define and enforce quality rules
  • Auto-Correction - Automatically fix common data quality issues
  • Quality Dashboards - Real-time quality visualization
  • Compliance Reporting - Track compliance with quality standards
  • Predictive Profiling - ML models for pattern learning and detection

Use Cases

  • Data governance and compliance (GDPR, HIPAA, SOC 2)
  • Master data management (MDM)
  • Data lake quality assurance
  • ETL process monitoring
  • Regulatory reporting accuracy
  • Customer data quality

Core Concepts

Quality Dimensions

HeliosDB data quality management tracks seven key dimensions:

1. Completeness

Measure of non-null, non-empty values in a field.

-- Check completeness of customer_email
SELECT quality_metric('completeness', 'customers', 'email') AS completeness_pct;
-- Result: 98.5% of records have non-null email

Thresholds:

  • Acceptable: 95%+
  • Warning: 80-95%
  • Critical: <80%

2. Accuracy

Measure of data conforming to expected format/range.

-- Validate phone numbers match E.164 format
SELECT quality_metric('accuracy', 'customers', 'phone_number',
'pattern:^\+[1-9]\d{1,14}$') AS accuracy_pct;

3. Consistency

Measure of uniform data representation across sources.

-- Check if customer names are consistent (no mixed cases)
SELECT quality_metric('consistency', 'customers', 'name') AS consistency_pct;

4. Timeliness

Measure of data freshness and update frequency.

-- Check if order records are updated within 1 hour
SELECT quality_metric('timeliness', 'orders', 'updated_at',
'interval:1 hour') AS timeliness_pct;

5. Validity

Measure of data type and constraint compliance.

-- Validate that dates are valid and in expected range
SELECT quality_metric('validity', 'customers', 'birth_date',
'range:[1920-01-01,2010-12-31]') AS validity_pct;

6. Uniqueness

Measure of duplicate detection.

-- Check for duplicate customer records
SELECT quality_metric('uniqueness', 'customers', 'email') AS uniqueness_pct;
-- Identify duplicates
SELECT email, COUNT(*) as duplicate_count
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;

7. Conformity

Measure of data conforming to business rules.

-- Check if orders conform to business rule (quantity > 0)
SELECT quality_metric('conformity', 'orders', 'quantity',
'rule:quantity > 0') AS conformity_pct;

Getting Started

Step 1: Enable Data Quality Monitoring

-- Enable quality monitoring on a table
ALTER TABLE customers ENABLE QUALITY_MONITORING;
-- Configure quality parameters
ALTER TABLE customers SET QUALITY_PARAMS (
completeness_threshold = 95,
accuracy_threshold = 98,
anomaly_sensitivity = 0.8
);

Step 2: Define Quality Rules

-- Create quality rule for completeness
CREATE QUALITY_RULE email_required AS
COLUMN customers.email
VALIDATE NOT NULL
ON VIOLATION ACTION = LOG;
-- Create quality rule for accuracy
CREATE QUALITY_RULE valid_email AS
COLUMN customers.email
VALIDATE MATCHES '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}$'
ON VIOLATION ACTION = QUARANTINE;
-- Create business rule
CREATE QUALITY_RULE positive_amount AS
COLUMN orders.amount
VALIDATE amount > 0
ON VIOLATION ACTION = BLOCK;

Step 3: Enable Anomaly Detection

-- Enable ML-based anomaly detection
ALTER TABLE sales ENABLE ANOMALY_DETECTION;
-- Configure anomaly detection parameters
ALTER TABLE sales SET ANOMALY_PARAMS (
sensitivity = 0.85, -- 0.5 (sensitive) to 0.99 (strict)
min_baseline_records = 1000, -- Train on at least 1000 records
detection_interval = '1 hour',
models = ['isolation_forest', 'one_class_svm'] -- ML algorithms to use
);

Step 4: Configure Auto-Correction

-- Enable auto-correction for common issues
ALTER TABLE customers ENABLE AUTO_CORRECTION;
-- Configure correction strategies
ALTER TABLE customers SET CORRECTION_PARAMS (
trim_whitespace = true, -- Remove leading/trailing spaces
normalize_case = 'PROPER', -- Convert to proper case
fix_phone_format = true, -- Standardize phone numbers
remove_duplicates = true, -- Remove exact duplicates
fix_date_format = 'YYYY-MM-DD', -- Standardize dates
null_strategy = 'PRESERVE' -- How to handle nulls
);

Quality Metrics

Retrieving Quality Metrics

Real-Time Metrics

-- Get overall quality score for a table
SELECT quality_score('customers') AS quality_pct;
-- Result: 94.2
-- Get quality score for specific column
SELECT quality_score('customers', 'email') AS email_quality;
-- Get all quality dimensions
SELECT
dimension,
score,
last_updated,
status
FROM quality_metrics('customers')
ORDER BY score;

Quality Report by Date Range

-- Get quality trend over time
SELECT
DATE(measured_at) as date,
quality_score,
completeness,
accuracy,
consistency,
duplicates_found
FROM quality_history
WHERE table_name = 'customers'
AND measured_at >= NOW() - INTERVAL '30 days'
ORDER BY measured_at;

Column-Level Metrics

-- Detailed metrics for a column
SELECT
column_name,
null_count,
null_pct,
unique_count,
duplicate_count,
distinct_values,
cardinality,
min_value,
max_value,
avg_length,
data_type
FROM column_metrics('customers', 'email');

Quality Score Calculation

HeliosDB calculates overall quality using weighted metrics:

Overall Quality Score =
(Completeness × 0.25) +
(Accuracy × 0.25) +
(Consistency × 0.20) +
(Validity × 0.15) +
(Uniqueness × 0.10) +
(Conformity × 0.05)

Data Validation

Built-In Validators

Type Validation

CREATE QUALITY_RULE type_check AS
COLUMN orders.order_date
VALIDATE TYPE = 'DATE'
ON VIOLATION ACTION = LOG;
CREATE QUALITY_RULE numeric_check AS
COLUMN products.price
VALIDATE TYPE IN ('DECIMAL', 'FLOAT')
ON VIOLATION ACTION = BLOCK;

Pattern Validation

-- Email validation
CREATE QUALITY_RULE email_format AS
COLUMN customers.email
VALIDATE PATTERN '^[^@]+@[^@]+\.[^@]+$'
ON VIOLATION ACTION = QUARANTINE;
-- Phone number (E.164 format)
CREATE QUALITY_RULE phone_format AS
COLUMN customers.phone
VALIDATE PATTERN '^\+[1-9]\d{1,14}$'
ON VIOLATION ACTION = LOG;
-- URL validation
CREATE QUALITY_RULE url_format AS
COLUMN websites.url
VALIDATE PATTERN '^https?://'
ON VIOLATION ACTION = QUARANTINE;

Range Validation

-- Date range
CREATE QUALITY_RULE birth_date_range AS
COLUMN customers.birth_date
VALIDATE BETWEEN '1920-01-01' AND CURRENT_DATE
ON VIOLATION ACTION = LOG;
-- Numeric range
CREATE QUALITY_RULE price_range AS
COLUMN products.price
VALIDATE BETWEEN 0.01 AND 999999.99
ON VIOLATION ACTION = BLOCK;
-- Age validation
CREATE QUALITY_RULE adult_age AS
COLUMN customers.age
VALIDATE >= 18
ON VIOLATION ACTION = QUARANTINE;

Business Rule Validation

-- Order rules
CREATE QUALITY_RULE order_quantity AS
COLUMN orders.quantity
VALIDATE quantity > 0
ON VIOLATION ACTION = BLOCK;
CREATE QUALITY_RULE order_amount AS
COLUMN orders.amount
VALIDATE amount > (quantity * unit_price * 0.99)
ON VIOLATION ACTION = QUARANTINE;
-- Referential integrity
CREATE QUALITY_RULE valid_customer AS
COLUMN orders.customer_id
VALIDATE EXISTS (SELECT 1 FROM customers WHERE id = customer_id)
ON VIOLATION ACTION = BLOCK;

Violation Actions

ActionBehaviorUse Case
LOGLog violation, accept dataNon-critical issues
WARNLog warning, accept dataMonitor and notify
QUARANTINEMove to quarantine tableReview before accepting
BLOCKReject dataCritical business rules
AUTO_FIXAttempt automatic correctionFixable issues
CUSTOMCall custom functionComplex logic

Anomaly Detection

ML-Based Anomaly Detection

HeliosDB uses machine learning to detect unusual patterns without predefined rules.

Statistical Anomalies

-- Detect anomalous sales amounts (statistical outliers)
SELECT
order_id,
amount,
expected_range_min,
expected_range_max,
deviation_score,
anomaly_probability
FROM detect_anomalies('orders', 'amount',
method='isolation_forest',
threshold=0.85)
WHERE anomaly_probability > 0.85
ORDER BY anomaly_probability DESC;

Behavioral Anomalies

-- Detect unusual customer behavior
-- (e.g., customer ordering 100x their normal quantity)
SELECT
customer_id,
order_count,
avg_order_amount,
this_order_amount,
behavior_score,
anomaly_reason
FROM detect_behavioral_anomalies('orders',
baseline='30 days',
threshold=0.80)
WHERE behavior_score > 0.80;

Time Series Anomalies

-- Detect anomalies in time series data
-- (e.g., sudden drop in website traffic)
SELECT
hour,
traffic_count,
expected_traffic,
deviation_pct,
anomaly_severity
FROM detect_time_series_anomalies('traffic_logs',
column='request_count',
interval='1 hour',
baseline='7 days',
method='ARIMA')
WHERE anomaly_severity > 0.7
ORDER BY hour DESC;

Configuring Anomaly Sensitivity

-- Sensitivity levels: 0.5 (very sensitive) to 0.99 (very strict)
-- Sensitive detection (catches more, including false positives)
ALTER TABLE orders SET ANOMALY_PARAMS (
sensitivity = 0.60,
min_baseline_records = 100
);
-- Balanced detection
ALTER TABLE orders SET ANOMALY_PARAMS (
sensitivity = 0.80,
min_baseline_records = 500
);
-- Strict detection (catches only major anomalies)
ALTER TABLE orders SET ANOMALY_PARAMS (
sensitivity = 0.95,
min_baseline_records = 1000
);

Auto-Correction

Automatic Fixing Strategies

Text Normalization

-- Enable automatic text corrections
ALTER TABLE customers SET CORRECTION_PARAMS (
trim_whitespace = true, -- Remove leading/trailing spaces
normalize_case = 'PROPER', -- Convert to Title Case
normalize_spaces = true -- Convert multiple spaces to single
);
-- Example:
-- " john DOE " → "John Doe"
-- "john smith" → "John Smith"

Format Standardization

-- Enable format standardization
ALTER TABLE customers SET CORRECTION_PARAMS (
fix_phone_format = true, -- Standardize phone numbers
fix_date_format = 'YYYY-MM-DD', -- Standardize dates
fix_email_format = true -- Normalize email (lowercase)
);
-- Examples:
-- "555.123.4567" → "+15551234567" (E.164)
-- "1/15/2024" → "2024-01-15"
-- "John.Doe@EXAMPLE.COM" → "john.doe@example.com"

Duplicate Removal

-- Enable duplicate detection and removal
ALTER TABLE customers SET CORRECTION_PARAMS (
remove_exact_duplicates = true, -- Remove identical rows
remove_near_duplicates = true, -- Remove similar rows
duplicate_threshold = 0.95, -- 95% similarity
keep_strategy = 'FIRST' -- Keep first occurrence
);

Type Coercion

-- Enable safe type coercion
ALTER TABLE orders SET CORRECTION_PARAMS (
auto_coerce_types = true,
string_to_number = true,
string_to_date = true,
invalid_handling = 'NULL' -- What to do with values that can't be coerced
);
-- Examples:
-- " 123 " → 123 (string to number)
-- "2024-01-15" → DATE '2024-01-15'

Custom Correction Functions

-- Define custom correction logic
CREATE CORRECTION_FUNCTION fix_phone_number(phone_text VARCHAR)
RETURNS VARCHAR AS $$
DECLARE cleaned VARCHAR;
BEGIN
-- Remove all non-digits
cleaned := REGEXP_REPLACE(phone_text, '[^0-9]', '');
-- Ensure 10 digits (US)
IF LENGTH(cleaned) = 10 THEN
RETURN '+1' || cleaned;
ELSIF LENGTH(cleaned) = 11 AND cleaned LIKE '1%' THEN
RETURN '+' || cleaned;
ELSE
RETURN NULL;
END IF;
END;
$$ LANGUAGE plpgsql;
-- Apply to quality rule
CREATE QUALITY_RULE phone_correction AS
COLUMN customers.phone
AUTO_CORRECT USING fix_phone_number(phone)
ON VIOLATION ACTION = AUTO_FIX;

Monitoring & Alerting

Quality Dashboards

-- Create a quality dashboard view
CREATE VIEW quality_dashboard AS
SELECT
table_name,
quality_score('table_name') as overall_quality,
(SELECT COUNT(*) FROM quality_violations
WHERE severity = 'CRITICAL') as critical_issues,
(SELECT COUNT(*) FROM quality_violations
WHERE severity = 'WARNING') as warnings,
CASE
WHEN quality_score >= 95 THEN 'EXCELLENT'
WHEN quality_score >= 85 THEN 'GOOD'
WHEN quality_score >= 75 THEN 'FAIR'
ELSE 'POOR'
END as quality_status,
CURRENT_TIMESTAMP as measured_at
FROM information_schema.tables
WHERE table_schema = 'public';
-- Query the dashboard
SELECT * FROM quality_dashboard ORDER BY overall_quality;

Setting Up Alerts

-- Create alert for quality drops below threshold
CREATE ALERT quality_alert_customers AS
WHEN quality_score('customers') < 90
THEN NOTIFY 'data-quality-team'
WITH MESSAGE = 'Customer data quality below threshold'
SEVERITY = 'HIGH';
-- Create alert for critical violations
CREATE ALERT critical_violations_alert AS
WHEN (SELECT COUNT(*) FROM quality_violations
WHERE severity = 'CRITICAL') > 10
THEN NOTIFY 'data-ops-team'
WITH MESSAGE = 'Critical data quality violations detected'
SEVERITY = 'CRITICAL';
-- Enable alerts
ALTER ALERT quality_alert_customers ENABLE;
ALTER ALERT critical_violations_alert ENABLE;

Monitoring Violations

-- View recent violations
SELECT
rule_name,
table_name,
column_name,
violation_type,
affected_records,
severity,
violation_time,
corrected
FROM quality_violations
WHERE violation_time >= NOW() - INTERVAL '24 hours'
ORDER BY violation_time DESC;
-- Get violation summary
SELECT
rule_name,
COUNT(*) as violation_count,
SUM(affected_records) as total_records_affected,
AVG(CAST(corrected AS INT)) as correction_rate
FROM quality_violations
WHERE violation_time >= NOW() - INTERVAL '30 days'
GROUP BY rule_name
ORDER BY violation_count DESC;

Best Practices

1. Start with Critical Columns

-- Prioritize core business columns
ALTER TABLE customers ENABLE QUALITY_MONITORING;
CREATE QUALITY_RULE email_required AS
COLUMN customers.email
VALIDATE NOT NULL AND PATTERN '^.+@.+\..+$'
ON VIOLATION ACTION = BLOCK;
CREATE QUALITY_RULE phone_required AS
COLUMN customers.phone
VALIDATE NOT NULL
ON VIOLATION ACTION = LOG;

2. Use Tiered Thresholds

-- Different rules for different data types
-- Critical business data: strict rules
ALTER TABLE financial_transactions SET QUALITY_PARAMS (
completeness_threshold = 99,
accuracy_threshold = 99.5,
violation_action = 'BLOCK'
);
-- Reference data: moderate rules
ALTER TABLE product_catalog SET QUALITY_PARAMS (
completeness_threshold = 95,
accuracy_threshold = 98,
violation_action = 'LOG'
);
-- Analytical data: lenient rules
ALTER TABLE analytics_events SET QUALITY_PARAMS (
completeness_threshold = 80,
accuracy_threshold = 90,
violation_action = 'WARN'
);

3. Regular Baselines Updates

-- Update ML anomaly detection baselines weekly
ALTER TABLE sales REFRESH ANOMALY_BASELINE
USING DATA FROM (NOW() - INTERVAL '7 days', NOW());
-- Monitor baseline effectiveness
SELECT
baseline_date,
baseline_records,
anomalies_detected,
false_positive_rate,
effectiveness_score
FROM anomaly_baseline_metrics('sales');

4. Test Rules Before Enforcement

-- Test quality rule in DRY RUN mode
CREATE QUALITY_RULE test_rule AS
COLUMN orders.amount
VALIDATE amount > 0
ON VIOLATION ACTION = LOG
MODE = 'DRY_RUN'; -- Don't actually enforce yet
-- After 7 days, review violations and enable enforcement
ALTER QUALITY_RULE test_rule MODE = 'ENFORCE';

5. Document Quality Rules

-- Add documentation to rules
CREATE QUALITY_RULE well_documented_rule AS
COLUMN customers.age
VALIDATE age >= 18
ON VIOLATION ACTION = BLOCK
DESCRIPTION = 'Enforces legal age requirement for customer accounts'
BUSINESS_OWNER = 'compliance-team@company.com'
SLA = '99% compliance'
CREATED_DATE = '2025-01-15'
REASON = 'Legal compliance with child protection laws';

Troubleshooting

Issue 1: Quality Score Dropping Suddenly

Symptoms:

  • Quality score drops from 95% to 70% overnight
  • Multiple new violations reported

Diagnosis:

-- Check what changed
SELECT
rule_name,
violation_count,
first_violation_time,
last_violation_time
FROM quality_violations_by_rule
WHERE violation_time >= NOW() - INTERVAL '24 hours'
ORDER BY violation_count DESC;
-- Check data changes
SELECT
column_name,
null_count,
null_pct_change,
distribution_change
FROM column_metrics('table_name')
WHERE measured_at >= NOW() - INTERVAL '24 hours';

Solutions:

-- If data import caused issues, temporarily adjust thresholds
ALTER TABLE affected_table SET QUALITY_PARAMS (
completeness_threshold = 80 -- Reduced from 95
);
-- Investigate root cause
-- Check ETL logs, recent schema changes, new data sources
-- After fixing, restore thresholds
ALTER TABLE affected_table SET QUALITY_PARAMS (
completeness_threshold = 95
);

Issue 2: Too Many False Positives in Anomaly Detection

Symptoms:

  • Anomaly detection alerts for normal, expected variations
  • Alert fatigue from false positives

Solution:

-- Lower sensitivity to reduce false positives
ALTER TABLE orders SET ANOMALY_PARAMS (
sensitivity = 0.90, -- Increased from 0.80
min_baseline_records = 2000 -- Increased from 1000
);
-- Or exclude certain patterns
ALTER TABLE orders ADD ANOMALY_EXCLUSION AS
WHERE order_type = 'BULK' -- Bulk orders are naturally different
OR customer_segment = 'VIP'; -- VIPs may have unusual patterns

Issue 3: Auto-Correction Creating More Problems

Symptoms:

  • Auto-corrections are changing data incorrectly
  • Business values being modified unexpectedly

Solution:

-- Disable auto-correction and review violations first
ALTER TABLE customers DISABLE AUTO_CORRECTION;
-- Review what would have been corrected
SELECT
column_name,
original_value,
corrected_value,
confidence_score
FROM correction_preview('customers')
WHERE corrected_value != original_value;
-- Re-enable with more conservative settings
ALTER TABLE customers SET CORRECTION_PARAMS (
trim_whitespace = true, -- Safe
normalize_case = false, -- Disable case normalization
null_strategy = 'PRESERVE' -- Never create nulls
);
ALTER TABLE customers ENABLE AUTO_CORRECTION;

Issue 4: Performance Impact of Quality Monitoring

Symptoms:

  • Query performance degraded after enabling quality monitoring
  • Monitoring overhead too high

Solution:

-- Reduce monitoring frequency
ALTER TABLE large_table SET QUALITY_PARAMS (
monitoring_frequency = '1 hour', -- Changed from '1 minute'
sampling_rate = 0.1 -- Monitor 10% of data
);
-- Monitor only critical columns
ALTER TABLE large_table ENABLE QUALITY_MONITORING
ON COLUMNS (customer_id, order_date); -- Not all columns
-- Disable anomaly detection during peak hours
ALTER TABLE orders SET ANOMALY_PARAMS (
enabled = false
) WITH SCHEDULE DURING '09:00' TO '17:00' UTC;

Summary

HeliosDB Data Quality Management provides a comprehensive platform for:

  • Monitoring data quality dimensions across your database
  • Validating data against business and technical rules
  • Detecting anomalies using machine learning
  • Correcting common data quality issues automatically
  • Alerting teams to quality issues in real-time
  • Reporting on compliance and quality trends

By implementing data quality management, you can ensure your database meets the highest standards of accuracy, consistency, and reliability.


Related Documentation: