HeliosDB Data Quality Management Guide
HeliosDB Data Quality Management Guide
Version: 1.0 Last Updated: 2025-11-30 Status: Complete
Table of Contents
- Overview
- Core Concepts
- Getting Started
- Quality Metrics
- Data Validation
- Anomaly Detection
- Auto-Correction
- Monitoring & Alerting
- Best Practices
- Troubleshooting
Overview
HeliosDB Data Quality Management provides enterprise-grade data quality monitoring, validation, and automated correction. It uses machine learning to detect anomalies, enforce data standards, and maintain data integrity across your database.
Key Features
- Automated Quality Monitoring - Continuous quality metric tracking
- ML-Based Anomaly Detection - Identify unusual patterns and outliers
- Validation Rule Engine - Define and enforce quality rules
- Auto-Correction - Automatically fix common data quality issues
- Quality Dashboards - Real-time quality visualization
- Compliance Reporting - Track compliance with quality standards
- Predictive Profiling - ML models for pattern learning and detection
Use Cases
- Data governance and compliance (GDPR, HIPAA, SOC 2)
- Master data management (MDM)
- Data lake quality assurance
- ETL process monitoring
- Regulatory reporting accuracy
- Customer data quality
Core Concepts
Quality Dimensions
HeliosDB data quality management tracks seven key dimensions:
1. Completeness
Measure of non-null, non-empty values in a field.
-- Check completeness of customer_emailSELECT quality_metric('completeness', 'customers', 'email') AS completeness_pct;-- Result: 98.5% of records have non-null emailThresholds:
- Acceptable: 95%+
- Warning: 80-95%
- Critical: <80%
2. Accuracy
Measure of data conforming to expected format/range.
-- Validate phone numbers match E.164 formatSELECT quality_metric('accuracy', 'customers', 'phone_number', 'pattern:^\+[1-9]\d{1,14}$') AS accuracy_pct;3. Consistency
Measure of uniform data representation across sources.
-- Check if customer names are consistent (no mixed cases)SELECT quality_metric('consistency', 'customers', 'name') AS consistency_pct;4. Timeliness
Measure of data freshness and update frequency.
-- Check if order records are updated within 1 hourSELECT quality_metric('timeliness', 'orders', 'updated_at', 'interval:1 hour') AS timeliness_pct;5. Validity
Measure of data type and constraint compliance.
-- Validate that dates are valid and in expected rangeSELECT quality_metric('validity', 'customers', 'birth_date', 'range:[1920-01-01,2010-12-31]') AS validity_pct;6. Uniqueness
Measure of duplicate detection.
-- Check for duplicate customer recordsSELECT quality_metric('uniqueness', 'customers', 'email') AS uniqueness_pct;-- Identify duplicatesSELECT email, COUNT(*) as duplicate_countFROM customersGROUP BY emailHAVING COUNT(*) > 1;7. Conformity
Measure of data conforming to business rules.
-- Check if orders conform to business rule (quantity > 0)SELECT quality_metric('conformity', 'orders', 'quantity', 'rule:quantity > 0') AS conformity_pct;Getting Started
Step 1: Enable Data Quality Monitoring
-- Enable quality monitoring on a tableALTER TABLE customers ENABLE QUALITY_MONITORING;
-- Configure quality parametersALTER TABLE customers SET QUALITY_PARAMS ( completeness_threshold = 95, accuracy_threshold = 98, anomaly_sensitivity = 0.8);Step 2: Define Quality Rules
-- Create quality rule for completenessCREATE QUALITY_RULE email_required AS COLUMN customers.email VALIDATE NOT NULL ON VIOLATION ACTION = LOG;
-- Create quality rule for accuracyCREATE QUALITY_RULE valid_email AS COLUMN customers.email VALIDATE MATCHES '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}$' ON VIOLATION ACTION = QUARANTINE;
-- Create business ruleCREATE QUALITY_RULE positive_amount AS COLUMN orders.amount VALIDATE amount > 0 ON VIOLATION ACTION = BLOCK;Step 3: Enable Anomaly Detection
-- Enable ML-based anomaly detectionALTER TABLE sales ENABLE ANOMALY_DETECTION;
-- Configure anomaly detection parametersALTER TABLE sales SET ANOMALY_PARAMS ( sensitivity = 0.85, -- 0.5 (sensitive) to 0.99 (strict) min_baseline_records = 1000, -- Train on at least 1000 records detection_interval = '1 hour', models = ['isolation_forest', 'one_class_svm'] -- ML algorithms to use);Step 4: Configure Auto-Correction
-- Enable auto-correction for common issuesALTER TABLE customers ENABLE AUTO_CORRECTION;
-- Configure correction strategiesALTER TABLE customers SET CORRECTION_PARAMS ( trim_whitespace = true, -- Remove leading/trailing spaces normalize_case = 'PROPER', -- Convert to proper case fix_phone_format = true, -- Standardize phone numbers remove_duplicates = true, -- Remove exact duplicates fix_date_format = 'YYYY-MM-DD', -- Standardize dates null_strategy = 'PRESERVE' -- How to handle nulls);Quality Metrics
Retrieving Quality Metrics
Real-Time Metrics
-- Get overall quality score for a tableSELECT quality_score('customers') AS quality_pct;-- Result: 94.2
-- Get quality score for specific columnSELECT quality_score('customers', 'email') AS email_quality;
-- Get all quality dimensionsSELECT dimension, score, last_updated, statusFROM quality_metrics('customers')ORDER BY score;Quality Report by Date Range
-- Get quality trend over timeSELECT DATE(measured_at) as date, quality_score, completeness, accuracy, consistency, duplicates_foundFROM quality_historyWHERE table_name = 'customers' AND measured_at >= NOW() - INTERVAL '30 days'ORDER BY measured_at;Column-Level Metrics
-- Detailed metrics for a columnSELECT column_name, null_count, null_pct, unique_count, duplicate_count, distinct_values, cardinality, min_value, max_value, avg_length, data_typeFROM column_metrics('customers', 'email');Quality Score Calculation
HeliosDB calculates overall quality using weighted metrics:
Overall Quality Score = (Completeness × 0.25) + (Accuracy × 0.25) + (Consistency × 0.20) + (Validity × 0.15) + (Uniqueness × 0.10) + (Conformity × 0.05)Data Validation
Built-In Validators
Type Validation
CREATE QUALITY_RULE type_check AS COLUMN orders.order_date VALIDATE TYPE = 'DATE' ON VIOLATION ACTION = LOG;
CREATE QUALITY_RULE numeric_check AS COLUMN products.price VALIDATE TYPE IN ('DECIMAL', 'FLOAT') ON VIOLATION ACTION = BLOCK;Pattern Validation
-- Email validationCREATE QUALITY_RULE email_format AS COLUMN customers.email VALIDATE PATTERN '^[^@]+@[^@]+\.[^@]+$' ON VIOLATION ACTION = QUARANTINE;
-- Phone number (E.164 format)CREATE QUALITY_RULE phone_format AS COLUMN customers.phone VALIDATE PATTERN '^\+[1-9]\d{1,14}$' ON VIOLATION ACTION = LOG;
-- URL validationCREATE QUALITY_RULE url_format AS COLUMN websites.url VALIDATE PATTERN '^https?://' ON VIOLATION ACTION = QUARANTINE;Range Validation
-- Date rangeCREATE QUALITY_RULE birth_date_range AS COLUMN customers.birth_date VALIDATE BETWEEN '1920-01-01' AND CURRENT_DATE ON VIOLATION ACTION = LOG;
-- Numeric rangeCREATE QUALITY_RULE price_range AS COLUMN products.price VALIDATE BETWEEN 0.01 AND 999999.99 ON VIOLATION ACTION = BLOCK;
-- Age validationCREATE QUALITY_RULE adult_age AS COLUMN customers.age VALIDATE >= 18 ON VIOLATION ACTION = QUARANTINE;Business Rule Validation
-- Order rulesCREATE QUALITY_RULE order_quantity AS COLUMN orders.quantity VALIDATE quantity > 0 ON VIOLATION ACTION = BLOCK;
CREATE QUALITY_RULE order_amount AS COLUMN orders.amount VALIDATE amount > (quantity * unit_price * 0.99) ON VIOLATION ACTION = QUARANTINE;
-- Referential integrityCREATE QUALITY_RULE valid_customer AS COLUMN orders.customer_id VALIDATE EXISTS (SELECT 1 FROM customers WHERE id = customer_id) ON VIOLATION ACTION = BLOCK;Violation Actions
| Action | Behavior | Use Case |
|---|---|---|
| LOG | Log violation, accept data | Non-critical issues |
| WARN | Log warning, accept data | Monitor and notify |
| QUARANTINE | Move to quarantine table | Review before accepting |
| BLOCK | Reject data | Critical business rules |
| AUTO_FIX | Attempt automatic correction | Fixable issues |
| CUSTOM | Call custom function | Complex logic |
Anomaly Detection
ML-Based Anomaly Detection
HeliosDB uses machine learning to detect unusual patterns without predefined rules.
Statistical Anomalies
-- Detect anomalous sales amounts (statistical outliers)SELECT order_id, amount, expected_range_min, expected_range_max, deviation_score, anomaly_probabilityFROM detect_anomalies('orders', 'amount', method='isolation_forest', threshold=0.85)WHERE anomaly_probability > 0.85ORDER BY anomaly_probability DESC;Behavioral Anomalies
-- Detect unusual customer behavior-- (e.g., customer ordering 100x their normal quantity)SELECT customer_id, order_count, avg_order_amount, this_order_amount, behavior_score, anomaly_reasonFROM detect_behavioral_anomalies('orders', baseline='30 days', threshold=0.80)WHERE behavior_score > 0.80;Time Series Anomalies
-- Detect anomalies in time series data-- (e.g., sudden drop in website traffic)SELECT hour, traffic_count, expected_traffic, deviation_pct, anomaly_severityFROM detect_time_series_anomalies('traffic_logs', column='request_count', interval='1 hour', baseline='7 days', method='ARIMA')WHERE anomaly_severity > 0.7ORDER BY hour DESC;Configuring Anomaly Sensitivity
-- Sensitivity levels: 0.5 (very sensitive) to 0.99 (very strict)
-- Sensitive detection (catches more, including false positives)ALTER TABLE orders SET ANOMALY_PARAMS ( sensitivity = 0.60, min_baseline_records = 100);
-- Balanced detectionALTER TABLE orders SET ANOMALY_PARAMS ( sensitivity = 0.80, min_baseline_records = 500);
-- Strict detection (catches only major anomalies)ALTER TABLE orders SET ANOMALY_PARAMS ( sensitivity = 0.95, min_baseline_records = 1000);Auto-Correction
Automatic Fixing Strategies
Text Normalization
-- Enable automatic text correctionsALTER TABLE customers SET CORRECTION_PARAMS ( trim_whitespace = true, -- Remove leading/trailing spaces normalize_case = 'PROPER', -- Convert to Title Case normalize_spaces = true -- Convert multiple spaces to single);
-- Example:-- " john DOE " → "John Doe"-- "john smith" → "John Smith"Format Standardization
-- Enable format standardizationALTER TABLE customers SET CORRECTION_PARAMS ( fix_phone_format = true, -- Standardize phone numbers fix_date_format = 'YYYY-MM-DD', -- Standardize dates fix_email_format = true -- Normalize email (lowercase));
-- Examples:-- "555.123.4567" → "+15551234567" (E.164)-- "1/15/2024" → "2024-01-15"-- "John.Doe@EXAMPLE.COM" → "john.doe@example.com"Duplicate Removal
-- Enable duplicate detection and removalALTER TABLE customers SET CORRECTION_PARAMS ( remove_exact_duplicates = true, -- Remove identical rows remove_near_duplicates = true, -- Remove similar rows duplicate_threshold = 0.95, -- 95% similarity keep_strategy = 'FIRST' -- Keep first occurrence);Type Coercion
-- Enable safe type coercionALTER TABLE orders SET CORRECTION_PARAMS ( auto_coerce_types = true, string_to_number = true, string_to_date = true, invalid_handling = 'NULL' -- What to do with values that can't be coerced);
-- Examples:-- " 123 " → 123 (string to number)-- "2024-01-15" → DATE '2024-01-15'Custom Correction Functions
-- Define custom correction logicCREATE CORRECTION_FUNCTION fix_phone_number(phone_text VARCHAR)RETURNS VARCHAR AS $$ DECLARE cleaned VARCHAR; BEGIN -- Remove all non-digits cleaned := REGEXP_REPLACE(phone_text, '[^0-9]', '');
-- Ensure 10 digits (US) IF LENGTH(cleaned) = 10 THEN RETURN '+1' || cleaned; ELSIF LENGTH(cleaned) = 11 AND cleaned LIKE '1%' THEN RETURN '+' || cleaned; ELSE RETURN NULL; END IF; END;$$ LANGUAGE plpgsql;
-- Apply to quality ruleCREATE QUALITY_RULE phone_correction AS COLUMN customers.phone AUTO_CORRECT USING fix_phone_number(phone) ON VIOLATION ACTION = AUTO_FIX;Monitoring & Alerting
Quality Dashboards
-- Create a quality dashboard viewCREATE VIEW quality_dashboard ASSELECT table_name, quality_score('table_name') as overall_quality, (SELECT COUNT(*) FROM quality_violations WHERE severity = 'CRITICAL') as critical_issues, (SELECT COUNT(*) FROM quality_violations WHERE severity = 'WARNING') as warnings, CASE WHEN quality_score >= 95 THEN 'EXCELLENT' WHEN quality_score >= 85 THEN 'GOOD' WHEN quality_score >= 75 THEN 'FAIR' ELSE 'POOR' END as quality_status, CURRENT_TIMESTAMP as measured_atFROM information_schema.tablesWHERE table_schema = 'public';
-- Query the dashboardSELECT * FROM quality_dashboard ORDER BY overall_quality;Setting Up Alerts
-- Create alert for quality drops below thresholdCREATE ALERT quality_alert_customers AS WHEN quality_score('customers') < 90 THEN NOTIFY 'data-quality-team' WITH MESSAGE = 'Customer data quality below threshold' SEVERITY = 'HIGH';
-- Create alert for critical violationsCREATE ALERT critical_violations_alert AS WHEN (SELECT COUNT(*) FROM quality_violations WHERE severity = 'CRITICAL') > 10 THEN NOTIFY 'data-ops-team' WITH MESSAGE = 'Critical data quality violations detected' SEVERITY = 'CRITICAL';
-- Enable alertsALTER ALERT quality_alert_customers ENABLE;ALTER ALERT critical_violations_alert ENABLE;Monitoring Violations
-- View recent violationsSELECT rule_name, table_name, column_name, violation_type, affected_records, severity, violation_time, correctedFROM quality_violationsWHERE violation_time >= NOW() - INTERVAL '24 hours'ORDER BY violation_time DESC;
-- Get violation summarySELECT rule_name, COUNT(*) as violation_count, SUM(affected_records) as total_records_affected, AVG(CAST(corrected AS INT)) as correction_rateFROM quality_violationsWHERE violation_time >= NOW() - INTERVAL '30 days'GROUP BY rule_nameORDER BY violation_count DESC;Best Practices
1. Start with Critical Columns
-- Prioritize core business columnsALTER TABLE customers ENABLE QUALITY_MONITORING;CREATE QUALITY_RULE email_required AS COLUMN customers.email VALIDATE NOT NULL AND PATTERN '^.+@.+\..+$' ON VIOLATION ACTION = BLOCK;
CREATE QUALITY_RULE phone_required AS COLUMN customers.phone VALIDATE NOT NULL ON VIOLATION ACTION = LOG;2. Use Tiered Thresholds
-- Different rules for different data types-- Critical business data: strict rulesALTER TABLE financial_transactions SET QUALITY_PARAMS ( completeness_threshold = 99, accuracy_threshold = 99.5, violation_action = 'BLOCK');
-- Reference data: moderate rulesALTER TABLE product_catalog SET QUALITY_PARAMS ( completeness_threshold = 95, accuracy_threshold = 98, violation_action = 'LOG');
-- Analytical data: lenient rulesALTER TABLE analytics_events SET QUALITY_PARAMS ( completeness_threshold = 80, accuracy_threshold = 90, violation_action = 'WARN');3. Regular Baselines Updates
-- Update ML anomaly detection baselines weeklyALTER TABLE sales REFRESH ANOMALY_BASELINE USING DATA FROM (NOW() - INTERVAL '7 days', NOW());
-- Monitor baseline effectivenessSELECT baseline_date, baseline_records, anomalies_detected, false_positive_rate, effectiveness_scoreFROM anomaly_baseline_metrics('sales');4. Test Rules Before Enforcement
-- Test quality rule in DRY RUN modeCREATE QUALITY_RULE test_rule AS COLUMN orders.amount VALIDATE amount > 0 ON VIOLATION ACTION = LOG MODE = 'DRY_RUN'; -- Don't actually enforce yet
-- After 7 days, review violations and enable enforcementALTER QUALITY_RULE test_rule MODE = 'ENFORCE';5. Document Quality Rules
-- Add documentation to rulesCREATE QUALITY_RULE well_documented_rule AS COLUMN customers.age VALIDATE age >= 18 ON VIOLATION ACTION = BLOCK DESCRIPTION = 'Enforces legal age requirement for customer accounts' BUSINESS_OWNER = 'compliance-team@company.com' SLA = '99% compliance' CREATED_DATE = '2025-01-15' REASON = 'Legal compliance with child protection laws';Troubleshooting
Issue 1: Quality Score Dropping Suddenly
Symptoms:
- Quality score drops from 95% to 70% overnight
- Multiple new violations reported
Diagnosis:
-- Check what changedSELECT rule_name, violation_count, first_violation_time, last_violation_timeFROM quality_violations_by_ruleWHERE violation_time >= NOW() - INTERVAL '24 hours'ORDER BY violation_count DESC;
-- Check data changesSELECT column_name, null_count, null_pct_change, distribution_changeFROM column_metrics('table_name')WHERE measured_at >= NOW() - INTERVAL '24 hours';Solutions:
-- If data import caused issues, temporarily adjust thresholdsALTER TABLE affected_table SET QUALITY_PARAMS ( completeness_threshold = 80 -- Reduced from 95);
-- Investigate root cause-- Check ETL logs, recent schema changes, new data sources-- After fixing, restore thresholdsALTER TABLE affected_table SET QUALITY_PARAMS ( completeness_threshold = 95);Issue 2: Too Many False Positives in Anomaly Detection
Symptoms:
- Anomaly detection alerts for normal, expected variations
- Alert fatigue from false positives
Solution:
-- Lower sensitivity to reduce false positivesALTER TABLE orders SET ANOMALY_PARAMS ( sensitivity = 0.90, -- Increased from 0.80 min_baseline_records = 2000 -- Increased from 1000);
-- Or exclude certain patternsALTER TABLE orders ADD ANOMALY_EXCLUSION AS WHERE order_type = 'BULK' -- Bulk orders are naturally different OR customer_segment = 'VIP'; -- VIPs may have unusual patternsIssue 3: Auto-Correction Creating More Problems
Symptoms:
- Auto-corrections are changing data incorrectly
- Business values being modified unexpectedly
Solution:
-- Disable auto-correction and review violations firstALTER TABLE customers DISABLE AUTO_CORRECTION;
-- Review what would have been correctedSELECT column_name, original_value, corrected_value, confidence_scoreFROM correction_preview('customers')WHERE corrected_value != original_value;
-- Re-enable with more conservative settingsALTER TABLE customers SET CORRECTION_PARAMS ( trim_whitespace = true, -- Safe normalize_case = false, -- Disable case normalization null_strategy = 'PRESERVE' -- Never create nulls);
ALTER TABLE customers ENABLE AUTO_CORRECTION;Issue 4: Performance Impact of Quality Monitoring
Symptoms:
- Query performance degraded after enabling quality monitoring
- Monitoring overhead too high
Solution:
-- Reduce monitoring frequencyALTER TABLE large_table SET QUALITY_PARAMS ( monitoring_frequency = '1 hour', -- Changed from '1 minute' sampling_rate = 0.1 -- Monitor 10% of data);
-- Monitor only critical columnsALTER TABLE large_table ENABLE QUALITY_MONITORING ON COLUMNS (customer_id, order_date); -- Not all columns
-- Disable anomaly detection during peak hoursALTER TABLE orders SET ANOMALY_PARAMS ( enabled = false) WITH SCHEDULE DURING '09:00' TO '17:00' UTC;Summary
HeliosDB Data Quality Management provides a comprehensive platform for:
- Monitoring data quality dimensions across your database
- Validating data against business and technical rules
- Detecting anomalies using machine learning
- Correcting common data quality issues automatically
- Alerting teams to quality issues in real-time
- Reporting on compliance and quality trends
By implementing data quality management, you can ensure your database meets the highest standards of accuracy, consistency, and reliability.
Related Documentation: