ML-Based Intelligent Data Tiering

UVP

Most three-tier storage systems force you to write rules: “move data older than 30 days to cold”. HeliosDB Full’s tiering layer learns the access pattern itself. A Random Forest + Neural Network ensemble (12 engineered features) predicts whether each object should live on Hot NVMe, Warm SATA-SSD, or Cold S3, then a cost-aware optimizer migrates only when the move is net-positive against your latency SLA. Production deployments report 60-85% storage cost reduction with <10% p95 latency impact and 82-87% prediction accuracy.

Prerequisites

HeliosDB Full v8.0.3
At least three storage tiers configured (NVMe, SATA-SSD, object store — or any subset)
A workload with at least 100 access samples (the ML model needs training data; below that, the rule-based fallback kicks in)
Prometheus or any metric scraper (recommended; the tiering layer exports rich metrics)
~25 minutes

The crate is heliosdb-ml-tiering. It sits on top of the base 3-tier storage manager (F68) — this tutorial assumes the base tiering is already configured.

1. The Three Tiers

Tier	Backing	Latency target	Cost / GB / mo	Use case
Hot	NVMe SSD	~1 ms	$0.15	Active rows, high IOPS
Warm	SATA SSD	~5 ms	$0.04	Moderate access, working set
Cold	S3 Standard / Azure Blob / GCS	~50 ms	$0.02	Archive, infrequent reads

These figures are the defaults; override them in MLTieringConfig to match your actual cloud bill. The optimizer uses the cost numbers you give it — if they’re wrong, the migration plan is wrong.

2. Quick Start

use heliosdb_ml_tiering::{MLTieringConfig, MLTieringManager};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let config = MLTieringConfig::default();    // sane defaults
    let manager = MLTieringManager::new(config).await?;

    manager.start().await?;                      // background ML loop begins

    // Use storage normally — the manager intercepts and routes
    manager.put("user/data/profile.json".into(), data).await?;
    let bytes = manager.get("user/data/profile.json").await?;

    // After ~24h, ask for the savings
    let report = manager.get_cost_report().await?;
    println!("Monthly savings:    ${:.2}",  report.monthly_savings);
    println!("Cost reduction:      {:.1}%", report.cost_reduction_pct);
    println!("Prediction accuracy: {:.1}%", report.prediction_accuracy * 100.0);

    Ok(())
}

That’s the entire integration. The ML loop runs every 5 minutes (background_job_interval_secs: 300), retrains every 24 hours, and only migrates when confidence ≥ 0.75 and the projected savings beat the migration cost.

3. The Six Components

┌─────────────────────────────────────────────┐
│        ML Tiering Orchestrator              │
└─────────────────────────────────────────────┘
         │
    ┌────┼────┐
    ▼    ▼    ▼
┌────────┐ ┌──────────┐ ┌────────────┐
│Predictor│→│Optimizer │←│Policy      │
│ (ML)    │ │  (Cost)  │ │Engine      │
└────────┘ └──────────┘ └────────────┘
    ▲           │              │
┌────────┐ ┌──────────┐        │
│Monitor │ │Migrator  │←───────┘
└────────┘ └──────────┘

Module	File	Job
Access Pattern Predictor	`predictor.rs`	RF + NN ensemble, 12 features, confidence-scored
Cost Model	`cost_model.rs`	Per-tier $ math, savings projections
Tier Optimizer	`optimizer.rs`	Cost-aware ranking, latency-bounded plan
Policy Engine	`policy_engine.rs`	Pin / exclude / threshold rules — overrides ML
Access Monitor	`monitor.rs`	Records every get/put for the trainer
Data Migrator	`migrator.rs`	Bandwidth-capped, retry-with-backoff movement

The policy engine takes precedence over the ML model — useful when the model is still learning, or when compliance forces a tier (e.g. EU customer data must stay on EU-region warm storage).

4. Cost-Savings Math (How the Optimizer Decides)

For each object, the optimizer computes:

expected_savings   = (current_tier_cost - candidate_tier_cost) × size_gb × months_horizon
expected_penalty   = (candidate_latency - current_latency) × predicted_access_count × $/ms_SLA
migration_cost     = size_gb × bandwidth_$/GB
net_value          = expected_savings - expected_penalty - migration_cost

Migrate iff net_value > 0 and confidence_score ≥ confidence_threshold. That second clause is the safety net — even if the math says “move it”, the model has to be confident enough.

Worked example — a 1 TB deployment

Strategy	Hot	Warm	Cold	$ / month	Annual
Baseline (all hot)	100%	—	—	$150	$1,800
ML-optimized	5%	25%	70%	$31.50	$378

Saving: $1,422 / year (79% reduction) for a 1 TB workload. Multiply by 100 for a 100 TB enterprise — that’s $142,200 / year off the cloud bill from a single feature flag.

5. Policy Engine — When the ML Model Is Wrong

The model is wrong sometimes. Real-world cases:

New product launches: no access history yet, the model is cold-starting
Compliance pin: “this customer’s data must live on-prem”
Predictable bursts: end-of-month billing reads

Three rule types cover these:

use heliosdb_ml_tiering::policy::{PolicyEngine, PinRule, ExcludeRule, ThresholdRule, Tier};

let policy = PolicyEngine::new()
    // Hard pin — overrides ML
    .add(PinRule::new("billing/end_of_month/*", Tier::Hot))
    // Compliance — never go cold
    .add(ExcludeRule::new("eu/customer/*", Tier::Cold))
    // Promote anything accessed >100x/day
    .add(ThresholdRule::new("*").access_count_per_day(100).promote_to(Tier::Hot));

The policy engine runs before the optimizer. Pin rules win; the ML model only decides among the tiers the policy hasn’t already constrained.

6. Configuration Reference

MLTieringConfig {
    // Storage tiers (override the defaults to match your bill)
    hot_tier:  TierConfig { cost_per_gb: 0.15, latency_ms: 1,  ..Default::default() },
    warm_tier: TierConfig { cost_per_gb: 0.04, latency_ms: 5,  ..Default::default() },
    cold_tier: TierConfig { cost_per_gb: 0.02, latency_ms: 50, ..Default::default() },

    // ML model
    ml_config: MLModelConfig {
        model_type: "Ensemble".into(),     // RandomForest | NeuralNetwork | Ensemble
        confidence_threshold: 0.75,         // 0.0-1.0; raise for safer migrations
        retrain_interval_hours: 24,
    },

    // Cost optimizer
    cost_optimization: CostOptimizationConfig {
        target_cost_reduction: 0.70,        // ambition; 70% by default
        max_latency_degradation: 0.10,      // hard ceiling
        enable_predictive_migration: true,
    },

    // Migration runtime
    migration: MigrationConfig {
        max_concurrent_migrations: 10,
        bandwidth_limit_mbps: 100,
        cooldown_period_secs: 300,
    },

    auto_tiering_enabled: true,
    background_job_interval_secs: 300,      // 5 minutes
}

Recommended starting points:

Tighten confidence_threshold to 0.85 in the first 7 days; relax to 0.75 once the model converges
Cap max_concurrent_migrations at 10% of your I/O budget
Set max_latency_degradation based on your customer-facing SLA — the optimizer treats it as a hard constraint

7. Observability

Every cycle reports:

let stats = manager.get_stats();
println!("Prediction accuracy: {:.1}%", stats.prediction_accuracy * 100.0);
println!("Total migrations:    {}",   stats.total_migrations);
println!("Bytes moved (24h):   {} MB", stats.bytes_moved_24h / 1_000_000);

let report = manager.get_cost_report().await?;
println!("Monthly savings:     ${:.2}", report.monthly_savings);
println!("Cost reduction:      {:.1}%", report.cost_reduction_pct);

Prometheus metrics exported:

heliosdb_tiering_predictions_total{model="ensemble",outcome="correct|wrong"}
heliosdb_tiering_migrations_total{from_tier,to_tier,result}
heliosdb_tiering_bytes_moved_total{from_tier,to_tier}
heliosdb_tiering_savings_dollars{period="monthly"}
heliosdb_tiering_latency_p95_seconds{tier}

8. Performance Reference

Operation	Latency	Throughput
Feature extraction (1000 objects)	47 ms	21,277 obj/s
Cost calculation (1000 ops)	0.8 ms	1,250,000 ops/s
ML training (100 samples)	950 ms	—
ML training (1000 samples)	5.4 s	—
Policy application (1 rule)	12 µs	—
Access recording (1000 ops)	120 ms	8,333 ops/s

The 24-hour retraining cycle is the dominant background cost; on a 1000-sample workload it takes ~5 seconds and runs once per day.

9. Production Checklist

Three tiers are configured, with realistic cost_per_gb numbers from your provider invoice
At least 100 access samples have been collected before enabling auto-migration (use auto_tiering_enabled: false while warming up)
max_latency_degradation reflects a real SLA, not a guess
Pin rules cover the obvious “never move this” categories (billing, compliance, audit logs)
Bandwidth limit is set to a sane fraction of total I/O (10% is a good first guess)
Prometheus is scraping; an alert exists on heliosdb_tiering_predictions_total{outcome="wrong"} rate
The cold tier object store is the same region as the rest of your stack (egress is the silent killer)

Where Next

Cognitive Agents — pair the storage tiering loop with the schema-manager agent for full-stack autonomy
PITR Recovery — make sure cold-tier objects participate in your PITR plan
Multi-Tenancy Setup — per-tenant tiering policies via the policy engine

References

Source: /home/app/Helios/Full/heliosdb-ai/crates/ml-tiering/
Test coverage: 40+ tests, 95% line coverage
Detailed design doc: heliosdb-ai/crates/ml-tiering/docs/ML_TIERING.md (24 pages)