Skip to content

ML-Based Intelligent Data Tiering

UVP

Most three-tier storage systems force you to write rules: “move data older than 30 days to cold”. HeliosDB Full’s tiering layer learns the access pattern itself. A Random Forest + Neural Network ensemble (12 engineered features) predicts whether each object should live on Hot NVMe, Warm SATA-SSD, or Cold S3, then a cost-aware optimizer migrates only when the move is net-positive against your latency SLA. Production deployments report 60-85% storage cost reduction with <10% p95 latency impact and 82-87% prediction accuracy.


Prerequisites

  • HeliosDB Full v8.0.3
  • At least three storage tiers configured (NVMe, SATA-SSD, object store — or any subset)
  • A workload with at least 100 access samples (the ML model needs training data; below that, the rule-based fallback kicks in)
  • Prometheus or any metric scraper (recommended; the tiering layer exports rich metrics)
  • ~25 minutes

The crate is heliosdb-ml-tiering. It sits on top of the base 3-tier storage manager (F68) — this tutorial assumes the base tiering is already configured.


1. The Three Tiers

TierBackingLatency targetCost / GB / moUse case
HotNVMe SSD~1 ms$0.15Active rows, high IOPS
WarmSATA SSD~5 ms$0.04Moderate access, working set
ColdS3 Standard / Azure Blob / GCS~50 ms$0.02Archive, infrequent reads

These figures are the defaults; override them in MLTieringConfig to match your actual cloud bill. The optimizer uses the cost numbers you give it — if they’re wrong, the migration plan is wrong.


2. Quick Start

use heliosdb_ml_tiering::{MLTieringConfig, MLTieringManager};
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let config = MLTieringConfig::default(); // sane defaults
let manager = MLTieringManager::new(config).await?;
manager.start().await?; // background ML loop begins
// Use storage normally — the manager intercepts and routes
manager.put("user/data/profile.json".into(), data).await?;
let bytes = manager.get("user/data/profile.json").await?;
// After ~24h, ask for the savings
let report = manager.get_cost_report().await?;
println!("Monthly savings: ${:.2}", report.monthly_savings);
println!("Cost reduction: {:.1}%", report.cost_reduction_pct);
println!("Prediction accuracy: {:.1}%", report.prediction_accuracy * 100.0);
Ok(())
}

That’s the entire integration. The ML loop runs every 5 minutes (background_job_interval_secs: 300), retrains every 24 hours, and only migrates when confidence ≥ 0.75 and the projected savings beat the migration cost.


3. The Six Components

┌─────────────────────────────────────────────┐
│ ML Tiering Orchestrator │
└─────────────────────────────────────────────┘
┌────┼────┐
▼ ▼ ▼
┌────────┐ ┌──────────┐ ┌────────────┐
│Predictor│→│Optimizer │←│Policy │
│ (ML) │ │ (Cost) │ │Engine │
└────────┘ └──────────┘ └────────────┘
▲ │ │
┌────────┐ ┌──────────┐ │
│Monitor │ │Migrator │←───────┘
└────────┘ └──────────┘
ModuleFileJob
Access Pattern Predictorpredictor.rsRF + NN ensemble, 12 features, confidence-scored
Cost Modelcost_model.rsPer-tier $ math, savings projections
Tier Optimizeroptimizer.rsCost-aware ranking, latency-bounded plan
Policy Enginepolicy_engine.rsPin / exclude / threshold rules — overrides ML
Access Monitormonitor.rsRecords every get/put for the trainer
Data Migratormigrator.rsBandwidth-capped, retry-with-backoff movement

The policy engine takes precedence over the ML model — useful when the model is still learning, or when compliance forces a tier (e.g. EU customer data must stay on EU-region warm storage).


4. Cost-Savings Math (How the Optimizer Decides)

For each object, the optimizer computes:

expected_savings = (current_tier_cost - candidate_tier_cost) × size_gb × months_horizon
expected_penalty = (candidate_latency - current_latency) × predicted_access_count × $/ms_SLA
migration_cost = size_gb × bandwidth_$/GB
net_value = expected_savings - expected_penalty - migration_cost

Migrate iff net_value > 0 and confidence_score ≥ confidence_threshold. That second clause is the safety net — even if the math says “move it”, the model has to be confident enough.

Worked example — a 1 TB deployment

StrategyHotWarmCold$ / monthAnnual
Baseline (all hot)100%$150$1,800
ML-optimized5%25%70%$31.50$378

Saving: $1,422 / year (79% reduction) for a 1 TB workload. Multiply by 100 for a 100 TB enterprise — that’s $142,200 / year off the cloud bill from a single feature flag.


5. Policy Engine — When the ML Model Is Wrong

The model is wrong sometimes. Real-world cases:

  • New product launches: no access history yet, the model is cold-starting
  • Compliance pin: “this customer’s data must live on-prem”
  • Predictable bursts: end-of-month billing reads

Three rule types cover these:

use heliosdb_ml_tiering::policy::{PolicyEngine, PinRule, ExcludeRule, ThresholdRule, Tier};
let policy = PolicyEngine::new()
// Hard pin — overrides ML
.add(PinRule::new("billing/end_of_month/*", Tier::Hot))
// Compliance — never go cold
.add(ExcludeRule::new("eu/customer/*", Tier::Cold))
// Promote anything accessed >100x/day
.add(ThresholdRule::new("*").access_count_per_day(100).promote_to(Tier::Hot));

The policy engine runs before the optimizer. Pin rules win; the ML model only decides among the tiers the policy hasn’t already constrained.


6. Configuration Reference

MLTieringConfig {
// Storage tiers (override the defaults to match your bill)
hot_tier: TierConfig { cost_per_gb: 0.15, latency_ms: 1, ..Default::default() },
warm_tier: TierConfig { cost_per_gb: 0.04, latency_ms: 5, ..Default::default() },
cold_tier: TierConfig { cost_per_gb: 0.02, latency_ms: 50, ..Default::default() },
// ML model
ml_config: MLModelConfig {
model_type: "Ensemble".into(), // RandomForest | NeuralNetwork | Ensemble
confidence_threshold: 0.75, // 0.0-1.0; raise for safer migrations
retrain_interval_hours: 24,
},
// Cost optimizer
cost_optimization: CostOptimizationConfig {
target_cost_reduction: 0.70, // ambition; 70% by default
max_latency_degradation: 0.10, // hard ceiling
enable_predictive_migration: true,
},
// Migration runtime
migration: MigrationConfig {
max_concurrent_migrations: 10,
bandwidth_limit_mbps: 100,
cooldown_period_secs: 300,
},
auto_tiering_enabled: true,
background_job_interval_secs: 300, // 5 minutes
}

Recommended starting points:

  • Tighten confidence_threshold to 0.85 in the first 7 days; relax to 0.75 once the model converges
  • Cap max_concurrent_migrations at 10% of your I/O budget
  • Set max_latency_degradation based on your customer-facing SLA — the optimizer treats it as a hard constraint

7. Observability

Every cycle reports:

let stats = manager.get_stats();
println!("Prediction accuracy: {:.1}%", stats.prediction_accuracy * 100.0);
println!("Total migrations: {}", stats.total_migrations);
println!("Bytes moved (24h): {} MB", stats.bytes_moved_24h / 1_000_000);
let report = manager.get_cost_report().await?;
println!("Monthly savings: ${:.2}", report.monthly_savings);
println!("Cost reduction: {:.1}%", report.cost_reduction_pct);

Prometheus metrics exported:

heliosdb_tiering_predictions_total{model="ensemble",outcome="correct|wrong"}
heliosdb_tiering_migrations_total{from_tier,to_tier,result}
heliosdb_tiering_bytes_moved_total{from_tier,to_tier}
heliosdb_tiering_savings_dollars{period="monthly"}
heliosdb_tiering_latency_p95_seconds{tier}

8. Performance Reference

OperationLatencyThroughput
Feature extraction (1000 objects)47 ms21,277 obj/s
Cost calculation (1000 ops)0.8 ms1,250,000 ops/s
ML training (100 samples)950 ms
ML training (1000 samples)5.4 s
Policy application (1 rule)12 µs
Access recording (1000 ops)120 ms8,333 ops/s

The 24-hour retraining cycle is the dominant background cost; on a 1000-sample workload it takes ~5 seconds and runs once per day.


9. Production Checklist

  • Three tiers are configured, with realistic cost_per_gb numbers from your provider invoice
  • At least 100 access samples have been collected before enabling auto-migration (use auto_tiering_enabled: false while warming up)
  • max_latency_degradation reflects a real SLA, not a guess
  • Pin rules cover the obvious “never move this” categories (billing, compliance, audit logs)
  • Bandwidth limit is set to a sane fraction of total I/O (10% is a good first guess)
  • Prometheus is scraping; an alert exists on heliosdb_tiering_predictions_total{outcome="wrong"} rate
  • The cold tier object store is the same region as the rest of your stack (egress is the silent killer)

Where Next

  • Cognitive Agents — pair the storage tiering loop with the schema-manager agent for full-stack autonomy
  • PITR Recovery — make sure cold-tier objects participate in your PITR plan
  • Multi-Tenancy Setup — per-tenant tiering policies via the policy engine

References

  • Source: /home/app/Helios/Full/heliosdb-ai/crates/ml-tiering/
  • Test coverage: 40+ tests, 95% line coverage
  • Detailed design doc: heliosdb-ai/crates/ml-tiering/docs/ML_TIERING.md (24 pages)