ML-Based Intelligent Data Tiering
UVP
Most three-tier storage systems force you to write rules: “move data older than 30 days to cold”. HeliosDB Full’s tiering layer learns the access pattern itself. A Random Forest + Neural Network ensemble (12 engineered features) predicts whether each object should live on Hot NVMe, Warm SATA-SSD, or Cold S3, then a cost-aware optimizer migrates only when the move is net-positive against your latency SLA. Production deployments report 60-85% storage cost reduction with <10% p95 latency impact and 82-87% prediction accuracy.
Prerequisites
- HeliosDB Full v8.0.3
- At least three storage tiers configured (NVMe, SATA-SSD, object store — or any subset)
- A workload with at least 100 access samples (the ML model needs training data; below that, the rule-based fallback kicks in)
- Prometheus or any metric scraper (recommended; the tiering layer exports rich metrics)
- ~25 minutes
The crate is heliosdb-ml-tiering. It sits on top of the base 3-tier storage manager (F68) — this tutorial assumes the base tiering is already configured.
1. The Three Tiers
| Tier | Backing | Latency target | Cost / GB / mo | Use case |
|---|---|---|---|---|
| Hot | NVMe SSD | ~1 ms | $0.15 | Active rows, high IOPS |
| Warm | SATA SSD | ~5 ms | $0.04 | Moderate access, working set |
| Cold | S3 Standard / Azure Blob / GCS | ~50 ms | $0.02 | Archive, infrequent reads |
These figures are the defaults; override them in MLTieringConfig to match your actual cloud bill. The optimizer uses the cost numbers you give it — if they’re wrong, the migration plan is wrong.
2. Quick Start
use heliosdb_ml_tiering::{MLTieringConfig, MLTieringManager};
#[tokio::main]async fn main() -> anyhow::Result<()> { let config = MLTieringConfig::default(); // sane defaults let manager = MLTieringManager::new(config).await?;
manager.start().await?; // background ML loop begins
// Use storage normally — the manager intercepts and routes manager.put("user/data/profile.json".into(), data).await?; let bytes = manager.get("user/data/profile.json").await?;
// After ~24h, ask for the savings let report = manager.get_cost_report().await?; println!("Monthly savings: ${:.2}", report.monthly_savings); println!("Cost reduction: {:.1}%", report.cost_reduction_pct); println!("Prediction accuracy: {:.1}%", report.prediction_accuracy * 100.0);
Ok(())}That’s the entire integration. The ML loop runs every 5 minutes (background_job_interval_secs: 300), retrains every 24 hours, and only migrates when confidence ≥ 0.75 and the projected savings beat the migration cost.
3. The Six Components
┌─────────────────────────────────────────────┐│ ML Tiering Orchestrator │└─────────────────────────────────────────────┘ │ ┌────┼────┐ ▼ ▼ ▼┌────────┐ ┌──────────┐ ┌────────────┐│Predictor│→│Optimizer │←│Policy ││ (ML) │ │ (Cost) │ │Engine │└────────┘ └──────────┘ └────────────┘ ▲ │ │┌────────┐ ┌──────────┐ ││Monitor │ │Migrator │←───────┘└────────┘ └──────────┘| Module | File | Job |
|---|---|---|
| Access Pattern Predictor | predictor.rs | RF + NN ensemble, 12 features, confidence-scored |
| Cost Model | cost_model.rs | Per-tier $ math, savings projections |
| Tier Optimizer | optimizer.rs | Cost-aware ranking, latency-bounded plan |
| Policy Engine | policy_engine.rs | Pin / exclude / threshold rules — overrides ML |
| Access Monitor | monitor.rs | Records every get/put for the trainer |
| Data Migrator | migrator.rs | Bandwidth-capped, retry-with-backoff movement |
The policy engine takes precedence over the ML model — useful when the model is still learning, or when compliance forces a tier (e.g. EU customer data must stay on EU-region warm storage).
4. Cost-Savings Math (How the Optimizer Decides)
For each object, the optimizer computes:
expected_savings = (current_tier_cost - candidate_tier_cost) × size_gb × months_horizonexpected_penalty = (candidate_latency - current_latency) × predicted_access_count × $/ms_SLAmigration_cost = size_gb × bandwidth_$/GBnet_value = expected_savings - expected_penalty - migration_costMigrate iff net_value > 0 and confidence_score ≥ confidence_threshold. That second clause is the safety net — even if the math says “move it”, the model has to be confident enough.
Worked example — a 1 TB deployment
| Strategy | Hot | Warm | Cold | $ / month | Annual |
|---|---|---|---|---|---|
| Baseline (all hot) | 100% | — | — | $150 | $1,800 |
| ML-optimized | 5% | 25% | 70% | $31.50 | $378 |
Saving: $1,422 / year (79% reduction) for a 1 TB workload. Multiply by 100 for a 100 TB enterprise — that’s $142,200 / year off the cloud bill from a single feature flag.
5. Policy Engine — When the ML Model Is Wrong
The model is wrong sometimes. Real-world cases:
- New product launches: no access history yet, the model is cold-starting
- Compliance pin: “this customer’s data must live on-prem”
- Predictable bursts: end-of-month billing reads
Three rule types cover these:
use heliosdb_ml_tiering::policy::{PolicyEngine, PinRule, ExcludeRule, ThresholdRule, Tier};
let policy = PolicyEngine::new() // Hard pin — overrides ML .add(PinRule::new("billing/end_of_month/*", Tier::Hot)) // Compliance — never go cold .add(ExcludeRule::new("eu/customer/*", Tier::Cold)) // Promote anything accessed >100x/day .add(ThresholdRule::new("*").access_count_per_day(100).promote_to(Tier::Hot));The policy engine runs before the optimizer. Pin rules win; the ML model only decides among the tiers the policy hasn’t already constrained.
6. Configuration Reference
MLTieringConfig { // Storage tiers (override the defaults to match your bill) hot_tier: TierConfig { cost_per_gb: 0.15, latency_ms: 1, ..Default::default() }, warm_tier: TierConfig { cost_per_gb: 0.04, latency_ms: 5, ..Default::default() }, cold_tier: TierConfig { cost_per_gb: 0.02, latency_ms: 50, ..Default::default() },
// ML model ml_config: MLModelConfig { model_type: "Ensemble".into(), // RandomForest | NeuralNetwork | Ensemble confidence_threshold: 0.75, // 0.0-1.0; raise for safer migrations retrain_interval_hours: 24, },
// Cost optimizer cost_optimization: CostOptimizationConfig { target_cost_reduction: 0.70, // ambition; 70% by default max_latency_degradation: 0.10, // hard ceiling enable_predictive_migration: true, },
// Migration runtime migration: MigrationConfig { max_concurrent_migrations: 10, bandwidth_limit_mbps: 100, cooldown_period_secs: 300, },
auto_tiering_enabled: true, background_job_interval_secs: 300, // 5 minutes}Recommended starting points:
- Tighten
confidence_thresholdto 0.85 in the first 7 days; relax to 0.75 once the model converges - Cap
max_concurrent_migrationsat 10% of your I/O budget - Set
max_latency_degradationbased on your customer-facing SLA — the optimizer treats it as a hard constraint
7. Observability
Every cycle reports:
let stats = manager.get_stats();println!("Prediction accuracy: {:.1}%", stats.prediction_accuracy * 100.0);println!("Total migrations: {}", stats.total_migrations);println!("Bytes moved (24h): {} MB", stats.bytes_moved_24h / 1_000_000);
let report = manager.get_cost_report().await?;println!("Monthly savings: ${:.2}", report.monthly_savings);println!("Cost reduction: {:.1}%", report.cost_reduction_pct);Prometheus metrics exported:
heliosdb_tiering_predictions_total{model="ensemble",outcome="correct|wrong"}heliosdb_tiering_migrations_total{from_tier,to_tier,result}heliosdb_tiering_bytes_moved_total{from_tier,to_tier}heliosdb_tiering_savings_dollars{period="monthly"}heliosdb_tiering_latency_p95_seconds{tier}8. Performance Reference
| Operation | Latency | Throughput |
|---|---|---|
| Feature extraction (1000 objects) | 47 ms | 21,277 obj/s |
| Cost calculation (1000 ops) | 0.8 ms | 1,250,000 ops/s |
| ML training (100 samples) | 950 ms | — |
| ML training (1000 samples) | 5.4 s | — |
| Policy application (1 rule) | 12 µs | — |
| Access recording (1000 ops) | 120 ms | 8,333 ops/s |
The 24-hour retraining cycle is the dominant background cost; on a 1000-sample workload it takes ~5 seconds and runs once per day.
9. Production Checklist
- Three tiers are configured, with realistic
cost_per_gbnumbers from your provider invoice - At least 100 access samples have been collected before enabling auto-migration (use
auto_tiering_enabled: falsewhile warming up) -
max_latency_degradationreflects a real SLA, not a guess - Pin rules cover the obvious “never move this” categories (billing, compliance, audit logs)
- Bandwidth limit is set to a sane fraction of total I/O (10% is a good first guess)
- Prometheus is scraping; an alert exists on
heliosdb_tiering_predictions_total{outcome="wrong"}rate - The cold tier object store is the same region as the rest of your stack (egress is the silent killer)
Where Next
- Cognitive Agents — pair the storage tiering loop with the schema-manager agent for full-stack autonomy
- PITR Recovery — make sure cold-tier objects participate in your PITR plan
- Multi-Tenancy Setup — per-tenant tiering policies via the policy engine
References
- Source:
/home/app/Helios/Full/heliosdb-ai/crates/ml-tiering/ - Test coverage: 40+ tests, 95% line coverage
- Detailed design doc:
heliosdb-ai/crates/ml-tiering/docs/ML_TIERING.md(24 pages)