Error Handling Best Practices for HeliosDB

Version: 1.0 Date: November 9, 2025 Status: Production Guidelines Audience: All HeliosDB developers

Purpose

This document establishes error handling best practices for HeliosDB to ensure production-ready code quality, eliminate panics, and provide excellent error diagnostics.

Key Principle: Production code MUST NEVER panic. All error conditions must be handled gracefully with descriptive error messages.

📋 Quick Reference

The Golden Rules

NEVER use .unwrap() in production code
NEVER use .expect() in production code
Always propagate errors with ? operator
Provide descriptive error messages
Use Result<T, E> for fallible operations
Use Option for optional values
Test code CAN use .unwrap() for simplicity
Document why unsafe is safe (when unavoidable)

🚫 Anti-Patterns (DO NOT DO THIS)

Anti-Pattern 1: unwrap() in Production Code

Problem: Causes panic and crashes the database

// ❌ BAD: Will panic if sorted_entries is empty
fn create_sstable(sorted_entries: Vec<Entry>) -> SSTable {
    let min_key = sorted_entries.first().unwrap().key.clone();
    let max_key = sorted_entries.last().unwrap().key.clone();

    SSTable {
        min_key,
        max_key,
        entries: sorted_entries,
    }
}

Why It’s Bad:

Panics on empty input (data loss)
No error context (hard to debug)
Cannot recover (database crash)
Production risk: CRITICAL

Real Impact: SSTable creation failure → Database corruption

Anti-Pattern 2: expect() with Generic Messages

// ❌ BAD: Generic error message, still panics
fn get_latest_timestamp() -> u64 {
    SystemTime::now()
        .duration_since(UNIX_EPOCH)
        .expect("Time went backwards")  // Unhelpful in logs
        .as_millis() as u64
}

Why It’s Bad:

Still panics (same as unwrap)
Generic message doesn’t help debugging
No error propagation
Cannot handle NTP sync, clock adjustments

Real Impact: XA transaction coordination failure → Data inconsistency

Anti-Pattern 3: Nested unwrap() Chains

// ❌ VERY BAD: Multiple panic points, hard to debug
fn process_data(data: &HashMap<String, Vec<Value>>) -> Value {
    data.get("values")
        .unwrap()
        .first()
        .unwrap()
        .clone()
}

Why It’s Bad:

3 panic points in 4 lines
Which unwrap failed? Unknown
No error context
Impossible to recover

Anti-Pattern 4: Silent Error Swallowing

// ❌ BAD: Errors are hidden, causes silent failures
fn load_config(path: &Path) -> Config {
    match File::open(path) {
        Ok(file) => parse_config(file),
        Err(_) => Config::default(),  // Error lost!
    }
}

Why It’s Bad:

Permission errors hidden
File not found hidden
Corrupt file hidden
Wrong behavior, no diagnostics

Anti-Pattern 5: Generic Error Types

// ❌ BAD: Generic error loses context
fn read_sstable(id: u64) -> Result<SSTable, String> {
    let path = format!("data/{}.sst", id);
    let data = std::fs::read(path)
        .map_err(|e| e.to_string())?;  // Context lost!

    deserialize(&data)
        .map_err(|e| e.to_string())?
}

Why It’s Bad:

Cannot distinguish IO vs. deserialization errors
Cannot implement retries
Poor error reporting
Hard to debug

Best Practices (DO THIS INSTEAD)

Best Practice 1: Proper Result Handling

//  GOOD: Proper error handling with context
use crate::error::HeliosError;

fn create_sstable(sorted_entries: Vec<Entry>) -> Result<SSTable, HeliosError> {
    if sorted_entries.is_empty() {
        return Err(HeliosError::Storage(
            "Cannot create SSTable from empty entries".to_string()
        ));
    }

    let min_key = sorted_entries
        .first()
        .ok_or_else(|| HeliosError::Storage(
            "Empty sorted entries after validation".to_string()
        ))?
        .key.clone();

    let max_key = sorted_entries
        .last()
        .ok_or_else(|| HeliosError::Storage(
            "Empty sorted entries after validation".to_string()
        ))?
        .key.clone();

    Ok(SSTable {
        min_key,
        max_key,
        entries: sorted_entries,
    })
}

Why It’s Good:

No panics possible
Descriptive error messages
Early validation
Errors propagate with ?
Caller can handle or propagate
Logs show exact error

Best Practice 2: Helper Functions for Common Patterns

//  GOOD: Helper function encapsulates error handling
/// Get current timestamp in milliseconds since UNIX_EPOCH
///
/// Safe timestamp generation that handles edge cases:
/// - System clock adjustments
/// - NTP sync
/// - Clock going backwards
/// - Virtualization time skew
///
/// Returns 0 if SystemTime fails (extremely rare).
#[inline]
fn current_timestamp_millis() -> u64 {
    SystemTime::now()
        .duration_since(UNIX_EPOCH)
        .unwrap_or(Duration::from_secs(0))  // Safe fallback
        .as_millis() as u64
}

// Usage is simple and safe
fn create_transaction() -> Transaction {
    Transaction {
        id: generate_id(),
        timestamp: current_timestamp_millis(),  // Never panics
        data: vec![],
    }
}

Why It’s Good:

Centralizes error handling logic
Documented edge cases
Safe fallback (0 timestamp is detectable)
Reusable across codebase
Inline for performance
Never panics

When to Use: Common patterns (SystemTime, parsing, conversions)

Best Practice 3: Early Validation

//  GOOD: Validate inputs early
fn process_batch(items: Vec<Item>) -> Result<Vec<ProcessedItem>, HeliosError> {
    // Validate inputs first
    if items.is_empty() {
        return Err(HeliosError::InvalidInput(
            "Batch cannot be empty".to_string()
        ));
    }

    if items.len() > MAX_BATCH_SIZE {
        return Err(HeliosError::InvalidInput(
            format!("Batch size {} exceeds maximum {}", items.len(), MAX_BATCH_SIZE)
        ));
    }

    // Process with confidence (inputs validated)
    let mut results = Vec::with_capacity(items.len());
    for item in items {
        results.push(process_item(item)?);
    }

    Ok(results)
}

Why It’s Good:

Fail fast on invalid input
Clear error messages
No partial processing
Easy to test
Performance: validate once

Best Practice 4: Descriptive Error Types

//  GOOD: Structured error types with context
#[derive(Debug, Clone)]
pub enum SSTableError {
    EmptyEntries,
    InvalidRange { min: Vec<u8>, max: Vec<u8> },
    IOError { path: PathBuf, source: String },
    CorruptedData { offset: u64, expected: u32, found: u32 },
}

impl std::fmt::Display for SSTableError {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        match self {
            SSTableError::EmptyEntries => {
                write!(f, "Cannot create SSTable from empty entries")
            }
            SSTableError::InvalidRange { min, max } => {
                write!(f, "Invalid key range: min={:?} > max={:?}", min, max)
            }
            SSTableError::IOError { path, source } => {
                write!(f, "IO error reading {}: {}", path.display(), source)
            }
            SSTableError::CorruptedData { offset, expected, found } => {
                write!(f, "Corrupted data at offset {}: expected checksum {}, found {}",
                       offset, expected, found)
            }
        }
    }
}

impl std::error::Error for SSTableError {}

// Usage
fn read_sstable(path: &Path) -> Result<SSTable, SSTableError> {
    let data = std::fs::read(path)
        .map_err(|e| SSTableError::IOError {
            path: path.to_path_buf(),
            source: e.to_string(),
        })?;

    // ... deserialize with proper error handling
    Ok(sstable)
}

Why It’s Good:

Type-safe error handling
Structured error data
Easy to match on error type
Excellent error messages
Enables retries based on error type
Good logging

Best Practice 5: Error Context Propagation

//  GOOD: Errors carry full context up the stack
use anyhow::{Context, Result};  // Or use custom error chaining

fn load_sstable_file(id: u64) -> Result<SSTable> {
    let path = get_sstable_path(id)?;

    let data = std::fs::read(&path)
        .context(format!("Failed to read SSTable file: {}", path.display()))?;

    let sstable = deserialize_sstable(&data)
        .context(format!("Failed to deserialize SSTable {}", id))?;

    validate_sstable(&sstable)
        .context(format!("SSTable {} failed validation", id))?;

    Ok(sstable)
}

// Error output example:
// Error: Failed to load SSTable 12345
// Caused by:
//     0: Failed to read SSTable file: /data/sstables/12345.sst
//     1: No such file or directory (os error 2)

Why It’s Good:

Full error chain visible
Easy to debug
Context at each layer
Root cause preserved
Great for logs

Best Practice 6: Option Handling

//  GOOD: Proper Option handling
fn get_user_by_id(id: u64) -> Result<User, HeliosError> {
    let users = get_user_cache()?;

    users.get(&id)
        .cloned()
        .ok_or_else(|| HeliosError::NotFound(
            format!("User {} not found in cache", id)
        ))
}

// Alternative: Return Option when "not found" is valid
fn get_cached_value(key: &str) -> Option<Value> {
    let cache = CACHE.lock().unwrap();  // Lock unwrap is OK (poison)
    cache.get(key).cloned()
}

// Caller decides how to handle None
match get_cached_value("key") {
    Some(value) => use_value(value),
    None => load_from_disk("key")?,  // Fallback
}

Why It’s Good:

Clear semantics (Option vs Result)
Descriptive errors for Result
None is valid for Option
Caller flexibility

Common Patterns & Solutions

Pattern 1: Array/Collection Access

// ❌ BAD
let first = collection.first().unwrap();
let last = collection.last().unwrap();

//  GOOD: Validate first
if collection.is_empty() {
    return Err(HeliosError::EmptyCollection);
}
let first = &collection[0];
let last = &collection[collection.len() - 1];

//  ALSO GOOD: Propagate Option
let first = collection.first()
    .ok_or_else(|| HeliosError::EmptyCollection)?;
let last = collection.last()
    .ok_or_else(|| HeliosError::EmptyCollection)?;

Pattern 2: SystemTime Operations

// ❌ BAD
let duration = SystemTime::now()
    .duration_since(UNIX_EPOCH)
    .unwrap();

//  GOOD: Helper function with safe fallback
fn current_timestamp_millis() -> u64 {
    SystemTime::now()
        .duration_since(UNIX_EPOCH)
        .unwrap_or(Duration::from_secs(0))
        .as_millis() as u64
}

//  ALSO GOOD: Return Result if precision matters
fn precise_timestamp() -> Result<u64, HeliosError> {
    SystemTime::now()
        .duration_since(UNIX_EPOCH)
        .map(|d| d.as_millis() as u64)
        .map_err(|e| HeliosError::SystemTime(e.to_string()))
}

Pattern 3: Deque/VecDeque Access

// ❌ BAD
let front = deque.front().unwrap();
let back = deque.back().unwrap();

//  GOOD
let front = deque.front()
    .ok_or_else(|| HeliosError::Storage("Deque unexpectedly empty".to_string()))?;
let back = deque.back()
    .ok_or_else(|| HeliosError::Storage("Deque unexpectedly empty".to_string()))?;

Pattern 4: Parsing Strings/Numbers

// ❌ BAD
let port: u16 = port_str.parse().unwrap();

//  GOOD
let port: u16 = port_str.parse()
    .map_err(|e| HeliosError::InvalidConfig(
        format!("Invalid port '{}': {}", port_str, e)
    ))?;

//  ALSO GOOD: With default
let port: u16 = port_str.parse().unwrap_or(5432);

Pattern 5: HashMap/BTreeMap Get

// ❌ BAD
let value = map.get(&key).unwrap();

//  GOOD: When key MUST exist
let value = map.get(&key)
    .ok_or_else(|| HeliosError::InvalidState(
        format!("Required key '{}' not found in map", key)
    ))?;

//  ALSO GOOD: When key might not exist
if let Some(value) = map.get(&key) {
    process(value);
} else {
    use_default();
}

Pattern 6: Channel Operations

// ❌ BAD
sender.send(msg).unwrap();
let msg = receiver.recv().unwrap();

//  GOOD
sender.send(msg)
    .map_err(|e| HeliosError::ChannelClosed(
        format!("Failed to send message: {}", e)
    ))?;

let msg = receiver.recv()
    .map_err(|e| HeliosError::ChannelClosed(
        format!("Failed to receive message: {}", e)
    ))?;

Pattern 7: Mutex/RwLock Poisoning

// ⚠ SPECIAL CASE: Lock poisoning
// Mutex/RwLock unwrap() is acceptable because:
// 1. Poison means panic happened while locked
// 2. Data may be corrupted, cannot safely continue
// 3. Unwrap propagates panic (correct behavior)

//  ACCEPTABLE in most cases
let guard = mutex.lock().unwrap();

//  BETTER: Handle poison if recovery possible
let guard = mutex.lock()
    .unwrap_or_else(|poisoned| {
        error!("Mutex poisoned, data may be corrupted");
        poisoned.into_inner()  // Use data anyway (risky!)
    });

//  BEST: Avoid shared mutable state
// Use message passing (channels) instead of locks

Special Cases

When unwrap() IS Acceptable

1. Test Code

#[cfg(test)]
mod tests {
    #[test]
    fn test_sstable_creation() {
        let entries = vec![entry1, entry2];
        let sstable = create_sstable(entries).unwrap();  //  OK in tests
        assert_eq!(sstable.entries.len(), 2);
    }
}

2. Static/Compile-Time Validated Data

//  OK: Regex is valid at compile time
static EMAIL_REGEX: Lazy<Regex> = Lazy::new(|| {
    Regex::new(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$")
        .unwrap()  // Panic at startup if regex invalid (correct)
});

3. Mutex/RwLock Poison (See Pattern 7)

4. Initialization (Once Cell, Lazy Static)

//  OK: Initialize once, panic if fails
static CONFIG: OnceCell<Config> = OnceCell::new();

fn init_config(path: &Path) {
    let config = load_config(path).unwrap();  // Panic on startup if config invalid
    CONFIG.set(config).unwrap();
}

When to Use expect() vs unwrap()

General Rule: Prefer neither. Use ? or explicit error handling.

If You Must:

// Slightly better: expect() with explanation
let config = load_config(path)
    .expect("Config file must be valid at startup");

// Same as:
let config = load_config(path).unwrap();

Verdict: expect() is marginally better than unwrap() (message in panic), but both should be avoided in production code.

🧪 Testing Error Handling

Test That Errors Are Returned

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_create_sstable_empty_entries() {
        let result = create_sstable(vec![]);
        assert!(result.is_err());

        let err = result.unwrap_err();
        assert!(matches!(err, HeliosError::Storage(_)));
    }

    #[test]
    fn test_create_sstable_valid() {
        let entries = vec![
            Entry::new(b"key1".to_vec(), b"value1".to_vec()),
            Entry::new(b"key2".to_vec(), b"value2".to_vec()),
        ];

        let result = create_sstable(entries);
        assert!(result.is_ok());

        let sstable = result.unwrap();
        assert_eq!(sstable.entries.len(), 2);
    }

    #[test]
    fn test_error_message_quality() {
        let result = create_sstable(vec![]);
        let err_msg = format!("{}", result.unwrap_err());

        // Error messages should be descriptive
        assert!(err_msg.contains("empty"));
        assert!(err_msg.contains("SSTable") || err_msg.contains("entries"));
    }
}

Refactoring Guidelines

Step 1: Identify unwrap() Calls

# Find all unwrap() in production code (exclude tests)
grep -r "\.unwrap()" crate-name/src/ --exclude-dir=tests

# Priority order:
# 1. CRITICAL: In storage/transaction/consensus paths
# 2. HIGH: In frequently executed paths
# 3. MEDIUM: In utility functions
# 4. LOW: In rarely executed paths

Step 2: Categorize Each unwrap()

For each unwrap(), ask:

Can this fail? (Yes = must fix, No = consider expect())
How often is it called? (Frequent = higher priority)
What happens if it panics? (Data loss = CRITICAL)
Is there a better pattern? (Helper function? Early validation?)

Step 3: Apply Appropriate Fix

Use the patterns in this document to fix each unwrap().

Step 4: Update Tests

Ensure tests cover both success and error cases.

Step 5: Verify

# Compile
cargo check -p crate-name

# Test
cargo test -p crate-name

# Clippy (optional: enforce no unwrap)
cargo clippy -p crate-name -- -D clippy::unwrap_used

🎓 Learning Resources

Internal Resources

Security Audit Report: docs/SECURITY_AUDIT_REPORT.md
Security Remediation Plan: docs/SECURITY_REMEDIATION_PLAN.md
Day 1 Completion Report: docs/SECURITY_FIX_DAY1_COMPLETE.md

Example Files (Good Error Handling)

heliosdb-security/src/ (Grade: 8.5/10)
- Zero unwrap() in production code
- Model for other crates

External Resources

Rust Error Handling Survey
anyhow Crate - Ergonomic error handling
thiserror Crate - Derive macros for custom errors

Checklist for Code Review

For Authors

Before submitting code, verify:

Zero unwrap() in production code (except special cases)
Zero expect() in production code
All Result types propagated with ?
Error messages are descriptive
Edge cases validated (empty collections, None, etc.)
Tests cover error cases
Documentation explains error conditions

For Reviewers

Check for:

No unwrap() or expect() in production paths
Proper Result<T, E> usage
Descriptive error types (not String)
Error context preserved
Edge cases handled
Tests for error paths
unsafe blocks documented (if any)

Quick Migration Guide

Before (Unsafe)

fn process(data: &[u8]) -> Vec<u8> {
    let first = data.first().unwrap();
    let last = data.last().unwrap();
    let result = compute(*first, *last).unwrap();
    result.to_vec()
}

After (Safe)

fn process(data: &[u8]) -> Result<Vec<u8>, HeliosError> {
    if data.is_empty() {
        return Err(HeliosError::InvalidInput("Data cannot be empty".to_string()));
    }

    let first = data[0];
    let last = data[data.len() - 1];

    let result = compute(first, last)
        .map_err(|e| HeliosError::Computation(e.to_string()))?;

    Ok(result.to_vec())
}

Changes

Return Result instead of plain type
Early validation (empty check)
Index access (safe after validation)
Error propagation with ?
Descriptive error messages

Tips & Tricks

Tip 1: Use Clippy Lints

[lints.clippy]
unwrap_used = "deny"
expect_used = "warn"
panic = "deny"

Tip 2: Pre-commit Hook

#!/bin/bash
# Deny unwrap() in production code
if git diff --cached --name-only | grep -E "src/.*\.rs$" | xargs grep -l "\.unwrap()" ; then
    echo "ERROR: unwrap() found in production code!"
    echo "Please use proper error handling."
    exit 1
fi

Tip 3: IDE Configuration

Configure your IDE to highlight unwrap() calls:

VS Code: Rust Analyzer → Diagnostics → Clippy
IntelliJ IDEA: Rust Plugin → Inspections → Enable clippy

Tip 4: Error Message Template

Use this template for error messages:

"<What failed>: <Why it failed> [<Context>]"

Examples:
"Failed to create SSTable: empty entries"
"Failed to read file /data/sstable.db: Permission denied"
"Invalid port '99999': number too large"

📚 Summary

Key Takeaways

Never unwrap() in production - Use Result and ? operator
Validate early - Check inputs before processing
Descriptive errors - Help debugging with good messages
Helper functions - Centralize common error handling patterns
Test error paths - Ensure errors are handled correctly
Code review - Catch unwrap() before merge

Production-Ready Error Handling

Before: Code with unwrap() = 🔴 Production risk After: Code with Result + ? = Production ready

Questions?

Contact: security-team@heliosdb.com

Document Version: 1.0 Last Updated: November 9, 2025 Status: Active Guidelines Next Review: December 9, 2025