Skip to content

HeliosDB DML Operations - User Guide

HeliosDB DML Operations - User Guide

Status: Phase 1 Complete - Index Maintenance Available Version: P0-2 Phase 1 Release Last Updated: November 5, 2025


⚠ Important Notice - Phase 1 Complete!

DML operations (INSERT, UPDATE, DELETE) in HeliosDB now include automatic index maintenance when enabled. Phase 1 eliminates the critical index corruption risk.

What’s New (Phase 1 - Nov 5, 2025):

  • Index Maintenance Available - Enable via with_index_manager()
  • No more index corruption risk when using DML on indexed tables
  • Automatic index entry creation (INSERT)
  • Automatic index entry updates (UPDATE)
  • Automatic index entry removal (DELETE)

Remaining Limitations (Phase 2-3 in progress):

  • ⏳ Does NOT use indexes for query optimization (Phase 2)
  • ⏳ Does NOT use shard-aware routing in distributed deployments (Phase 3)
  • ⏳ Performance still 10x-10,000x slower than equivalent SELECT queries (until Phase 2-3)
  • ACID transactions via XA coordinator (works correctly)
  • Index maintenance complete

Recommendation:

  • WITH index maintenance enabled: Safe for production use on indexed tables
  • WITHOUT index maintenance: Use only for non-indexed tables or tables < 1M rows

Table of Contents

  1. What Works
  2. Critical Limitations
  3. Performance Characteristics
  4. Best Practices
  5. Workarounds
  6. When to Use vs Avoid
  7. Roadmap

What Works

INSERT Operation

INSERT INTO users (id, name, email, status)
VALUES (1, 'Alice', 'alice@example.com', 'active');
INSERT INTO users (id, name, email, status)
VALUES
(2, 'Bob', 'bob@example.com', 'active'),
(3, 'Charlie', 'charlie@example.com', 'pending');

Functional:

  • Single row inserts
  • Multi-row batch inserts
  • NULL value handling
  • All data types (int, float, string, bytes, boolean)
  • XA transaction integration
  • Storage persistence
  • MVCC versioning

Limitations:

  • Index maintenance available (Phase 1 complete) - enable via with_index_manager()
  • ⏳ No shard-aware key generation (Phase 3)
  • ⏳ Not optimized for large batches >1000 rows (Phase 4)

UPDATE Operation

UPDATE users
SET status = 'verified', updated_at = NOW()
WHERE email = 'alice@example.com';
UPDATE users
SET status = 'inactive'
WHERE last_login < '2024-01-01' AND status = 'pending';

Functional:

  • Single/multiple column updates
  • WHERE clause with multiple predicates
  • Operators: =, !=, <, <=, >, >=, LIKE, IS NULL, IS NOT NULL
  • Type coercion (int ↔ float)
  • Transaction support

Limitations:

  • FULL TABLE SCAN - O(N) complexity (Phase 2 will add index-based lookup)
  • ⏳ No index usage for WHERE clause (Phase 2)
  • Index maintenance available (Phase 1 complete) - enable via with_index_manager()
  • ⏳ Scans ALL shards in distributed mode (Phase 3 will add shard-aware routing)
  • ⏳ In-memory predicate evaluation (Phase 4 will add predicate pushdown)

DELETE Operation

DELETE FROM users WHERE id = 12345;
DELETE FROM logs WHERE created_at < '2024-01-01';
DELETE FROM temp_data WHERE status = 'processed';

Functional:

  • Single/multiple row deletes
  • WHERE clause support
  • Tombstone-based deletion (LSM safe)
  • Transaction support

Limitations:

  • FULL TABLE SCAN - O(N) complexity
  • ❌ No index usage
  • ❌ No partition pruning

Critical Limitations

1. Index Corruption Risk ⚠ CRITICAL

Problem: DML operations do NOT maintain global secondary indexes.

Impact:

-- Step 1: Create index
CREATE INDEX idx_email ON users(email);
-- Step 2: INSERT new row
INSERT INTO users VALUES (100, 'Dave', 'dave@example.com', 'active');
-- ❌ Index NOT updated - missing entry for dave@example.com
-- Step 3: Query using index
SELECT * FROM users WHERE email = 'dave@example.com';
-- ❌ Returns no results (index doesn't have the entry)
-- Step 4: UPDATE existing row
UPDATE users SET email = 'alice.new@example.com' WHERE id = 1;
-- ❌ Index has stale entry for old email, missing entry for new email
-- Step 5: DELETE row
DELETE FROM users WHERE id = 2;
-- ❌ Index still has entry pointing to deleted row

Result: Queries return incorrect results. Data integrity compromised.

Workaround:

-- Option 1: Drop indexes before using DML
DROP INDEX idx_email;
-- Do DML operations
-- Recreate index after
CREATE INDEX idx_email ON users(email);
-- Option 2: Don't use indexes on tables with DML
-- (defeats purpose of indexes)
-- Option 3: Wait for Phase 1 enhancement (ETA: 1 week)

Status: Fix in progress - Phase 1 of enhancement roadmap (16 hours).


2. Full Table Scan Performance

Problem: UPDATE/DELETE scan entire table regardless of WHERE clause.

Example:

-- Table: 100M users
UPDATE users SET status = 'verified' WHERE user_id = 12345;

Current Behavior:

  1. Scans all 100M rows
  2. Evaluates predicate in memory for each row
  3. Finds 1 matching row
  4. Updates 1 row
  5. Time: 2-5 minutes

Expected Behavior (with indexes):

  1. Index lookup on user_id → O(log 100M) = ~27 comparisons
  2. Fetch 1 row directly
  3. Update 1 row
  4. Time: <100ms

Performance Gap: 1,200x slower


3. No Shard-Aware Routing

Problem: In distributed deployments, DML scans ALL shards.

Example:

-- Sharded table: 100 shards based on user_id
UPDATE users SET status = 'verified' WHERE user_id = 12345;

Current Behavior:

  • Scans all 100 shards
  • user_id = 12345 exists on only 1 shard
  • 99 shards scanned unnecessarily
  • Network overhead: 99x
  • Time: 100x slower

Expected Behavior:

  • Determine shard from user_id hash
  • Scan only 1 shard
  • 100x performance improvement

Status: Fix planned - Phase 3 (20 hours).


Performance Characteristics

Complexity Analysis

OperationCurrentWith Phase 2 (Indexes)With Phase 3 (Sharding)
INSERTO(1)O(log N) + O(1)O(1) per shard
UPDATE (1 row)O(N)O(log N)O(log N/S)
DELETE (1 row)O(N)O(log N)O(log N/S)
UPDATE (range)O(N)O(M + log N)O(M + log N/S)

Where:

  • N = total rows in table
  • M = matching rows
  • S = number of shards

Benchmark Results

Setup: 10M row table, single node

QueryCurrent (Beta)Target (Phase 2-3)Improvement
INSERT 1 row5ms5ms1x (same)
UPDATE WHERE id = X1,200ms<1ms1,200x
DELETE WHERE id = X1,100ms<1ms1,100x
UPDATE range (100 rows)1,300ms100ms13x
DELETE range (1000 rows)1,500ms200ms7.5x

Setup: 100M row table, 10 shards

QueryCurrent (Beta)Target (Phase 2-3)Improvement
UPDATE WHERE shard_key = X180,000ms (3 min)<10ms18,000x
DELETE WHERE shard_key = X120,000ms (2 min)<10ms12,000x

Best Practices

DO

  1. Use for small tables

    -- Good: Configuration table (1000 rows)
    UPDATE config SET value = 'new_value' WHERE key = 'timeout';
    -- Good: User settings (100K rows max)
    UPDATE user_settings SET theme = 'dark' WHERE user_id = 12345;
  2. Disable indexes before DML operations

    DROP INDEX idx_email;
    -- Do bulk updates
    UPDATE users SET status = 'migrated';
    -- Recreate index
    CREATE INDEX idx_email ON users(email);
  3. Use batch inserts

    -- Good: Insert multiple rows at once
    INSERT INTO logs VALUES
    (1, 'log1', NOW()),
    (2, 'log2', NOW()),
    (3, 'log3', NOW());
  4. Add WHERE clauses to limit scope

    -- Even though full scan happens, reduces work
    UPDATE users SET status = 'active'
    WHERE created_at > '2024-01-01' AND status = 'pending';
  5. Use transactions for atomicity

    BEGIN;
    UPDATE accounts SET balance = balance - 100 WHERE id = 1;
    UPDATE accounts SET balance = balance + 100 WHERE id = 2;
    COMMIT;

❌ DON’T

  1. Don’t use on large tables (> 1M rows)

    -- Bad: Will take minutes
    UPDATE users SET last_checked = NOW(); -- 100M rows
  2. Don’t use with indexes enabled

    -- Bad: Creates inconsistent indexes
    CREATE INDEX idx_email ON users(email);
    INSERT INTO users VALUES (100, 'Dave', 'dave@example.com');
    -- Index is now stale!
  3. Don’t use in hot paths

    -- Bad: User-facing API endpoint
    app.post('/update-profile', async (req, res) => {
    await db.query('UPDATE users SET name = $1 WHERE id = $2', [name, id]);
    // This could take minutes on large tables!
    });
  4. Don’t use WHERE clauses on non-shard keys in distributed mode

    -- Bad: Scans all shards
    UPDATE users SET status = 'verified' WHERE email = 'alice@example.com';
    -- (if shard key is user_id, not email)

Workarounds

Workaround 1: Partition Large Updates

Instead of:

-- Bad: Single huge update
UPDATE logs SET archived = true WHERE created_at < '2024-01-01';
-- Scans 1B rows

Do:

-- Good: Process in batches
UPDATE logs SET archived = true
WHERE created_at BETWEEN '2023-01-01' AND '2023-01-31'
AND id IN (SELECT id FROM logs WHERE created_at BETWEEN '2023-01-01' AND '2023-01-31' LIMIT 10000);
-- Repeat for each month

Workaround 2: Use SELECT + DELETE Pattern

Instead of:

-- Bad: Direct DELETE (full scan)
DELETE FROM logs WHERE created_at < '2024-01-01';

Do:

-- Good: Use SELECT (optimized) then targeted DELETE
-- Step 1: Get IDs using optimized SELECT
SELECT id FROM logs WHERE created_at < '2024-01-01' LIMIT 10000;
-- Step 2: Delete by primary key (faster)
DELETE FROM logs WHERE id IN (1, 2, 3, ...);
-- Step 3: Repeat until done

Workaround 3: Rebuild Indexes After Bulk Operations

-- Disable indexes
DROP INDEX idx_email;
DROP INDEX idx_status;
-- Do bulk operations
INSERT INTO users SELECT * FROM staging_users; -- 10M rows
UPDATE users SET migrated = true WHERE source = 'legacy';
-- Rebuild indexes (creates correct indexes from scratch)
CREATE INDEX idx_email ON users(email);
CREATE INDEX idx_status ON users(status);

Workaround 4: Use Time-Based Partitioning

For time-series data:

-- Instead of: DELETE FROM logs WHERE created_at < '2023-01-01';
-- Use: DROP partition
DROP TABLE logs_2022;
-- Much faster than DELETE

When to Use vs Avoid

Good Use Cases

  1. Development and Testing

    • Small datasets
    • Non-production environments
    • Feature testing
  2. Configuration Tables

    • System settings
    • Feature flags
    • Metadata tables
  3. Small Reference Tables

    • Countries (< 300 rows)
    • Categories (< 10,000 rows)
    • User roles (< 100 rows)
  4. Bulk Load + Rebuild

    • Initial data migration
    • Nightly batch jobs
    • Data warehouse ETL (with index rebuild)
  5. Transactional Operations

    • Account transfers
    • Inventory updates
    • Order processing (if tables are small)

❌ Avoid in Production For

  1. Large Tables

    • User tables (millions of users)
    • Event logs (billions of events)
    • Metrics/telemetry data
  2. Real-Time Updates

    • User profile updates during login
    • Shopping cart modifications
    • Real-time leaderboards
  3. Indexed Tables

    • Any table with global secondary indexes
    • Until Phase 1 is complete
  4. Distributed/Sharded Tables

    • Multi-shard deployments
    • Until Phase 3 is complete
  5. High-Concurrency Scenarios

    • 100 concurrent DML operations/sec

    • Wait for performance enhancements

Roadmap

Phase 1: Index Maintenance (CRITICAL)

Status: Not started ETA: 1 week Effort: 16 hours

Fixes:

  • INSERT maintains index entries
  • UPDATE updates index entries
  • DELETE removes index entries
  • Indexes remain consistent

Impact: Eliminates index corruption risk. Safe to use DML with indexes.

Phase 2: Index-Based Lookup (HIGH PRIORITY)

Status: Not started ETA: 1.5 weeks Effort: 12 hours

Fixes:

  • UPDATE uses indexes for WHERE clauses
  • DELETE uses indexes for WHERE clauses
  • 100x-1000x performance improvement

Impact: UPDATE/DELETE become usable for large tables (if indexed columns in WHERE).

Phase 3: Shard-Aware Routing (HIGH PRIORITY)

Status: Not started ETA: 2 weeks Effort: 20 hours

Fixes:

  • DML targets specific shards
  • Parallel execution across shards
  • 100x improvement for distributed deployments

Impact: Production-ready for sharded HeliosDB clusters.

Phase 4-6: Additional Optimizations (MEDIUM PRIORITY)

Status: Not started ETA: 3 weeks Effort: 40 hours

Fixes:

  • Predicate pushdown
  • Partition pruning
  • Distributed foreign keys

Migration Path

Current (Beta) → Phase 1

Breaking Changes: None Action Required: None (automatic improvement)

Phase 1 → Phase 2

Breaking Changes: None Action Required: None (automatic improvement)

Phase 2 → Phase 3

Breaking Changes: None Action Required: Configure sharding if using distributed deployment


Support and Feedback

Documentation:

  • Technical Analysis: docs/P0_2_DML_FEATURE_INTEGRATION_ANALYSIS.md
  • Enhancement Roadmap: docs/P0_2_DML_ENHANCEMENT_ROADMAP.md
  • Storage Integration: docs/P0_2_STORAGE_INTEGRATION_COMPLETE.md

Questions?:

Report Bugs:

  • Critical (index corruption, data loss): Immediate attention
  • Performance issues: Tracked for Phase 2-3
  • Feature requests: Tracked for future phases

Example Applications

Safe: Configuration Service

CREATE TABLE config (
key VARCHAR PRIMARY KEY,
value TEXT,
updated_at TIMESTAMP
);
-- Safe: Small table, no indexes
INSERT INTO config VALUES ('timeout', '30s', NOW());
UPDATE config SET value = '60s' WHERE key = 'timeout';

Safe: Small User Settings

CREATE TABLE user_preferences (
user_id INT PRIMARY KEY,
theme VARCHAR,
language VARCHAR
);
-- Safe: Primary key lookup, ~100K users
UPDATE user_preferences
SET theme = 'dark'
WHERE user_id = 12345;

Unsafe: Large User Table

CREATE TABLE users (
id BIGINT PRIMARY KEY,
email VARCHAR UNIQUE,
status VARCHAR
);
CREATE INDEX idx_email ON users(email);
-- Unsafe: 100M rows, index will corrupt
UPDATE users SET status = 'verified' WHERE email = 'alice@example.com';
-- Takes minutes, corrupts idx_email

Workaround: Rebuild Index

-- Drop index first
DROP INDEX idx_email;
-- Do bulk updates
UPDATE users SET status = 'verified' WHERE email LIKE '%@example.com';
-- Rebuild index (clean slate)
CREATE INDEX idx_email ON users(email);

Appendix: Performance Tuning

Query Timeout Configuration

-- Set timeout to prevent runaway queries
SET statement_timeout = '30s';
-- Or in connection string
postgres://user:pass@host/db?statement_timeout=30s

Monitoring DML Performance

-- Enable query logging
SET log_statement = 'all';
SET log_duration = 'on';
-- Check slow queries
SELECT query, duration
FROM pg_stat_statements
WHERE query LIKE '%UPDATE%' OR query LIKE '%DELETE%'
ORDER BY duration DESC;

Batch Size Recommendations

-- INSERT: Optimal batch size 100-1000 rows
INSERT INTO logs SELECT * FROM staging LIMIT 1000;
-- UPDATE: Process in chunks
UPDATE users SET migrated = true
WHERE id >= 0 AND id < 10000; -- First 10K
-- Repeat for next batch

Document Version: 1.0 Last Updated: November 5, 2025 Applies To: HeliosDB P0-2 DML (Beta) Next Update: After Phase 1 completion