HeliosDB DML Operations - User Guide
HeliosDB DML Operations - User Guide
Status: Phase 1 Complete - Index Maintenance Available Version: P0-2 Phase 1 Release Last Updated: November 5, 2025
⚠ Important Notice - Phase 1 Complete!
DML operations (INSERT, UPDATE, DELETE) in HeliosDB now include automatic index maintenance when enabled. Phase 1 eliminates the critical index corruption risk.
What’s New (Phase 1 - Nov 5, 2025):
- Index Maintenance Available - Enable via
with_index_manager() - No more index corruption risk when using DML on indexed tables
- Automatic index entry creation (INSERT)
- Automatic index entry updates (UPDATE)
- Automatic index entry removal (DELETE)
Remaining Limitations (Phase 2-3 in progress):
- ⏳ Does NOT use indexes for query optimization (Phase 2)
- ⏳ Does NOT use shard-aware routing in distributed deployments (Phase 3)
- ⏳ Performance still 10x-10,000x slower than equivalent SELECT queries (until Phase 2-3)
- ACID transactions via XA coordinator (works correctly)
- Index maintenance complete
Recommendation:
- WITH index maintenance enabled: Safe for production use on indexed tables
- WITHOUT index maintenance: Use only for non-indexed tables or tables < 1M rows
Table of Contents
- What Works
- Critical Limitations
- Performance Characteristics
- Best Practices
- Workarounds
- When to Use vs Avoid
- Roadmap
What Works
INSERT Operation
INSERT INTO users (id, name, email, status)VALUES (1, 'Alice', 'alice@example.com', 'active');
INSERT INTO users (id, name, email, status)VALUES (2, 'Bob', 'bob@example.com', 'active'), (3, 'Charlie', 'charlie@example.com', 'pending');Functional:
- Single row inserts
- Multi-row batch inserts
- NULL value handling
- All data types (int, float, string, bytes, boolean)
- XA transaction integration
- Storage persistence
- MVCC versioning
Limitations:
- Index maintenance available (Phase 1 complete) - enable via
with_index_manager() - ⏳ No shard-aware key generation (Phase 3)
- ⏳ Not optimized for large batches >1000 rows (Phase 4)
UPDATE Operation
UPDATE usersSET status = 'verified', updated_at = NOW()WHERE email = 'alice@example.com';
UPDATE usersSET status = 'inactive'WHERE last_login < '2024-01-01' AND status = 'pending';Functional:
- Single/multiple column updates
- WHERE clause with multiple predicates
- Operators: =, !=, <, <=, >, >=, LIKE, IS NULL, IS NOT NULL
- Type coercion (int ↔ float)
- Transaction support
Limitations:
- ⏳ FULL TABLE SCAN - O(N) complexity (Phase 2 will add index-based lookup)
- ⏳ No index usage for WHERE clause (Phase 2)
- Index maintenance available (Phase 1 complete) - enable via
with_index_manager() - ⏳ Scans ALL shards in distributed mode (Phase 3 will add shard-aware routing)
- ⏳ In-memory predicate evaluation (Phase 4 will add predicate pushdown)
DELETE Operation
DELETE FROM users WHERE id = 12345;
DELETE FROM logs WHERE created_at < '2024-01-01';
DELETE FROM temp_data WHERE status = 'processed';Functional:
- Single/multiple row deletes
- WHERE clause support
- Tombstone-based deletion (LSM safe)
- Transaction support
Limitations:
- ❌ FULL TABLE SCAN - O(N) complexity
- ❌ No index usage
- ❌ No partition pruning
Critical Limitations
1. Index Corruption Risk ⚠ CRITICAL
Problem: DML operations do NOT maintain global secondary indexes.
Impact:
-- Step 1: Create indexCREATE INDEX idx_email ON users(email);
-- Step 2: INSERT new rowINSERT INTO users VALUES (100, 'Dave', 'dave@example.com', 'active');-- ❌ Index NOT updated - missing entry for dave@example.com
-- Step 3: Query using indexSELECT * FROM users WHERE email = 'dave@example.com';-- ❌ Returns no results (index doesn't have the entry)
-- Step 4: UPDATE existing rowUPDATE users SET email = 'alice.new@example.com' WHERE id = 1;-- ❌ Index has stale entry for old email, missing entry for new email
-- Step 5: DELETE rowDELETE FROM users WHERE id = 2;-- ❌ Index still has entry pointing to deleted rowResult: Queries return incorrect results. Data integrity compromised.
Workaround:
-- Option 1: Drop indexes before using DMLDROP INDEX idx_email;-- Do DML operations-- Recreate index afterCREATE INDEX idx_email ON users(email);
-- Option 2: Don't use indexes on tables with DML-- (defeats purpose of indexes)
-- Option 3: Wait for Phase 1 enhancement (ETA: 1 week)Status: Fix in progress - Phase 1 of enhancement roadmap (16 hours).
2. Full Table Scan Performance
Problem: UPDATE/DELETE scan entire table regardless of WHERE clause.
Example:
-- Table: 100M usersUPDATE users SET status = 'verified' WHERE user_id = 12345;Current Behavior:
- Scans all 100M rows
- Evaluates predicate in memory for each row
- Finds 1 matching row
- Updates 1 row
- Time: 2-5 minutes
Expected Behavior (with indexes):
- Index lookup on user_id → O(log 100M) = ~27 comparisons
- Fetch 1 row directly
- Update 1 row
- Time: <100ms
Performance Gap: 1,200x slower
3. No Shard-Aware Routing
Problem: In distributed deployments, DML scans ALL shards.
Example:
-- Sharded table: 100 shards based on user_idUPDATE users SET status = 'verified' WHERE user_id = 12345;Current Behavior:
- Scans all 100 shards
- user_id = 12345 exists on only 1 shard
- 99 shards scanned unnecessarily
- Network overhead: 99x
- Time: 100x slower
Expected Behavior:
- Determine shard from user_id hash
- Scan only 1 shard
- 100x performance improvement
Status: Fix planned - Phase 3 (20 hours).
Performance Characteristics
Complexity Analysis
| Operation | Current | With Phase 2 (Indexes) | With Phase 3 (Sharding) |
|---|---|---|---|
| INSERT | O(1) | O(log N) + O(1) | O(1) per shard |
| UPDATE (1 row) | O(N) | O(log N) | O(log N/S) |
| DELETE (1 row) | O(N) | O(log N) | O(log N/S) |
| UPDATE (range) | O(N) | O(M + log N) | O(M + log N/S) |
Where:
- N = total rows in table
- M = matching rows
- S = number of shards
Benchmark Results
Setup: 10M row table, single node
| Query | Current (Beta) | Target (Phase 2-3) | Improvement |
|---|---|---|---|
| INSERT 1 row | 5ms | 5ms | 1x (same) |
| UPDATE WHERE id = X | 1,200ms | <1ms | 1,200x |
| DELETE WHERE id = X | 1,100ms | <1ms | 1,100x |
| UPDATE range (100 rows) | 1,300ms | 100ms | 13x |
| DELETE range (1000 rows) | 1,500ms | 200ms | 7.5x |
Setup: 100M row table, 10 shards
| Query | Current (Beta) | Target (Phase 2-3) | Improvement |
|---|---|---|---|
| UPDATE WHERE shard_key = X | 180,000ms (3 min) | <10ms | 18,000x |
| DELETE WHERE shard_key = X | 120,000ms (2 min) | <10ms | 12,000x |
Best Practices
DO
-
Use for small tables
-- Good: Configuration table (1000 rows)UPDATE config SET value = 'new_value' WHERE key = 'timeout';-- Good: User settings (100K rows max)UPDATE user_settings SET theme = 'dark' WHERE user_id = 12345; -
Disable indexes before DML operations
DROP INDEX idx_email;-- Do bulk updatesUPDATE users SET status = 'migrated';-- Recreate indexCREATE INDEX idx_email ON users(email); -
Use batch inserts
-- Good: Insert multiple rows at onceINSERT INTO logs VALUES(1, 'log1', NOW()),(2, 'log2', NOW()),(3, 'log3', NOW()); -
Add WHERE clauses to limit scope
-- Even though full scan happens, reduces workUPDATE users SET status = 'active'WHERE created_at > '2024-01-01' AND status = 'pending'; -
Use transactions for atomicity
BEGIN;UPDATE accounts SET balance = balance - 100 WHERE id = 1;UPDATE accounts SET balance = balance + 100 WHERE id = 2;COMMIT;
❌ DON’T
-
Don’t use on large tables (> 1M rows)
-- Bad: Will take minutesUPDATE users SET last_checked = NOW(); -- 100M rows -
Don’t use with indexes enabled
-- Bad: Creates inconsistent indexesCREATE INDEX idx_email ON users(email);INSERT INTO users VALUES (100, 'Dave', 'dave@example.com');-- Index is now stale! -
Don’t use in hot paths
-- Bad: User-facing API endpointapp.post('/update-profile', async (req, res) => {await db.query('UPDATE users SET name = $1 WHERE id = $2', [name, id]);// This could take minutes on large tables!}); -
Don’t use WHERE clauses on non-shard keys in distributed mode
-- Bad: Scans all shardsUPDATE users SET status = 'verified' WHERE email = 'alice@example.com';-- (if shard key is user_id, not email)
Workarounds
Workaround 1: Partition Large Updates
Instead of:
-- Bad: Single huge updateUPDATE logs SET archived = true WHERE created_at < '2024-01-01';-- Scans 1B rowsDo:
-- Good: Process in batchesUPDATE logs SET archived = trueWHERE created_at BETWEEN '2023-01-01' AND '2023-01-31'AND id IN (SELECT id FROM logs WHERE created_at BETWEEN '2023-01-01' AND '2023-01-31' LIMIT 10000);-- Repeat for each monthWorkaround 2: Use SELECT + DELETE Pattern
Instead of:
-- Bad: Direct DELETE (full scan)DELETE FROM logs WHERE created_at < '2024-01-01';Do:
-- Good: Use SELECT (optimized) then targeted DELETE-- Step 1: Get IDs using optimized SELECTSELECT id FROM logs WHERE created_at < '2024-01-01' LIMIT 10000;-- Step 2: Delete by primary key (faster)DELETE FROM logs WHERE id IN (1, 2, 3, ...);-- Step 3: Repeat until doneWorkaround 3: Rebuild Indexes After Bulk Operations
-- Disable indexesDROP INDEX idx_email;DROP INDEX idx_status;
-- Do bulk operationsINSERT INTO users SELECT * FROM staging_users; -- 10M rowsUPDATE users SET migrated = true WHERE source = 'legacy';
-- Rebuild indexes (creates correct indexes from scratch)CREATE INDEX idx_email ON users(email);CREATE INDEX idx_status ON users(status);Workaround 4: Use Time-Based Partitioning
For time-series data:
-- Instead of: DELETE FROM logs WHERE created_at < '2023-01-01';-- Use: DROP partitionDROP TABLE logs_2022;-- Much faster than DELETEWhen to Use vs Avoid
Good Use Cases
-
Development and Testing
- Small datasets
- Non-production environments
- Feature testing
-
Configuration Tables
- System settings
- Feature flags
- Metadata tables
-
Small Reference Tables
- Countries (< 300 rows)
- Categories (< 10,000 rows)
- User roles (< 100 rows)
-
Bulk Load + Rebuild
- Initial data migration
- Nightly batch jobs
- Data warehouse ETL (with index rebuild)
-
Transactional Operations
- Account transfers
- Inventory updates
- Order processing (if tables are small)
❌ Avoid in Production For
-
Large Tables
- User tables (millions of users)
- Event logs (billions of events)
- Metrics/telemetry data
-
Real-Time Updates
- User profile updates during login
- Shopping cart modifications
- Real-time leaderboards
-
Indexed Tables
- Any table with global secondary indexes
- Until Phase 1 is complete
-
Distributed/Sharded Tables
- Multi-shard deployments
- Until Phase 3 is complete
-
High-Concurrency Scenarios
-
100 concurrent DML operations/sec
- Wait for performance enhancements
-
Roadmap
Phase 1: Index Maintenance (CRITICAL)
Status: Not started ETA: 1 week Effort: 16 hours
Fixes:
- INSERT maintains index entries
- UPDATE updates index entries
- DELETE removes index entries
- Indexes remain consistent
Impact: Eliminates index corruption risk. Safe to use DML with indexes.
Phase 2: Index-Based Lookup (HIGH PRIORITY)
Status: Not started ETA: 1.5 weeks Effort: 12 hours
Fixes:
- UPDATE uses indexes for WHERE clauses
- DELETE uses indexes for WHERE clauses
- 100x-1000x performance improvement
Impact: UPDATE/DELETE become usable for large tables (if indexed columns in WHERE).
Phase 3: Shard-Aware Routing (HIGH PRIORITY)
Status: Not started ETA: 2 weeks Effort: 20 hours
Fixes:
- DML targets specific shards
- Parallel execution across shards
- 100x improvement for distributed deployments
Impact: Production-ready for sharded HeliosDB clusters.
Phase 4-6: Additional Optimizations (MEDIUM PRIORITY)
Status: Not started ETA: 3 weeks Effort: 40 hours
Fixes:
- Predicate pushdown
- Partition pruning
- Distributed foreign keys
Migration Path
Current (Beta) → Phase 1
Breaking Changes: None Action Required: None (automatic improvement)
Phase 1 → Phase 2
Breaking Changes: None Action Required: None (automatic improvement)
Phase 2 → Phase 3
Breaking Changes: None Action Required: Configure sharding if using distributed deployment
Support and Feedback
Documentation:
- Technical Analysis:
docs/P0_2_DML_FEATURE_INTEGRATION_ANALYSIS.md - Enhancement Roadmap:
docs/P0_2_DML_ENHANCEMENT_ROADMAP.md - Storage Integration:
docs/P0_2_STORAGE_INTEGRATION_COMPLETE.md
Questions?:
- GitHub Issues: https://github.com/heliosdb/heliosdb/issues
- Discord: #dml-operations channel
Report Bugs:
- Critical (index corruption, data loss): Immediate attention
- Performance issues: Tracked for Phase 2-3
- Feature requests: Tracked for future phases
Example Applications
Safe: Configuration Service
CREATE TABLE config ( key VARCHAR PRIMARY KEY, value TEXT, updated_at TIMESTAMP);
-- Safe: Small table, no indexesINSERT INTO config VALUES ('timeout', '30s', NOW());UPDATE config SET value = '60s' WHERE key = 'timeout';Safe: Small User Settings
CREATE TABLE user_preferences ( user_id INT PRIMARY KEY, theme VARCHAR, language VARCHAR);
-- Safe: Primary key lookup, ~100K usersUPDATE user_preferencesSET theme = 'dark'WHERE user_id = 12345;Unsafe: Large User Table
CREATE TABLE users ( id BIGINT PRIMARY KEY, email VARCHAR UNIQUE, status VARCHAR);CREATE INDEX idx_email ON users(email);
-- Unsafe: 100M rows, index will corruptUPDATE users SET status = 'verified' WHERE email = 'alice@example.com';-- Takes minutes, corrupts idx_emailWorkaround: Rebuild Index
-- Drop index firstDROP INDEX idx_email;
-- Do bulk updatesUPDATE users SET status = 'verified' WHERE email LIKE '%@example.com';
-- Rebuild index (clean slate)CREATE INDEX idx_email ON users(email);Appendix: Performance Tuning
Query Timeout Configuration
-- Set timeout to prevent runaway queriesSET statement_timeout = '30s';
-- Or in connection stringpostgres://user:pass@host/db?statement_timeout=30sMonitoring DML Performance
-- Enable query loggingSET log_statement = 'all';SET log_duration = 'on';
-- Check slow queriesSELECT query, durationFROM pg_stat_statementsWHERE query LIKE '%UPDATE%' OR query LIKE '%DELETE%'ORDER BY duration DESC;Batch Size Recommendations
-- INSERT: Optimal batch size 100-1000 rowsINSERT INTO logs SELECT * FROM staging LIMIT 1000;
-- UPDATE: Process in chunksUPDATE users SET migrated = trueWHERE id >= 0 AND id < 10000; -- First 10K-- Repeat for next batchDocument Version: 1.0 Last Updated: November 5, 2025 Applies To: HeliosDB P0-2 DML (Beta) Next Update: After Phase 1 completion