HeliosDB REPL: Bulk Loading & Feature Exploration Guide
HeliosDB REPL: Bulk Loading & Feature Exploration Guide
This guide shows how to populate HeliosDB with test data and explore all HeliosCore features using the REPL.
Table of Contents
- Quick Start
- Data Population Methods
- REPL Bulk Loading Commands
- Exploring HeliosCore Features
- Example Queries
- System Views & Metadata
- Performance Testing
Quick Start
Option 1: Use the Population Utility (Recommended)
# Populate 1 million rows (default) - ~200MB on diskcargo run --release --example populate_test_data
# Populate 10 million rows - ~2GB on diskcargo run --release --example populate_test_data -- 10000000 ./largedata
# Start REPL with populated datacargo run --release -- --data-dir ./helios_testdataOption 2: Run Persistent Benchmark (Creates 15GB Dataset)
# Run full benchmark - creates 75M rows across 3 shardscargo run --release --example hybrid_benchmark_persistent -- 1.0 ./benchmark_data
# Explore one shard in REPLcargo run --release -- --data-dir ./benchmark_data/shard_0Option 3: Populate via REPL (Interactive)
# Start REPLcargo run --release
# Then run SQL commands to create and populate tablesData Population Methods
Method 1: External Utility (Fastest)
The populate_test_data example creates two tables:
| Table | Columns | Description |
|---|---|---|
users | id, name, email, age, score, active, created_at | User records |
orders | id, user_id, product, quantity, price, status, order_date | Order transactions |
# Usagecargo run --release --example populate_test_data -- [rows] [data_dir]
# Examplescargo run --release --example populate_test_data -- 100000 ./small_test # 100K rowscargo run --release --example populate_test_data -- 1000000 ./medium_test # 1M rowscargo run --release --example populate_test_data -- 10000000 ./large_test # 10M rowsPerformance: ~1.5-2M rows/sec on modern hardware
Method 2: SQL Bulk Insert via REPL
-- Create tableCREATE TABLE products ( id INT8 PRIMARY KEY, name TEXT NOT NULL, category TEXT, price FLOAT8, stock INT4, active BOOLEAN DEFAULT true);
-- Insert multiple rows in one statementINSERT INTO products (id, name, category, price, stock, active) VALUES (1, 'Widget A', 'electronics', 29.99, 100, true), (2, 'Widget B', 'electronics', 49.99, 50, true), (3, 'Gadget X', 'gadgets', 99.99, 25, true), (4, 'Tool Y', 'tools', 19.99, 200, false);
-- Use a loop for bulk generation (HeliosDB extension)INSERT INTO productsSELECT generate_series AS id, 'Product_' || generate_series AS name, CASE generate_series % 3 WHEN 0 THEN 'electronics' WHEN 1 THEN 'gadgets' ELSE 'tools' END AS category, (generate_series % 100) + 0.99 AS price, generate_series % 500 AS stock, generate_series % 2 = 0 AS activeFROM generate_series(1000, 10000);Method 3: COPY Command (CSV Import)
-- Create table firstCREATE TABLE logs ( timestamp TIMESTAMPTZ, level TEXT, message TEXT, source TEXT);
-- Import from CSVCOPY logs FROM '/path/to/logs.csv' WITH (FORMAT CSV, HEADER true);
-- Export to CSVCOPY logs TO '/path/to/export.csv' WITH (FORMAT CSV, HEADER true);REPL Bulk Loading Commands
Starting REPL with Data
# In-memory mode (data lost on exit)cargo run --release
# Persistent mode (data saved to disk)cargo run --release -- --data-dir /path/to/data
# With specific configurationcargo run --release -- --data-dir /path/to/data --cache-size 512REPL Meta-Commands
| Command | Description |
|---|---|
\dt | List all tables |
\d tablename | Describe table schema |
\di | List all indexes |
\ds | Show storage statistics |
\timing | Toggle query timing display |
\explain query | Show query execution plan |
\help | Show all commands |
Creating Tables for Testing
-- Users table with various data typesCREATE TABLE users ( id INT8 PRIMARY KEY, name TEXT NOT NULL, email TEXT UNIQUE, age INT4 CHECK (age >= 0 AND age <= 150), score FLOAT8 DEFAULT 0.0, active BOOLEAN DEFAULT true, created_at TIMESTAMPTZ DEFAULT NOW(), metadata JSONB);
-- Orders table with foreign keyCREATE TABLE orders ( id INT8 PRIMARY KEY, user_id INT8 REFERENCES users(id), product TEXT NOT NULL, quantity INT4 DEFAULT 1, price FLOAT8 NOT NULL, status TEXT DEFAULT 'pending', order_date DATE DEFAULT CURRENT_DATE);
-- Create indexes for common queriesCREATE INDEX idx_users_email ON users(email);CREATE INDEX idx_users_age ON users(age);CREATE INDEX idx_orders_user_id ON orders(user_id);CREATE INDEX idx_orders_status ON orders(status);Exploring HeliosCore Features
Feature 1: Probabilistic Filters
-- Check Bloom filter statisticsSELECT * FROM helios_filter_stats WHERE table_name = 'users';
-- Filter types available:-- - Bloom: Standard, good balance-- - XOR: Better space efficiency-- - Ribbon: Near-optimal, latest research-- - Cuckoo: Good for high load factors
-- View filter configurationSELECT filter_type, bits_per_key, false_positive_rateFROM helios_storage_config;Feature 2: Frequency Estimation (Count-Min Sketch)
-- Estimate frequency of a value without full scanSELECT helios_estimate_frequency('users', 'age', 25);
-- Compare with exact countSELECT COUNT(*) FROM users WHERE age = 25;
-- View sketch statisticsSELECT * FROM helios_sketch_stats;Feature 3: Cardinality Estimation (HyperLogLog)
-- Estimate distinct values (memory-efficient)SELECT helios_estimate_cardinality('users', 'email');
-- Compare with exact countSELECT COUNT(DISTINCT email) FROM users;
-- HyperLogLog uses ~12KB regardless of data size-- Error rate: ~2%Feature 4: MVCC & Snapshots
-- Create a snapshotSELECT helios_create_snapshot() AS snapshot_id;
-- Read from snapshot (consistent view)SELECT * FROM users AS OF SNAPSHOT 12345;
-- Compare current vs snapshotSELECT (SELECT COUNT(*) FROM users) AS current_count, (SELECT COUNT(*) FROM users AS OF SNAPSHOT 12345) AS snapshot_count;
-- List active snapshotsSELECT * FROM helios_snapshots;Feature 5: Compression Statistics
-- View compression ratiosSELECT table_name, raw_size_bytes, compressed_size_bytes, compression_ratio, compression_typeFROM helios_compression_stats;
-- Compression types: none, lz4, zstd, snappyFeature 6: WAL & Durability
-- View WAL statusSELECT * FROM helios_wal_status;
-- WAL configurationSELECT wal_mode, -- 'sync', 'async', 'disabled' sync_mode, -- 'fdatasync', 'fsync', 'o_direct' wal_size_bytes, last_checkpointFROM helios_wal_config;
-- Force checkpointSELECT helios_checkpoint();Feature 7: Storage Segments
-- View segment informationSELECT segment_id, table_name, row_count, size_bytes, min_key, max_key, created_atFROM helios_segmentsORDER BY created_at DESC;
-- Segment size is configurable (default 64MB)Example Queries
Basic CRUD Operations
-- INSERTINSERT INTO users (id, name, email, age, score, active)VALUES (1, 'Alice', 'alice@example.com', 30, 85.5, true);
-- SELECT with filteringSELECT * FROM users WHERE age BETWEEN 25 AND 35 AND active = true;
-- UPDATEUPDATE users SET score = score + 10 WHERE id = 1;
-- DELETEDELETE FROM users WHERE active = false AND score < 50;
-- UPSERT (INSERT ... ON CONFLICT)INSERT INTO users (id, name, email)VALUES (1, 'Alice Updated', 'alice@example.com')ON CONFLICT (id) DO UPDATE SET name = EXCLUDED.name;Aggregations
-- Basic aggregatesSELECT COUNT(*) AS total_users, AVG(age) AS avg_age, MIN(score) AS min_score, MAX(score) AS max_score, SUM(score) AS total_scoreFROM users;
-- GROUP BY with HAVINGSELECT CASE WHEN age < 30 THEN 'young' ELSE 'mature' END AS age_group, COUNT(*) AS count, AVG(score) AS avg_scoreFROM usersGROUP BY age_groupHAVING COUNT(*) > 10;
-- Window functionsSELECT name, score, RANK() OVER (ORDER BY score DESC) AS rank, score - LAG(score) OVER (ORDER BY score DESC) AS diff_from_prevFROM usersORDER BY score DESCLIMIT 10;Joins
-- INNER JOINSELECT u.name, o.product, o.quantity, o.priceFROM users uJOIN orders o ON u.id = o.user_idWHERE o.status = 'delivered';
-- LEFT JOIN with aggregationSELECT u.name, COUNT(o.id) AS order_count, COALESCE(SUM(o.price * o.quantity), 0) AS total_spentFROM users uLEFT JOIN orders o ON u.id = o.user_idGROUP BY u.id, u.nameORDER BY total_spent DESCLIMIT 10;Subqueries & CTEs
-- SubquerySELECT * FROM usersWHERE id IN ( SELECT user_id FROM orders WHERE status = 'delivered' GROUP BY user_id HAVING COUNT(*) > 5);
-- CTE (Common Table Expression)WITH high_value_customers AS ( SELECT user_id, SUM(price * quantity) AS total FROM orders GROUP BY user_id HAVING SUM(price * quantity) > 1000)SELECT u.name, h.totalFROM users uJOIN high_value_customers h ON u.id = h.user_idORDER BY h.total DESC;
-- Recursive CTE (for hierarchical data)WITH RECURSIVE category_tree AS ( SELECT id, name, parent_id, 0 AS level FROM categories WHERE parent_id IS NULL UNION ALL SELECT c.id, c.name, c.parent_id, ct.level + 1 FROM categories c JOIN category_tree ct ON c.parent_id = ct.id)SELECT * FROM category_tree;System Views & Metadata
Available System Views
-- List all system viewsSELECT * FROM helios_system_views;| View | Description |
|---|---|
helios_tables | All table metadata |
helios_columns | Column definitions |
helios_indexes | Index information |
helios_constraints | Constraints (PK, FK, CHECK) |
helios_storage_stats | Per-table storage statistics |
helios_filter_stats | Bloom/XOR filter statistics |
helios_compression_stats | Compression ratios |
helios_wal_status | WAL status and health |
helios_segments | Storage segment details |
helios_snapshots | MVCC snapshot list |
helios_transactions | Active transactions |
helios_locks | Current lock status |
Storage Statistics
-- Comprehensive storage overviewSELECT table_name, row_count, pg_size_pretty(data_size_bytes) AS data_size, pg_size_pretty(index_size_bytes) AS index_size, pg_size_pretty(total_size_bytes) AS total_size, compression_ratioFROM helios_storage_statsORDER BY total_size_bytes DESC;
-- Per-column statisticsSELECT table_name, column_name, null_fraction, distinct_count, avg_width, most_common_valuesFROM helios_column_statsWHERE table_name = 'users';Performance Diagnostics
-- Query execution historySELECT query_text, execution_time_ms, rows_returned, rows_scanned, index_usedFROM helios_query_logORDER BY execution_time_ms DESCLIMIT 10;
-- Cache hit ratesSELECT cache_type, hits, misses, hit_rate_percentFROM helios_cache_stats;
-- I/O statisticsSELECT reads, writes, bytes_read, bytes_written, avg_read_latency_us, avg_write_latency_usFROM helios_io_stats;Performance Testing
Benchmark Queries
-- Enable timing\timing on
-- Point lookup (should use index)SELECT * FROM users WHERE id = 12345;
-- Range scanSELECT COUNT(*) FROM users WHERE age BETWEEN 20 AND 30;
-- Full table scanSELECT AVG(score) FROM users;
-- Index scanSELECT * FROM orders WHERE status = 'pending' LIMIT 100;
-- Join performanceSELECT COUNT(*)FROM users uJOIN orders o ON u.id = o.user_id;Explain Plans
-- View execution planEXPLAIN SELECT * FROM users WHERE age > 30;
-- View with execution statisticsEXPLAIN ANALYZE SELECT * FROM users WHERE age > 30;
-- Verbose plan with costsEXPLAIN (VERBOSE, COSTS, BUFFERS)SELECT * FROM users WHERE age > 30;Load Testing with generate_series
-- Insert 100K rows quicklyINSERT INTO test_table (id, data)SELECT generate_series AS id, md5(generate_series::text) AS dataFROM generate_series(1, 100000);
-- Measure insert throughput\timing onINSERT INTO bulk_test SELECT * FROM generate_series(1, 10000);Memory-Efficient Operations
For Large Datasets (>1GB)
-- Use LIMIT for exploratory queriesSELECT * FROM large_table LIMIT 100;
-- Aggregate instead of fetching all rowsSELECT COUNT(*), AVG(value), MAX(value) FROM large_table;
-- Use cursors for streaming resultsDECLARE my_cursor CURSOR FOR SELECT * FROM large_table;FETCH 1000 FROM my_cursor;-- ... process rows ...CLOSE my_cursor;
-- Filter early to reduce dataSELECT * FROM large_tableWHERE created_at > NOW() - INTERVAL '1 day'LIMIT 1000;Batch Processing
-- Process in batches using offset/limitSELECT * FROM large_tableORDER BY idLIMIT 10000 OFFSET 0;
-- Use keyset pagination (more efficient)SELECT * FROM large_tableWHERE id > 10000ORDER BY idLIMIT 10000;Quick Reference
Starting Points
# Quick test dataset (100K rows, ~20MB)cargo run --release --example populate_test_data -- 100000 ./quicktestcargo run --release -- --data-dir ./quicktest
# Medium dataset (1M rows, ~200MB)cargo run --release --example populate_test_data -- 1000000 ./mediumcargo run --release -- --data-dir ./medium
# Large dataset (10M rows, ~2GB)cargo run --release --example populate_test_data -- 10000000 ./largecargo run --release -- --data-dir ./large
# Benchmark dataset (75M rows, ~15GB across 3 shards)cargo run --release --example hybrid_benchmark_persistent -- 1.0 ./benchmarkcargo run --release -- --data-dir ./benchmark/shard_0Most Useful REPL Commands
\dt -- List tables\d users -- Describe table\timing on -- Show query timingEXPLAIN ANALYZE SELECT ... -- Query plan with statsSELECT * FROM helios_storage_stats; -- Storage overview