Skip to content

HeliosDB REPL: Bulk Loading & Feature Exploration Guide

HeliosDB REPL: Bulk Loading & Feature Exploration Guide

This guide shows how to populate HeliosDB with test data and explore all HeliosCore features using the REPL.

Table of Contents

  1. Quick Start
  2. Data Population Methods
  3. REPL Bulk Loading Commands
  4. Exploring HeliosCore Features
  5. Example Queries
  6. System Views & Metadata
  7. Performance Testing

Quick Start

Terminal window
# Populate 1 million rows (default) - ~200MB on disk
cargo run --release --example populate_test_data
# Populate 10 million rows - ~2GB on disk
cargo run --release --example populate_test_data -- 10000000 ./largedata
# Start REPL with populated data
cargo run --release -- --data-dir ./helios_testdata

Option 2: Run Persistent Benchmark (Creates 15GB Dataset)

Terminal window
# Run full benchmark - creates 75M rows across 3 shards
cargo run --release --example hybrid_benchmark_persistent -- 1.0 ./benchmark_data
# Explore one shard in REPL
cargo run --release -- --data-dir ./benchmark_data/shard_0

Option 3: Populate via REPL (Interactive)

Terminal window
# Start REPL
cargo run --release
# Then run SQL commands to create and populate tables

Data Population Methods

Method 1: External Utility (Fastest)

The populate_test_data example creates two tables:

TableColumnsDescription
usersid, name, email, age, score, active, created_atUser records
ordersid, user_id, product, quantity, price, status, order_dateOrder transactions
Terminal window
# Usage
cargo run --release --example populate_test_data -- [rows] [data_dir]
# Examples
cargo run --release --example populate_test_data -- 100000 ./small_test # 100K rows
cargo run --release --example populate_test_data -- 1000000 ./medium_test # 1M rows
cargo run --release --example populate_test_data -- 10000000 ./large_test # 10M rows

Performance: ~1.5-2M rows/sec on modern hardware

Method 2: SQL Bulk Insert via REPL

-- Create table
CREATE TABLE products (
id INT8 PRIMARY KEY,
name TEXT NOT NULL,
category TEXT,
price FLOAT8,
stock INT4,
active BOOLEAN DEFAULT true
);
-- Insert multiple rows in one statement
INSERT INTO products (id, name, category, price, stock, active) VALUES
(1, 'Widget A', 'electronics', 29.99, 100, true),
(2, 'Widget B', 'electronics', 49.99, 50, true),
(3, 'Gadget X', 'gadgets', 99.99, 25, true),
(4, 'Tool Y', 'tools', 19.99, 200, false);
-- Use a loop for bulk generation (HeliosDB extension)
INSERT INTO products
SELECT
generate_series AS id,
'Product_' || generate_series AS name,
CASE generate_series % 3
WHEN 0 THEN 'electronics'
WHEN 1 THEN 'gadgets'
ELSE 'tools'
END AS category,
(generate_series % 100) + 0.99 AS price,
generate_series % 500 AS stock,
generate_series % 2 = 0 AS active
FROM generate_series(1000, 10000);

Method 3: COPY Command (CSV Import)

-- Create table first
CREATE TABLE logs (
timestamp TIMESTAMPTZ,
level TEXT,
message TEXT,
source TEXT
);
-- Import from CSV
COPY logs FROM '/path/to/logs.csv' WITH (FORMAT CSV, HEADER true);
-- Export to CSV
COPY logs TO '/path/to/export.csv' WITH (FORMAT CSV, HEADER true);

REPL Bulk Loading Commands

Starting REPL with Data

Terminal window
# In-memory mode (data lost on exit)
cargo run --release
# Persistent mode (data saved to disk)
cargo run --release -- --data-dir /path/to/data
# With specific configuration
cargo run --release -- --data-dir /path/to/data --cache-size 512

REPL Meta-Commands

CommandDescription
\dtList all tables
\d tablenameDescribe table schema
\diList all indexes
\dsShow storage statistics
\timingToggle query timing display
\explain queryShow query execution plan
\helpShow all commands

Creating Tables for Testing

-- Users table with various data types
CREATE TABLE users (
id INT8 PRIMARY KEY,
name TEXT NOT NULL,
email TEXT UNIQUE,
age INT4 CHECK (age >= 0 AND age <= 150),
score FLOAT8 DEFAULT 0.0,
active BOOLEAN DEFAULT true,
created_at TIMESTAMPTZ DEFAULT NOW(),
metadata JSONB
);
-- Orders table with foreign key
CREATE TABLE orders (
id INT8 PRIMARY KEY,
user_id INT8 REFERENCES users(id),
product TEXT NOT NULL,
quantity INT4 DEFAULT 1,
price FLOAT8 NOT NULL,
status TEXT DEFAULT 'pending',
order_date DATE DEFAULT CURRENT_DATE
);
-- Create indexes for common queries
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_users_age ON users(age);
CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_orders_status ON orders(status);

Exploring HeliosCore Features

Feature 1: Probabilistic Filters

-- Check Bloom filter statistics
SELECT * FROM helios_filter_stats WHERE table_name = 'users';
-- Filter types available:
-- - Bloom: Standard, good balance
-- - XOR: Better space efficiency
-- - Ribbon: Near-optimal, latest research
-- - Cuckoo: Good for high load factors
-- View filter configuration
SELECT filter_type, bits_per_key, false_positive_rate
FROM helios_storage_config;

Feature 2: Frequency Estimation (Count-Min Sketch)

-- Estimate frequency of a value without full scan
SELECT helios_estimate_frequency('users', 'age', 25);
-- Compare with exact count
SELECT COUNT(*) FROM users WHERE age = 25;
-- View sketch statistics
SELECT * FROM helios_sketch_stats;

Feature 3: Cardinality Estimation (HyperLogLog)

-- Estimate distinct values (memory-efficient)
SELECT helios_estimate_cardinality('users', 'email');
-- Compare with exact count
SELECT COUNT(DISTINCT email) FROM users;
-- HyperLogLog uses ~12KB regardless of data size
-- Error rate: ~2%

Feature 4: MVCC & Snapshots

-- Create a snapshot
SELECT helios_create_snapshot() AS snapshot_id;
-- Read from snapshot (consistent view)
SELECT * FROM users AS OF SNAPSHOT 12345;
-- Compare current vs snapshot
SELECT
(SELECT COUNT(*) FROM users) AS current_count,
(SELECT COUNT(*) FROM users AS OF SNAPSHOT 12345) AS snapshot_count;
-- List active snapshots
SELECT * FROM helios_snapshots;

Feature 5: Compression Statistics

-- View compression ratios
SELECT
table_name,
raw_size_bytes,
compressed_size_bytes,
compression_ratio,
compression_type
FROM helios_compression_stats;
-- Compression types: none, lz4, zstd, snappy

Feature 6: WAL & Durability

-- View WAL status
SELECT * FROM helios_wal_status;
-- WAL configuration
SELECT
wal_mode, -- 'sync', 'async', 'disabled'
sync_mode, -- 'fdatasync', 'fsync', 'o_direct'
wal_size_bytes,
last_checkpoint
FROM helios_wal_config;
-- Force checkpoint
SELECT helios_checkpoint();

Feature 7: Storage Segments

-- View segment information
SELECT
segment_id,
table_name,
row_count,
size_bytes,
min_key,
max_key,
created_at
FROM helios_segments
ORDER BY created_at DESC;
-- Segment size is configurable (default 64MB)

Example Queries

Basic CRUD Operations

-- INSERT
INSERT INTO users (id, name, email, age, score, active)
VALUES (1, 'Alice', 'alice@example.com', 30, 85.5, true);
-- SELECT with filtering
SELECT * FROM users WHERE age BETWEEN 25 AND 35 AND active = true;
-- UPDATE
UPDATE users SET score = score + 10 WHERE id = 1;
-- DELETE
DELETE FROM users WHERE active = false AND score < 50;
-- UPSERT (INSERT ... ON CONFLICT)
INSERT INTO users (id, name, email)
VALUES (1, 'Alice Updated', 'alice@example.com')
ON CONFLICT (id) DO UPDATE SET name = EXCLUDED.name;

Aggregations

-- Basic aggregates
SELECT
COUNT(*) AS total_users,
AVG(age) AS avg_age,
MIN(score) AS min_score,
MAX(score) AS max_score,
SUM(score) AS total_score
FROM users;
-- GROUP BY with HAVING
SELECT
CASE WHEN age < 30 THEN 'young' ELSE 'mature' END AS age_group,
COUNT(*) AS count,
AVG(score) AS avg_score
FROM users
GROUP BY age_group
HAVING COUNT(*) > 10;
-- Window functions
SELECT
name,
score,
RANK() OVER (ORDER BY score DESC) AS rank,
score - LAG(score) OVER (ORDER BY score DESC) AS diff_from_prev
FROM users
ORDER BY score DESC
LIMIT 10;

Joins

-- INNER JOIN
SELECT
u.name,
o.product,
o.quantity,
o.price
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.status = 'delivered';
-- LEFT JOIN with aggregation
SELECT
u.name,
COUNT(o.id) AS order_count,
COALESCE(SUM(o.price * o.quantity), 0) AS total_spent
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name
ORDER BY total_spent DESC
LIMIT 10;

Subqueries & CTEs

-- Subquery
SELECT * FROM users
WHERE id IN (
SELECT user_id FROM orders
WHERE status = 'delivered'
GROUP BY user_id
HAVING COUNT(*) > 5
);
-- CTE (Common Table Expression)
WITH high_value_customers AS (
SELECT user_id, SUM(price * quantity) AS total
FROM orders
GROUP BY user_id
HAVING SUM(price * quantity) > 1000
)
SELECT u.name, h.total
FROM users u
JOIN high_value_customers h ON u.id = h.user_id
ORDER BY h.total DESC;
-- Recursive CTE (for hierarchical data)
WITH RECURSIVE category_tree AS (
SELECT id, name, parent_id, 0 AS level
FROM categories WHERE parent_id IS NULL
UNION ALL
SELECT c.id, c.name, c.parent_id, ct.level + 1
FROM categories c
JOIN category_tree ct ON c.parent_id = ct.id
)
SELECT * FROM category_tree;

System Views & Metadata

Available System Views

-- List all system views
SELECT * FROM helios_system_views;
ViewDescription
helios_tablesAll table metadata
helios_columnsColumn definitions
helios_indexesIndex information
helios_constraintsConstraints (PK, FK, CHECK)
helios_storage_statsPer-table storage statistics
helios_filter_statsBloom/XOR filter statistics
helios_compression_statsCompression ratios
helios_wal_statusWAL status and health
helios_segmentsStorage segment details
helios_snapshotsMVCC snapshot list
helios_transactionsActive transactions
helios_locksCurrent lock status

Storage Statistics

-- Comprehensive storage overview
SELECT
table_name,
row_count,
pg_size_pretty(data_size_bytes) AS data_size,
pg_size_pretty(index_size_bytes) AS index_size,
pg_size_pretty(total_size_bytes) AS total_size,
compression_ratio
FROM helios_storage_stats
ORDER BY total_size_bytes DESC;
-- Per-column statistics
SELECT
table_name,
column_name,
null_fraction,
distinct_count,
avg_width,
most_common_values
FROM helios_column_stats
WHERE table_name = 'users';

Performance Diagnostics

-- Query execution history
SELECT
query_text,
execution_time_ms,
rows_returned,
rows_scanned,
index_used
FROM helios_query_log
ORDER BY execution_time_ms DESC
LIMIT 10;
-- Cache hit rates
SELECT
cache_type,
hits,
misses,
hit_rate_percent
FROM helios_cache_stats;
-- I/O statistics
SELECT
reads,
writes,
bytes_read,
bytes_written,
avg_read_latency_us,
avg_write_latency_us
FROM helios_io_stats;

Performance Testing

Benchmark Queries

-- Enable timing
\timing on
-- Point lookup (should use index)
SELECT * FROM users WHERE id = 12345;
-- Range scan
SELECT COUNT(*) FROM users WHERE age BETWEEN 20 AND 30;
-- Full table scan
SELECT AVG(score) FROM users;
-- Index scan
SELECT * FROM orders WHERE status = 'pending' LIMIT 100;
-- Join performance
SELECT COUNT(*)
FROM users u
JOIN orders o ON u.id = o.user_id;

Explain Plans

-- View execution plan
EXPLAIN SELECT * FROM users WHERE age > 30;
-- View with execution statistics
EXPLAIN ANALYZE SELECT * FROM users WHERE age > 30;
-- Verbose plan with costs
EXPLAIN (VERBOSE, COSTS, BUFFERS)
SELECT * FROM users WHERE age > 30;

Load Testing with generate_series

-- Insert 100K rows quickly
INSERT INTO test_table (id, data)
SELECT
generate_series AS id,
md5(generate_series::text) AS data
FROM generate_series(1, 100000);
-- Measure insert throughput
\timing on
INSERT INTO bulk_test SELECT * FROM generate_series(1, 10000);

Memory-Efficient Operations

For Large Datasets (>1GB)

-- Use LIMIT for exploratory queries
SELECT * FROM large_table LIMIT 100;
-- Aggregate instead of fetching all rows
SELECT COUNT(*), AVG(value), MAX(value) FROM large_table;
-- Use cursors for streaming results
DECLARE my_cursor CURSOR FOR SELECT * FROM large_table;
FETCH 1000 FROM my_cursor;
-- ... process rows ...
CLOSE my_cursor;
-- Filter early to reduce data
SELECT * FROM large_table
WHERE created_at > NOW() - INTERVAL '1 day'
LIMIT 1000;

Batch Processing

-- Process in batches using offset/limit
SELECT * FROM large_table
ORDER BY id
LIMIT 10000 OFFSET 0;
-- Use keyset pagination (more efficient)
SELECT * FROM large_table
WHERE id > 10000
ORDER BY id
LIMIT 10000;

Quick Reference

Starting Points

Terminal window
# Quick test dataset (100K rows, ~20MB)
cargo run --release --example populate_test_data -- 100000 ./quicktest
cargo run --release -- --data-dir ./quicktest
# Medium dataset (1M rows, ~200MB)
cargo run --release --example populate_test_data -- 1000000 ./medium
cargo run --release -- --data-dir ./medium
# Large dataset (10M rows, ~2GB)
cargo run --release --example populate_test_data -- 10000000 ./large
cargo run --release -- --data-dir ./large
# Benchmark dataset (75M rows, ~15GB across 3 shards)
cargo run --release --example hybrid_benchmark_persistent -- 1.0 ./benchmark
cargo run --release -- --data-dir ./benchmark/shard_0

Most Useful REPL Commands

\dt -- List tables
\d users -- Describe table
\timing on -- Show query timing
EXPLAIN ANALYZE SELECT ... -- Query plan with stats
SELECT * FROM helios_storage_stats; -- Storage overview