HeliosDB REPL: Bulk Loading & Feature Exploration Guide

This guide shows how to populate HeliosDB with test data and explore all HeliosCore features using the REPL.

Quick Start
Data Population Methods
REPL Bulk Loading Commands
Exploring HeliosCore Features
Example Queries
System Views & Metadata
Performance Testing

Quick Start

Option 1: Use the Population Utility (Recommended)

# Populate 1 million rows (default) - ~200MB on disk
cargo run --release --example populate_test_data

# Populate 10 million rows - ~2GB on disk
cargo run --release --example populate_test_data -- 10000000 ./largedata

# Start REPL with populated data
cargo run --release -- --data-dir ./helios_testdata

Option 2: Run Persistent Benchmark (Creates 15GB Dataset)

# Run full benchmark - creates 75M rows across 3 shards
cargo run --release --example hybrid_benchmark_persistent -- 1.0 ./benchmark_data

# Explore one shard in REPL
cargo run --release -- --data-dir ./benchmark_data/shard_0

Option 3: Populate via REPL (Interactive)

# Start REPL
cargo run --release

# Then run SQL commands to create and populate tables

Data Population Methods

Method 1: External Utility (Fastest)

The populate_test_data example creates two tables:

Table	Columns	Description
`users`	id, name, email, age, score, active, created_at	User records
`orders`	id, user_id, product, quantity, price, status, order_date	Order transactions

# Usage
cargo run --release --example populate_test_data -- [rows] [data_dir]

# Examples
cargo run --release --example populate_test_data -- 100000 ./small_test   # 100K rows
cargo run --release --example populate_test_data -- 1000000 ./medium_test # 1M rows
cargo run --release --example populate_test_data -- 10000000 ./large_test # 10M rows

Performance: ~1.5-2M rows/sec on modern hardware

Method 2: SQL Bulk Insert via REPL

-- Create table
CREATE TABLE products (
    id INT8 PRIMARY KEY,
    name TEXT NOT NULL,
    category TEXT,
    price FLOAT8,
    stock INT4,
    active BOOLEAN DEFAULT true
);

-- Insert multiple rows in one statement
INSERT INTO products (id, name, category, price, stock, active) VALUES
    (1, 'Widget A', 'electronics', 29.99, 100, true),
    (2, 'Widget B', 'electronics', 49.99, 50, true),
    (3, 'Gadget X', 'gadgets', 99.99, 25, true),
    (4, 'Tool Y', 'tools', 19.99, 200, false);

-- Use a loop for bulk generation (HeliosDB extension)
INSERT INTO products
SELECT
    generate_series AS id,
    'Product_' || generate_series AS name,
    CASE generate_series % 3
        WHEN 0 THEN 'electronics'
        WHEN 1 THEN 'gadgets'
        ELSE 'tools'
    END AS category,
    (generate_series % 100) + 0.99 AS price,
    generate_series % 500 AS stock,
    generate_series % 2 = 0 AS active
FROM generate_series(1000, 10000);

Method 3: COPY Command (CSV Import)

-- Create table first
CREATE TABLE logs (
    timestamp TIMESTAMPTZ,
    level TEXT,
    message TEXT,
    source TEXT
);

-- Import from CSV
COPY logs FROM '/path/to/logs.csv' WITH (FORMAT CSV, HEADER true);

-- Export to CSV
COPY logs TO '/path/to/export.csv' WITH (FORMAT CSV, HEADER true);

REPL Bulk Loading Commands

Starting REPL with Data

# In-memory mode; add --dump-on-shutdown to dump on exit
cargo run --release

# Persistent mode (data saved to disk)
cargo run --release -- --data-dir /path/to/data

# With specific configuration
cargo run --release -- --data-dir /path/to/data --cache-size 512

REPL Meta-Commands

Command	Description
`\dt`	List all tables
`\d tablename`	Describe table schema
`\di`	List all indexes
`\ds`	Show storage statistics
`\timing`	Toggle query timing display
`\explain query`	Show query execution plan
`\help`	Show all commands

Creating Tables for Testing

-- Users table with various data types
CREATE TABLE users (
    id INT8 PRIMARY KEY,
    name TEXT NOT NULL,
    email TEXT UNIQUE,
    age INT4 CHECK (age >= 0 AND age <= 150),
    score FLOAT8 DEFAULT 0.0,
    active BOOLEAN DEFAULT true,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    metadata JSONB
);

-- Orders table with foreign key
CREATE TABLE orders (
    id INT8 PRIMARY KEY,
    user_id INT8 REFERENCES users(id),
    product TEXT NOT NULL,
    quantity INT4 DEFAULT 1,
    price FLOAT8 NOT NULL,
    status TEXT DEFAULT 'pending',
    order_date DATE DEFAULT CURRENT_DATE
);

-- Create indexes for common queries
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_users_age ON users(age);
CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_orders_status ON orders(status);

Exploring HeliosCore Features

Feature 1: Probabilistic Filters

-- Check Bloom filter statistics
SELECT * FROM helios_filter_stats WHERE table_name = 'users';

-- Filter types available:
-- - Bloom: Standard, good balance
-- - XOR: Better space efficiency
-- - Ribbon: Near-optimal, latest research
-- - Cuckoo: Good for high load factors

-- View filter configuration
SELECT filter_type, bits_per_key, false_positive_rate
FROM helios_storage_config;

Feature 2: Frequency Estimation (Count-Min Sketch)

-- Estimate frequency of a value without full scan
SELECT helios_estimate_frequency('users', 'age', 25);

-- Compare with exact count
SELECT COUNT(*) FROM users WHERE age = 25;

-- View sketch statistics
SELECT * FROM helios_sketch_stats;

Feature 3: Cardinality Estimation (HyperLogLog)

-- Estimate distinct values (memory-efficient)
SELECT helios_estimate_cardinality('users', 'email');

-- Compare with exact count
SELECT COUNT(DISTINCT email) FROM users;

-- HyperLogLog uses ~12KB regardless of data size
-- Error rate: ~2%

Feature 4: MVCC & Snapshots

-- Create a snapshot
SELECT helios_create_snapshot() AS snapshot_id;

-- Read from snapshot (consistent view)
SELECT * FROM users AS OF SNAPSHOT 12345;

-- Compare current vs snapshot
SELECT
    (SELECT COUNT(*) FROM users) AS current_count,
    (SELECT COUNT(*) FROM users AS OF SNAPSHOT 12345) AS snapshot_count;

-- List active snapshots
SELECT * FROM helios_snapshots;

Feature 5: Compression Statistics

-- View compression ratios
SELECT
    table_name,
    raw_size_bytes,
    compressed_size_bytes,
    compression_ratio,
    compression_type
FROM helios_compression_stats;

-- Compression types: none, lz4, zstd, snappy

Feature 6: WAL & Durability

-- View WAL status
SELECT * FROM helios_wal_status;

-- WAL configuration
SELECT
    wal_mode,           -- 'sync', 'async', 'disabled'
    sync_mode,          -- 'fdatasync', 'fsync', 'o_direct'
    wal_size_bytes,
    last_checkpoint
FROM helios_wal_config;

-- Force checkpoint
SELECT helios_checkpoint();

Feature 7: Storage Segments

-- View segment information
SELECT
    segment_id,
    table_name,
    row_count,
    size_bytes,
    min_key,
    max_key,
    created_at
FROM helios_segments
ORDER BY created_at DESC;

-- Segment size is configurable (default 64MB)

Example Queries

Basic CRUD Operations

-- INSERT
INSERT INTO users (id, name, email, age, score, active)
VALUES (1, 'Alice', 'alice@example.com', 30, 85.5, true);

-- SELECT with filtering
SELECT * FROM users WHERE age BETWEEN 25 AND 35 AND active = true;

-- UPDATE
UPDATE users SET score = score + 10 WHERE id = 1;

-- DELETE
DELETE FROM users WHERE active = false AND score < 50;

-- UPSERT (INSERT ... ON CONFLICT)
INSERT INTO users (id, name, email)
VALUES (1, 'Alice Updated', 'alice@example.com')
ON CONFLICT (id) DO UPDATE SET name = EXCLUDED.name;

Aggregations

-- Basic aggregates
SELECT
    COUNT(*) AS total_users,
    AVG(age) AS avg_age,
    MIN(score) AS min_score,
    MAX(score) AS max_score,
    SUM(score) AS total_score
FROM users;

-- GROUP BY with HAVING
SELECT
    CASE WHEN age < 30 THEN 'young' ELSE 'mature' END AS age_group,
    COUNT(*) AS count,
    AVG(score) AS avg_score
FROM users
GROUP BY age_group
HAVING COUNT(*) > 10;

-- Window functions
SELECT
    name,
    score,
    RANK() OVER (ORDER BY score DESC) AS rank,
    score - LAG(score) OVER (ORDER BY score DESC) AS diff_from_prev
FROM users
ORDER BY score DESC
LIMIT 10;

Joins

-- INNER JOIN
SELECT
    u.name,
    o.product,
    o.quantity,
    o.price
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.status = 'delivered';

-- LEFT JOIN with aggregation
SELECT
    u.name,
    COUNT(o.id) AS order_count,
    COALESCE(SUM(o.price * o.quantity), 0) AS total_spent
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name
ORDER BY total_spent DESC
LIMIT 10;

Subqueries & CTEs

-- Subquery
SELECT * FROM users
WHERE id IN (
    SELECT user_id FROM orders
    WHERE status = 'delivered'
    GROUP BY user_id
    HAVING COUNT(*) > 5
);

-- CTE (Common Table Expression)
WITH high_value_customers AS (
    SELECT user_id, SUM(price * quantity) AS total
    FROM orders
    GROUP BY user_id
    HAVING SUM(price * quantity) > 1000
)
SELECT u.name, h.total
FROM users u
JOIN high_value_customers h ON u.id = h.user_id
ORDER BY h.total DESC;

-- Recursive CTE (for hierarchical data)
WITH RECURSIVE category_tree AS (
    SELECT id, name, parent_id, 0 AS level
    FROM categories WHERE parent_id IS NULL
    UNION ALL
    SELECT c.id, c.name, c.parent_id, ct.level + 1
    FROM categories c
    JOIN category_tree ct ON c.parent_id = ct.id
)
SELECT * FROM category_tree;

System Views & Metadata

Available System Views

-- List all system views
SELECT * FROM helios_system_views;

View	Description
`helios_tables`	All table metadata
`helios_columns`	Column definitions
`helios_indexes`	Index information
`helios_constraints`	Constraints (PK, FK, CHECK)
`helios_storage_stats`	Per-table storage statistics
`helios_filter_stats`	Bloom/XOR filter statistics
`helios_compression_stats`	Compression ratios
`helios_wal_status`	WAL status and health
`helios_segments`	Storage segment details
`helios_snapshots`	MVCC snapshot list
`helios_transactions`	Active transactions
`helios_locks`	Current lock status

Storage Statistics

-- Comprehensive storage overview
SELECT
    table_name,
    row_count,
    pg_size_pretty(data_size_bytes) AS data_size,
    pg_size_pretty(index_size_bytes) AS index_size,
    pg_size_pretty(total_size_bytes) AS total_size,
    compression_ratio
FROM helios_storage_stats
ORDER BY total_size_bytes DESC;

-- Per-column statistics
SELECT
    table_name,
    column_name,
    null_fraction,
    distinct_count,
    avg_width,
    most_common_values
FROM helios_column_stats
WHERE table_name = 'users';

Performance Diagnostics

-- Query execution history
SELECT
    query_text,
    execution_time_ms,
    rows_returned,
    rows_scanned,
    index_used
FROM helios_query_log
ORDER BY execution_time_ms DESC
LIMIT 10;

-- Cache hit rates
SELECT
    cache_type,
    hits,
    misses,
    hit_rate_percent
FROM helios_cache_stats;

-- I/O statistics
SELECT
    reads,
    writes,
    bytes_read,
    bytes_written,
    avg_read_latency_us,
    avg_write_latency_us
FROM helios_io_stats;

Performance Testing

Benchmark Queries

-- Enable timing
\timing on

-- Point lookup (should use index)
SELECT * FROM users WHERE id = 12345;

-- Range scan
SELECT COUNT(*) FROM users WHERE age BETWEEN 20 AND 30;

-- Full table scan
SELECT AVG(score) FROM users;

-- Index scan
SELECT * FROM orders WHERE status = 'pending' LIMIT 100;

-- Join performance
SELECT COUNT(*)
FROM users u
JOIN orders o ON u.id = o.user_id;

Explain Plans

-- View execution plan
EXPLAIN SELECT * FROM users WHERE age > 30;

-- View with execution statistics
EXPLAIN ANALYZE SELECT * FROM users WHERE age > 30;

-- Verbose plan with costs
EXPLAIN (VERBOSE, COSTS, BUFFERS)
SELECT * FROM users WHERE age > 30;

Load Testing with generate_series

-- Insert 100K rows quickly
INSERT INTO test_table (id, data)
SELECT
    generate_series AS id,
    md5(generate_series::text) AS data
FROM generate_series(1, 100000);

-- Measure insert throughput
\timing on
INSERT INTO bulk_test SELECT * FROM generate_series(1, 10000);

Memory-Efficient Operations

For Large Datasets (>1GB)

-- Use LIMIT for exploratory queries
SELECT * FROM large_table LIMIT 100;

-- Aggregate instead of fetching all rows
SELECT COUNT(*), AVG(value), MAX(value) FROM large_table;

-- Use cursors for streaming results
DECLARE my_cursor CURSOR FOR SELECT * FROM large_table;
FETCH 1000 FROM my_cursor;
-- ... process rows ...
CLOSE my_cursor;

-- Filter early to reduce data
SELECT * FROM large_table
WHERE created_at > NOW() - INTERVAL '1 day'
LIMIT 1000;

Batch Processing

-- Process in batches using offset/limit
SELECT * FROM large_table
ORDER BY id
LIMIT 10000 OFFSET 0;

-- Use keyset pagination (more efficient)
SELECT * FROM large_table
WHERE id > 10000
ORDER BY id
LIMIT 10000;

Quick Reference

Starting Points

# Quick test dataset (100K rows, ~20MB)
cargo run --release --example populate_test_data -- 100000 ./quicktest
cargo run --release -- --data-dir ./quicktest

# Medium dataset (1M rows, ~200MB)
cargo run --release --example populate_test_data -- 1000000 ./medium
cargo run --release -- --data-dir ./medium

# Large dataset (10M rows, ~2GB)
cargo run --release --example populate_test_data -- 10000000 ./large
cargo run --release -- --data-dir ./large

# Benchmark dataset (75M rows, ~15GB across 3 shards)
cargo run --release --example hybrid_benchmark_persistent -- 1.0 ./benchmark
cargo run --release -- --data-dir ./benchmark/shard_0

Most Useful REPL Commands

\dt                          -- List tables
\d users                     -- Describe table
\timing on                   -- Show query timing
EXPLAIN ANALYZE SELECT ...   -- Query plan with stats
SELECT * FROM helios_storage_stats;  -- Storage overview

HeliosDB REPL: Bulk Loading & Feature Exploration Guide

HeliosDB REPL: Bulk Loading & Feature Exploration Guide

Table of Contents

Quick Start

Option 1: Use the Population Utility (Recommended)

Option 2: Run Persistent Benchmark (Creates 15GB Dataset)

Option 3: Populate via REPL (Interactive)

Data Population Methods

Method 1: External Utility (Fastest)

Method 2: SQL Bulk Insert via REPL

Method 3: COPY Command (CSV Import)

REPL Bulk Loading Commands

Starting REPL with Data

REPL Meta-Commands

Creating Tables for Testing

Exploring HeliosCore Features

Feature 1: Probabilistic Filters

Feature 2: Frequency Estimation (Count-Min Sketch)

Feature 3: Cardinality Estimation (HyperLogLog)

Feature 4: MVCC & Snapshots

Feature 5: Compression Statistics

Feature 6: WAL & Durability

Feature 7: Storage Segments

Example Queries

Basic CRUD Operations

Aggregations

Joins

Subqueries & CTEs

System Views & Metadata

Available System Views

Storage Statistics

Performance Diagnostics

Performance Testing

Benchmark Queries

Explain Plans

Load Testing with generate_series

Memory-Efficient Operations

For Large Datasets (>1GB)

Batch Processing

Quick Reference

Starting Points

Most Useful REPL Commands