Docling Ingestion Tutorial

Available since: v3.19.0 (2026-04-25) Build: cargo build --release --features graph-rag (implies code-graph) APIs: EmbeddedDatabase::graph_rag_ingest_pdf, _office, _audio, _image External dependency: docling-serve — runs out-of-process, typically as a Docker / Podman container

UVP

Docling is the document-conversion stack that handles PDFs, DOCX, PPTX, images, and audio. HeliosDB Nano embeds an idempotent adapter that POSTs your file at a docling-serve instance, parses the resulting DoclingDocument JSON, and projects sections + chunks + tables into the universal _hdb_graph_* schema with CONTAINS edges preserving hierarchy. Re-running the ingester against the same file is a safe no-op via source_ref keys. Drop a PDF into the same engine that holds your code symbols, your emails, and your issues — and run a single graph_rag_search() across all of them.

Prerequisites

HeliosDB Nano v3.19+ source tree with --features graph-rag
Docker or Podman (for running docling-serve locally)
A sample PDF, DOCX, or image to ingest
The Graph-RAG Tutorial read first — this tutorial assumes familiarity with _hdb_graph_nodes / _hdb_graph_edges
About 25 minutes (most of which is docling-serve’s first-run model download)

1. Run docling-serve Locally

The official image:

docker run --rm -d \
  --name docling-serve \
  -p 5001:5001 \
  ghcr.io/docling-project/docling-serve:latest

First boot pulls the conversion models (~700 MB; takes 1–3 minutes depending on your link). Once ready, the API responds at:

curl -s http://localhost:5001/health
# {"status":"ok"}

The Nano adapter targets http://localhost:5001/v1/convert/source by default — change it in DoclingIngestOptions::with_endpoint(...) if your instance lives elsewhere or behind a gateway.

2. Wire HeliosDB Nano to docling-serve

The four ingestion entry points are one-per-modality:

Method	What it ingests	Default `corpus_kind`
`graph_rag_ingest_pdf`	PDFs	`Pdf`
`graph_rag_ingest_office`	DOCX / PPTX / XLSX	`Office`
`graph_rag_ingest_audio`	Audio (text-only DoclingDocuments via docling’s ASR)	`Audio`
`graph_rag_ingest_image`	Images	`Image`

Each takes a single DoclingIngestOptions carrying the source (URL, local path, or in-memory bytes) plus the docling endpoint:

use heliosdb_nano::{
    graph_rag::DoclingIngestOptions,
    EmbeddedDatabase, Result,
};

fn main() -> Result<()> {
    let db = EmbeddedDatabase::new_in_memory()?;

    // 1. Build options. `from_path` defaults to the local docling-serve
    //    endpoint and the "Document" corpus kind — the per-modality
    //    entry point overrides the kind to "Pdf" / "Office" / etc.
    let opts = DoclingIngestOptions::from_path("/path/to/whitepaper.pdf");

    // 2. Run. Each call is idempotent — same source_ref → same node_id.
    let stats = db.graph_rag_ingest_pdf(&opts)?;

    println!(
        "ingested: {} nodes, {} edges, {} rows seen",
        stats.nodes_added, stats.edges_added, stats.rows_seen
    );
    Ok(())
}

Behind the scenes the adapter:

Reads the file (or fetches the URL, or decodes the in-memory bytes), base64-encodes it, and POSTs {"sources": [{"kind": "file", "filename": "...", "data_base64": "..."}], "to_formats": ["json"]} at the docling endpoint.
Parses the response’s documents[].json_content field as a DoclingDocument.
Walks the structure and emits one root node (node_kind = "Pdf" / "Office" / etc.), one DocSection per heading-style text, one DocChunk per paragraph, and one DocTable per table — wired together with CONTAINS edges.

3. The Three Source Modes

// (a) Local file — adapter reads + base64-encodes + POSTs.
let opts = DoclingIngestOptions::from_path("./paper.pdf");

// (b) Remote URL — docling-serve fetches it server-side.
let opts = DoclingIngestOptions::from_url("https://arxiv.org/pdf/2403.10131.pdf");

// (c) In-memory bytes — useful for non-filesystem corpora.
let opts = DoclingIngestOptions {
    source_bytes: Some(std::fs::read("paper.pdf")?),
    filename:     Some("paper.pdf".into()),
    docling_endpoint: "http://localhost:5001/v1/convert/source".into(),
    corpus_kind:  "Pdf".into(),
    timeout_ms:   60_000,
    ..Default::default()
};

The default 60 s HTTP timeout covers docling’s first-model-load slow path. Override with timeout_ms for big files or slow networks.

4. Override the Endpoint

let opts = DoclingIngestOptions::from_path("./paper.pdf")
    .with_endpoint("http://docling.internal:5001/v1/convert/source")
    .with_corpus_kind("ResearchPaper");

auth_bearer adds Authorization: Bearer <token> to the docling-serve POST — useful when the conversion service is behind an auth proxy.

5. Inspect the Result

After a single PDF ingestion:

SELECT node_kind, COUNT(*) FROM _hdb_graph_nodes
WHERE source_ref LIKE 'docling:%'
GROUP BY node_kind ORDER BY 1;

-- node_kind  | count
-- DocChunk   | 87
-- DocSection | 14
-- DocTable   | 3
-- Pdf        | 1

The hierarchy:

SELECT n_from.node_kind AS parent_kind, n_from.title AS parent_title,
       n_to.node_kind   AS child_kind,  LEFT(n_to.text, 60) AS child_text
FROM _hdb_graph_edges e
JOIN _hdb_graph_nodes n_from ON n_from.node_id = e.from_node
JOIN _hdb_graph_nodes n_to   ON n_to.node_id   = e.to_node
WHERE e.edge_kind = 'CONTAINS'
  AND n_from.source_ref LIKE 'docling:%'
ORDER BY n_from.node_id, n_to.node_id
LIMIT 10;

source_ref keys follow stable patterns:

Node	`source_ref`
Document root	`docling:document:<filename>`
Section heading	`docling:section:<self_ref>`
Paragraph chunk	`docling:chunk:<self_ref>`
Table	`docling:table:<self_ref>`

Re-running the ingester against the same file finds the existing root via source_ref and skips re-inserting (the adapter does not update existing rows in this phase — it dedupes on source_ref).

6. End-to-End — PDF + Code Graph + Search

use heliosdb_nano::{
    code_graph::CodeIndexOptions,
    graph_rag::{Direction, DoclingIngestOptions, GraphRagOptions},
    EmbeddedDatabase, Result,
};

fn main() -> Result<()> {
    let db = EmbeddedDatabase::open("./data")?;

    // 1. Code corpus
    db.execute("CREATE TABLE IF NOT EXISTS src \
                (path TEXT PRIMARY KEY, lang TEXT, content TEXT)")?;
    db.execute("INSERT INTO src VALUES \
      ('parse.rs', 'rust', 'pub fn parse_query(s: &str) -> Result<Query, Error> { ... }') \
      ON CONFLICT DO NOTHING")?;
    db.code_index(CodeIndexOptions::for_table("src"))?;

    // 2. Spec PDF
    let pdf_opts = DoclingIngestOptions::from_path("./docs/sql_spec.pdf");
    let pdf_stats = db.graph_rag_ingest_pdf(&pdf_opts)?;
    println!("PDF: {} chunks projected", pdf_stats.nodes_added);

    // 3. Cross-modal MENTIONS — DocChunk → Function via whole-word match
    db.graph_rag_link_exact(&[])?;

    // 4. Search: starting from any chunk that mentions "parse_query",
    //    walk 2 hops out through CONTAINS / MENTIONS / CALLS edges.
    let hits = db.graph_rag_search(&GraphRagOptions {
        seed_text: "parse_query".into(),
        hops: 2,
        direction: Direction::Both,
        limit: 50,
        ..Default::default()
    })?;

    for h in hits {
        println!("[hop={}] {} :: {}",
            h.hop_distance, h.node_kind, h.title.unwrap_or_default());
    }
    Ok(())
}

You’ll see the Function itself, the DocChunk rows that mention it (linked via MENTIONS from §3 of GRAPH_RAG_TUTORIAL), and the surrounding DocSection headings (linked via CONTAINS).

7. Drive It from MCP

The same call surfaces over MCP via helios_graphrag_search. Once the PDF is ingested, an MCP-aware agent (Claude Code, Cursor, …) can run:

{
  "jsonrpc": "2.0", "id": 1, "method": "tools/call",
  "params": {
    "name": "helios_graphrag_search",
    "arguments": { "seed_text": "parse_query", "hops": 2, "limit": 25 },
    "_meta": { "progressToken": "search-1" }
  }
}

The agent receives streaming notifications/progress events (“seeding for parse_query, hops=2” → ” hits”) followed by the final result. See MCP_ENDPOINT_TUTORIAL §6.

8. DOCX / PPTX / Audio / Image

The other three entry points share the exact same options shape:

db.graph_rag_ingest_office(&DoclingIngestOptions::from_path("./deck.pptx"))?;
db.graph_rag_ingest_audio (&DoclingIngestOptions::from_path("./meeting.mp3"))?;
db.graph_rag_ingest_image (&DoclingIngestOptions::from_path("./diagram.png"))?;

Audio runs through docling’s ASR pipeline — the resulting DoclingDocument has only text (no headings), so the graph projection emits a single Audio root with one DocChunk per transcribed utterance. Images project the OCR’d text similarly under an Image root.

9. Where Next

GRAPH_RAG_TUTORIAL — query the graph this tutorial just populated.
MCP_ENDPOINT_TUTORIAL — surface ingestion results to AI agents.
CODE_GRAPH_TUTORIAL — populate _hdb_code_symbols so the docling-projected DocChunks have something to MENTIONS-link to.
SEMANTIC_HASH_INDEX_QUICKREF — keep re-indexing of the code side incremental.