Skip to content

Docling Ingestion Tutorial

Docling Ingestion Tutorial

Available since: v3.19.0 (2026-04-25) Build: cargo build --release --features graph-rag (implies code-graph) APIs: EmbeddedDatabase::graph_rag_ingest_pdf, _office, _audio, _image External dependency: docling-serve — runs out-of-process, typically as a Docker / Podman container


UVP

Docling is the document-conversion stack that handles PDFs, DOCX, PPTX, images, and audio. HeliosDB Nano embeds an idempotent adapter that POSTs your file at a docling-serve instance, parses the resulting DoclingDocument JSON, and projects sections + chunks + tables into the universal _hdb_graph_* schema with CONTAINS edges preserving hierarchy. Re-running the ingester against the same file is a safe no-op via source_ref keys. Drop a PDF into the same engine that holds your code symbols, your emails, and your issues — and run a single graph_rag_search() across all of them.


Prerequisites

  • HeliosDB Nano v3.19+ source tree with --features graph-rag
  • Docker or Podman (for running docling-serve locally)
  • A sample PDF, DOCX, or image to ingest
  • The Graph-RAG Tutorial read first — this tutorial assumes familiarity with _hdb_graph_nodes / _hdb_graph_edges
  • About 25 minutes (most of which is docling-serve’s first-run model download)

1. Run docling-serve Locally

The official image:

Terminal window
docker run --rm -d \
--name docling-serve \
-p 5001:5001 \
ghcr.io/docling-project/docling-serve:latest

First boot pulls the conversion models (~700 MB; takes 1–3 minutes depending on your link). Once ready, the API responds at:

Terminal window
curl -s http://localhost:5001/health
# {"status":"ok"}

The Nano adapter targets http://localhost:5001/v1/convert/source by default — change it in DoclingIngestOptions::with_endpoint(...) if your instance lives elsewhere or behind a gateway.


2. Wire HeliosDB Nano to docling-serve

The four ingestion entry points are one-per-modality:

MethodWhat it ingestsDefault corpus_kind
graph_rag_ingest_pdfPDFsPdf
graph_rag_ingest_officeDOCX / PPTX / XLSXOffice
graph_rag_ingest_audioAudio (text-only DoclingDocuments via docling’s ASR)Audio
graph_rag_ingest_imageImagesImage

Each takes a single DoclingIngestOptions carrying the source (URL, local path, or in-memory bytes) plus the docling endpoint:

use heliosdb_nano::{
graph_rag::DoclingIngestOptions,
EmbeddedDatabase, Result,
};
fn main() -> Result<()> {
let db = EmbeddedDatabase::new_in_memory()?;
// 1. Build options. `from_path` defaults to the local docling-serve
// endpoint and the "Document" corpus kind — the per-modality
// entry point overrides the kind to "Pdf" / "Office" / etc.
let opts = DoclingIngestOptions::from_path("/path/to/whitepaper.pdf");
// 2. Run. Each call is idempotent — same source_ref → same node_id.
let stats = db.graph_rag_ingest_pdf(&opts)?;
println!(
"ingested: {} nodes, {} edges, {} rows seen",
stats.nodes_added, stats.edges_added, stats.rows_seen
);
Ok(())
}

Behind the scenes the adapter:

  1. Reads the file (or fetches the URL, or decodes the in-memory bytes), base64-encodes it, and POSTs {"sources": [{"kind": "file", "filename": "...", "data_base64": "..."}], "to_formats": ["json"]} at the docling endpoint.
  2. Parses the response’s documents[].json_content field as a DoclingDocument.
  3. Walks the structure and emits one root node (node_kind = "Pdf" / "Office" / etc.), one DocSection per heading-style text, one DocChunk per paragraph, and one DocTable per table — wired together with CONTAINS edges.

3. The Three Source Modes

// (a) Local file — adapter reads + base64-encodes + POSTs.
let opts = DoclingIngestOptions::from_path("./paper.pdf");
// (b) Remote URL — docling-serve fetches it server-side.
let opts = DoclingIngestOptions::from_url("https://arxiv.org/pdf/2403.10131.pdf");
// (c) In-memory bytes — useful for non-filesystem corpora.
let opts = DoclingIngestOptions {
source_bytes: Some(std::fs::read("paper.pdf")?),
filename: Some("paper.pdf".into()),
docling_endpoint: "http://localhost:5001/v1/convert/source".into(),
corpus_kind: "Pdf".into(),
timeout_ms: 60_000,
..Default::default()
};

The default 60 s HTTP timeout covers docling’s first-model-load slow path. Override with timeout_ms for big files or slow networks.


4. Override the Endpoint

let opts = DoclingIngestOptions::from_path("./paper.pdf")
.with_endpoint("http://docling.internal:5001/v1/convert/source")
.with_corpus_kind("ResearchPaper");

auth_bearer adds Authorization: Bearer <token> to the docling-serve POST — useful when the conversion service is behind an auth proxy.


5. Inspect the Result

After a single PDF ingestion:

SELECT node_kind, COUNT(*) FROM _hdb_graph_nodes
WHERE source_ref LIKE 'docling:%'
GROUP BY node_kind ORDER BY 1;
-- node_kind | count
-- DocChunk | 87
-- DocSection | 14
-- DocTable | 3
-- Pdf | 1

The hierarchy:

SELECT n_from.node_kind AS parent_kind, n_from.title AS parent_title,
n_to.node_kind AS child_kind, LEFT(n_to.text, 60) AS child_text
FROM _hdb_graph_edges e
JOIN _hdb_graph_nodes n_from ON n_from.node_id = e.from_node
JOIN _hdb_graph_nodes n_to ON n_to.node_id = e.to_node
WHERE e.edge_kind = 'CONTAINS'
AND n_from.source_ref LIKE 'docling:%'
ORDER BY n_from.node_id, n_to.node_id
LIMIT 10;

source_ref keys follow stable patterns:

Nodesource_ref
Document rootdocling:document:<filename>
Section headingdocling:section:<self_ref>
Paragraph chunkdocling:chunk:<self_ref>
Tabledocling:table:<self_ref>

Re-running the ingester against the same file finds the existing root via source_ref and skips re-inserting (the adapter does not update existing rows in this phase — it dedupes on source_ref).


use heliosdb_nano::{
code_graph::CodeIndexOptions,
graph_rag::{Direction, DoclingIngestOptions, GraphRagOptions},
EmbeddedDatabase, Result,
};
fn main() -> Result<()> {
let db = EmbeddedDatabase::open("./data")?;
// 1. Code corpus
db.execute("CREATE TABLE IF NOT EXISTS src \
(path TEXT PRIMARY KEY, lang TEXT, content TEXT)")?;
db.execute("INSERT INTO src VALUES \
('parse.rs', 'rust', 'pub fn parse_query(s: &str) -> Result<Query, Error> { ... }') \
ON CONFLICT DO NOTHING")?;
db.code_index(CodeIndexOptions::for_table("src"))?;
// 2. Spec PDF
let pdf_opts = DoclingIngestOptions::from_path("./docs/sql_spec.pdf");
let pdf_stats = db.graph_rag_ingest_pdf(&pdf_opts)?;
println!("PDF: {} chunks projected", pdf_stats.nodes_added);
// 3. Cross-modal MENTIONS — DocChunk → Function via whole-word match
db.graph_rag_link_exact(&[])?;
// 4. Search: starting from any chunk that mentions "parse_query",
// walk 2 hops out through CONTAINS / MENTIONS / CALLS edges.
let hits = db.graph_rag_search(&GraphRagOptions {
seed_text: "parse_query".into(),
hops: 2,
direction: Direction::Both,
limit: 50,
..Default::default()
})?;
for h in hits {
println!("[hop={}] {} :: {}",
h.hop_distance, h.node_kind, h.title.unwrap_or_default());
}
Ok(())
}

You’ll see the Function itself, the DocChunk rows that mention it (linked via MENTIONS from §3 of GRAPH_RAG_TUTORIAL), and the surrounding DocSection headings (linked via CONTAINS).


7. Drive It from MCP

The same call surfaces over MCP via helios_graphrag_search. Once the PDF is ingested, an MCP-aware agent (Claude Code, Cursor, …) can run:

{
"jsonrpc": "2.0", "id": 1, "method": "tools/call",
"params": {
"name": "helios_graphrag_search",
"arguments": { "seed_text": "parse_query", "hops": 2, "limit": 25 },
"_meta": { "progressToken": "search-1" }
}
}

The agent receives streaming notifications/progress events (“seeding for parse_query, hops=2” → ” hits”) followed by the final result. See MCP_ENDPOINT_TUTORIAL §6.


8. DOCX / PPTX / Audio / Image

The other three entry points share the exact same options shape:

db.graph_rag_ingest_office(&DoclingIngestOptions::from_path("./deck.pptx"))?;
db.graph_rag_ingest_audio (&DoclingIngestOptions::from_path("./meeting.mp3"))?;
db.graph_rag_ingest_image (&DoclingIngestOptions::from_path("./diagram.png"))?;

Audio runs through docling’s ASR pipeline — the resulting DoclingDocument has only text (no headings), so the graph projection emits a single Audio root with one DocChunk per transcribed utterance. Images project the OCR’d text similarly under an Image root.


9. Where Next