Docling Ingestion Tutorial
Docling Ingestion Tutorial
Available since: v3.19.0 (2026-04-25)
Build: cargo build --release --features graph-rag (implies code-graph)
APIs: EmbeddedDatabase::graph_rag_ingest_pdf, _office, _audio, _image
External dependency: docling-serve — runs out-of-process, typically as a Docker / Podman container
UVP
Docling is the document-conversion stack that handles PDFs, DOCX, PPTX, images, and audio. HeliosDB Nano embeds an idempotent adapter that POSTs your file at a docling-serve instance, parses the resulting DoclingDocument JSON, and projects sections + chunks + tables into the universal _hdb_graph_* schema with CONTAINS edges preserving hierarchy. Re-running the ingester against the same file is a safe no-op via source_ref keys. Drop a PDF into the same engine that holds your code symbols, your emails, and your issues — and run a single graph_rag_search() across all of them.
Prerequisites
- HeliosDB Nano v3.19+ source tree with
--features graph-rag - Docker or Podman (for running docling-serve locally)
- A sample PDF, DOCX, or image to ingest
- The Graph-RAG Tutorial read first — this tutorial assumes familiarity with
_hdb_graph_nodes/_hdb_graph_edges - About 25 minutes (most of which is docling-serve’s first-run model download)
1. Run docling-serve Locally
The official image:
docker run --rm -d \ --name docling-serve \ -p 5001:5001 \ ghcr.io/docling-project/docling-serve:latestFirst boot pulls the conversion models (~700 MB; takes 1–3 minutes depending on your link). Once ready, the API responds at:
curl -s http://localhost:5001/health# {"status":"ok"}The Nano adapter targets http://localhost:5001/v1/convert/source by default — change it in DoclingIngestOptions::with_endpoint(...) if your instance lives elsewhere or behind a gateway.
2. Wire HeliosDB Nano to docling-serve
The four ingestion entry points are one-per-modality:
| Method | What it ingests | Default corpus_kind |
|---|---|---|
graph_rag_ingest_pdf | PDFs | Pdf |
graph_rag_ingest_office | DOCX / PPTX / XLSX | Office |
graph_rag_ingest_audio | Audio (text-only DoclingDocuments via docling’s ASR) | Audio |
graph_rag_ingest_image | Images | Image |
Each takes a single DoclingIngestOptions carrying the source (URL, local path, or in-memory bytes) plus the docling endpoint:
use heliosdb_nano::{ graph_rag::DoclingIngestOptions, EmbeddedDatabase, Result,};
fn main() -> Result<()> { let db = EmbeddedDatabase::new_in_memory()?;
// 1. Build options. `from_path` defaults to the local docling-serve // endpoint and the "Document" corpus kind — the per-modality // entry point overrides the kind to "Pdf" / "Office" / etc. let opts = DoclingIngestOptions::from_path("/path/to/whitepaper.pdf");
// 2. Run. Each call is idempotent — same source_ref → same node_id. let stats = db.graph_rag_ingest_pdf(&opts)?;
println!( "ingested: {} nodes, {} edges, {} rows seen", stats.nodes_added, stats.edges_added, stats.rows_seen ); Ok(())}Behind the scenes the adapter:
- Reads the file (or fetches the URL, or decodes the in-memory bytes), base64-encodes it, and POSTs
{"sources": [{"kind": "file", "filename": "...", "data_base64": "..."}], "to_formats": ["json"]}at the docling endpoint. - Parses the response’s
documents[].json_contentfield as aDoclingDocument. - Walks the structure and emits one root node (
node_kind = "Pdf"/"Office"/ etc.), oneDocSectionper heading-style text, oneDocChunkper paragraph, and oneDocTableper table — wired together withCONTAINSedges.
3. The Three Source Modes
// (a) Local file — adapter reads + base64-encodes + POSTs.let opts = DoclingIngestOptions::from_path("./paper.pdf");
// (b) Remote URL — docling-serve fetches it server-side.let opts = DoclingIngestOptions::from_url("https://arxiv.org/pdf/2403.10131.pdf");
// (c) In-memory bytes — useful for non-filesystem corpora.let opts = DoclingIngestOptions { source_bytes: Some(std::fs::read("paper.pdf")?), filename: Some("paper.pdf".into()), docling_endpoint: "http://localhost:5001/v1/convert/source".into(), corpus_kind: "Pdf".into(), timeout_ms: 60_000, ..Default::default()};The default 60 s HTTP timeout covers docling’s first-model-load slow path. Override with timeout_ms for big files or slow networks.
4. Override the Endpoint
let opts = DoclingIngestOptions::from_path("./paper.pdf") .with_endpoint("http://docling.internal:5001/v1/convert/source") .with_corpus_kind("ResearchPaper");auth_bearer adds Authorization: Bearer <token> to the docling-serve POST — useful when the conversion service is behind an auth proxy.
5. Inspect the Result
After a single PDF ingestion:
SELECT node_kind, COUNT(*) FROM _hdb_graph_nodesWHERE source_ref LIKE 'docling:%'GROUP BY node_kind ORDER BY 1;
-- node_kind | count-- DocChunk | 87-- DocSection | 14-- DocTable | 3-- Pdf | 1The hierarchy:
SELECT n_from.node_kind AS parent_kind, n_from.title AS parent_title, n_to.node_kind AS child_kind, LEFT(n_to.text, 60) AS child_textFROM _hdb_graph_edges eJOIN _hdb_graph_nodes n_from ON n_from.node_id = e.from_nodeJOIN _hdb_graph_nodes n_to ON n_to.node_id = e.to_nodeWHERE e.edge_kind = 'CONTAINS' AND n_from.source_ref LIKE 'docling:%'ORDER BY n_from.node_id, n_to.node_idLIMIT 10;source_ref keys follow stable patterns:
| Node | source_ref |
|---|---|
| Document root | docling:document:<filename> |
| Section heading | docling:section:<self_ref> |
| Paragraph chunk | docling:chunk:<self_ref> |
| Table | docling:table:<self_ref> |
Re-running the ingester against the same file finds the existing root via source_ref and skips re-inserting (the adapter does not update existing rows in this phase — it dedupes on source_ref).
6. End-to-End — PDF + Code Graph + Search
use heliosdb_nano::{ code_graph::CodeIndexOptions, graph_rag::{Direction, DoclingIngestOptions, GraphRagOptions}, EmbeddedDatabase, Result,};
fn main() -> Result<()> { let db = EmbeddedDatabase::open("./data")?;
// 1. Code corpus db.execute("CREATE TABLE IF NOT EXISTS src \ (path TEXT PRIMARY KEY, lang TEXT, content TEXT)")?; db.execute("INSERT INTO src VALUES \ ('parse.rs', 'rust', 'pub fn parse_query(s: &str) -> Result<Query, Error> { ... }') \ ON CONFLICT DO NOTHING")?; db.code_index(CodeIndexOptions::for_table("src"))?;
// 2. Spec PDF let pdf_opts = DoclingIngestOptions::from_path("./docs/sql_spec.pdf"); let pdf_stats = db.graph_rag_ingest_pdf(&pdf_opts)?; println!("PDF: {} chunks projected", pdf_stats.nodes_added);
// 3. Cross-modal MENTIONS — DocChunk → Function via whole-word match db.graph_rag_link_exact(&[])?;
// 4. Search: starting from any chunk that mentions "parse_query", // walk 2 hops out through CONTAINS / MENTIONS / CALLS edges. let hits = db.graph_rag_search(&GraphRagOptions { seed_text: "parse_query".into(), hops: 2, direction: Direction::Both, limit: 50, ..Default::default() })?;
for h in hits { println!("[hop={}] {} :: {}", h.hop_distance, h.node_kind, h.title.unwrap_or_default()); } Ok(())}You’ll see the Function itself, the DocChunk rows that mention it (linked via MENTIONS from §3 of GRAPH_RAG_TUTORIAL), and the surrounding DocSection headings (linked via CONTAINS).
7. Drive It from MCP
The same call surfaces over MCP via helios_graphrag_search. Once the PDF is ingested, an MCP-aware agent (Claude Code, Cursor, …) can run:
{ "jsonrpc": "2.0", "id": 1, "method": "tools/call", "params": { "name": "helios_graphrag_search", "arguments": { "seed_text": "parse_query", "hops": 2, "limit": 25 }, "_meta": { "progressToken": "search-1" } }}The agent receives streaming notifications/progress events (“seeding for parse_query, hops=2” → ”
8. DOCX / PPTX / Audio / Image
The other three entry points share the exact same options shape:
db.graph_rag_ingest_office(&DoclingIngestOptions::from_path("./deck.pptx"))?;db.graph_rag_ingest_audio (&DoclingIngestOptions::from_path("./meeting.mp3"))?;db.graph_rag_ingest_image (&DoclingIngestOptions::from_path("./diagram.png"))?;Audio runs through docling’s ASR pipeline — the resulting DoclingDocument has only text (no headings), so the graph projection emits a single Audio root with one DocChunk per transcribed utterance. Images project the OCR’d text similarly under an Image root.
9. Where Next
- GRAPH_RAG_TUTORIAL — query the graph this tutorial just populated.
- MCP_ENDPOINT_TUTORIAL — surface ingestion results to AI agents.
- CODE_GRAPH_TUTORIAL — populate
_hdb_code_symbolsso the docling-projected DocChunks have something to MENTIONS-link to. - SEMANTIC_HASH_INDEX_QUICKREF — keep re-indexing of the code side incremental.