Simple package to extract text, paths and bitmap images with coordinates from programmatic PDFs. This package is used in the Docling PDF conversion. Below, we show a few output of the latest parser with char, word and line level output for text, in addition to the extracted paths and bitmap resources.
To do the visualizations yourself, simply run (change word into char or line),
uv run python ./docling_parse/visualize.py -i <path-to-pdf-file> -c word --interactive| original | char | word | line |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Install the package from PyPI:
pip install docling-parsedocling-parse v7 split page parsing into two public configs:
DecodeConfig: how to compute pages. This is fixed when a document is opened.ContentConfig: what to keep or materialize per page. This can be overridden per page.
from docling_core.types.doc.page import TextCellUnit
from docling_parse.pdf_parser import (
ContentConfig,
ContentLevel,
DecodeConfig,
DoclingPdfParser,
)
parser = DoclingPdfParser(loglevel="fatal")
pdf_doc = parser.load(
path_or_stream="<path-to-pdf>",
decode_config=DecodeConfig(
do_sanitization=True,
keep_glyphs=False,
),
content_config=ContentConfig(
char_cells_content_level=ContentLevel.SKIP,
word_cells_content_level=ContentLevel.COMPUTE_AND_MATERIALIZE,
line_cells_content_level=ContentLevel.COMPUTE_AND_MATERIALIZE,
shapes_content_level=ContentLevel.SKIP,
bitmaps_content_level=ContentLevel.SKIP,
),
)
for page_no, page in pdf_doc.iterate_pages():
print(page_no, len(page.word_cells), len(page.textline_cells))
for word in page.iterate_cells(unit_type=TextCellUnit.WORD):
print(word.rect, word.text)
image = page.render_as_image(cell_unit=TextCellUnit.WORD)
image.show()If you open cheaply and later need richer output, request it per page. When the
new content_config needs entities that were previously skipped, that page is
re-decoded automatically:
from docling_parse.pdf_parser import ContentConfig, ContentLevel
page = pdf_doc.get_page(
1,
content_config=ContentConfig(
word_cells_content_level=ContentLevel.COMPUTE_AND_MATERIALIZE,
line_cells_content_level=ContentLevel.COMPUTE_AND_MATERIALIZE,
),
)The main API break in v7 is that the old public DecodePageConfig selection
flags were split into two concerns:
DecodeConfig: compute-time tuning onlyContentConfig: what to skip, compute, or materialize per page
In practice:
- open-time
decode_configreplaces the old per-page decode tuning - per-page content selection now lives in
content_config materialize_bitmap_bytesbecameinclude_bitmap_bytes- threaded
page_materialization_configbecamepage_content_config
Typical migration examples:
- old
DecodePageConfig.keep_char_cells=True->ContentConfig(char_cells_content_level=ContentLevel.COMPUTE_AND_MATERIALIZE) - old
DecodePageConfig.create_word_cells=Truewithout surfacing them everywhere ->ContentConfig(word_cells_content_level=ContentLevel.COMPUTE) - old
materialize_bitmap_bytes=False->ContentConfig(include_bitmap_bytes=False)
One semantic change matters: decode_config is now fixed when the document or
threaded batch is opened. If you want richer page output later, override
content_config on get_page(...) instead. On the sequential path this may
re-decode that page; on the threaded path you can only materialize entities the
batch already computed.
Parse one or more PDFs in parallel with backpressure:
from docling_parse.pdf_parser import (
ContentConfig,
ContentLevel,
DecodeConfig,
DoclingThreadedPdfParser,
ThreadedPdfParserConfig,
)
parser = DoclingThreadedPdfParser(
parser_config=ThreadedPdfParserConfig(
loglevel="fatal",
threads=4,
max_concurrent_results=32,
page_content_config=ContentConfig(
word_cells_content_level=ContentLevel.COMPUTE,
line_cells_content_level=ContentLevel.COMPUTE_AND_MATERIALIZE,
),
),
decode_config=DecodeConfig(),
)
doc_key = parser.load("doc_a.pdf", page_numbers=[1, 3, 5])
print(doc_key, parser.page_count(doc_key), parser.scheduled_page_count(doc_key))
for result in parser.iterate_results():
if not result.success:
print(result.doc_key, result.page_number, result.error_message)
continue
# Batch decode kept word cells in C++, but did not materialize them by default.
page = result.get_page(
ContentConfig(
word_cells_content_level=ContentLevel.COMPUTE_AND_MATERIALIZE,
line_cells_content_level=ContentLevel.COMPUTE_AND_MATERIALIZE,
)
)
print(
result.doc_key,
result.page_number,
len(page.word_cells),
result.timings.total_s,
)For threaded parse-and-render workloads, set
ThreadedPdfParserConfig.render_config and use result.get_image(),
result.get_image(scale=...), or result.get_image(canvas_size=...).
Use the CLI
$ docling-parse -h
usage: docling-parse [-h] -p PDF
Process a PDF file.
options:
-h, --help show this help message and exit
-p PDF, --pdf PDF Path to the PDF fileCurrent perf tooling lives under perf/:
perf/run_perf.py: per-page CSV benchmarking acrossdocling,docling-threaded,pdfplumber,pypdfium2, andpymupdfperf/run_scaling.py: pages/sec and scaling sweeps for threaded parse and render workloadsdocs/performance_code.md: usage notes and interpretation
For historical V1 vs V2 benchmarks, see legacy_performance_benchmarks.md.
To build the parser, simply run the following command in the root folder,
rm -rf build; cmake -B ./build; cd build; makeYou can run the parser from your build folder:
% ./parse.exe -h
program to process PDF files or configuration files
Usage:
PDFProcessor [OPTION...]
-i, --input arg Input PDF file
-c, --config arg Config file
--create-config arg Create config file
-p, --page arg Pages to process (default: -1 for all) (default:
-1)
--password arg Password for accessing encrypted, password-protected files
-o, --output arg Output file
-l, --loglevel arg loglevel [error;warning;success;info]
-h, --help Print usageIf you don't have an input file, a template input file will be printed on the terminal.
To build the package, simply run (make sure uv is installed),
uv syncThe latter will only work after a clean git clone. If you are developing and updating C++ code, please use,
# uv pip install --force-reinstall --no-deps -e .
rm -rf .venv; uv venv; uv pip install --force-reinstall --no-deps -e ".[perf-tools]"or
BUILD_THREADS=12 uv pip install --force-reinstall --no-deps -e ".[perf]"To test the package, run:
uv run pytest ./tests -v -sPlease read Contributing to Docling Parse for details.
If you use Docling in your projects, please consider citing the following:
@techreport{Docling,
author = {Docling Team},
month = {8},
title = {Docling Technical Report},
url = {https://arxiv.org/abs/2408.09869},
eprint = {2408.09869},
doi = {10.48550/arXiv.2408.09869},
version = {1.0.0},
year = {2024}
}The Docling Parse codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.
Docling (and also docling-parse) is hosted as a project in the LF AI & Data Foundation.
The project was started by the AI for knowledge team at IBM Research Zurich.



















