Docling Parse

Simple package to extract text, paths and bitmap images with coordinates from programmatic PDFs. This package is used in the Docling PDF conversion. Below, we show a few output of the latest parser with char, word and line level output for text, in addition to the extracted paths and bitmap resources.

To do the visualizations yourself, simply run (change word into char or line),

uv run python ./docling_parse/visualize.py -i <path-to-pdf-file> -c word --interactive

original	char	word	line

Quick start

Install the package from PyPI:

pip install docling-parse

Sequential parsing

docling-parse v7 split page parsing into two public configs:

DecodeConfig: how to compute pages. This is fixed when a document is opened.
ContentConfig: what to keep or materialize per page. This can be overridden per page.

from docling_core.types.doc.page import TextCellUnit
from docling_parse.pdf_parser import (
    ContentConfig,
    ContentLevel,
    DecodeConfig,
    DoclingPdfParser,
)

parser = DoclingPdfParser(loglevel="fatal")

pdf_doc = parser.load(
    path_or_stream="<path-to-pdf>",
    decode_config=DecodeConfig(
        do_sanitization=True,
        keep_glyphs=False,
    ),
    content_config=ContentConfig(
        char_cells_content_level=ContentLevel.SKIP,
        word_cells_content_level=ContentLevel.COMPUTE_AND_MATERIALIZE,
        line_cells_content_level=ContentLevel.COMPUTE_AND_MATERIALIZE,
        shapes_content_level=ContentLevel.SKIP,
        bitmaps_content_level=ContentLevel.SKIP,
    ),
)

for page_no, page in pdf_doc.iterate_pages():
    print(page_no, len(page.word_cells), len(page.textline_cells))

    for word in page.iterate_cells(unit_type=TextCellUnit.WORD):
        print(word.rect, word.text)

    image = page.render_as_image(cell_unit=TextCellUnit.WORD)
    image.show()

If you open cheaply and later need richer output, request it per page. When the new content_config needs entities that were previously skipped, that page is re-decoded automatically:

from docling_parse.pdf_parser import ContentConfig, ContentLevel

page = pdf_doc.get_page(
    1,
    content_config=ContentConfig(
        word_cells_content_level=ContentLevel.COMPUTE_AND_MATERIALIZE,
        line_cells_content_level=ContentLevel.COMPUTE_AND_MATERIALIZE,
    ),
)

v6 -> v7 migration

The main API break in v7 is that the old public DecodePageConfig selection flags were split into two concerns:

DecodeConfig: compute-time tuning only
ContentConfig: what to skip, compute, or materialize per page

In practice:

open-time decode_config replaces the old per-page decode tuning
per-page content selection now lives in content_config
materialize_bitmap_bytes became include_bitmap_bytes
threaded page_materialization_config became page_content_config

Typical migration examples:

old DecodePageConfig.keep_char_cells=True -> ContentConfig(char_cells_content_level=ContentLevel.COMPUTE_AND_MATERIALIZE)
old DecodePageConfig.create_word_cells=True without surfacing them everywhere -> ContentConfig(word_cells_content_level=ContentLevel.COMPUTE)
old materialize_bitmap_bytes=False -> ContentConfig(include_bitmap_bytes=False)

One semantic change matters: decode_config is now fixed when the document or threaded batch is opened. If you want richer page output later, override content_config on get_page(...) instead. On the sequential path this may re-decode that page; on the threaded path you can only materialize entities the batch already computed.

Parallel parsing (multi-threaded)

Parse one or more PDFs in parallel with backpressure:

from docling_parse.pdf_parser import (
    ContentConfig,
    ContentLevel,
    DecodeConfig,
    DoclingThreadedPdfParser,
    ThreadedPdfParserConfig,
)

parser = DoclingThreadedPdfParser(
    parser_config=ThreadedPdfParserConfig(
        loglevel="fatal",
        threads=4,
        max_concurrent_results=32,
        page_content_config=ContentConfig(
            word_cells_content_level=ContentLevel.COMPUTE,
            line_cells_content_level=ContentLevel.COMPUTE_AND_MATERIALIZE,
        ),
    ),
    decode_config=DecodeConfig(),
)

doc_key = parser.load("doc_a.pdf", page_numbers=[1, 3, 5])
print(doc_key, parser.page_count(doc_key), parser.scheduled_page_count(doc_key))

for result in parser.iterate_results():
    if not result.success:
        print(result.doc_key, result.page_number, result.error_message)
        continue

    # Batch decode kept word cells in C++, but did not materialize them by default.
    page = result.get_page(
        ContentConfig(
            word_cells_content_level=ContentLevel.COMPUTE_AND_MATERIALIZE,
            line_cells_content_level=ContentLevel.COMPUTE_AND_MATERIALIZE,
        )
    )
    print(
        result.doc_key,
        result.page_number,
        len(page.word_cells),
        result.timings.total_s,
    )

For threaded parse-and-render workloads, set ThreadedPdfParserConfig.render_config and use result.get_image(), result.get_image(scale=...), or result.get_image(canvas_size=...).

Use the CLI

$ docling-parse -h
usage: docling-parse [-h] -p PDF

Process a PDF file.

options:
  -h, --help         show this help message and exit
  -p PDF, --pdf PDF  Path to the PDF file

Performance Benchmarks

Current perf tooling lives under perf/:

perf/run_perf.py: per-page CSV benchmarking across docling, docling-threaded, pdfplumber, pypdfium2, and pymupdf
perf/run_scaling.py: pages/sec and scaling sweeps for threaded parse and render workloads
docs/performance_code.md: usage notes and interpretation

For historical V1 vs V2 benchmarks, see legacy_performance_benchmarks.md.

Development

CXX

To build the parser, simply run the following command in the root folder,

rm -rf build; cmake -B ./build; cd build; make

You can run the parser from your build folder:

% ./parse.exe -h
program to process PDF files or configuration files
Usage:
  PDFProcessor [OPTION...]

  -i, --input arg          Input PDF file
  -c, --config arg         Config file
      --create-config arg  Create config file
  -p, --page arg           Pages to process (default: -1 for all) (default:
                           -1)
      --password arg       Password for accessing encrypted, password-protected files
  -o, --output arg         Output file
  -l, --loglevel arg       loglevel [error;warning;success;info]
  -h, --help               Print usage

If you don't have an input file, a template input file will be printed on the terminal.

Python

To build the package, simply run (make sure uv is installed),

uv sync

The latter will only work after a clean git clone. If you are developing and updating C++ code, please use,

# uv pip install --force-reinstall --no-deps -e .
rm -rf .venv; uv venv; uv pip install --force-reinstall --no-deps -e ".[perf-tools]"

or

BUILD_THREADS=12 uv pip install --force-reinstall --no-deps -e ".[perf]"

To test the package, run:

uv run pytest ./tests -v -s

Contributing

Please read Contributing to Docling Parse for details.

References

If you use Docling in your projects, please consider citing the following:

@techreport{Docling,
  author = {Docling Team},
  month = {8},
  title = {Docling Technical Report},
  url = {https://arxiv.org/abs/2408.09869},
  eprint = {2408.09869},
  doi = {10.48550/arXiv.2408.09869},
  version = {1.0.0},
  year = {2024}
}

License

The Docling Parse codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

LF AI & Data

Docling (and also docling-parse) is hosted as a project in the LF AI & Data Foundation.

IBM ❤️ Open Source AI

The project was started by the AI for knowledge team at IBM Research Zurich.

Name		Name	Last commit message	Last commit date
Latest commit History 273 Commits
.github		.github
app		app
cmake		cmake
docling_parse		docling_parse
docs		docs
perf		perf
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MAINTAINERS.md		MAINTAINERS.md
MANIFEST.in		MANIFEST.in
README.md		README.md
local_build.py		local_build.py
pyproject.toml		pyproject.toml
setup.py		setup.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Docling Parse

Quick start

Sequential parsing

v6 -> v7 migration

Parallel parsing (multi-threaded)

Performance Benchmarks

Development

CXX

Python

Contributing

References

License

LF AI & Data

IBM ❤️ Open Source AI

About

Uh oh!

Releases 83

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Docling Parse

Quick start

Sequential parsing

v6 -> v7 migration

Parallel parsing (multi-threaded)

Performance Benchmarks

Development

CXX

Python

Contributing

References

License

LF AI & Data

IBM ❤️ Open Source AI

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 83

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages