-
Notifications
You must be signed in to change notification settings - Fork 53
fix: prevent KeyError in reading order prediction and preserve PageElement type #139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
…ement type - Add safety check to ensure index is valid before accessing state.dn_map and state.up_map - Fix coordinate conversion to preserve PageElement type instead of converting to BoundingBox - Add comprehensive unit tests for reading order edge cases - Import CoordOrigin for proper coordinate system handling Co-authored-by: cau-git <[email protected]>
|
@copilot Can you generate hypothesis why the index bound violation could happen? |
There are several scenarios where the index bound violation could occur: Hypothesis 1: Stale l2r_map references after dilated elements processing The most likely cause is at line 256 where
Hypothesis 2: l2r_map chain creates circular reference or points beyond bounds Though currently disabled (line 317: Hypothesis 3: Race condition with shared state The Why the fix works: The safety check |
|
✅ DCO Check Passed Thanks @Copilot, all your commits are properly signed off. 🎉 |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Issue resolved by this Pull Request:
Resolves the unprecedented KeyError occurring during PDF conversion in reading order prediction.
Description
The reading order predictor raised
KeyError: 22(or other indices) when processing certain PDFs. Two bugs were identified:l2r_mapchain could produce indices outside the valid range forstate.dn_mapandstate.up_mapPageElementobjects by converting them toBoundingBox, losingcid,label, and other critical attributesChanges:
Added bounds check before dictionary access in
_init_ud_maps:Fixed coordinate conversion to preserve type:
Checklist:
Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
astral.shcurl -LsSf REDACTED(dns block)huggingface.co/home/REDACTED/work/docling-ibm-models/docling-ibm-models/.venv/bin/python /home/REDACTED/work/docling-ibm-models/docling-ibm-models/.venv/bin/pytest tests/test_reading_order.py -v --tb=short(dns block)/home/REDACTED/work/docling-ibm-models/docling-ibm-models/.venv/bin/python /home/REDACTED/work/docling-ibm-models/docling-ibm-models/.venv/bin/pytest tests/ -k not test_readingorder -x --tb=short(dns block)If you need me to access, download, or install something from one of these locations, you can either:
Original prompt
This section details on the original issue you should resolve
<issue_title>Unprecedented KeyError occuring</issue_title>
<issue_description>### Bug
I was using docling to convert pdf file, and noticed KeyError originating from docling from the line
doc = converter.convert(pdf_path)
the pdf path contains the pdf file.
The error is occurring for same pdf but with different numbers on KeyError.
...
Steps to reproduce
Unfortunately cant share the same pdf.
...
Docling version
Docling version: 2.58.0
Docling Core version: 2.49.0
Docling IBM Models version: 3.10.1
Docling Parse version: 4.7.0
Python: cpython-312 (3.12.3)
Platform: Linux-6.14.0-1012-aws-x86_64-with-glibc2.39
...
Python version
File "home/src/extractors/pdf_markdown.py", line 79, in process_single_page
doc = converter.convert(pdf_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "home/path/venv./lib/python3.12/site-packages/pydantic/_internal/_validate_call.py", line 39, in wrapper_function
return wrapper(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "home/path/venv./lib/python3.12/site-packages/pydantic/_internal/_validate_call.py", line 136, in call
res = self.pydantic_validator.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "home/path/venv./lib/python3.12/site-packages/docling/document_converter.py", line 237, in convert
return next(all_res)
^^^^^^^^^^^^^
File "home/path/venv./lib/python3.12/site-packages/docling/document_converter.py", line 260, in convert_all
for conv_res in conv_res_iter:
File "home/path/venv./lib/python3.12/site-packages/docling/document_converter.py", line 332, in _convert
for item in map(
File "home/path/venv./lib/python3.12/site-packages/docling/document_converter.py", line 379, in _process_document
conv_res = self._execute_pipeline(in_doc, raises_on_error=raises_on_error)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "home/path/venv./lib/python3.12/site-packages/docling/document_converter.py", line 402, in _execute_pipeline
conv_res = pipeline.execute(in_doc, raises_on_error=raises_on_error)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "home/path/venv./lib/python3.12/site-packages/docling/pipeline/base_pipeline.py", line 80, in execute
raise e
File "home/path/venv./lib/python3.12/site-packages/docling/pipeline/base_pipeline.py", line 73, in execute
conv_res = self._assemble_document(conv_res)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "home/path/venv./lib/python3.12/site-packages/docling/pipeline/standard_pdf_pipeline.py", line 153, in _assemble_document
conv_res.document = self.reading_order_model(conv_res)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "home/path/venv./lib/python3.12/site-packages/docling/models/readingorder_model.py", line 410, in call
sorted_elements = self.ro_model.predict_reading_order(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "home/path/venv./lib/python3.12/site-packages/docling_ibm_models/reading_order/reading_order_rb.py", line 108, in predict_reading_order
page_to_elems[page_no] = self._predict_page(elems)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "home/path/venv./lib/python3.12/site-packages/docling_ibm_models/reading_order/reading_order_rb.py", line 239, in _predict_page
self._init_ud_maps(page_elements)
File "home/path/venv./lib/python3.12/site-packages/docling_ibm_models/reading_order/reading_order_rb.py", line 366, in _init_ud_maps
self.dn_map[i].append(j)
~~~~~~~~~~~^^^
KeyError: 22
...
</issue_description>
<agent_instructions># Repository Guidelines
Project Structure & Module Organization
Docling evaluation toolkit lives in
docling_eval/. Typer CLI entry points sit incli/; dataset ingestion underdataset_builders/; metric logic inevaluators/; helpers come viaaggregations/andutils/. Provider adapters (AWS, Azure, Google) live inprediction_providers/. Benchmark configs and sample outputs are underbenchmarks/, while reference docs live indocs/. Tests mirror runtime modules intests/, reusingtests/data/fixtures and the git-ignoredtests/scratch/for generated artifacts.Important tools to handle CVAT XML annotations are under
cvat_tools/, and more tools to manage a full annotation campaign and evaluation undercampaign_tools(relying heavily oncvat_tools).Build, Test & Development Commands
Install dependencies with uv:
uv syncseeds the virt...✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.