Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docling fails with RuntimeError: #-instructions 1 does not match expected value 2 for PDF operation: m #847

Closed
marfago opened this issue Jan 31, 2025 · 2 comments
Assignees
Labels
bug Something isn't working pdf parsing PDF issue related to docling-parse

Comments

@marfago
Copy link

marfago commented Jan 31, 2025

Bug

Docling fails with exception.

Steps to reproduce

Please run this code

import logging
import sys

from docling.document_converter import DocumentConverter

if __name__ == "__main__":

    root = logging.getLogger()
    root.setLevel(logging.DEBUG)
    handler = logging.StreamHandler(sys.stdout)
    handler.setLevel(logging.DEBUG)
    formatter = logging.Formatter(
        "%(asctime)s %(levelname)s - [Process %(process)d] - [%(threadName)s] - %(name)s - %(message)s"
    )
    handler.setFormatter(formatter)
    root.addHandler(handler)

    print(DocumentConverter().convert("https://arxiv.org/pdf/2309.05406v5"))

The stack trace

2025-01-30 17:51:06,158 DEBUG - [Process 35668] - [MainThread] - urllib3.connectionpool - Starting new HTTPS connection (1): arxiv.org:443
2025-01-30 17:51:06,270 DEBUG - [Process 35668] - [MainThread] - urllib3.connectionpool - https://arxiv.org:443 "GET /pdf/2309.05406v5 HTTP/1.1" 200 5150002
2025-01-30 17:51:07,560 INFO - [Process 35668] - [MainThread] - docling.document_converter - Going to convert document batch...
2025-01-30 17:51:07,561 DEBUG - [Process 35668] - [MainThread] - urllib3.connectionpool - Starting new HTTPS connection (1): huggingface.co:443
2025-01-30 17:51:07,697 DEBUG - [Process 35668] - [MainThread] - urllib3.connectionpool - https://huggingface.co:443 "GET /api/models/ds4sd/docling-models/revision/v2.1.0 HTTP/1.1" 200 1264
2025-01-30 17:51:07,731 INFO - [Process 35668] - [MainThread] - docling.utils.accelerator_utils - Accelerator device: 'cpu'
2025-01-30 17:51:08,725 INFO - [Process 35668] - [MainThread] - docling.utils.accelerator_utils - Accelerator device: 'cpu'
2025-01-30 17:51:08,940 DEBUG - [Process 35668] - [MainThread] - docling_ibm_models.layoutmodel.layout_predictor - LayoutPredictor settings: {'safe_tensors_file': 'C:\\Users\\fagom\\.cache\\huggingface\\hub\\models--ds4sd--docling-models\\snapshots\\36bebf56681740529abd09f5473a93a69373fbf0\\model_artifacts\\layout\\model.safetensors', 'device': 'cpu', 'num_threads': 4, 'image_size': 640, 'threshold': 0.3}
2025-01-30 17:51:08,940 INFO - [Process 35668] - [MainThread] - docling.utils.accelerator_utils - Accelerator device: 'cpu'
2025-01-30 17:51:09,108 INFO - [Process 35668] - [MainThread] - docling.pipeline.base_pipeline - Processing document 2309.05406v5.pdf
2025-01-30 17:51:10,319 WARNING - [Process 35668] - [MainThread] - docling.pipeline.base_pipeline - Encountered an error during conversion of document 64901092dec5889cacddcecb334242ea2381d67ea3743a60ed0b02ce65800306:
Traceback (most recent call last):

  File "F:\workspace\proj\.venv\Lib\site-packages\docling\pipeline\base_pipeline.py", line 161, in _build_document
    for p in pipeline_pages:  # Must exhaust!
             ^^^^^^^^^^^^^^

  File "F:\workspace\proj\.venv\Lib\site-packages\docling\pipeline\base_pipeline.py", line 127, in _apply_on_pages
    yield from page_batch

  File "F:\workspace\proj\.venv\Lib\site-packages\docling\models\page_assemble_model.py", line 60, in __call__
    for page in page_batch:
                ^^^^^^^^^^

  File "F:\workspace\proj\.venv\Lib\site-packages\docling\models\table_structure_model.py", line 136, in __call__
    for page in page_batch:
                ^^^^^^^^^^

  File "F:\workspace\proj\.venv\Lib\site-packages\docling\models\layout_model.py", line 102, in __call__
    for page in page_batch:
                ^^^^^^^^^^

  File "F:\workspace\proj\.venv\Lib\site-packages\docling\models\easyocr_model.py", line 82, in __call__
    for page in page_batch:
                ^^^^^^^^^^

  File "F:\workspace\proj\.venv\Lib\site-packages\docling\models\page_preprocessing_model.py", line 25, in __call__
    for page in page_batch:
                ^^^^^^^^^^

  File "F:\workspace\proj\.venv\Lib\site-packages\docling\pipeline\standard_pdf_pipeline.py", line 179, in initialize_page
    page._backend = conv_res.input._backend.load_page(page.page_no)  # type: ignore
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "F:\workspace\proj\.venv\Lib\site-packages\docling\backend\docling_parse_v2_backend.py", line 239, in load_page
    return DoclingParseV2PageBackend(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "F:\workspace\proj\.venv\Lib\site-packages\docling\backend\docling_parse_v2_backend.py", line 27, in __init__
    parsed_page = parser.parse_pdf_from_key_on_page(document_hash, page_no)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

RuntimeError: #-instructions 1 does not match expected value 2 for PDF operation: m

Traceback (most recent call last):
  File "F:\workspace\proj\tests\docling_test.py", line 24, in <module>
    print(DocumentConverter().convert("https://arxiv.org/pdf/2309.05406v5"))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\workspace\proj\.venv\Lib\site-packages\pydantic\_internal\_validate_call.py", line 38, in wrapper_function
    return wrapper(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\workspace\proj\.venv\Lib\site-packages\pydantic\_internal\_validate_call.py", line 111, in __call__
    res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\workspace\proj\.venv\Lib\site-packages\docling\document_converter.py", line 195, in convert
    return next(all_res)
           ^^^^^^^^^^^^^
  File "F:\workspace\proj\.venv\Lib\site-packages\docling\document_converter.py", line 216, in convert_all
    for conv_res in conv_res_iter:
                    ^^^^^^^^^^^^^
  File "F:\workspace\proj\.venv\Lib\site-packages\docling\document_converter.py", line 251, in _convert
    for item in map(
                ^^^^
  File "F:\workspace\proj\.venv\Lib\site-packages\docling\document_converter.py", line 292, in _process_document
    conv_res = self._execute_pipeline(in_doc, raises_on_error=raises_on_error)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\workspace\proj\.venv\Lib\site-packages\docling\document_converter.py", line 315, in _execute_pipeline
    conv_res = pipeline.execute(in_doc, raises_on_error=raises_on_error)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\workspace\proj\.venv\Lib\site-packages\docling\pipeline\base_pipeline.py", line 53, in execute
    raise e
  File "F:\workspace\proj\.venv\Lib\site-packages\docling\pipeline\base_pipeline.py", line 45, in execute
    conv_res = self._build_document(conv_res)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\workspace\proj\.venv\Lib\site-packages\docling\pipeline\base_pipeline.py", line 196, in _build_document
    raise e
  File "F:\workspace\proj\.venv\Lib\site-packages\docling\pipeline\base_pipeline.py", line 161, in _build_document
    for p in pipeline_pages:  # Must exhaust!
             ^^^^^^^^^^^^^^
  File "F:\workspace\proj\.venv\Lib\site-packages\docling\pipeline\base_pipeline.py", line 127, in _apply_on_pages
    yield from page_batch
  File "F:\workspace\proj\.venv\Lib\site-packages\docling\models\page_assemble_model.py", line 60, in __call__
    for page in page_batch:
                ^^^^^^^^^^
  File "F:\workspace\proj\.venv\Lib\site-packages\docling\models\table_structure_model.py", line 136, in __call__
    for page in page_batch:
                ^^^^^^^^^^
  File "F:\workspace\proj\.venv\Lib\site-packages\docling\models\layout_model.py", line 102, in __call__
    for page in page_batch:
                ^^^^^^^^^^
  File "F:\workspace\proj\.venv\Lib\site-packages\docling\models\easyocr_model.py", line 82, in __call__
    for page in page_batch:
                ^^^^^^^^^^
  File "F:\workspace\proj\.venv\Lib\site-packages\docling\models\page_preprocessing_model.py", line 25, in __call__
    for page in page_batch:
                ^^^^^^^^^^
  File "F:\workspace\proj\.venv\Lib\site-packages\docling\pipeline\standard_pdf_pipeline.py", line 179, in initialize_page
    page._backend = conv_res.input._backend.load_page(page.page_no)  # type: ignore
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\workspace\proj\.venv\Lib\site-packages\docling\backend\docling_parse_v2_backend.py", line 239, in load_page
    return DoclingParseV2PageBackend(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\workspace\proj\.venv\Lib\site-packages\docling\backend\docling_parse_v2_backend.py", line 27, in __init__
    parsed_page = parser.parse_pdf_from_key_on_page(document_hash, page_no)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: #-instructions 1 does not match expected value 2 for PDF operation: m

Docling version

2.17.0

Python version

3.12.8

@marfago marfago added the bug Something isn't working label Jan 31, 2025
@dolfim-ibm dolfim-ibm added the pdf parsing PDF issue related to docling-parse label Jan 31, 2025
@cau-git
Copy link
Contributor

cau-git commented Jan 31, 2025

Thanks for reporting, we are tracking it here.
Duplicate of #669, closing.

@cau-git cau-git closed this as completed Jan 31, 2025
@PeterStaar-IBM
Copy link
Contributor

Will be resolved in PR: docling-project/docling-parse#91

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pdf parsing PDF issue related to docling-parse
Projects
None yet
Development

No branches or pull requests

4 participants