Skip to content

fix(ocr): clamp PDF image bboxes before cropping#2192

Open
cyphercodes wants to merge 1 commit into
microsoft:mainfrom
cyphercodes:fix-ocr-clamp-image-bbox-2097
Open

fix(ocr): clamp PDF image bboxes before cropping#2192
cyphercodes wants to merge 1 commit into
microsoft:mainfrom
cyphercodes:fix-ocr-clamp-image-bbox-2097

Conversation

@cyphercodes

Copy link
Copy Markdown

Summary

  • Clamp PDF image crop bboxes to the pdfplumber page bbox before calling within_bbox.
  • Keep OCR extraction working when image metadata has tiny floating-point overflows, such as a slightly negative top value.
  • Add a focused regression test for the out-of-bounds bbox case.

Fixes #2097

Tests

  • pytest -q packages/markitdown-ocr/tests/test_pdf_converter.py::test_extract_images_clamps_slightly_out_of_bounds_bbox
  • pytest -q packages/markitdown-ocr/tests/test_pdf_converter.py -k 'not test_pdf_multipage'
  • git diff --check

Note: running the full packages/markitdown-ocr/tests/test_pdf_converter.py currently fails on test_pdf_multipage with the current dependency set; I reproduced the same failure on unmodified main before re-applying this patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: fix, 解决markitdown-ocr因bbox的top值导致解析pdf文件图片识别的问题

1 participant