fix(ocr): clamp PDF image bboxes before cropping by cyphercodes · Pull Request #2192 · microsoft/markitdown

cyphercodes · 2026-07-05T03:16:01Z

Summary

Clamp PDF image crop bboxes to the pdfplumber page bbox before calling within_bbox.
Keep OCR extraction working when image metadata has tiny floating-point overflows, such as a slightly negative top value.
Add a focused regression test for the out-of-bounds bbox case.

Fixes #2097

Tests

pytest -q packages/markitdown-ocr/tests/test_pdf_converter.py::test_extract_images_clamps_slightly_out_of_bounds_bbox
pytest -q packages/markitdown-ocr/tests/test_pdf_converter.py -k 'not test_pdf_multipage'
git diff --check

Note: running the full packages/markitdown-ocr/tests/test_pdf_converter.py currently fails on test_pdf_multipage with the current dependency set; I reproduced the same failure on unmodified main before re-applying this patch.

fix(ocr): clamp PDF image bboxes to page bounds

e294280

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(ocr): clamp PDF image bboxes before cropping#2192

fix(ocr): clamp PDF image bboxes before cropping#2192
cyphercodes wants to merge 1 commit into
microsoft:mainfrom
cyphercodes:fix-ocr-clamp-image-bbox-2097

cyphercodes commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

cyphercodes commented Jul 5, 2026

Summary

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant