Garbled order of OCR'ed contents #1035

rkevk · 2022-11-16T15:58:18Z

I've encountered a typewritten but high-resolution and clearly legible PDF whose OCR text is somehow misplaced after being generated: The file in question is this thesis. For example, page 3 is a good pure-text example of where this behavior occurs (I won't paste the page here because I'm unsure about copyright issues).

More explicitly, it seems that Tesseract (more or less) correctly identifies all the words and their order (as can also be verified using --sidecar), but when grafting the OCR contents back onto the PDF, the text or its bounding box gets misplaced. For example, when highlighting OCR'ed phrases spanning more than one word, the position of the words evidently does not follow the left-to-right, top-to-bottom order that it should, and instead of highlighting two adjacent words, half the page is highlighted. This can also be verified by copy-pasting (i.e., CTRL-A, CTRL-C) the entire text, which results in a garbled version of the original Tesseract/sidecar text.

This occurred on both v13.4 and v13.5 with tesseract 4.1.1 when running OCRmyPDF without any further options. Is there an option I should be trying here?

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2022-11-16T22:04:10Z

You can try --pdf-renderer hocr which uses a different renderer to produce the PDF.

Some PDF viewers also struggle with OCR placement.

rkevk · 2022-11-18T19:34:43Z

For the PDF viewer in question (Evince), hocr didn't make a difference.
However, it turns out that the Firefox PDF viewer does read the placement of the OCR contents correctly, regardless of which renderer I use (however, I'll note that for most other OCR'ed documents Evince also worked fine). Feel free to close the issue if you feel the problem is more one of Evince than of OCRmyPDF.

cristobaltapia · 2025-02-04T07:17:22Z

I also have this problem with a typewritten document. The content is misplaced. It is somehow shifted downwards about two lines. The hocr option does not make a difference. I also tried different readers (evince, zathura and firefox) but there is no difference.

cristobaltapia · 2025-02-04T16:42:08Z

So, it appears that the problem for me happens when I use the option --language deu. Without this option it works as expected.

Edit: This is not true. I probably just looked at the one page that had the text in the correct position.

cristobaltapia · 2025-02-05T10:15:13Z

Here is an example of how it looks in the file I was ocr'ing:

cristobaltapia · 2025-02-06T12:19:55Z

Ok, I figured it out. The problem was in using the argument --clean, as the document is then processed with unpaper for the ocr. But then the original file is kept. Since I was doing deskewing, the text was not in the same position. So, not a bug in my case at least.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Garbled order of OCR'ed contents #1035

Garbled order of OCR'ed contents #1035

rkevk commented Nov 16, 2022 •

edited

Loading

jbarlow83 commented Nov 16, 2022

rkevk commented Nov 18, 2022

cristobaltapia commented Feb 4, 2025

cristobaltapia commented Feb 4, 2025 •

edited

Loading

cristobaltapia commented Feb 5, 2025

cristobaltapia commented Feb 6, 2025

Garbled order of OCR'ed contents #1035

Garbled order of OCR'ed contents #1035

Comments

rkevk commented Nov 16, 2022 • edited Loading

jbarlow83 commented Nov 16, 2022

rkevk commented Nov 18, 2022

cristobaltapia commented Feb 4, 2025

cristobaltapia commented Feb 4, 2025 • edited Loading

cristobaltapia commented Feb 5, 2025

cristobaltapia commented Feb 6, 2025

rkevk commented Nov 16, 2022 •

edited

Loading

cristobaltapia commented Feb 4, 2025 •

edited

Loading