[Feature]: Process hOCR textangle attribute in hOCR to PDF transform #1467

0dinD · 2025-01-31T14:10:19Z

Describe the proposed feature

Currently, the hOCR to PDF transform just ignores the textangle hOCR attribute and assumes that textangle is 0 degrees (the default if textangle is not specified). This creates problems with the PDF output when textangle is present with a non-zero value. This frequently happens when processing documents with mixed 90-degree text orientations (example provided below). Note that Tesseract is not perfect either and sometimes produces garbled hOCR output as well, but this is a separate issue that I've reported as tesseract-ocr/tesseract#4387. Here is the Tesseract code which sets textangle, should it be of interest to you: https://github.com/tesseract-ocr/tesseract/blob/3157ff0e741ea5c85e16fbd1c6edf20f30eccbd3/src/api/hocrrenderer.cpp#L43-L58

Below is a case where Tesseract does produce valid hOCR, with a non-zero textangle in some parts of the document:

Command used: ocrmypdf text-mixed-orientation.png text-mixed-orientation.pdf

Input image:

The text in the output PDF is not correct when it comes to the 90-degree text, because OCRmyPDF treats it as 0-degree text even though it has a textangle of 90 in the hOCR (I used the -k option to see this in the hOCR output).

The text was updated successfully, but these errors were encountered:

0dinD added enhancement triage Issue needs triage labels Jan 31, 2025

0dinD assigned jbarlow83 Jan 31, 2025

0dinD linked a pull request Jan 31, 2025 that will close this issue

Process hOCR textangle attribute #1468

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Process hOCR textangle attribute in hOCR to PDF transform #1467

[Feature]: Process hOCR textangle attribute in hOCR to PDF transform #1467

0dinD commented Jan 31, 2025

[Feature]: Process hOCR textangle attribute in hOCR to PDF transform #1467

[Feature]: Process hOCR textangle attribute in hOCR to PDF transform #1467

Comments

0dinD commented Jan 31, 2025

Describe the proposed feature