Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Process hOCR textangle attribute in hOCR to PDF transform #1467

Open
0dinD opened this issue Jan 31, 2025 · 0 comments · May be fixed by #1468
Open

[Feature]: Process hOCR textangle attribute in hOCR to PDF transform #1467

0dinD opened this issue Jan 31, 2025 · 0 comments · May be fixed by #1468
Assignees
Labels
enhancement triage Issue needs triage

Comments

@0dinD
Copy link
Contributor

0dinD commented Jan 31, 2025

Describe the proposed feature

Currently, the hOCR to PDF transform just ignores the textangle hOCR attribute and assumes that textangle is 0 degrees (the default if textangle is not specified). This creates problems with the PDF output when textangle is present with a non-zero value. This frequently happens when processing documents with mixed 90-degree text orientations (example provided below). Note that Tesseract is not perfect either and sometimes produces garbled hOCR output as well, but this is a separate issue that I've reported as tesseract-ocr/tesseract#4387. Here is the Tesseract code which sets textangle, should it be of interest to you: https://github.com/tesseract-ocr/tesseract/blob/3157ff0e741ea5c85e16fbd1c6edf20f30eccbd3/src/api/hocrrenderer.cpp#L43-L58

Below is a case where Tesseract does produce valid hOCR, with a non-zero textangle in some parts of the document:

Command used: ocrmypdf text-mixed-orientation.png text-mixed-orientation.pdf

Input image:

Image

The text in the output PDF is not correct when it comes to the 90-degree text, because OCRmyPDF treats it as 0-degree text even though it has a textangle of 90 in the hOCR (I used the -k option to see this in the hOCR output).

@0dinD 0dinD added enhancement triage Issue needs triage labels Jan 31, 2025
@0dinD 0dinD linked a pull request Jan 31, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement triage Issue needs triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants