You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the hOCR to PDF transform just ignores the textangle hOCR attribute and assumes that textangle is 0 degrees (the default if textangle is not specified). This creates problems with the PDF output when textangle is present with a non-zero value. This frequently happens when processing documents with mixed 90-degree text orientations (example provided below). Note that Tesseract is not perfect either and sometimes produces garbled hOCR output as well, but this is a separate issue that I've reported as tesseract-ocr/tesseract#4387. Here is the Tesseract code which sets textangle, should it be of interest to you: https://github.com/tesseract-ocr/tesseract/blob/3157ff0e741ea5c85e16fbd1c6edf20f30eccbd3/src/api/hocrrenderer.cpp#L43-L58
Below is a case where Tesseract does produce valid hOCR, with a non-zero textangle in some parts of the document:
The text in the output PDF is not correct when it comes to the 90-degree text, because OCRmyPDF treats it as 0-degree text even though it has a textangle of 90 in the hOCR (I used the -k option to see this in the hOCR output).
The text was updated successfully, but these errors were encountered:
Describe the proposed feature
Currently, the hOCR to PDF transform just ignores the
textangle
hOCR attribute and assumes thattextangle
is 0 degrees (the default iftextangle
is not specified). This creates problems with the PDF output whentextangle
is present with a non-zero value. This frequently happens when processing documents with mixed 90-degree text orientations (example provided below). Note that Tesseract is not perfect either and sometimes produces garbled hOCR output as well, but this is a separate issue that I've reported as tesseract-ocr/tesseract#4387. Here is the Tesseract code which setstextangle
, should it be of interest to you: https://github.com/tesseract-ocr/tesseract/blob/3157ff0e741ea5c85e16fbd1c6edf20f30eccbd3/src/api/hocrrenderer.cpp#L43-L58Below is a case where Tesseract does produce valid hOCR, with a non-zero
textangle
in some parts of the document:Command used:
ocrmypdf text-mixed-orientation.png text-mixed-orientation.pdf
Input image:
The text in the output PDF is not correct when it comes to the 90-degree text, because OCRmyPDF treats it as 0-degree text even though it has a
textangle
of 90 in the hOCR (I used the-k
option to see this in the hOCR output).The text was updated successfully, but these errors were encountered: