[Enhancement]: Propagate ocr confidence to output hocr file#86
Conversation
jflesch
left a comment
There was a problem hiding this comment.
Basically this change is a good idea, but as done currently, I think it will break libtesseract support (and possibly Cuneiform support).
| def add_word(self, word, box): | ||
| self.word_boxes.append(Box(word, box)) | ||
| def add_word(self, word, box, confidence): | ||
| self.word_boxes.append(Box(word, box, confidence)) |
There was a problem hiding this comment.
add_word() functions are called by pyocr.libtesseract.image_to_string(). Since you haven't updated libtesseract support, this change will break it.
I suggest you try running the tests. Make sure you have cuneiform, tesseract, libtesseract, tesseract-ocr-fra, and tesseract-ocr-jpn installed, and simply run ./run_tests.py.
Unfortunately, the outputs of the tests slightly vary based on the exact version of Tesseract and Cuneiform you're using (and wind direction I guess ....). So you will have to filter the failed tests manually: ignore those that failed just because the output has slightly changed, and just focus on the ones that failed because the API broke due to your changes.
There was a problem hiding this comment.
Fair point! Here is what I did:
- I've implemented the changes in
libtesseract/__init__.pyandlibtesseract/tesseract_raw.pyto support the extraction of the confidence measure using the C-API - The confidence parameter for the Box constructor has a default value of 0
All in all, this basically means that the confidence measure is propagated to the output hocr files for the tesseract and libtesseract interfaces and a value of 0 is used for all the words when using cuneiform. Do you think that makes sense?
I've done some manual testing with Cuneiform, Tesseract and libtesseract to verify everything was working as expected (looking at the output hocr files). There are however still many failing unit-tests when running the test suite. I must say that it's quite hard to know if it's because I broke the API or just because tesseract feels like spitting out a different output 😅
There was a problem hiding this comment.
Yeah, I know, they are a pain. Unfortunately I haven't found a better way :/
Hint: if they fail on assertEqual, it's usually that the output differs from what was expected (aka the wind is blowing north now). For this change, you ignore those tests and focus on those where an uncatched Exception has been raised.
src/pyocr/builders.py
Outdated
| and reread using read_box_file(). | ||
| """ | ||
| return to_unicode("%s %d %d %d %d") % ( | ||
| return to_unicode("%s %s %d %d %d %d") % ( |
There was a problem hiding this comment.
As is, it will break tesseract.CharBoxBuilder.
CharBoxBuilder correspond to a file format specific to Tesseract (configuration 'makebox'). If you modify this function, the format won't be the same than Tesseract anymore, and CharBoxBuilder.read_file() won't be able to read files written by CharBoxBuilder.write_file() anymore.
|
Thank for the quick review! |
src/pyocr/builders.py
Outdated
| self.built_text.append(u"") | ||
|
|
||
| def add_word(self, word, box): | ||
| def add_word(self, word, box, confidence=None): |
There was a problem hiding this comment.
- Either make
confidenceoptional for alladd_word()or for none. But the API of the builders must be the same for all of them. - I would recommend '0' as default value instead. Seems safer.
There was a problem hiding this comment.
I think you missed my point 1 :-)
Either you specify a default value for confidence on all the Builders.add_word() methods (making the argument optional), or on none of them. All builders must all have the same API (see the base class builders.BaseBuilder)
There was a problem hiding this comment.
Ah sorry I did miss you point! The parameter has now been made optional for all add_word methods
src/pyocr/builders.py
Outdated
| continue | ||
| confidence = piece.split(" ")[1] | ||
| return int(confidence) | ||
| raise Exception("Invalid hocr confidence measure: %s" % title) |
There was a problem hiding this comment.
-
This will be a problem. Paperwork uses this code to write and read hOCR files. In other words, there are already a lot of people (me included) with a lot of documents/hOCR files written without the confidence.
This code must not raise an exception if the confidence is not found. However it can display a trace instead (this is not really unexpected nor a problem --> not a warning -->logger.info).
In other words, please do no break the compatibility (API or file formats) :-) -
The message is not correct. The problem here is not that the confidence is invalid. The problem is that it hasn't been found.
…form will default to confidence=0)
01b1b2c to
1475c9e
Compare
|
Yes I did ran the tests. I have 8, 3 and 4 failing tests for libtesseract, tesseract and cuneiform respectively. I suspect most of the tests fail because of the comparisons of the output of the ocr with the content of the |
|
Good point, test reference outputs will have to be updated. I'll take care of that later. |
|
Also, what should be the default behaviour if you're parsing an hocr file that doesn't have a confidence measure. Should the Box confidence attribute be set to 0 as it's done with cuneiform? |
|
Good point, I missed the fact that it's not returning anything. I think 0 is fine. If it wasn't used before, we can safely assume it won't be used now. |
|
Finding any information about the confidence measure is very hard (not much in the documentation, nothing in the changelog). I could however find a related issue in a cached version of some now defunct code.google thread. It seems that some versions of tesseract (<=3.02) had negative confidence values (between 0 and -7 :/). In the more recent versions it's a number between 0 and 100 (%). So if I think defaulting it to 0 is the safest bet. |
|
Looks good to me. I'll update the tests later this week and then do a release with this new feature. |
|
No problem! Thank you for your work on this project, it's been super useful for a project where I work 👍 |
|
Hi, is there a plan when this feature will result in a new release? |
|
Sorry, I've been busy with personal matters (moving to another flat, etc) and I forgot to do the release :( (thanks for reminding me :). I'll try to do it this evening (France ; GMT+1). |
|
Awesome! Thanks a lot :) |
|
Thanks! |
This PR allows to parse the individual word confidence measures from Tesseract output and write them to the simplified output hocr file in the title attribute of the Box objects.
Example output:
<span class="ocrx_word" title="bbox 638 1797 751 1823; x_wconf 70">Word</span>Note: directly relates to #74 and #58 and less so to #12