Conversation
| @property | ||
| def TESSERACT_CHARACTER_CONFIDENCE_THRESHOLD(self) -> int: | ||
| """Tesseract predictions with confidence below this threshold are ignored""" | ||
| return self._get_float("TESSERACT_CHARACTER_CONFIDENCE_THRESHOLD", 0.0) |
There was a problem hiding this comment.
I wonder, maybe we'd like to have some really low default threshold, i.e. 0.1, just to filter out complete garbage chars?
There was a problem hiding this comment.
I am ok with 0; the default behavior is no filter at all so this PR should just keep that for now. We can use followups to change this value.
| image: np.ndarray, | ||
| lang: str = "eng", | ||
| config: str = "", | ||
| character_confidence_threshold: float = 0.5, |
There was a problem hiding this comment.
Here we are adding some default, so maybe let's also keep it in config?
There was a problem hiding this comment.
I see below we again have 0.5 as a default in hocr_to_dataframe, so either way, I would unify those
| ocr_df = self.hocr_to_dataframe(hocr, character_confidence_threshold) | ||
| return ocr_df | ||
|
|
||
| def hocr_to_dataframe( |
There was a problem hiding this comment.
what's the compute performance with this code? We essentially were relying on tesseract internal cpp code to parse results but here we do it in python.
There was a problem hiding this comment.
I have not analyzed this. We simply iterate over ~300 words, I am not sure there is any risk of significant slowdowns. What do you think?
| "width": right - left, | ||
| "height": bottom - top, |
There was a problem hiding this comment.
small nit on performance we can create df using bbox first then use vector ops to compute width and height (and overwrite the data for right and bottom).
This change adds the ability to filter out characters predicted by Tesseract with low confidence scores. Some notes: - I intentionally disabled it by default; I think some low score(like 0.9-0.95 for Tesseract) could be a safe choice though - I wanted to use character bboxes and combine them into word bbox later. However, a bug in Tesseract in some specific scenarios returns incorrect character bboxes (unit tests caught it 🥳 ). More in comment in the code
This change adds the ability to filter out characters predicted by Tesseract with low confidence scores.
Some notes: