Added space_between_characters#197
Conversation
sebastianGehrmann
left a comment
There was a problem hiding this comment.
I think the scope for this could be expanded by not adding spaces between every letter of a word. Something like "house" -> "h ouse" or "h o use" is a much more plausible error for OCR systems.
This could easily be implemented by moving away from whitespace tokenization to a character-based one which would also make the transformation applicable to many more languages.
| ## What are the limitations of this transformation? | ||
| - The transformation's outputs are very simple. | ||
| - It is not capable of generating linguistically diverse text. | ||
| - This transformation will mainly affect the perfornamce of token/word-level models, while character-level models should be much robust. No newline at end of file |
There was a problem hiding this comment.
nit: "much more robust"
There was a problem hiding this comment.
Thank you @sebastianGehrmann for your suggestion. I agree that it is very interesting to expand this transformation by adding the possibility of not having a space. I have implemented it in b551a5c where I added a new argument controlling the probability of inserting a space between 2 characters in a token.
I have also updated the README in 28b1301
| TaskType.TEXT_TO_TEXT_GENERATION, | ||
| TaskType.TEXT_TAGGING, | ||
| ] | ||
| languages = ["en"] |
There was a problem hiding this comment.
By using another tokenizer, this could also work for other languages. The "en" choice is surprising here.
There was a problem hiding this comment.
You are right, thank you for spotting this. I have changed it to "all" in 28b1301
|
Thanks for the changes! A couple small things now:
I think an easy fix for (1) is to no longer differentiate between probability per-word and instead just have the probability per character. That way you are truly language-agnostic. |
Thanks for the comments:
Thank you. |
No description provided.