Skip to content

Added space_between_characters#197

Open
marco-digio wants to merge 5 commits into
GEM-benchmark:mainfrom
marco-digio:space_between_characters
Open

Added space_between_characters#197
marco-digio wants to merge 5 commits into
GEM-benchmark:mainfrom
marco-digio:space_between_characters

Conversation

@marco-digio
Copy link
Copy Markdown
Contributor

No description provided.

Copy link
Copy Markdown
Collaborator

@sebastianGehrmann sebastianGehrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the scope for this could be expanded by not adding spaces between every letter of a word. Something like "house" -> "h ouse" or "h o use" is a much more plausible error for OCR systems.

This could easily be implemented by moving away from whitespace tokenization to a character-based one which would also make the transformation applicable to many more languages.

## What are the limitations of this transformation?
- The transformation's outputs are very simple.
- It is not capable of generating linguistically diverse text.
- This transformation will mainly affect the perfornamce of token/word-level models, while character-level models should be much robust. No newline at end of file
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "much more robust"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sebastianGehrmann for your suggestion. I agree that it is very interesting to expand this transformation by adding the possibility of not having a space. I have implemented it in b551a5c where I added a new argument controlling the probability of inserting a space between 2 characters in a token.
I have also updated the README in 28b1301

TaskType.TEXT_TO_TEXT_GENERATION,
TaskType.TEXT_TAGGING,
]
languages = ["en"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By using another tokenizer, this could also work for other languages. The "en" choice is surprising here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, thank you for spotting this. I have changed it to "all" in 28b1301

@sebastianGehrmann
Copy link
Copy Markdown
Collaborator

Thanks for the changes! A couple small things now:

  1. You are still using a whitespace tokenizer which I am not sure is completely necessary (especially since it excludes languages like Chinese which don't use spaces
  2. The tests don't cover the new cases

I think an easy fix for (1) is to no longer differentiate between probability per-word and instead just have the probability per character. That way you are truly language-agnostic.

@marco-digio
Copy link
Copy Markdown
Contributor Author

Thanks for the changes! A couple small things now:

  1. You are still using a whitespace tokenizer which I am not sure is completely necessary (especially since it excludes languages like Chinese which don't use spaces
  2. The tests don't cover the new cases

I think an easy fix for (1) is to no longer differentiate between probability per-word and instead just have the probability per character. That way you are truly language-agnostic.

Thanks for the comments:

  1. Languages that do not use spaces, like Chinese, will not benefit from this transformation. The goal of this PR is to transform words in a full-width style, separating every character with spaces. Since Chinese and similar languages do not use spaces, I believe that this transformation should not be applied to them. I think that similar transformations applied to languages that do not use spaces could be useful (I do not speak any of those languages so I am not completely sure about this) but they should be kept separate from this one.
  2. You are right, but I don't know how to set the argument prob_char different from the default value in the test cases. If you know a transformation that uses different arguments in the tests, please link it to me so I can check how to implement it here. I have not found similar cases in the ones that I have checked, nor documentation that describes it.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants