-
Notifications
You must be signed in to change notification settings - Fork 197
Added space_between_characters #197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
851acd8
b551a5c
28b1301
b418cbe
85e4549
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| # Space Between Characters | ||
| This perturbation adds noise to all types of text sources (sentence, paragraph, etc.). | ||
|
|
||
| Author name: Marco Di Giovanni | ||
| Author email: marco.digiovanni@polimi.it | ||
| Author Affiliation: Politecnico di Milano and University of Bologna | ||
|
|
||
| ## What type of a transformation is this? | ||
| This transformation acts like a perturbation to test robustness. Few words are picked at random and spaces are added between characters (e.g., Marco -> M a r c o) | ||
|
|
||
| Generated transformations display high similarity to the source sentences i.e. the code outputs highly precise and readable generations. | ||
|
|
||
| ## What tasks does it intend to benefit? | ||
| This perturbation would benefit all tasks which have a sentence/paragraph/document as input like text classification, text generation, etc. | ||
|
|
||
|
|
||
| ## What are the limitations of this transformation? | ||
| - The transformation's outputs are very simple. | ||
| - It is not capable of generating linguistically diverse text. | ||
| - This transformation will mainly affect the perfornamce of token/word-level models, while character-level models should be much robust. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| from .transformation import * |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| { | ||
| "type": "space_between_characters", | ||
| "test_cases": [ | ||
| { | ||
| "class": "SpaceBetweenCharacters", | ||
| "inputs": { | ||
| "sentence": "Andrew finally returned the French book to Chris that I bought last week" | ||
| }, | ||
| "outputs": [{ | ||
| "sentence": "Andrew f i n a l l y returned the French book to C h r i s that I bought last w e e k" | ||
| }] | ||
| }, | ||
| { | ||
| "class": "SpaceBetweenCharacters", | ||
| "inputs": { | ||
| "sentence": "Sentences with gapping, such as Paul likes coffee and Mary tea, lack an overt predicate to indicate the relation between two or more arguments." | ||
| }, | ||
| "outputs": [{ | ||
| "sentence": "Sentences w i t h gapping, such as Paul likes c o f f e e and M a r y tea, lack a n overt predicate to indicate the relation b e t w e e n two or more arguments." | ||
| }] | ||
| }, | ||
| { | ||
| "class": "SpaceBetweenCharacters", | ||
| "inputs": { | ||
| "sentence": "Alice in Wonderland is a 2010 American live-action/animated dark fantasy adventure film" | ||
| }, | ||
| "outputs": [{ | ||
| "sentence": "Alice i n Wonderland is a 2010 American l i v e - a c t i o n / a n i m a t e d dark f a n t a s y adventure film" | ||
| }] | ||
| }, | ||
| { | ||
| "class": "SpaceBetweenCharacters", | ||
| "inputs": { | ||
| "sentence": "Ujjal Dev Dosanjh served as 33rd Premier of British Columbia from 2000 to 2001" | ||
| }, | ||
| "outputs": [{ | ||
| "sentence": "Ujjal D e v Dosanjh served as 33rd Premier o f British C o l u m b i a from 2000 t o 2001" | ||
| }] | ||
| }, | ||
| { | ||
| "class": "SpaceBetweenCharacters", | ||
| "inputs": { | ||
| "sentence": "Neuroplasticity is a continuous processing allowing short-term, medium-term, and long-term remodeling of the neuronosynaptic organization." | ||
| }, | ||
| "outputs": [{ | ||
| "sentence": "Neuroplasticity i s a continuous processing allowing short-term, m e d i u m - t e r m , and l o n g - t e r m remodeling of t h e neuronosynaptic organization." | ||
| }] | ||
| } | ||
| ] | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| import random | ||
| from typing import List | ||
|
|
||
| from interfaces.SentenceOperation import SentenceOperation | ||
| from tasks.TaskTypes import TaskType | ||
|
|
||
|
|
||
| def add_spaces(text, prob=0.1, seed=0, max_outputs=1): | ||
| random.seed(seed) | ||
|
|
||
| words = text.split(" ") | ||
| perturbed_texts = [] | ||
| for _ in range(max_outputs): | ||
| perturbed_text = [] | ||
| for word in words: | ||
| if random.random() <= prob: | ||
| new_word = " ".join(word) | ||
| else: | ||
| new_word = word | ||
| perturbed_text.append(new_word) | ||
| perturbed_texts.append(" ".join(perturbed_text)) | ||
| return perturbed_texts | ||
|
|
||
|
|
||
| class SpaceBetweenCharacters(SentenceOperation): | ||
| tasks = [ | ||
| TaskType.TEXT_CLASSIFICATION, | ||
| TaskType.TEXT_TO_TEXT_GENERATION, | ||
| TaskType.TEXT_TAGGING, | ||
| ] | ||
| languages = ["en"] | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By using another tokenizer, this could also work for other languages. The "en" choice is surprising here.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are right, thank you for spotting this. I have changed it to "all" in 28b1301 |
||
|
|
||
| def __init__(self, seed=42, max_outputs=1, prob=0.1): | ||
| super().__init__(seed, max_outputs=max_outputs) | ||
| self.prob = prob | ||
|
|
||
| def generate(self, sentence: str) -> List[str]: | ||
| perturbed_texts = add_spaces( | ||
| text=sentence, | ||
| prob=self.prob, | ||
| seed=self.seed, | ||
| max_outputs=self.max_outputs, | ||
| ) | ||
| return perturbed_texts | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "much more robust"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @sebastianGehrmann for your suggestion. I agree that it is very interesting to expand this transformation by adding the possibility of not having a space. I have implemented it in b551a5c where I added a new argument controlling the probability of inserting a space between 2 characters in a token.
I have also updated the README in 28b1301