-
Notifications
You must be signed in to change notification settings - Fork 197
Added Synonym insertion #160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
beb7548
add synonym insertion transformation
JosephSefara d0d9c80
Update README.md
vukosim f4201c8
Update test.json
vukosim f8cd666
Update README.md
vukosim b762f37
Update README.md
vukosim b81cc9e
update readme
JosephSefara 1a56e0c
format json
JosephSefara 17e78ea
Merge branch 'GEM-benchmark:main' into synonym_insertion
JosephSefara 732b44d
added list of keywords
JosephSefara 96c84d4
update robust evaluation results on readme and test.json
JosephSefara 2126a52
minor updates
JosephSefara File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,88 @@ | ||
| # Synonym Insertion | ||
| This perturbation adds noise to all types of text sources (sentence, paragraph, etc.) by randomly inserting synonyms of randomly selected words excluding punctuations and stopwords. | ||
|
|
||
| Author1 name: Tshephisho Sefara | ||
|
|
||
| Author1 email: sefaratj@gmail.com | ||
|
|
||
| Author1 Affiliation: Council for Scientific and Industrial Research | ||
|
|
||
| Author2 name: Vukosi Marivate | ||
|
|
||
| Author2 email: vukosi.marivate@cs.up.ac.za, vima@vima.co.za | ||
|
|
||
| Author2 Affiliation: Department of Computer Science, University of Pretoria | ||
|
|
||
| ## What type of a transformation is this? | ||
| This transformation could augment the semantic representation of the sentence as well as test model robustness by inserting synonyms of random words excluding punctuations and stopwords. | ||
|
|
||
|
|
||
| ## What tasks does it intend to benefit? | ||
| This perturbation would benefit all tasks on text classification and generation. | ||
|
|
||
| Benchmark results: | ||
|
|
||
| - Text Classification: we run sentiment analysis on a 1% sample of the IMDB dataset. The original accuracy is 96.0 and the perturbed accuracy is 94.0. | ||
| ``` | ||
| Applying transformation: | ||
| 100%|██████████| 250/250 [00:18<00:00, 13.85it/s] | ||
| Finished transformation! 250 examples generated from 250 original examples, with 250 successfully transformed and 0 unchanged (1.0 perturb rate) | ||
| Here is the performance of the model on the transformed set | ||
| The accuracy on this subset which has 250 examples = 94.0 | ||
|
|
||
| {'accuracy': 96.0, | ||
| 'dataset_name': 'imdb', | ||
| 'model_name': 'aychang/roberta-base-imdb', | ||
| 'no_of_examples': 250, | ||
| 'pt_accuracy': 94.0, | ||
| 'split': 'test[:1%]'} | ||
| ``` | ||
| - Text Generation: we run text generation on a 1% sample of the xsum dataset. The original bleu is 16 and the perturbed bleu is 13.85. | ||
| ``` | ||
| Applying transformation: | ||
| 100%|██████████| 113/113 [00:12<00:00, 9.31it/s] | ||
| Finished transformation! 113 examples generated from 113 original examples, with 113 successfully transformed and 0 unchanged (1.0 perturb rate) | ||
| Here is the performance of the model on the transformed set | ||
| Length of Evaluation dataset is 113 | ||
| Predicted BLEU score = 13.849736846663058 | ||
| {'bleu': 16.0, | ||
| 'dataset_name': 'xsum', | ||
| 'model_name': 'sshleifer/distilbart-xsum-12-6', | ||
| 'pt_bleu': 13.8, | ||
| 'split': 'test[:1%]'} | ||
| ``` | ||
|
|
||
| ## Related Work | ||
| This perturbation is adapted from our TextAugmentation library https://github.com/dsfsi/textaugment | ||
| ```bibtex | ||
| @inproceedings{marivate2020improving, | ||
| title={Improving short text classification through global augmentation methods}, | ||
| author={Marivate, Vukosi and Sefara, Tshephisho}, | ||
| booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction}, | ||
| pages={385--399}, | ||
| year={2020}, | ||
| organization={Springer} | ||
| } | ||
| ``` | ||
|
|
||
| The synonyms are based on WordNet via NLTK | ||
|
|
||
| ```bibtex | ||
| @book{miller1998wordnet, | ||
| title={WordNet: An electronic lexical database}, | ||
| author={Miller, George A}, | ||
| year={1998}, | ||
| publisher={MIT press} | ||
| } | ||
| @inproceedings{bird2006nltk, | ||
| title={NLTK: the natural language toolkit}, | ||
| author={Bird, Steven}, | ||
| booktitle={Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions}, | ||
| pages={69--72}, | ||
| year={2006} | ||
| } | ||
| ``` | ||
|
|
||
|
|
||
| ## What are the limitations of this transformation? | ||
| The space of synonyms depends on WordNet and could be limited. The transformation might introduce non-grammatical segments. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| from .transformation import * |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| nltk>=3.4 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,60 @@ | ||
| { | ||
| "type": "synonym_insertion", | ||
| "test_cases": [ | ||
| { | ||
| "class": "SynonymInsertion", | ||
| "inputs": { | ||
| "sentence": "Andrew finally returned the French book to Chris that I bought last week" | ||
| }, | ||
| "outputs": [ | ||
| { | ||
| "sentence": "Andrew finally returned the French book koran to Chris that I bought last final week" | ||
| } | ||
| ] | ||
| }, | ||
| { | ||
| "class": "SynonymInsertion", | ||
| "inputs": { | ||
| "sentence": "Sentences with gapping, such as Paul likes coffee and Mary tea, lack an overt predicate to indicate the relation between two or more arguments." | ||
| }, | ||
| "outputs": [ | ||
| { | ||
| "sentence": "Sentences with gapping, such as Paul likes coffee chocolate and Mary tea, lack an overt predicate to indicate argue the relation between two or more arguments controversy." | ||
| } | ||
| ] | ||
| }, | ||
| { | ||
| "class": "SynonymInsertion", | ||
| "inputs": { | ||
| "sentence": "Alice in Wonderland is a 2010 American live-action/animated dark fantasy adventure film" | ||
| }, | ||
| "outputs": [ | ||
| { | ||
| "sentence": "Alice in Wonderland is a 2010 American live - action / animated animize dark fantasy illusion adventure film" | ||
| } | ||
| ] | ||
| }, | ||
| { | ||
| "class": "SynonymInsertion", | ||
| "inputs": { | ||
| "sentence": "Ujjal Dev Dosanjh served as 33rd Premier of British Columbia from 2000 to 2001" | ||
| }, | ||
| "outputs": [ | ||
| { | ||
| "sentence": "Ujjal Dev Dosanjh served as 33rd Premier of British Columbia from 2000 to 2001" | ||
| } | ||
| ] | ||
| }, | ||
| { | ||
| "class": "SynonymInsertion", | ||
| "inputs": { | ||
| "sentence": "Neuroplasticity is a continuous processing allowing short-term, medium-term, and long-term remodeling of the neuronosynaptic organization." | ||
| }, | ||
| "outputs": [ | ||
| { | ||
| "sentence": "Neuroplasticity is a continuous processing allowing short - term terminus, medium - term terminus, and long - term condition remodeling of the neuronosynaptic organization." | ||
| } | ||
| ] | ||
| } | ||
| ] | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,120 @@ | ||
| import random | ||
| import re | ||
| from abc import ABC | ||
|
|
||
| import nltk | ||
| import spacy | ||
| from nltk.corpus import wordnet, stopwords | ||
|
|
||
| from interfaces.SentenceOperation import SentenceOperation | ||
| from tasks.TaskTypes import TaskType | ||
| from initialize import spacy_nlp | ||
|
|
||
| """ | ||
| Base Class for implementing the different input transformations a generation should be robust against. | ||
| """ | ||
|
|
||
|
|
||
| class InsertWordTransformation: | ||
| nlp = None | ||
|
|
||
| def __init__(self, seed=0, max_outputs=1, prob=0.5): | ||
| self.nlp = spacy_nlp if spacy_nlp else spacy.load("en_core_web_sm") | ||
| self.max_outputs = max_outputs | ||
| self.seed = seed | ||
| self.prob = prob | ||
| self.stopwords = stopwords.words('english') | ||
|
|
||
| def untokenize(self, words: list): | ||
| """ | ||
| Untokenizing a text undoes the tokenizing operation, restoring | ||
| punctuation and spaces to the places that people expect them to be. | ||
| Ideally, `untokenize(tokenize(text))` should be identical to `text`, | ||
| except for line breaks. | ||
| ref: https://github.com/commonsense/metanl/blob/master/metanl/token_utils.py#L28 | ||
| """ | ||
| text = " ".join(words) | ||
| step1 = ( | ||
| text.replace("`` ", '"').replace(" ''", '"').replace(". . .", "...") | ||
| ) | ||
| step2 = step1.replace(" ( ", " (").replace(" ) ", ") ") | ||
| step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2) | ||
| step4 = re.sub(r" ([.,:;?!%]+)$", r"\1", step3) | ||
| step5 = ( | ||
| step4.replace(" '", "'") | ||
| .replace(" n't", "n't") | ||
| .replace("can not", "cannot") | ||
| ) | ||
| step6 = step5.replace(" ` ", " '") | ||
| return step6.strip() | ||
|
|
||
| def transform(self, input_text: str): | ||
| random.seed(self.seed) | ||
| pos_wordnet_dict = { | ||
| "VERB": "v", | ||
| "NOUN": "n", | ||
| "ADV": "r", | ||
| "ADJ": "s", | ||
| } | ||
| doc = self.nlp(input_text) | ||
| results = set() | ||
| for _ in range(self.max_outputs): | ||
| result = [] | ||
| for token in doc: | ||
| word = token.text | ||
| wordnet_pos = pos_wordnet_dict.get(token.pos_) | ||
| if not wordnet_pos: | ||
| result.append(word) | ||
| elif word in self.stopwords: | ||
| result.append(word) | ||
| else: | ||
| synsets = wordnet.synsets(word, pos=wordnet_pos) | ||
| if len(synsets) > 0: | ||
| synsets = [syn.name().split(".")[0] for syn in synsets] | ||
| synsets = [syn for syn in synsets if syn.lower() != word.lower()] | ||
| synsets = list(set(synsets)) # remove duplicate synonyms | ||
| if len(synsets) > 0 and random.random() < self.prob: | ||
| syn = random.choice(synsets) | ||
| syn = syn.replace("_", " ") | ||
| result.append(word) | ||
| result.append(syn) | ||
| else: | ||
| result.append(word) | ||
| else: | ||
| result.append(word) | ||
| result = self.untokenize(result) # rebuild the sentence | ||
| results.add(result) | ||
| return list(results) | ||
|
|
||
|
|
||
| """ | ||
| Insert words such as synonyms from WordNet via nltk. | ||
| """ | ||
|
|
||
|
|
||
| class SynonymInsertion(SentenceOperation, ABC): | ||
| """ | ||
| This class is an implementation of synonym insertion in the sentence. Created by the Authors of TextAugment | ||
| https://github.com/dsfsi/textaugment | ||
| """ | ||
| tasks = [TaskType.TEXT_CLASSIFICATION, TaskType.TEXT_TO_TEXT_GENERATION] | ||
| languages = ["en"] | ||
| heavy = False | ||
| keywords = [ | ||
| "tokenizer", "external-knowledge-based", "lexical", "low-precision", "low-coverage", "low-generations" | ||
| ] | ||
|
|
||
| def __init__(self, seed=0, prob=0.5, max_outputs=1): | ||
| super().__init__(seed, max_outputs=max_outputs) | ||
| nltk.download(["wordnet", "stopwords"]) | ||
| self.insert_word_transformation = InsertWordTransformation( | ||
| seed, max_outputs, prob | ||
| ) | ||
|
|
||
| def generate(self, sentence: str): | ||
| result = self.insert_word_transformation.transform( | ||
| input_text=sentence, | ||
| ) | ||
| if self.verbose: | ||
| print(f"Perturbed Input from {self.name()} : {result}") | ||
| return result | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please add keywords here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I would also recommend adding the robustness evaluation for your PR that can be added to the leaderboard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.