GEM-benchmark · kaustubhdhole · Oct 28, 2021 · Jul 24, 2021 · Jul 24, 2021 · Jul 24, 2021
diff --git a/transformations/synonym_insertion/README.md b/transformations/synonym_insertion/README.md
@@ -0,0 +1,88 @@
+# Synonym Insertion
+This perturbation adds noise to all types of text sources (sentence, paragraph, etc.) by randomly inserting synonyms of randomly selected words excluding punctuations and stopwords.
+
+Author1 name: Tshephisho Sefara
+
+Author1 email: sefaratj@gmail.com
+
+Author1 Affiliation: Council for Scientific and Industrial Research
+
+Author2 name: Vukosi Marivate
+
+Author2 email: vukosi.marivate@cs.up.ac.za, vima@vima.co.za
+
+Author2 Affiliation: Department of Computer Science, University of Pretoria
+
+## What type of a transformation is this?
+This transformation could augment the semantic representation of the sentence as well as test model robustness by inserting synonyms of random words excluding punctuations and stopwords.
+
+
+## What tasks does it intend to benefit?
+This perturbation would benefit all tasks on text classification and generation.
+
+Benchmark results:
+
+- Text Classification: we run sentiment analysis on a 1% sample of the IMDB dataset. The original accuracy is 96.0 and the perturbed accuracy is 94.0.
+```
+Applying transformation:
+100%|██████████| 250/250 [00:18<00:00, 13.85it/s]
+Finished transformation! 250 examples generated from 250 original examples, with 250 successfully transformed and 0 unchanged (1.0 perturb rate)
+Here is the performance of the model on the transformed set
+The accuracy on this subset which has 250 examples = 94.0
+
+ {'accuracy': 96.0,
+ 'dataset_name': 'imdb',
+ 'model_name': 'aychang/roberta-base-imdb',
+ 'no_of_examples': 250,
+ 'pt_accuracy': 94.0,
+ 'split': 'test[:1%]'}
+```
+- Text Generation: we run text generation on a 1% sample of the xsum dataset. The original bleu is 16 and the perturbed bleu is 13.85.
+```
+Applying transformation:
+100%|██████████| 113/113 [00:12<00:00,  9.31it/s]
+Finished transformation! 113 examples generated from 113 original examples, with 113 successfully transformed and 0 unchanged (1.0 perturb rate)
+Here is the performance of the model on the transformed set
+Length of Evaluation dataset is 113
+Predicted BLEU score = 13.849736846663058
+{'bleu': 16.0,
+ 'dataset_name': 'xsum',
+ 'model_name': 'sshleifer/distilbart-xsum-12-6',
+ 'pt_bleu': 13.8,
+ 'split': 'test[:1%]'}
+```
+
+## Related Work
+This perturbation is adapted from our TextAugmentation library https://github.com/dsfsi/textaugment
+```bibtex
+@inproceedings{marivate2020improving,
+  title={Improving short text classification through global augmentation methods},
+  author={Marivate, Vukosi and Sefara, Tshephisho},
+  booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction},
+  pages={385--399},
+  year={2020},
+  organization={Springer}
+}
+```
+
+The synonyms are based on WordNet via NLTK
+
+```bibtex
+@book{miller1998wordnet,
+  title={WordNet: An electronic lexical database},
+  author={Miller, George A},
+  year={1998},
+  publisher={MIT press}
+}
+@inproceedings{bird2006nltk,
+  title={NLTK: the natural language toolkit},
+  author={Bird, Steven},
+  booktitle={Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions},
+  pages={69--72},
+  year={2006}
+}
+```
+
+
+## What are the limitations of this transformation?
+The space of synonyms depends on WordNet and could be limited. The transformation might introduce non-grammatical segments.
diff --git a/transformations/synonym_insertion/__init__.py b/transformations/synonym_insertion/__init__.py
@@ -0,0 +1 @@
+from .transformation import *
diff --git a/transformations/synonym_insertion/requirements.txt b/transformations/synonym_insertion/requirements.txt
@@ -0,0 +1 @@
+nltk>=3.4
diff --git a/transformations/synonym_insertion/test.json b/transformations/synonym_insertion/test.json
@@ -0,0 +1,60 @@
+{
+    "type": "synonym_insertion",
+    "test_cases": [
+      {
+        "class": "SynonymInsertion",
+        "inputs": {
+          "sentence": "Andrew finally returned the French book to Chris that I bought last week"
+        },
+        "outputs": [
+          {
+            "sentence": "Andrew finally returned the French book koran to Chris that I bought last final week"
+          }
+        ]
+      },
+      {
+        "class": "SynonymInsertion",
+        "inputs": {
+          "sentence": "Sentences with gapping, such as Paul likes coffee and Mary tea, lack an overt predicate to indicate the relation between two or more arguments."
+        },
+        "outputs": [
+          {
+            "sentence": "Sentences with gapping, such as Paul likes coffee chocolate and Mary tea, lack an overt predicate to indicate argue the relation between two or more arguments controversy."
+          }
+        ]
+      },
+      {
+        "class": "SynonymInsertion",
+        "inputs": {
+          "sentence": "Alice in Wonderland is a 2010 American live-action/animated dark fantasy adventure film"
+        },
+        "outputs": [
+          {
+            "sentence": "Alice in Wonderland is a 2010 American live - action / animated animize dark fantasy illusion adventure film"
+          }
+        ]
+      },
+      {
+        "class": "SynonymInsertion",
+        "inputs": {
+          "sentence": "Ujjal Dev Dosanjh served as 33rd Premier of British Columbia from 2000 to 2001"
+        },
+        "outputs": [
+          {
+            "sentence": "Ujjal Dev Dosanjh served as 33rd Premier of British Columbia from 2000 to 2001"
+          }
+        ]
+      },
+      {
+        "class": "SynonymInsertion",
+        "inputs": {
+          "sentence": "Neuroplasticity is a continuous processing allowing short-term, medium-term, and long-term remodeling of the neuronosynaptic organization."
+        },
+        "outputs": [
+          {
+            "sentence": "Neuroplasticity is a continuous processing allowing short - term terminus, medium - term terminus, and long - term condition remodeling of the neuronosynaptic organization."
+          }
+        ]
+      }
+    ]
+  }
diff --git a/transformations/synonym_insertion/transformation.py b/transformations/synonym_insertion/transformation.py
@@ -0,0 +1,120 @@
+import random
+import re
+from abc import ABC
+
+import nltk
+import spacy
+from nltk.corpus import wordnet, stopwords
+
+from interfaces.SentenceOperation import SentenceOperation
+from tasks.TaskTypes import TaskType
+from initialize import spacy_nlp
+
+"""
+Base Class for implementing the different input transformations a generation should be robust against.
+"""
+
+
+class InsertWordTransformation:
+    nlp = None
+
+    def __init__(self, seed=0, max_outputs=1, prob=0.5):
+        self.nlp = spacy_nlp if spacy_nlp else spacy.load("en_core_web_sm")
+        self.max_outputs = max_outputs
+        self.seed = seed
+        self.prob = prob
+        self.stopwords = stopwords.words('english')
+
+    def untokenize(self, words: list):
+        """
+        Untokenizing a text undoes the tokenizing operation, restoring
+        punctuation and spaces to the places that people expect them to be.
+        Ideally, `untokenize(tokenize(text))` should be identical to `text`,
+        except for line breaks.
+        ref: https://github.com/commonsense/metanl/blob/master/metanl/token_utils.py#L28
+        """
+        text = " ".join(words)
+        step1 = (
+            text.replace("`` ", '"').replace(" ''", '"').replace(". . .", "...")
+        )
+        step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
+        step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2)
+        step4 = re.sub(r" ([.,:;?!%]+)$", r"\1", step3)
+        step5 = (
+            step4.replace(" '", "'")
+            .replace(" n't", "n't")
+            .replace("can not", "cannot")
+        )
+        step6 = step5.replace(" ` ", " '")
+        return step6.strip()
+
+    def transform(self, input_text: str):
+        random.seed(self.seed)
+        pos_wordnet_dict = {
+            "VERB": "v",
+            "NOUN": "n",
+            "ADV": "r",
+            "ADJ": "s",
+        }
+        doc = self.nlp(input_text)
+        results = set()
+        for _ in range(self.max_outputs):
+            result = []
+            for token in doc:
+                word = token.text
+                wordnet_pos = pos_wordnet_dict.get(token.pos_)
+                if not wordnet_pos:
+                    result.append(word)
+                elif word in self.stopwords:
+                    result.append(word)
+                else:
+                    synsets = wordnet.synsets(word, pos=wordnet_pos)
+                    if len(synsets) > 0:
+                        synsets = [syn.name().split(".")[0] for syn in synsets]
+                        synsets = [syn for syn in synsets if syn.lower() != word.lower()]
+                        synsets = list(set(synsets))  # remove duplicate synonyms
+                        if len(synsets) > 0 and random.random() < self.prob:
+                            syn = random.choice(synsets)
+                            syn = syn.replace("_", " ")
+                            result.append(word)
+                            result.append(syn)
+                        else:
+                            result.append(word)
+                    else:
+                        result.append(word)
+            result = self.untokenize(result)  # rebuild the sentence
+            results.add(result)
+        return list(results)
+
+
+"""
+Insert words such as synonyms from WordNet via nltk. 
+"""
+
+
+class SynonymInsertion(SentenceOperation, ABC):
+    """
+    This class is an implementation of synonym insertion in the sentence. Created by the Authors of TextAugment
+    https://github.com/dsfsi/textaugment
+    """
+    tasks = [TaskType.TEXT_CLASSIFICATION, TaskType.TEXT_TO_TEXT_GENERATION]
+    languages = ["en"]
+    heavy = False
+    keywords = [
+        "tokenizer", "external-knowledge-based", "lexical", "low-precision", "low-coverage", "low-generations"
+    ]
+
+    def __init__(self, seed=0, prob=0.5, max_outputs=1):
+        super().__init__(seed, max_outputs=max_outputs)
+        nltk.download(["wordnet", "stopwords"])
+        self.insert_word_transformation = InsertWordTransformation(
+            seed, max_outputs, prob
+        )
+
+    def generate(self, sentence: str):
+        result = self.insert_word_transformation.transform(
+            input_text=sentence,
+        )
+        if self.verbose:
+            print(f"Perturbed Input from {self.name()} : {result}")
+        return result