Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 88 additions & 0 deletions transformations/synonym_insertion/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Synonym Insertion
This perturbation adds noise to all types of text sources (sentence, paragraph, etc.) by randomly inserting synonyms of randomly selected words excluding punctuations and stopwords.

Author1 name: Tshephisho Sefara

Author1 email: sefaratj@gmail.com

Author1 Affiliation: Council for Scientific and Industrial Research

Author2 name: Vukosi Marivate

Author2 email: vukosi.marivate@cs.up.ac.za, vima@vima.co.za

Author2 Affiliation: Department of Computer Science, University of Pretoria

## What type of a transformation is this?
This transformation could augment the semantic representation of the sentence as well as test model robustness by inserting synonyms of random words excluding punctuations and stopwords.


## What tasks does it intend to benefit?
This perturbation would benefit all tasks on text classification and generation.

Benchmark results:

- Text Classification: we run sentiment analysis on a 1% sample of the IMDB dataset. The original accuracy is 96.0 and the perturbed accuracy is 94.0.
```
Applying transformation:
100%|██████████| 250/250 [00:18<00:00, 13.85it/s]
Finished transformation! 250 examples generated from 250 original examples, with 250 successfully transformed and 0 unchanged (1.0 perturb rate)
Here is the performance of the model on the transformed set
The accuracy on this subset which has 250 examples = 94.0

{'accuracy': 96.0,
'dataset_name': 'imdb',
'model_name': 'aychang/roberta-base-imdb',
'no_of_examples': 250,
'pt_accuracy': 94.0,
'split': 'test[:1%]'}
```
- Text Generation: we run text generation on a 1% sample of the xsum dataset. The original bleu is 16 and the perturbed bleu is 13.85.
```
Applying transformation:
100%|██████████| 113/113 [00:12<00:00, 9.31it/s]
Finished transformation! 113 examples generated from 113 original examples, with 113 successfully transformed and 0 unchanged (1.0 perturb rate)
Here is the performance of the model on the transformed set
Length of Evaluation dataset is 113
Predicted BLEU score = 13.849736846663058
{'bleu': 16.0,
'dataset_name': 'xsum',
'model_name': 'sshleifer/distilbart-xsum-12-6',
'pt_bleu': 13.8,
'split': 'test[:1%]'}
```

## Related Work
This perturbation is adapted from our TextAugmentation library https://github.com/dsfsi/textaugment
```bibtex
@inproceedings{marivate2020improving,
title={Improving short text classification through global augmentation methods},
author={Marivate, Vukosi and Sefara, Tshephisho},
booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction},
pages={385--399},
year={2020},
organization={Springer}
}
```

The synonyms are based on WordNet via NLTK

```bibtex
@book{miller1998wordnet,
title={WordNet: An electronic lexical database},
author={Miller, George A},
year={1998},
publisher={MIT press}
}
@inproceedings{bird2006nltk,
title={NLTK: the natural language toolkit},
author={Bird, Steven},
booktitle={Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions},
pages={69--72},
year={2006}
}
```


## What are the limitations of this transformation?
The space of synonyms depends on WordNet and could be limited. The transformation might introduce non-grammatical segments.
1 change: 1 addition & 0 deletions transformations/synonym_insertion/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .transformation import *
1 change: 1 addition & 0 deletions transformations/synonym_insertion/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
nltk>=3.4
60 changes: 60 additions & 0 deletions transformations/synonym_insertion/test.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
{
"type": "synonym_insertion",
"test_cases": [
{
"class": "SynonymInsertion",
"inputs": {
"sentence": "Andrew finally returned the French book to Chris that I bought last week"
},
"outputs": [
{
"sentence": "Andrew finally returned the French book koran to Chris that I bought last final week"
}
]
},
{
"class": "SynonymInsertion",
"inputs": {
"sentence": "Sentences with gapping, such as Paul likes coffee and Mary tea, lack an overt predicate to indicate the relation between two or more arguments."
},
"outputs": [
{
"sentence": "Sentences with gapping, such as Paul likes coffee chocolate and Mary tea, lack an overt predicate to indicate argue the relation between two or more arguments controversy."
}
]
},
{
"class": "SynonymInsertion",
"inputs": {
"sentence": "Alice in Wonderland is a 2010 American live-action/animated dark fantasy adventure film"
},
"outputs": [
{
"sentence": "Alice in Wonderland is a 2010 American live - action / animated animize dark fantasy illusion adventure film"
}
]
},
{
"class": "SynonymInsertion",
"inputs": {
"sentence": "Ujjal Dev Dosanjh served as 33rd Premier of British Columbia from 2000 to 2001"
},
"outputs": [
{
"sentence": "Ujjal Dev Dosanjh served as 33rd Premier of British Columbia from 2000 to 2001"
}
]
},
{
"class": "SynonymInsertion",
"inputs": {
"sentence": "Neuroplasticity is a continuous processing allowing short-term, medium-term, and long-term remodeling of the neuronosynaptic organization."
},
"outputs": [
{
"sentence": "Neuroplasticity is a continuous processing allowing short - term terminus, medium - term terminus, and long - term condition remodeling of the neuronosynaptic organization."
}
]
}
]
}
120 changes: 120 additions & 0 deletions transformations/synonym_insertion/transformation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
import random
import re
from abc import ABC

import nltk
import spacy
from nltk.corpus import wordnet, stopwords

from interfaces.SentenceOperation import SentenceOperation
from tasks.TaskTypes import TaskType
from initialize import spacy_nlp

"""
Base Class for implementing the different input transformations a generation should be robust against.
"""


class InsertWordTransformation:
nlp = None

def __init__(self, seed=0, max_outputs=1, prob=0.5):
self.nlp = spacy_nlp if spacy_nlp else spacy.load("en_core_web_sm")
self.max_outputs = max_outputs
self.seed = seed
self.prob = prob
self.stopwords = stopwords.words('english')

def untokenize(self, words: list):
"""
Untokenizing a text undoes the tokenizing operation, restoring
punctuation and spaces to the places that people expect them to be.
Ideally, `untokenize(tokenize(text))` should be identical to `text`,
except for line breaks.
ref: https://github.com/commonsense/metanl/blob/master/metanl/token_utils.py#L28
"""
text = " ".join(words)
step1 = (
text.replace("`` ", '"').replace(" ''", '"').replace(". . .", "...")
)
step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2)
step4 = re.sub(r" ([.,:;?!%]+)$", r"\1", step3)
step5 = (
step4.replace(" '", "'")
.replace(" n't", "n't")
.replace("can not", "cannot")
)
step6 = step5.replace(" ` ", " '")
return step6.strip()

def transform(self, input_text: str):
random.seed(self.seed)
pos_wordnet_dict = {
"VERB": "v",
"NOUN": "n",
"ADV": "r",
"ADJ": "s",
}
doc = self.nlp(input_text)
results = set()
for _ in range(self.max_outputs):
result = []
for token in doc:
word = token.text
wordnet_pos = pos_wordnet_dict.get(token.pos_)
if not wordnet_pos:
result.append(word)
elif word in self.stopwords:
result.append(word)
else:
synsets = wordnet.synsets(word, pos=wordnet_pos)
if len(synsets) > 0:
synsets = [syn.name().split(".")[0] for syn in synsets]
synsets = [syn for syn in synsets if syn.lower() != word.lower()]
synsets = list(set(synsets)) # remove duplicate synonyms
if len(synsets) > 0 and random.random() < self.prob:
syn = random.choice(synsets)
syn = syn.replace("_", " ")
result.append(word)
result.append(syn)
else:
result.append(word)
else:
result.append(word)
result = self.untokenize(result) # rebuild the sentence
results.add(result)
return list(results)


"""
Insert words such as synonyms from WordNet via nltk.
"""


class SynonymInsertion(SentenceOperation, ABC):
"""
This class is an implementation of synonym insertion in the sentence. Created by the Authors of TextAugment
https://github.com/dsfsi/textaugment
"""
tasks = [TaskType.TEXT_CLASSIFICATION, TaskType.TEXT_TO_TEXT_GENERATION]
languages = ["en"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please add keywords here.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I would also recommend adding the robustness evaluation for your PR that can be added to the leaderboard.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • keywords added
  • readme and test.json contains the results of the Robustness Evaluation for
    • Text Classification
    • Text Generation

heavy = False
keywords = [
"tokenizer", "external-knowledge-based", "lexical", "low-precision", "low-coverage", "low-generations"
]

def __init__(self, seed=0, prob=0.5, max_outputs=1):
super().__init__(seed, max_outputs=max_outputs)
nltk.download(["wordnet", "stopwords"])
self.insert_word_transformation = InsertWordTransformation(
seed, max_outputs, prob
)

def generate(self, sentence: str):
result = self.insert_word_transformation.transform(
input_text=sentence,
)
if self.verbose:
print(f"Perturbed Input from {self.name()} : {result}")
return result