Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
e562770
Create branch.
sajantanand Sep 1, 2021
beff00f
Working barebones version.
sajantanand Sep 1, 2021
90b1f43
No generated punctuation.
sajantanand Sep 1, 2021
18c4ef9
Update requirements.txt
sajantanand Sep 2, 2021
5b8def2
Passing pytest.
sajantanand Sep 4, 2021
805d49e
Merge branch 'GEM-benchmark:main' into random_walk
sajantanand Sep 4, 2021
e06451e
Update README.md
sajantanand Sep 4, 2021
2fd4753
Update README.md
sajantanand Sep 4, 2021
3d61fde
Update transformation.py
sajantanand Sep 4, 2021
41cb2ff
Truncating long sequences.
sajantanand Sep 4, 2021
4056276
Merging using colab.
sajantanand Sep 4, 2021
38663ed
Working truncation of very long messages.
sajantanand Sep 4, 2021
bc5ff32
Update README.md
sajantanand Sep 4, 2021
0f85695
Fix partial word masking.
sajantanand Sep 12, 2021
4a1450d
New json.
sajantanand Sep 12, 2021
0f36104
Address reviewer comments and improve documentatation.
sajantanand Sep 21, 2021
296ecda
Update README.md
sajantanand Sep 21, 2021
e27b2d5
Merge branch 'GEM-benchmark:main' into random_walk
sajantanand Sep 21, 2021
f7325fd
Merge branch 'GEM-benchmark:main' into random_walk
sajantanand Oct 5, 2021
f5ca532
Adding named entities.
sajantanand Oct 5, 2021
efac865
Merge branch 'GEM-benchmark:main' into random_walk
sajantanand Oct 11, 2021
2a5941c
Merge branch 'GEM-benchmark:main' into random_walk
sajantanand Oct 13, 2021
3fc94eb
Evaluation results.:
sajantanand Oct 13, 2021
6f88999
Fix tabs and add option to choose least probable replacement.
sajantanand Oct 13, 2021
ee57b60
Add "descending" parameter to README.md
sajantanand Oct 13, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions test/mapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@
"pinyin",
"punctuation",
"quora_trained_t5_for_qa",
"random_walk"
"sentence_reordering",
"synonym_substitution",
"token_replacement",
Expand Down
62 changes: 62 additions & 0 deletions transformations/random_walk/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Random Walk using Masked-Languange Modeling
This transformation performs a random walk on the original sentence by randomly masking a word and replacing it with a suggestion by the BERT languange model.

Author names:

- Sajant Anand (sajant@berkeley.edu, UC Berkeley)
- Roy Rinberg (royrinberg@gmail.com, Columbia University)
- Jamie Simon (james.simon@berkeley.edu, UC Berkeley)
- Chandan Singh (chandan_singh@berkeley.edu, UC Berkeley)

## Data and Code Provenance

This transformation requires the 'bert-large-cased' pretrained model (~1 GB) from the Hugging Face Transformers library and the 'all-mpnet-base-v2' pretrained model (~400 GB) from the Sentence Transformers library. Provided that the libraries are installed (as they should be from 'requirements.txt'), these models will be installed the first time this transformation is ran. Both libraries operates under the Apache 2.0 license. Additionally the Spacy library is necessary but is installed by default when using this benchmark.

## What type of a transformation is this?
This transformation acts like a perturbation to test robustness and generate sentences with similar syntactic content. By randomly replacing words with their mostly likely replacements, as determined by a bidirectional model that incorporates context clues from prevous and later words, we hope to generate similar sentences that make grammatical sense. We measure the similarity between the original and random-walked sentence by performing sentence embeddings and then calculate the cosine similarity bewteen the embedded vectors.

## How it works
At each step in the random walk, we randomly choose a word and replace it by the mask token recognized by BERT. Care is take to preserve punctuation where possible so that the generated sentence has the same punctuation as the original sentence. Additionally, we can exclude named entities found by the Spacy model from random selection for repalcement. With a word masked, we run BERT on the sentence and perform a softmax on the output logits. Then we select the high probability replacement words for the masked token and use these to construct new sentences. Note that BERT has a max input token length of 512, so for long inputs, we split the sentence into chunks less than 512 tokens.

The differences between original and generated sentences are generally controlled by two class initialization parameters, `steps` and `k`.
- `steps`: number of random walk steps to do
- `k`: number of high probability replacements for the masked word to consider

This process generates $k^steps$ new sentences. We then randomly select a subset of these, as specified by `max_outputs`.

The seed of the random generators (both from `numpy` and the `random` module) are set by the `seed` parameter in the class initializer. Choosing a fixed value will lead to reproducable results.

The sentence similarity is done by first mapping the original and random-walked sentence to 768-dimensional vectors using a pre-trained sentence transformer. We then calculate the cosine similarity. We note that generated sentences with low similarity to the original sentence will still typically make grammatical sense; the meaning of the sentence may not be close to the original however (e.g. change the verb 'love' to 'hate'). The class initialization function takes a parameter `sim_req` which is the minimum similarity score that a generated sentence must have to be considered valid.

Finally a boolean `names` specifies whether or not we replace named entities and a boolean `descending` controls the order of the most probable tokens for masked-word replacement.

## What tasks does it intend to benefit?
This perturbation would benefit all tasks which have a sentence/paragraph/document as input like text classification, text generation, etc. Evaluating the perturbation using Google Colab is currently in progress.

## Robustness Evaluation

This model was evaluated with the model aychang/roberta-base-imdb on the test[:20%] split of the imdb dataset. Note: due to the computational demands of this transformation and the lack of resources at our disposal (only GPU access is Colab), we evaluate the transformation with the following parameters:
- `seed = 0`
- `max_outputs = 1` : Produce a single sentence
- `steps = 5` : Randomly select a word to replace 5 times
- `k = 1`: Number of high probability replacements for the masked word to consider
- `sim_req = 0` : Similarity requirement for generated sentences (long sentences tend to have low similarity
- `named_entities = True` : Do not replace named entities
- `descending = True` : Choose most probable replacements (we will rarely use `False`; we included it for kicks.)

Wall Time: 00:03:52 (DD:HH:MM)
Performance: Of 1000 original sentences, 985 successfully transformed and 15 unchanged (0.985 perturb rate). Accuracy: 96.0 -> 96.0

Performance is strongly affected by parameters `steps` and `k`, as larger values of each will lead to greater variation in generated sentences, at the expense of longer runtimes.

## What are the limitations of this transformation?

This transformation can generate nonsensical words when the random walk has many steps (steps >~ number of words in sentence).

## References
1) Saketh Kotamraju; "How to use BERT from the Hugging face transformer library"; https://towardsdatascience.com/how-to-use-bert-from-the-hugging-face-transformer-library-d373a22b0209

As far as we know, this type of transformation where words are randomly perturbed has not been studied in published literature. Random walks have been used to measure sentence similarity, e.g. the papers listed below.

2) Daniel Ramage, Anna N. Rafferty, and Christopher D. Manning; "Random Walks for Text Semantic Similarity"; https://nlp.stanford.edu/pubs/wordwalk-textgraphs09.pdf
3) Ahmed Hassan, Amjad Abu-Jbara, Wanchen Lu, and Dragomir Radev; "A Random Walk–Based Model for Identifying Semantic Orientation"; https://aclanthology.org/J14-3003.pdf
2 changes: 2 additions & 0 deletions transformations/random_walk/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from .transformation import *

2 changes: 2 additions & 0 deletions transformations/random_walk/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
sentence-transformers==2.0.0
transformers==4.6.1
90 changes: 90 additions & 0 deletions transformations/random_walk/test.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
{
"type": "random_walk",
"test_cases": [
{
"class": "RandomWalk",
"inputs": {
"sentence": "Andrew finally returned the French book to Chris that I bought last week."
},
"outputs": [
{
"sentence": "She finally returned the French book to me that I bought last year."
},
{
"sentence": "She finally returned the picture book for me that I bought last year."
},
{
"sentence": "Andrew finally returned the French box to Chris that I bought last week."
}
]
},
{
"class": "RandomWalk",
"inputs": {
"sentence": "Sentences with gapping, such as Paul likes coffee and Mary tea, lack an overt predicate to indicate the relation between two or more arguments."
},
"outputs": [
{
"sentence": "Sentences with gapping, such as Paul likes coffee and Mary tea, lack an explicit predicate to explain the relation between two or more arguments."
},
{
"sentence": "Sentences involving gapping, such as John likes coffee and Mary tea, lack an appropriate predicate to indicate the relation between two or three arguments."
},
{
"sentence": "Examples with gapping, Such as Paul likes coffee and Mary tea, lack an appropriate predicate to indicate the relation between two or more arguments."
}
]
},
{
"class": "RandomWalk",
"inputs": {
"sentence": "Alice in Wonderland is a 2010 American live-action/animated dark fantasy adventure film"
},
"outputs": [
{
"sentence": "Alice in Wonderland is a 2010 American live-Action/Animated romantic fantasy adventure film"
},
{
"sentence": "Alice In Wonderland is a 2010 Canadian live-Action/animated dark fantasy adventure film"
},
{
"sentence": "Alice in Wonderland is a 2010 American live-Action/Animated dark fantasy adventure film"
}
]
},
{
"class": "RandomWalk",
"inputs": {
"sentence": "Ujjal Dev Dosanjh served as 33rd Premier of British Columbia from 2000 to 2001"
},
"outputs": [
{
"sentence": "Ujjal Dev who served as the Premier of Sri Columbia from 2000 until 2001"
},
{
"sentence": "Ram Dev Dosanjh served as 33rd Premier of British Columbia from 2000 until 2003"
},
{
"sentence": "and Dev Dosanjh serving as Deputy Premier of British Columbia from 2000 to 2001"
}
]
},
{
"class": "RandomWalk",
"inputs": {
"sentence": "Neuroplasticity is a continuous processing allowing short-term, medium-term, and long-term remodeling of the neuronosynaptic organization."
},
"outputs": [
{
"sentence": "Neuroplasticity is a continuous processing allowing short-term, mid-term, and long-term remodeling of the brain organization."
},
{
"sentence": "Neuroplasticity is a neural processing allowing short-term, medium-term, and long-term remodeling of the neuronosynaptic organization."
},
{
"sentence": "It is a dynamic processing allowing short-duration, medium-term, and long-term remodeling of the neuronosynaptic organization."
}
]
}
]
}
Loading