-
Notifications
You must be signed in to change notification settings - Fork 197
initial commit of alliteration filter #251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ahoimarie
wants to merge
12
commits into
GEM-benchmark:main
Choose a base branch
from
ahoimarie:alliteration
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 11 commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
3c12348
re-added submission, after having cleaned up the tree
ahoimarie e09da52
added keyword
ahoimarie 527ffb0
added evaluation results and tweaked code to make evaluations work.
ahoimarie 64e174d
added Data statement
ahoimarie 9940840
added minimum_alliteration_length, included all input text sentences,…
ahoimarie dbcd9fb
updated README and robustness scores
ahoimarie 7172a66
changed criterion to check for alliterations
ahoimarie 1ac30cf
corrected docstring for rolling_window
ahoimarie 866a310
Update Makefile
ahoimarie 58e744c
remove Makefile from tracking
ahoimarie 5abb695
added Makefile
ahoimarie 95e2d67
Removed a modified makefile from pull request
ahoimarie File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,81 @@ | ||
| ## Alliteration filter | ||
|
|
||
| **Author: Marie Tolkiehn**\ | ||
| Center for Data and Computing in Natural Sciences, Universität Hamburg\ | ||
| marie.tolkiehn@desy.de | ||
|
|
||
|
|
||
| ## What type of a filter is this? | ||
|
|
||
| This filter returns True if any of the input sentences is an alliteration and False otherwise. | ||
| By default, stop words are removed and do not count to the alliteration. | ||
| However, should the sentence solely consist of stop words, they will not be removed. | ||
|
|
||
| A sentence is deemed an alliteration if it contains words starting with the same character or digraph ("ch", "ph", "sh", "th"). | ||
| The minimum alliteration length then governs how many words starting with the same first phoneme are required to be deemed a valid alliteration. | ||
| The default minimum alliteration length is 3. | ||
|
|
||
| These alliterative words do not need to appear contiguously in the sentence. | ||
| This means that e.g. "Peter Aquarium prepared a pepperoni pizza." is a valid alliteration | ||
| as it contains more than (default) 3 alliterative non-stopword words (despite "Aquarium"). | ||
|
|
||
| ## Why is this filter important? | ||
| Alliterations attract audiences. | ||
| Alliterations are a stylistic device and trope of literature or poetry. | ||
| However, alliterations are around us all the time. From newspaper headlines | ||
| ("Beer Baron Beats Banner" or "Banner Bars Booze (Booze Barred By Banner)" (c) The Simpsons) | ||
| over ads ("Taco Tuesdays"), and company/brand names ("Coca Cola", "Bed, Bath & Beyond", "PayPal"), | ||
| protagonists ("Peter Pevensie", "Peter Pan", "Bilbo Baggins", "Donald Duck") | ||
| and even academic publications, writers often use alliterations to catch the reader's (or listener's) attention, | ||
| as through sound repetition, they are catchy and easy to remember. | ||
| Alliterations generally sound pleasing and different phonemes create different rhythms and vibes. | ||
| For example, alliterations starting with S are often connected to snake-like features, | ||
| whereas alliterations with plosives such as P create a particular rhythm. | ||
|
|
||
| This filter could check just how prevalent alliterations are in various types of texts and if there are particular areas they are particularly prevalent. | ||
| A good language model may then be able to generate synonymous alliterations from non-alliterative texts. | ||
|
|
||
| ## Robustness Evaluation | ||
| ### Removing Stopwords (True), minimum alliteration length = 3 | ||
| Here is the performance of the model on the filtered set: | ||
| * **IMDB**\ | ||
| `python evaluate.py -f Alliteration -task "TEXT_CLASSIFICATION" -m "textattack/roberta-base-imdb" -d "imdb" -p 20`\ | ||
| The accuracy on this subset which has 612 examples = 95.0 | ||
|
|
||
| * **SST-2**\ | ||
| `python evaluate.py -f Alliteration -task "TEXT_CLASSIFICATION" -m "textattack/roberta-base-SST-2" -d "sst2" -p 20`\ | ||
| The accuracy on this subset which has 17 examples = 88.0 | ||
|
|
||
| * **QQP** \ | ||
| `python evaluate.py -f Alliteration -task "TEXT_CLASSIFICATION" -m "textattack/bert-base-uncased-QQP" -d "qqp" -p 20`\ | ||
| The accuracy on this subset which has 31 examples = 97.0 | ||
|
|
||
| * **MNLI**\ | ||
| `python evaluate.py -f Alliteration -task "TEXT_CLASSIFICATION" -m "roberta-large-mnli" -d "multi_nli" -p 20`\ | ||
| The accuracy on this subset which has 128 examples = 91.0 | ||
|
|
||
|
|
||
| ### Not removing stopwords (False), minimum alliteration length = 3 | ||
| * **IMDB**\ | ||
| `python evaluate.py -f Alliteration -task "TEXT_CLASSIFICATION" -m "textattack/roberta-base-imdb" -d "imdb" -p 20`\ | ||
| The accuracy on this subset which has 886 examples = 95.0 | ||
| * **SST-2**\ | ||
| `python evaluate.py -f Alliteration -task "TEXT_CLASSIFICATION" -m "textattack/roberta-base-SST-2" -d "sst2" -p 20`\ | ||
| The accuracy on this subset which has 34 examples = 97.0 | ||
| * **QQP** \ | ||
| `python evaluate.py -f Alliteration -task "TEXT_CLASSIFICATION" -m "textattack/bert-base-uncased-QQP" -d "qqp" -p 20`\ | ||
| The accuracy on this subset which has 111 examples = 94.0 | ||
| * **MNLI**\ | ||
| `python evaluate.py -f Alliteration -task "TEXT_CLASSIFICATION" -m "roberta-large-mnli" -d "multi_nli" -p 20`\ | ||
| The accuracy on this subset which has 233 examples = 92.0\ | ||
|
|
||
|
|
||
|
|
||
| ## Data and code source | ||
| Data was fully created by the author. | ||
| Only the test case involving "Peter and his famous pickled peppers" first appeared in print in 1813 in John Harris's Peter Piper's Practical Principles of Plain and Perfect Pronunciation. | ||
|
|
||
|
|
||
| ## What are the limitations of this filter? | ||
| There may be phonetic alliterations that are not captured by a graphematic approach. For example, `Phonetic` and `Fine` are phonetic alliterations but not graphematic ones. | ||
| This could be ameliorated e.g. by using more sophisticated methods such as a pronouncing dictionary by Carnegie Mellon's to compare each word. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| from .filter import * |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,144 @@ | ||
| #!/usr/bin/env python3 | ||
| # *_* coding: utf-8 *_* | ||
|
|
||
| import string | ||
|
|
||
| import numpy as np | ||
| import spacy | ||
|
|
||
| from initialize import spacy_nlp | ||
| from interfaces.SentenceOperation import SentenceOperation | ||
| from tasks.TaskTypes import TaskType | ||
|
|
||
|
|
||
| class Alliteration(SentenceOperation): | ||
| tasks = [TaskType.TEXT_CLASSIFICATION, TaskType.TEXT_TO_TEXT_GENERATION] | ||
| languages = ["en"] | ||
| keywords = ["morphological"] | ||
|
|
||
| def __init__( | ||
| self, | ||
| stopwords: bool = True, | ||
| min_alliteration_length: int = 3, | ||
| allowed_offwords: int = 2, | ||
| ): | ||
| super().__init__() | ||
| self.stopwords = stopwords | ||
| self.min_alliteration_length = min_alliteration_length | ||
| self.allowed_offwords = allowed_offwords | ||
| self.nlp = spacy_nlp if spacy_nlp else spacy.load("en_core_web_sm") | ||
|
|
||
| def filter(self, sentence: str = None, min_sentence_length=3) -> bool: | ||
| """ | ||
| This filter returns True if any of the input sentences is an alliteration. | ||
| A sentence is deemed an alliteration if it contains a minimum alliteration length of (Default) 3. | ||
| These alliterative words do not need to appear contiguously. | ||
| This means that e.g. "Peter Aquarium prepared a pepperoni pizza." is an alliteration | ||
| as it contains more than 3 alliterative non-stopword words (despite "Aquarium"). | ||
| By default, stop words are removed and do not count to the alliteration. | ||
| """ | ||
|
|
||
| def get_phonemes(word: str): | ||
| """ | ||
| We are adding some digraphs to avoid 'sand' and 'shady' to alliterate. | ||
| Then we check for these digraphs first | ||
| """ | ||
| digraphs = ["ch", "ph", "sh", "th"] | ||
| if word[:2] in digraphs: | ||
| return word[:2] | ||
| else: | ||
| return word[:1] | ||
|
|
||
| def segment_sentences(self, sentence, min_sentence_length): | ||
| """ | ||
| If the input contains multiple sentences, only take the sentences that have the min_sentence_length | ||
| and that do contain alphanumeric characters. | ||
| """ | ||
| sent = self.nlp(sentence.lstrip()) | ||
| segmented_sentence = list(sent.sents) | ||
| all_stopwords = self.nlp.Defaults.stop_words | ||
| filt_sentences = [] | ||
| for k in segmented_sentence: | ||
| # Skip any too short 'sentences' that contain no alphanumeric characters | ||
| if ( | ||
| len(k.text) > min_sentence_length | ||
| and k.text.lower().islower() | ||
| ): | ||
| valid_sentences = k.text | ||
| else: | ||
| continue | ||
|
|
||
| # Convert to lower, remove punctuation, tokenize into words | ||
| sentenceS = ( | ||
| valid_sentences.lower() | ||
| .translate(str.maketrans("", "", string.punctuation)) | ||
| .split() | ||
| ) | ||
|
|
||
| if self.stopwords: | ||
| if not set(sentenceS).issubset( | ||
| self.nlp.Defaults.stop_words | ||
| ): | ||
| # Remove all stopwords from our sentence | ||
| sentenceS = [ | ||
| word | ||
| for word in sentenceS | ||
| if word not in all_stopwords | ||
| ] | ||
| filt_sentences.append(sentenceS) | ||
|
|
||
| return filt_sentences | ||
|
|
||
| def rolling_window(data, windowlen): | ||
| """ | ||
| Create a 1-dimensional rolling window of size windowlen. | ||
| If the windowlen is larger than the length of the data, use the length of the data instead. | ||
| """ | ||
| if len(data) < windowlen: | ||
| windowlen = len(data) | ||
| shape = data.shape[:-1] + ( | ||
| data.shape[-1] - windowlen + 1, | ||
| windowlen, | ||
| ) | ||
| strides = data.strides + (data.strides[-1],) | ||
| return np.lib.stride_tricks.as_strided( | ||
| data, shape=shape, strides=strides | ||
| ) | ||
|
|
||
| def find_contiguous_elements( | ||
| elements, min_alliteration_length, allowed_offwords | ||
| ): | ||
| """ | ||
| Create rolling windows of size min_alliteration_length + allowed_offwords | ||
| and check if any window contains a block of the same elements of the size min_alliteration_length. | ||
| Return True if any window with the min_alliteration_length is found, False otherwise. | ||
| """ | ||
| rolling_sent = rolling_window( | ||
| elements, min_alliteration_length + allowed_offwords | ||
| ) | ||
|
|
||
| for windows in rolling_sent: | ||
| if ( | ||
| windows == max(set(windows), key=sorted(windows).count) | ||
| ).sum() >= min_alliteration_length: | ||
| return True | ||
|
|
||
| return False | ||
|
|
||
| # Process input sentences | ||
| sentenceS = segment_sentences(self, sentence, min_sentence_length) | ||
|
|
||
| # Iterate through sentences | ||
| sentence_count = [] | ||
| for sen in sentenceS: | ||
| cat_sentence = np.array([get_phonemes(word) for word in sen]) | ||
| phonemes_bool = find_contiguous_elements( | ||
| cat_sentence, | ||
| self.min_alliteration_length, | ||
| self.allowed_offwords, | ||
| ) | ||
| sentence_count.append(phonemes_bool) | ||
|
|
||
| return any( | ||
| sentence_count | ||
| ) # return True if any of the input sentences are alliterative |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| spacytextblob==3.0.1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,118 @@ | ||
| { | ||
| "type": "alliteration", | ||
| "test_cases": [ | ||
| { | ||
| "class": "Alliteration", | ||
| "args": { | ||
| "stopwords": true | ||
| }, | ||
| "inputs": { | ||
| "sentence": "Andrew always asks Anne about anchovies." | ||
| }, | ||
| "outputs": true | ||
| }, | ||
| { | ||
| "class": "Alliteration", | ||
| "args": { | ||
| "stopwords": true | ||
| }, | ||
| "inputs": { | ||
| "sentence": "She showed Shawn shady shandy." | ||
| }, | ||
| "outputs": true | ||
| }, | ||
| { | ||
| "class": "Alliteration", | ||
| "args": { | ||
| "stopwords": true | ||
| }, | ||
| "inputs": { | ||
| "sentence": "She showed Shawn some shady shandy." | ||
| }, | ||
| "outputs": true | ||
| }, | ||
| { | ||
| "class": "Alliteration", | ||
| "args": { | ||
| "stopwords": true | ||
| }, | ||
| "inputs": { | ||
| "sentence": "Peter Piper picked a peck of pickled peppers." | ||
| }, | ||
| "outputs": true | ||
| }, | ||
| { | ||
| "class": "Alliteration", | ||
| "args": { | ||
| "stopwords": false | ||
| }, | ||
| "inputs": { | ||
| "sentence": "Andrew always asks Anne about anchovies." | ||
| }, | ||
| "outputs": true | ||
| }, | ||
| { | ||
| "class": "Alliteration", | ||
| "args": { | ||
| "stopwords": false | ||
| }, | ||
| "inputs": { | ||
| "sentence": "She showed Shawn shady shandy." | ||
| }, | ||
| "outputs": true | ||
| }, | ||
| { | ||
| "class": "Alliteration", | ||
| "args": { | ||
| "stopwords": false | ||
| }, | ||
| "inputs": { | ||
| "sentence": "She showed Shawn some shady shandy." | ||
| }, | ||
| "outputs": true | ||
| }, | ||
| { | ||
| "class": "Alliteration", | ||
| "args": { | ||
| "stopwords": false | ||
| }, | ||
| "inputs": { | ||
| "sentence": "Peter Piper picked a peck of pickled peppers." | ||
| }, | ||
| "outputs": true | ||
| }, | ||
| { | ||
| "class": "Alliteration", | ||
| "args": { | ||
| "stopwords": true | ||
| }, | ||
| "inputs": { | ||
| "sentence": "4 *((( ::). She showed Aquarium Shawn shady shandy. This is the second sentence Sandy sorted. It is imminent in Iowa." | ||
| }, | ||
| "outputs": true | ||
| }, | ||
| { | ||
| "class": "Alliteration", | ||
| "args": { | ||
| "stopwords": false, | ||
| "min_alliteration_length": 5 | ||
| }, | ||
| "inputs": { | ||
| "sentence": "4 *((( ::). She offered Shawn super shandy. This is the second sentence Sandy sorted. It is imminent in Iowa." | ||
| }, | ||
| "outputs": true | ||
| }, | ||
| { | ||
| "class": "Alliteration", | ||
| "args": { | ||
| "stopwords": true, | ||
| "min_alliteration_length": 5 | ||
| }, | ||
| "inputs": { | ||
| "sentence": "4 *((( ::). She offered Shawn super shandy. This is the second sentence Sandy sorted. It is imminent in Iowa." | ||
| }, | ||
| "outputs": false | ||
| } | ||
| ] | ||
| } | ||
|
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.