Native Language Identification with Big Bird Embeddings

This is the repository for the code used in the paper Native Language Identification with Big Bird Embeddings by Sergey Kramp, Giovanni Cassani and Chris Emmery. The code is released under the MIT license.

For citing this work, please use the following bibtex entry:

@inproceedings{Kramp2023NativeLI,
  title={Native Language Identification with Big Bird Embeddings},
  author={Sergey Kramp and Giovanni Cassani and Chris Emmery},
  year={2023},
  url={https://api.semanticscholar.org/CorpusID:261705924}
}

TLDR

In this work we used embeddings from a fine-tuned Big Bird (Zaheer et al., 2020) model to perform Native Language Identification (NLI) on the Reddit L2 dataset. Here you will find the code used to sample the data, scripts used for fine-tuning, and notebooks containing the experiments described in the paper.

What you will not find in this repository is the data used. You can download the Reddit L2 dataset here (it appears as Reddit-L2 chunks).

The weights of the fine-tuned Big Bird model are available on HuggingFace.

How to use the code

data: contains classes for working with the data.
- databalancer.py: contains a DataBalancer class that is used to balance the data by sampling chunks or authors from the Reddit L2 dataset directory and creating folders by label (the author's native language instead of the author's country as it is in the original dataset). To work with this class you need to have the Reddit L2 dataset downloaded and extracted.
- dataprocessor.py: contains a DataProcessor class that is used to discover the text chunks from the Reddit L2 dataset directory and create datasets.
- data_chunk.py: contains a Chunk class, which is created by the DataProcessor and contains the text chunk, the metadata for the chunk, such as the author name and the label. It also contains methods for tokenizing the text and getting token ids.
- reddit_dataset.py: contains a RedditDataset, which is created by the DataProcessor and contains Chunks. It inherits from the torch Dataset class and can be used with a torch DataLoader for fine-tuning. It also contains a method for getting a Pandas DataFrame with the data.
feature_extractors: contains classes for extracting linguistic features and embeddings from the data.
- normal_feature_extractor.py: contains a NormalFeatureExtractor class that is used to extract linguistic features from the data. In particular, it extracts the following features:
  1. Character and word n-grams using the scikit-learn implementation.
  2. Spelling mistake feautures using the symspellpy package.
  3. Grammar mistake features using the language-tool-python package.
  4. Function word used using a list of function words from Volansky et al. (2015)
  5. POS (part-of-speech) tags using the NLTK library.
  6. Average sentence length.
- transformer_feature_extractor.py: contains a TransformerFeatureExtractor class that is used to extract embeddings from the data. In our case, we used it with Big Bird, but it can be used with any model on the Hugging Face library.
fine_tuning_scripts: contains scripts for fine-tuning Big Bird as well some log files produced during fine-tuning.
language_checkers: contains wrapper classes around the symspellpy and language-tool-python packages that are used be the NormalFeatureExtractor, as well as some required text files.
notebooks: all notebooks that start with "experiment" refer to an experiment described in the paper. balance_data.ipynb contains the data balancing code and figures.ipynb contains the code for generating the figures in the paper.

Note

Through out the code you might encounter folders that don't exists in the repository, in particular: fine_tuned_models (used for storing the model checkpoints and tokenizers) and pickles (used for storing intermediate results). You can create these folders yourself or change the code to store the results in a different location.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
feature_extractors		feature_extractors
fine_tuning_scripts		fine_tuning_scripts
language_checkers		language_checkers
paper		paper
.gitignore		.gitignore
README.md		README.md
balance_data.ipynb		balance_data.ipynb
experiment_compare_user_chunks_split.ipynb		experiment_compare_user_chunks_split.ipynb
experiment_main.ipynb		experiment_main.ipynb
experiment_out_of_sample.ipynb		experiment_out_of_sample.ipynb
experiment_text_length.ipynb		experiment_text_length.ipynb
experiment_toefl.ipynb		experiment_toefl.ipynb
figures.ipynb		figures.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Native Language Identification with Big Bird Embeddings

TLDR

How to use the code

Note

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

SergeyKramp/mthesis-bigbird-embeddings

Folders and files

Latest commit

History

Repository files navigation

Native Language Identification with Big Bird Embeddings

TLDR

How to use the code

Note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages