The work is done during Larisa Poghosyan's capstone project for the degree of BS in Data Science in the Zaven and Sonia Akian College of Science and Engineering. Supervised by Vahe Hakobyan.
We provide the precomputed embeddings for each model on each dataset discussed in the project. You can find the assets at: https://github.com/larissapoghosyan/Capstone_Project/releases/tag/embeddings
Otherwise, you can easily replicate the results presented in the work by following the instructions below. See the corresponding sections for details.
Transformers_BERT_embeddings_both_datasets
notebook provides a pipeline to extract BERT
embeddings and save them as h5py files.
The notebooks W2V_FastText_embeddings_both_datasets
and ELMo_embeddings_both_datasets
provide embeddings of Word2Vec
, FastText
and ELMo
respectively.
The notebook baseline_classifiers_IMDb_Dataset
contains the code to calculate point estimates for all the models in list on IMDb Movie Reviews Sentiment Classification task.
Pyro_for_pp_IMDb
contains the probabilistic programming pipeline using pyro
. This notebook also provides the accuracy distributions and HDI plots, for IMDb dataset.
In order to retrieve IMDb data, please see the link:
https://github.com/larissapoghosyan/Capstone_Project/releases/download/embeddings/IMDb_Reviews.csv
To download Word2Vec pretrained embeddings, please run:
api.load('word2vec-google-news-300')
To download FastText embeddings, please run:
wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
unzip wiki-news-300d-1M.vec.zip
To extract ELMo embeddings, please run:
pip install allennlp==0.9.0
pip install flair
pip install sacremoses
To use the pre-trained models of TinyBERT, BERT, RoBERTa, Sentence-BERT and DistilBERT, run the following code to install Hugging Face transformers:
pip install transformers
To run the Pyro pipeline, please use the following code:
pip3 install pyro-ppl
All of the experiments are performed on Colab
and Colab Pro+
on a single (K40 / V100 / P100) GPU.
Using GPUs for the project, however, is not required but is expected to take much longer.
For example, experiments on ELMO
takes approximately 1 hour
on K40
GPU while on CPU it takes up to 10 hours
.
Pre-trained models like BERT, Word2Vec, FastText, etc., are widely used in many NLP applications such as chatbots, text classification, machine translation, etc. Such models are trained on huge corpora of text data and can capture statistical, semantic, and relational properties of the language. As a result, they provide numeric representations of text tokens (words, sentences) that can be used in downstream tasks. Having such pre-trained models off the shelf is convenient in practice, as it may not be possible to obtain good quality representations by training them from scratch due to lack of data or resource constraints. That said, in the practical setting, such embeddings are often used as inputs to models to serve the purpose of the task. For example, in a sentence classification task, it is possible to use a Logistic Regression on the top average Word2Vec embeddings. Using such embeddings on real-life industrial problems could produce some optimistic improvements over baselines; however, it is not clear whether those improvements are reliable or not. In our study, we intend to check the question at hand by formulating multiple applicable and viable tasks in the industry and replicating the workflow of data scientists. Our goal is to construct various models (different in sophistication) that use embeddings as inputs and use a methodology to report the confidence bounds of the metric of interest. With this experiment we hope to develop an understanding of the phenomenon of having optimal results on the paper that might not be optimal in reality; thus, aiming to find a reliable method, that will aid the decision making process and facilitate the model selection.
Kernel Density Estimation plots of Accuracy Distributions for each Model on IMDb Movie Sentiment Classification task