STANCE DETECTION (ONLINE: STREAMING DATA)

Authors:

Hasan Alp Caferoğlu
Ann Neşe Rende
Ekrem Polat

For a detailed explanation of the Stance Detection project, please visit my personal blog.

We utilized the XStance dataset in our project. However, the XStance dataset has sentences in German and French, so we need to translate them into English. Firstly, we uploaded our dataset to Google Drive and gave a path in Google Colab. Then, we eliminated unnecessary columns from the dataset. Since the translation process takes too long, we only take 10,000 data from the dataset. Finally, this code utilizes two pre-trained T5 models, French to English and German to English, to achieve this translation. Ultimately, we export the translated content as a CSV file for implementation.

Data Preprocessing

The following codes utilize the same data preprocessing. We first uploaded our dataset to Google Drive and gave a path in Google Colab. Then, we cleaned the dataset by removing mentions and #SemEval hashtags. Finally, we shuffled the dataset.

DistilBERT and LSTM

Since this code includes DistilBERT and LSTM, it contains many parameters. Therefore, we used 12,000 rows from our dataset to avoid memory issues. Firstly, we tokenized our inputs with the DistilBERT tokenizer. Secondly, we extracted embeddings from tokenized inputs using a pre-trained DistilBERT model. In the next step, we formed our network containing embedding layers from DistilBERT, LSTM, and Dense layers. Finally, we trained our model, observed performance metrics, and evaluated our model on the test dataset. We also conducted several experiments by modifying both model structure and hyperparameters.

LSTM with GloVe embeddings

In order to solve the memory issue and train the model with more data, we utilize word embeddings from GloVe. We add some helper functions to read GloVe embeddings from the file, load them to a dictionary, and create an embedding matrix which will be the input to LSTM. This time, instead of a DistilBERT tokenizer, we utilize a tokenizer from Keras. Moreover, we replace the Input layer in the bidirectional LSTM with an Embedding layer with the modified vector dimensions.

DistilBERT fine-tune

After the data preprocessing part, we tokenized our inputs using an AutoTokenizer from the transformers library of Hugging Face. Then, we fine-tuned the DistilBERT model using the Hugging Face Training API. We plotted training performance metrics and evaluated the model on the test dataset. Also, we deployed our model to HuggingFace to use it in the streaming feature.

DistilBERT with stream data

In this code, we utilized our DistilBERT model deployed to HuggingFace in the 4th step. We used different parts of data as a stream. Firstly, we fed into the stream data to our model. If the accuracy of the data is below 50%, we updated our model by retraining our model with stream data. We do not update the model if the accuracy is above 50%. This code includes several experiments with different stream data. In the end, we used 8,000 rows from XStance as 8 portions stream and observed none of them required updating the model.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
DistilBERT _ LSTM.ipynb		DistilBERT _ LSTM.ipynb
DistilBERT_finetune.ipynb		DistilBERT_finetune.ipynb
DistilBERT_with_stream_data.ipynb		DistilBERT_with_stream_data.ipynb
LSTM with glove embeddings.ipynb		LSTM with glove embeddings.ipynb
README.md		README.md
T5-Text Translation.ipynb		T5-Text Translation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STANCE DETECTION (ONLINE: STREAMING DATA)

Authors:

Contents

T5-Text Translation

Data Preprocessing

DistilBERT and LSTM

LSTM with GloVe embeddings

DistilBERT fine-tune

DistilBERT with stream data

About

Releases

Packages

Contributors 2

Languages

HasanAlpCaferoglu/Stance-Detection-Streaming-Data

Folders and files

Latest commit

History

Repository files navigation

STANCE DETECTION (ONLINE: STREAMING DATA)

Authors:

Contents

T5-Text Translation

Data Preprocessing

DistilBERT and LSTM

LSTM with GloVe embeddings

DistilBERT fine-tune

DistilBERT with stream data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages