Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
pankace authored Sep 7, 2024
1 parent ccb5657 commit 8656ba7
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ py -m streamlit run app.py
```

# 3. Conclusion
The dataset we used is annotated from [this research](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9044360/). We preprocessed it into a usable format. The dataset was analyzed and some insights about token distribution were found. We then explored two different methods of vectorizing sentences, including TF-IDF and embeddings, and found that embeddings give better performance in general because of their ability to capture the semantics of languages. From model exploration, we found that using logistic regression and random forest on sentence embedding already had both a great performance of 79% accuracy and a small model size. While LSTM and transformers, theoretically, should have better performance because of their ability to make use of sequential information in sentences, they gave similar performance but with more complex architecture and more resource consumption in our experiment. Overall performance should be improved with a larger dataset, especially of neural network models.
The dataset we used is annotated from [Benítez-Andrades JA, González-Jiménez Á, López-Brea Á, Aveleira-Mata J, Alija-Pérez JM, García-Ordás MT. Detecting racism and xenophobia using deep learning models on Twitter data: CNN, LSTM and BERT. PeerJ Comput Sci. 2022 Mar 1;8:e906. doi: 10.7717/peerj-cs.906. PMID: 35494847; PMCID: PMC9044360.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9044360/). We preprocessed it into a usable format. The dataset was analyzed and some insights about token distribution were found. We then explored two different methods of vectorizing sentences, including TF-IDF and embeddings, and found that embeddings give better performance in general because of their ability to capture the semantics of languages. From model exploration, we found that using logistic regression and random forest on sentence embedding already had both a great performance of 79% accuracy and a small model size. While LSTM and transformers, theoretically, should have better performance because of their ability to make use of sequential information in sentences, they gave similar performance but with more complex architecture and more resource consumption in our experiment. Overall performance should be improved with a larger dataset, especially of neural network models.

# 4. Future Work
### Docker Containerization:
Expand Down

0 comments on commit 8656ba7

Please sign in to comment.