Predict which Tweets are about real disasters and which ones are not
Explore the docs »
View Demo
·
Report Bug
·
Request Feature
Table of Contents
- About The Project
- Project Structure
-
Project Overview
- Dataset Overview
- Data Cleaning
- Latent Dirichlet Allocation (LDA)
- Feature Engineering
- Classifier Model Training
- Performance on Test Set Using Random Forest (Bagging)
- Performance on Twitter API Data Using Random Forest (Bagging)
- Performance on Twitter API Data Using Random Forest (Boosting)
- Performance on Twitter API Data Using Neural Network
- Reference
- Contact
This is a course project of Rutgers MSDS597 Data Wrangling and Husbandry.
The project is based on Natural Language Processing with Disaster Tweets competition on Kaggle. We are trying to build a classifier to distinguish whether a public tweet is aboud disaster or not.
Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).
But, it’s not always clear whether a person’s words are actually announcing a disaster. Take this example:
The author explicitly uses the word “ABLAZE” but means it metaphorically. This is clear to a human right away, especially with the visual aid. But it’s less clear to a machine.
In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we've created a quick tutorial to get you up and running.
Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.
-
Bin
- Directory which stores all codesProject.Rmd
- The R markdown file for the project
-
Config
- Directory which stores all the configuration fileskaggle.json
- A json file storing your Kaggle username and api key. (Replace this file with your own data)twitter.json
- A json file storing your Twitter API key, value, and bearer token. (Replace this file with your own data)
-
Data
- Directory which stores all data used in this project (will be auto-generated by R if not existed)Kaggle
- Directory which stores the dataset downloaded from Kaggle (the directory wiil be auto-generated by R if not existed and the dataset will be downloaded to this directory automatically)Twitter API Data
- Directory which stores the dataset we downloaded using Twitter API
-
Lexicon
- Directory which stores the lexicons we use for sentimental analysisNRC-Hashtag-Emotion-Lexicon-v0.2.txt
- NRC Hashtag Lexicon
-
Output
- Directory which stores all the output filesImages
- Output Directory for imagesLDA
- Output Directory for LDA modelModel
- Output Directory for trained modelsPresentation Slides
- Directory for presentation slidesRmd Knit
- Output Directory for Rmarkdown Knit
id
- unique identifier for each tweetkeyword
- keyword of the text contentlocation
- location of the tweettext
- text content of each tweettarget
- binary value (0 for non-disaster tweet and 1 for disaster tweet)
- Clean mislabeled tweets
- Extract links from
text
- getlink_count
- Convert special url character (%20 as white space) to readable string and replace newline character \n in text - get
line_count
- Extract tags and '@'s from
text
- gettag_count
andat_count
-
link_count
- number of links in a tweet -
line_count
- number of lines in a tweet -
tag_count
- number of tags in a tweet -
at_count
- number of '@'s in a tweet -
char_length
- number of characters in a tweet -
word_count
- number of words in a tweet -
mean_word_length
- derived by, the average number of characters used for a word in a tweet
-
unique_word_count
- the number of distinct words used in a tweet -
punc_count
- the number of punctuation marks in a tweet
- Extract common disaster and non-disaster keyword, tags, '@'s and n-grams
- N-gram analysis
- Lexicon: NRC-Hashtag-Emotion-Lexicon-v0.2 [1, 2]
Sentimental Score
- sum of 4 positive sentiments scores minus sum of 4 negative sentiments scores
- 80% train - 43% disaster tweet, 57% non-disaster tweet
- 20% validatoin - 43% disaster tweet, 57% non-disaster tweet
-
Bagging
-
Boosting
-
package: neuralnet
-
number of hidden node: 5
-
min threshold: 0.06
-
activation function: logistic
-
error function: ce
-
stepmax: 1000000
-
accuracy on validation set: 75.61%
-
confusion matrix on validation set:
-
visualization:
[1] Using Hashtags to Capture Fine Emotion Categories from Tweets. Saif M. Mohammad, Svetlana Kiritchenko, Computational Intelligence, Volume 31, Issue 2, Pages 301-326, May 2015.
[2] #Emotional Tweets, Saif Mohammad, In Proceedings of the First Joint Conference on Lexical and Computational Semantics (*Sem), June 2012, Montreal, Canada.
Feiyu Zheng - [email protected]
Junyi Li - [email protected]
Ziyu Zhou - [email protected]
Project Link: https://github.com/ChaserZ98/Natural-Language-Processing-with-Disaster-Tweets