Natural Language Processing with Disaster Tweets

Predict which Tweets are about real disasters and which ones are not
Explore the docs »

View Demo · Report Bug · Request Feature

Table of Contents

About The Project
- Built With
- Purpose
Project Structure
Project Overview
Reference
Contact

About The Project

This is a course project of Rutgers MSDS597 Data Wrangling and Husbandry.

Built with

Language:
Data Source:

Natural Language Processing with Disaster Tweets
R Packages:

Purpose

The project is based on Natural Language Processing with Disaster Tweets competition on Kaggle. We are trying to build a classifier to distinguish whether a public tweet is aboud disaster or not.

Competition Description

Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

But, it’s not always clear whether a person’s words are actually announcing a disaster. Take this example:

The author explicitly uses the word “ABLAZE” but means it metaphorically. This is clear to a human right away, especially with the visual aid. But it’s less clear to a machine.

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we've created a quick tutorial to get you up and running.

Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.

(back to top)

Project Structure

Bin - Directory which stores all codes
- Project.Rmd - The R markdown file for the project
Config - Directory which stores all the configuration files
- kaggle.json - A json file storing your Kaggle username and api key. (Replace this file with your own data)
- twitter.json - A json file storing your Twitter API key, value, and bearer token. (Replace this file with your own data)
Data - Directory which stores all data used in this project (will be auto-generated by R if not existed)
- Kaggle - Directory which stores the dataset downloaded from Kaggle (the directory wiil be auto-generated by R if not existed and the dataset will be downloaded to this directory automatically)
- Twitter API Data - Directory which stores the dataset we downloaded using Twitter API
Lexicon - Directory which stores the lexicons we use for sentimental analysis
- NRC-Hashtag-Emotion-Lexicon-v0.2.txt - NRC Hashtag Lexicon
Output - Directory which stores all the output files
- Images - Output Directory for images
- LDA - Output Directory for LDA model
- Model - Output Directory for trained models
- Presentation Slides - Directory for presentation slides
- Rmd Knit - Output Directory for Rmarkdown Knit

(back to top)

Project Overview

Dataset overview

Columns

id - unique identifier for each tweet
keyword - keyword of the text content
location - location of the tweet
text - text content of each tweet
target - binary value (0 for non-disaster tweet and 1 for disaster tweet)

Missing Value

Target distribution

Word Cloud for `text`

Word cloud for all tweets
Word cloud for disaster tweets

USA Disaster Distribution

Data Cleaning

Clean mislabeled tweets
Extract links from text - get link_count
Convert special url character (%20 as white space) to readable string and replace newline character \n in text - get line_count
Extract tags and '@'s from text - get tag_count and at_count

Latent Dirichlet Allocation (LDA)

Tokenization
Create Document Term Matrix (DTM)
Model Tuning
Topics

Feature Engineering

Numeric

link_count - number of links in a tweet
line_count - number of lines in a tweet
tag_count - number of tags in a tweet
at_count - number of '@'s in a tweet
char_length - number of characters in a tweet
word_count - number of words in a tweet
mean_word_length - derived by $\frac{\text{char\_length}}{\text{word\_count}}$ , the average number of characters used for a word in a tweet
unique_word_count - the number of distinct words used in a tweet
punc_count - the number of punctuation marks in a tweet
stop_word_count - the number of stop words in a tweet

Ratio Analysis

Extract common disaster and non-disaster keyword, tags, '@'s and n-grams
N-gram analysis
- Unigram
- Bigram
- Trigram

Sentiment Analysis

Lexicon: NRC-Hashtag-Emotion-Lexicon-v0.2 ^{[1, 2]}
- score: real-valued score between 0 to $\infty$
- size: 16862 unigrams
- emotions
  - positive: anticipation, joy, surprise, trust
  - negative: anger, disgust, fear, sadness
Sentimental Score - sum of 4 positive sentiments scores minus sum of 4 negative sentiments scores

Classifier Model Training

Train-validation Split

80% train - 43% disaster tweet, 57% non-disaster tweet
20% validatoin - 43% disaster tweet, 57% non-disaster tweet

Random Forest

Bagging
- package: randomForest
- ntree: 2000
- accuracy on validation set: 77.78%
- confusion matrix on validation set:
Boosting
- package: gbm
- ntree: 5000
- ineraction depth: 3
- shrinkage: 0.2
- accuracy on validation set: 76%
- confusion matrix on validation set:

Performance on Test Set Using Random Forest (Bagging)

accuracy: 75.45%
confusion matrix on validation set:

Performance on Twitter API Data Using Random Forest (Bagging)

accuracy: 65.52%
confusion matrix:

Performance on Twitter API Data Using Random Forest (Boosting)

accuracy: 60.1%
confusion matrix:

Performance on Twitter API Data Using Neural Network

accuracy: 72.91%
confusion matrix:

(back to top)

Reference

[1] Using Hashtags to Capture Fine Emotion Categories from Tweets. Saif M. Mohammad, Svetlana Kiritchenko, Computational Intelligence, Volume 31, Issue 2, Pages 301-326, May 2015.

[2] #Emotional Tweets, Saif Mohammad, In Proceedings of the First Joint Conference on Lexical and Computational Semantics (*Sem), June 2012, Montreal, Canada.

(back to top)

Contact

Feiyu Zheng - [email protected]

Junyi Li - [email protected]

Ziyu Zhou - [email protected]

Project Link: https://github.com/ChaserZ98/Natural-Language-Processing-with-Disaster-Tweets

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
Bin		Bin
Config		Config
Data		Data
Lexicon		Lexicon
Output		Output
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

ChaserZ98/Natural-Language-Processing-with-Disaster-Tweets

Folders and files

Latest commit

History

Repository files navigation

Natural Language Processing with Disaster Tweets

About The Project

Built with

Purpose

Competition Description

Project Structure

Project Overview

Dataset overview

Columns

Missing Value

Target distribution

Word Cloud for text

USA Disaster Distribution

Data Cleaning

Latent Dirichlet Allocation (LDA)

Feature Engineering

Numeric

Ratio Analysis

Sentiment Analysis

Classifier Model Training

Train-validation Split

Random Forest

Neural Network

Performance on Test Set Using Random Forest (Bagging)

Performance on Twitter API Data Using Random Forest (Bagging)

Performance on Twitter API Data Using Random Forest (Boosting)

Performance on Twitter API Data Using Neural Network

Reference

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Word Cloud for `text`

Packages