Skip to content

ChaserZ98/Natural-Language-Processing-with-Disaster-Tweets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

commits forks stars issues MIT license

Natural Language Processing with Disaster Tweets

Predict which Tweets are about real disasters and which ones are not
Explore the docs »

View Demo · Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Project Structure
  3. Project Overview
  4. Reference
  5. Contact

About The Project

This is a course project of Rutgers MSDS597 Data Wrangling and Husbandry.

Built with

Purpose

The project is based on Natural Language Processing with Disaster Tweets competition on Kaggle. We are trying to build a classifier to distinguish whether a public tweet is aboud disaster or not.

Competition Description

Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

But, it’s not always clear whether a person’s words are actually announcing a disaster. Take this example:

The author explicitly uses the word “ABLAZE” but means it metaphorically. This is clear to a human right away, especially with the visual aid. But it’s less clear to a machine.

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we've created a quick tutorial to get you up and running.

Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.

(back to top)

Project Structure

  • Bin - Directory which stores all codes

    • Project.Rmd - The R markdown file for the project
  • Config - Directory which stores all the configuration files

    • kaggle.json - A json file storing your Kaggle username and api key. (Replace this file with your own data)
    • twitter.json - A json file storing your Twitter API key, value, and bearer token. (Replace this file with your own data)
  • Data - Directory which stores all data used in this project (will be auto-generated by R if not existed)

    • Kaggle - Directory which stores the dataset downloaded from Kaggle (the directory wiil be auto-generated by R if not existed and the dataset will be downloaded to this directory automatically)
    • Twitter API Data - Directory which stores the dataset we downloaded using Twitter API
  • Lexicon - Directory which stores the lexicons we use for sentimental analysis

    • NRC-Hashtag-Emotion-Lexicon-v0.2.txt - NRC Hashtag Lexicon
  • Output - Directory which stores all the output files

    • Images - Output Directory for images
    • LDA - Output Directory for LDA model
    • Model - Output Directory for trained models
    • Presentation Slides - Directory for presentation slides
    • Rmd Knit - Output Directory for Rmarkdown Knit

(back to top)

Project Overview

Dataset overview

Columns

  • id - unique identifier for each tweet
  • keyword - keyword of the text content
  • location - location of the tweet
  • text - text content of each tweet
  • target - binary value (0 for non-disaster tweet and 1 for disaster tweet)

Missing Value

missing value plot

Target distribution

target distribution plot

Word Cloud for text

  • Word cloud for all tweets

    word cloud all
  • Word cloud for disaster tweets

    word cloud all

USA Disaster Distribution

usa disaster map

Data Cleaning

  • Clean mislabeled tweets
  • Extract links from text - get link_count
  • Convert special url character (%20 as white space) to readable string and replace newline character \n in text - get line_count
  • Extract tags and '@'s from text - get tag_count and at_count

Latent Dirichlet Allocation (LDA)

  • Tokenization

  • Create Document Term Matrix (DTM)

  • Model Tuning

    coherence plot
  • Topics

    topic plot

Feature Engineering

Numeric

  • link_count - number of links in a tweet

  • line_count - number of lines in a tweet

  • tag_count - number of tags in a tweet

  • at_count - number of '@'s in a tweet

  • char_length - number of characters in a tweet

  • word_count - number of words in a tweet

  • mean_word_length - derived by , the average number of characters used for a word in a tweet

  • unique_word_count - the number of distinct words used in a tweet

  • punc_count - the number of punctuation marks in a tweet

  • stop_word_count - the number of stop words in a tweet distribution plot

Ratio Analysis

  • Extract common disaster and non-disaster keyword, tags, '@'s and n-grams
  • N-gram analysis
    • Unigram

      top 20 unigram plot
    • Bigram

      top 20 bigram plot
    • Trigram

      top 20 trigram plot

Sentiment Analysis

  • Lexicon: NRC-Hashtag-Emotion-Lexicon-v0.2 [1, 2]
    • score: real-valued score between 0 to
    • size: 16862 unigrams
    • emotions
      • positive: anticipation, joy, surprise, trust
      • negative: anger, disgust, fear, sadness
  • Sentimental Score - sum of 4 positive sentiments scores minus sum of 4 negative sentiments scores

Classifier Model Training

Train-validation Split

  • 80% train - 43% disaster tweet, 57% non-disaster tweet
  • 20% validatoin - 43% disaster tweet, 57% non-disaster tweet

Random Forest

  • Bagging

    • package: randomForest

    • ntree: 2000

    • accuracy on validation set: 77.78%

    • confusion matrix on validation set:

  • Boosting

    • package: gbm

    • ntree: 5000

    • ineraction depth: 3

    • shrinkage: 0.2

    • accuracy on validation set: 76%

    • confusion matrix on validation set:

Neural Network

  • package: neuralnet

  • number of hidden node: 5

  • min threshold: 0.06

  • activation function: logistic

  • error function: ce

  • stepmax: 1000000

  • accuracy on validation set: 75.61%

  • confusion matrix on validation set:

  • visualization:

    top 20 unigram plot

Performance on Test Set Using Random Forest (Bagging)

  • accuracy: 75.45%

  • confusion matrix on validation set:

Performance on Twitter API Data Using Random Forest (Bagging)

  • accuracy: 65.52%

  • confusion matrix:

Performance on Twitter API Data Using Random Forest (Boosting)

  • accuracy: 60.1%

  • confusion matrix:

Performance on Twitter API Data Using Neural Network

  • accuracy: 72.91%

  • confusion matrix:

(back to top)

Reference

[1] Using Hashtags to Capture Fine Emotion Categories from Tweets. Saif M. Mohammad, Svetlana Kiritchenko, Computational Intelligence, Volume 31, Issue 2, Pages 301-326, May 2015.

[2] #Emotional Tweets, Saif Mohammad, In Proceedings of the First Joint Conference on Lexical and Computational Semantics (*Sem), June 2012, Montreal, Canada.

(back to top)

Contact

Feiyu Zheng - [email protected]

Junyi Li - [email protected]

Ziyu Zhou - [email protected]

Project Link: https://github.com/ChaserZ98/Natural-Language-Processing-with-Disaster-Tweets

(back to top)

About

MSDS597 Data Wrangling with R Project Repository

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages