For this project, we will repeat the experiment by Waseem and Hovy (2016). All groups who choose this project will jointly retrieve tweets from the last 2-3 months using the hashtag list of Waseem and Hovy (see paper), and then each project member will annotate 150-200 tweets (depending on how many groups choose this project). We will then train and optimize (parameters, features) a classifier of our choice on the Waseem data, and then classify the new data. The working hypothesis is that the results will be lower than those reported by Waseem and Hovy. Then the next step will be to investigate two possible causes of the discrepancies (e.g., different levels of explicit abuse, different distribution of original hashtags).
- Clone this repository
- Install the dependencies from
requirements.txtin a virtual environment or globally using the commandpip install -r requirements.txt - In order to fetch Tweets through Twitter API, a
credentials.pyfile is needed and it must never be committed to git. The file needs the following variables with string values to be defined:consumer_key(a.k.a. API Key)consumer_secret(a.k.a. API Secret)access_tokenaccess_token_secret
This script uses Twitter API to fetch all tweets matching a certain search
query (specified by search_query variable) within a time range
(specified by date_start and date_end variables). The fetched Tweets
are stored in a CSV file in the Data folder. A stratified sample of
these tweets will then be extracted for annotating as our test set . No
arguments are needed to be passed while running this script. This script
requires the credentials.py file mentioned in the installation section for
accessing the Twitter API.