Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
986f904
requirements.txt
Dec 7, 2020
c67d4eb
requirements.txt
Dec 7, 2020
de022ee
stub of a Flask frontend
Dec 18, 2020
5cb21a9
stub of a Flask frontend
Dec 18, 2020
3cdb61c
some thoughts on the backend
Jan 3, 2021
95c8fde
some thoughts on the backend
Jan 3, 2021
2c8ce6d
updated UI with temporarily disabled forms
Jan 4, 2021
7b8a95d
updated UI with temporarily disabled forms
Jan 4, 2021
e170abf
commenting
Jan 7, 2021
535e3df
commenting
Jan 7, 2021
c876292
added tests and checks
Feb 26, 2021
dc2523d
added tests and checks
Feb 26, 2021
aff5d1e
refactored as a factory
Mar 1, 2021
e06da64
refactored as a factory
Mar 1, 2021
86b9b17
queueing
Mar 3, 2021
c1266f9
queueing
Mar 3, 2021
a88e60b
updated to start processing jobs
Mar 6, 2021
7c06ec4
updated to start processing jobs
Mar 6, 2021
6da4c0f
simplest case works without queueing
Mar 10, 2021
60cec14
simplest case works without queueing
Mar 10, 2021
95252b6
template and settings updates
Mar 12, 2021
dc3d007
template and settings updates
Mar 12, 2021
aca08e0
whitespace and some dirname changes bc case matters in linux
Mar 25, 2021
2c7f6a9
whitespace and some dirname changes bc case matters in linux
Mar 25, 2021
4e3eacd
refactor file naming
Apr 6, 2021
48b63e9
refactor file naming
Apr 6, 2021
cc9c93a
fix bugs with unique filename and file stream pointer
Apr 9, 2021
f1d5c67
fix bugs with unique filename and file stream pointer
Apr 9, 2021
7c7d46e
enable VAN export cleaning and support two output files
Apr 13, 2021
01a84cd
enable VAN export cleaning and support two output files
Apr 13, 2021
a5971d3
added loader that you can't see
Apr 14, 2021
8ec3c60
added loader that you can't see
Apr 14, 2021
1b59797
in theory enabled two more upload types
Apr 14, 2021
91a71c2
in theory enabled two more upload types
Apr 14, 2021
236def6
Merge branch 'master' into frontend
Apr 19, 2021
7ba3539
Merge branch 'master' into frontend
Apr 19, 2021
ede82a3
email sending works
Apr 22, 2021
989b591
switched to instance relative config
Apr 22, 2021
fc755c5
moved test target email to config
Apr 23, 2021
fbbd52d
readd tasks file
Apr 23, 2021
b481710
Merge branch 'frontend' of github.com:MoveOnOrg/votetripling into fro…
Apr 23, 2021
fba797a
Merge branch 'master' into frontend
Apr 24, 2021
c15471b
Merge branch 'frontend' into queue
Apr 24, 2021
0d7b0e5
enabled sms aggregation script with no interactivity
Apr 27, 2021
c444955
extraction has three outputs, not two
Apr 28, 2021
066347a
support more args for aggregation
Apr 28, 2021
183616c
removed db to simplify and all works in synchronous mode
Apr 28, 2021
4d295ce
works in async mode
Apr 29, 2021
a34f940
file cleanup now works
May 4, 2021
b680a0f
no more db
May 4, 2021
23800f7
using csv reader allows multiple types of csv formatting
May 18, 2021
5f879a9
fixed position messages:
May 18, 2021
40db4c7
specify inbound and outbound text
May 18, 2021
47f40fc
made file upload and email required in each form
May 21, 2021
470ed68
add required headers for script 1 and make script 1 headers match eve…
May 21, 2021
f222588
insert original filename
May 21, 2021
3b50cac
now emailing errors
May 21, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,18 @@
*.DS_Store
.DS_Store
*.csv
*.xlsx
settings.py
config.py

__pycache__
*.pyc
*.rdb
*.sqlite

.pytest_cache/
.coverage
htmlcov/

dist/
build/
*.egg-info/
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -380,7 +380,7 @@ def main(args):
)
PARSER.add_argument(
"-p", "--phoneCol",
default="EndpointPhoneNumber",
default="ContactPhone",
help="name of the column in input data containing the phone number. Any unique identifier for the recipient will suffice"
)
PARSER.add_argument(
Expand Down
Binary file removed Projects/NLP/SMS_Annotation/Input_Data/.DS_Store
Binary file not shown.
1 change: 1 addition & 0 deletions Projects/NLP/SMS_Annotation/Input_Data/readme.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Uploaded data will show up in this dir!
86 changes: 65 additions & 21 deletions README_local.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,10 @@ This document describes how to use 5 versions of name extraction scripts for vot
- Levenshtein `pip install python-Levenshtein`
- NLTK `pip install nltk`

- Alternatively, if you are in an environment where you can't / don't want to install Anaconda, install Python 3.6.9+. Create and activate your Python 3 virtual environment (see [pipenv and virtualenv](https://docs.python-guide.org/dev/virtualenvs/)) and run `pip install -r requirements.txt`

- You'll also need to run `spacy download en` once.

## Getting Started
Find your use case below and add your input data to the appropriate place, then run the specified python script.
All of these scripts should be run out of the directory `Projects/NLP/SMS_Annotation`
Expand All @@ -21,7 +25,7 @@ All output data (after running a script) will be found in `Projects/NLP/SMS_Anno
**Use Case:** I need to aggregate SMS messages by conversation. This step is necessary before performing any extraction on SMS data.

**Inputs:**
Add a csv added to the Input_Data folder. This csv should be raw individual SMS messages, not grouped by conversation.
Add a csv added to the Input_Data folder. This csv should be raw individual SMS messages, not grouped by conversation.

**Instructions:**
Open the script aggregate_text_messages.R in RStudioo and follow the instructions to aggregate messages into a single row per conversation
Expand All @@ -45,16 +49,16 @@ A file (filename specified by you in the R script) with a single row representin


## SMS Conversation Categorization and Name Extraction
**Use Case:** I have SMS conversations and I need to figure out which text recipiants volunteered to triple, which chose to opt out, what names they provided, and whether they moved.
**Use Case:** I have SMS conversations and I need to figure out which text recipiants volunteered to triple, which chose to opt out, what names they provided, and whether they moved.

**Inputs:**
Add a csv to the Input_Data folder. This csv file must be of the same format as the output of the aggregation in step 1.
Add a CSV to the Input_Data folder. This csv file must be of the same format as the output of the aggregation in step 1.

**Instructions:**
In this directory, run `python3 Code/annotate_conversations.py -d [input_filename]`.
In this directory, run `python3 Code/annotate_conversations.py -i [input_filename]`.

**Outputs:**
This script will output two files:
This script will output three files:
1. A file of triplers called `sms_triplers.csv`. For each tripler, we provide the following fields (each row represents one text message conversation):
- *ConversationId* a unique identifier for the conversation
- *contact_phone* the phone number of the target
Expand All @@ -74,46 +78,45 @@ This script will output two files:
- *wrong_number* guess for did we have the wrong number for this person (to be reviewed)
- *names_extract* guess for what names (if any) were provided by this person as tripling targets (to be reviewed)


3. A file of opt-outs

## Text Banker Log Cleaning
**Use Case:** I have text banker logs for names provided by vote triplers. I need these logs cleaned up and standardized.
**Use Case:** I have text banker logs for names provided by vote triplers. I need these logs cleaned up and standardized.

**Inputs:**
Add a csv to the Input_Data folder. This csv file must contain column 'names' containing the names logged by a text banker
Add a csv to the Input_Data folder. This csv file must contain column 'names' containing the names logged by a text banker

**Instructions:**
In this directory, run `python3 Code/name_cleaning.py -d [input_filename]`
In this directory, run `python3 Code/name_cleaning.py -i [input_filename]`

**Outputs:**
A File named `labeled_names_cleaned_no_response.csv` with the cleaned names in a column titles "clean_names", along with any other columns in the initial file


A file in `Output_Data` named `labeled_names_cleaned_no_response.csv` with the cleaned names in a column titles "clean_names", along with any other columns in the initial file

## Text Banker Log Cleaning (utilizing text message conversation)
**Use Case:** I have text banker logs for names provided by vote triplers. I also have access to the initial text conversation. I need these logs cleaned up and standardized. We use a different script for these cases, because we can clean up the logs better and perform spell check by looking at the original messages.
**Use Case:** I have text banker logs for names provided by vote triplers. I also have access to the initial text conversation. I need these logs cleaned up and standardized. We use a different script for these cases, because we can clean up the logs better and perform spell check by looking at the original messages.

**Inputs:**
Add a csv to the Input_Data folder.
This csv file must be of the same format as the output of the aggregation in step 1.
This csv file must also contain column 'names' containing the names logged by a text banker.

**Instructions:**
In this directory, run `python3 Code/name_cleaning_with_responses.py -d [input_filename]`
In this directory, run `python3 Code/name_cleaning_with_responses.py -i [input_filename]`

**Outputs:**
A File named `labeled_names_cleaned_with_response.csv` with the cleaned names in a column titles "clean_names", along with any other columns in the initial file



## VAN Export Cleaning
**Use Case:** I have a VAN Export and I need to extract any tripling target names from the note text.

**Inputs:**
Add a csv to the Input_Data folder. This csv file must contain the following columns:
- *VANID* a unique ID for this row
- *voter_file_vanid* a unique ID for this row
- *ContactName* the name of the tripler
- *NoteText* free text possibly including names of tripling targets

**Instructions:**
In this directory, run `python3 Code/van_export_cleaning.py -d [input_filename]`
In this directory, run `python3 Code/van_export_cleaning.py -d [input_filename]`

**Outputs:**
This script will output two files:
Expand All @@ -126,3 +129,44 @@ This script will output two files:
- *ContactName* a unique identifier for the conversation
- *NoteText* free text possibly including names of tripling targets
- *names_extract* a guess for the extracted names (to be reviewed)

# Running the app frontend
app.py is a Python 3.x, Flask-based frontend that provides a dedicated UI for uploading data sets and requesting that the above scripts be run with them.

Make sure you've created and activated a virtual environment (see Requirements) and installed everything in requirements.txt.

You'll need to [install Redis](https://redis.io/topics/quickstart). On OSX, install homebrew and then `brew install redis`. You may also need to run `pip install "celery[redis]"`

Copy config.py from `instance/config.py.example` file and fill it in.

To run an instance of the frontend locally, from the project root directory run:
```
export FLASK_APP=parser
export FLASK_ENV=development
flask run
```
and access the running application at [http://localhost:5000/](http://localhost:5000/)

## Configuring email

Email config variables in the example config file assume you are using Gmail for testing. Two important notes:
* Gmail probably isn't adequate for production scale; you can only send about 100 emails a day.
* Gmail doesn't consider any apps that send mail using SMTP protocol secure. When you try and run the app with a Gmail account you'll get security warnings on that account unless you have enabled what Google calls ["Less Secure Apps"](https://support.google.com/accounts/answer/6010255?hl=en).

## Running scripts async in the background vs. waiting for results

If config.PROCESS_ASYNC is set to True, the app uses Celery workers, a Redis queue and Flask-mail to manage script jobs and email results in the background. If config.PROCESS_ASYNC is set to False, it runs script jobs synchronously and the app waits to deliver the results as linked files.

Synchronous mode is not recommended for production if you expect lots of large files that take a while ( > 30 seconds) to process.

If you set config.PROCESS_ASYNC to true, you'll need to run celery and redis (which celery uses to manage its queue)
* `celery -A celery_worker.celery worker --loglevel=info` will spin up a celery worker for you in a local dev environment. [More on celery workers](https://docs.celeryproject.org/en/stable/userguide/workers.html)
* Run redis in a different terminal window with `redis-server`.

## Testing the app frontend

`pytest` should run all the tests in the `tests` folder.

## TODO

A Docker container would ease of deployment.
34 changes: 34 additions & 0 deletions instance/config.py.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
import os

SECRET_KEY = 'dev' # change this for prod!
BASE_URL = 'http://localhost:5000/' # local dev
MAX_CONTENT_LENGTH = 16 * 1024 * 1024 # 16 MB

APP_ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
UPLOAD_FOLDER = os.path.join(os.path.dirname(APP_ROOT), 'Projects/NLP/SMS_Annotation/Input_Data'),
RESULTS_FOLDER = os.path.join(os.path.dirname(APP_ROOT), 'Projects/NLP/SMS_Annotation/Output_Data'),
SCRIPTS_FOLDER = os.path.join(os.path.dirname(APP_ROOT), 'Projects/NLP/SMS_Annotation/Code'),

# If PROCESS_ASYNC is set to True, we run scripts in the background and email
# results with celery, redis and flask-mail.
# If PROCESS_ASYNC is set to False, we run scripts synchronously and await a
# link to the results. If a script take too long to run (as they do with larger
# files), the web app may time out before the script finishes.
PROCESS_ASYNC = True

# Show full script error output in error message to user
SHOW_SCRIPT_ERRORS = False

CELERY_BROKER_URL = 'redis://localhost:6379/0'
CELERY_RESULT_BACKEND = 'redis://localhost:6379/1'

MAIL_SERVER = 'smtp.gmail.com'
MAIL_PORT = 587
MAIL_USE_TLS = True
MAIL_USERNAME = 'notarealemail@gmail.com'
MAIL_PASSWORD = 'password1234'
EMAIL_SENDER = 'Votetripling SMS Transcript Processing, notarealemail@gmail.com'
EMAIL_SUBJECT = 'SMS transcript processing'
TEST_TARGET_EMAIL = 'your.email@example.com'

FILE_LIFE = 72 # no. of hrs we let uploaded and result files hang out on the server
43 changes: 43 additions & 0 deletions parser/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# because this code gets run from several different places, update PATH
# so we can find modules from wherever we run things
import os
import sys
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.append(os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), 'instance'))

from celery import Celery
from flask import Flask
import config

celery = Celery(__name__, broker=config.CELERY_BROKER_URL, result_backend=config.CELERY_RESULT_BACKEND)

def create_app(test_config=None):
# create and configure the app
app = Flask(__name__, instance_relative_config=True)
app.instance_path = (os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), 'instance')) # maybe setting this manually fixes the config problem in blueprint celery tasks
app.config.from_mapping(
SECRET_KEY='dev',
DATABASE=os.path.join(app.instance_path, 'parser.sqlite')
)
test_config = None
if test_config is None:
# load the instance config, if it exists, when not testing
app.config.from_pyfile('config.py', silent=True)
else:
# load the test config if passed in
app.config.from_mapping(test_config)

import main
app.register_blueprint(main.bp)

# ensure the instance folder exists
try:
os.makedirs(app.instance_path)
except OSError:
pass

# redis_client.init_app(app)
celery.conf.update(app.config)

return app
6 changes: 6 additions & 0 deletions parser/celery_worker.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/usr/bin/env python
import os
from __init__ import celery, create_app

app = create_app(os.getenv('FLASK_CONFIG') or 'default')
app.app_context().push()
Loading