Install following dependencies
- python3
- pip
(Optional) Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate
Install dependencies
pip install -r requirements.txt
User table, population table, and output must be specified in the arguments. The demographics argument must be valid columns in the user table.
Script can be run using any combination of the following
- multiple demographics (raking or naive post-stratification)
- redistribution
- smooth before binning
- uninformed smoothing (ignores smoothing_k)
Single correction factor (income)
python3 robust_poststrat.py --demographics income --smoothing_k 10 --mininum_bin_threshold 50 --user_table /path/to/user_table.csv --population_table /path/to/population_table.csv --output /path/to/output.csv
Single correction factor (income) with redistribution
python3 robust_poststrat.py --demographics income --smoothing_k 10 --mininum_bin_threshold 50 --redistribution --user_table /path/to/user_table.csv --population_table /path/to/population_table.csv --output /path/to/output.csv
Single correction factor (income) with smooth before binning
python3 robust_poststrat.py --demographics income --smoothing_k 10 --mininum_bin_threshold 50 --smooth_before_binning --user_table /path/to/user_table.csv --population_table /path/to/population_table.csv --output /path/to/output.csv
Single correction factor (income) with uninformed smoothing (smoothing_k is ignored)
python3 robust_poststrat.py --demographics income --smoothing_k 10 --mininum_bin_threshold 50 --uninformed_smoothing --user_table /path/to/user_table.csv --population_table /path/to/population_table.csv --output /path/to/output.csv
Multiple correction factors (income + education) using raking
python3 robust_poststrat.py --demographics income education --smoothing_k 10 --mininum_bin_threshold 50 --user_table /path/to/user_table.csv --population_table /path/to/population_table.csv --output /path/to/output.csv
Multiple correction factors (income + education) using naive post-stratification
python3 robust_poststrat.py --demographics income education --smoothing_k 10 --mininum_bin_threshold 50 --naive_poststrat --user_table /path/to/user_table.csv --population_table /path/to/population_table.csv --output /path/to/output.csv
Multiple correction factors (age + gender + income + education) using raking with redistribution
python3 robust_poststrat.py --demographics age gender income education --smoothing_k 10 --mininum_bin_threshold 50 --redistribution --user_table /path/to/user_table.csv --population_table /path/to/population_table.csv --output /path/to/output.csv
- pandas
- numpy
- quantipy3
If you use this code in your work please cite the following paper:
@article{giorgi2022correcting,
title={Correcting Sociodemographic Selection Biases for Population Prediction from Social Media},
author={Salvatore Giorgi and Veronica Lynn and Keshav Gupta and Farhan Ahmed and Sandra Matz and Lyle Ungar and H. Andrew Schwartz},
year={2022},
journal={Proceedings of the International AAAI Conference on Web and Social Media},
}