All code is made available under a Apache license.
Scripts can be found in parent folder.
Initial sampling of approx 200 ocr articles + metadata per content/data query combination for 'contentious words', 'alternative words' and 'additional words'. (refer to datasheets)
cd sample_1
python3 PIPELINE.py
see sample_1/PIPELINE.py for details.
Sample 5-sentence extracts of centred on contentious, alternative and additional target words according to the ratios 20:20:5
cd sample_2/
python3 PIPELINE.py
see sample_2/PIPELINE.py for details.
Build blocks (batches) of sequential 50 annotations, sampled w/o replacement from previous sample step
Requires:
- control.csv, a csv of url, query word, text for each control sample
Run:
python3 make_blocks.py
see here for details
Code for generating the flask web interface can be found here
see here for details
refer to and run create_datasets.py