Skip to content

Latest commit

 

History

History

build_scripts

Instructions for creating the ConConCor dataset

All code is made available under a Apache license.

Scripts can be found in parent folder.

Initial sampling of approx 200 ocr articles + metadata per content/data query combination for 'contentious words', 'alternative words' and 'additional words'. (refer to datasheets)

cd sample_1
python3 PIPELINE.py

see sample_1/PIPELINE.py for details.

Scoring of P(sentence) based on bigram probabilities from sample_1/PIPELINE.py

Sample 5-sentence extracts of centred on contentious, alternative and additional target words according to the ratios 20:20:5

cd sample_2/
python3 PIPELINE.py

see sample_2/PIPELINE.py for details.

Build blocks (batches) of sequential 50 annotations, sampled w/o replacement from previous sample step

Requires:

  • control.csv, a csv of url, query word, text for each control sample

Run:

python3 make_blocks.py

Assemble the google forms from the batches

see here for details

Web interface re-directing prolific users to assembled google forms

Code for generating the flask web interface can be found here

Retrieve the annotations

see here for details

Build the datasets

refer to and run create_datasets.py