-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Following extensive testing that showed oracle performed much better than classic docmatching, we disabled curation of daily and weekly docmatching in June 2024. The crontab that automatically looks for curated files was disabled in mid-June but doing so left some of the infrastructure activated within the pipeline itself. The things that need to be disabled are
- uploading to Google Sheets
- broadcast of docmatching status via Slack
Additionally, the existing doc-matching process exports results to a backoffice file formatted for Google Sheets; see adsdocmatch/match_w_metadata.py, L86. Right now, the backoffice matching script will reprocess this file using grep/sed/awk to extract only those flagged as "Match" to a three column file (preprint bibcode, published bibcode, and score), and it is this result that is uploaded to oracle (using the -mf option) on a daily/weekly basis.
We can handle the first two issues easily with changes to run.py, and there is a PR in progress that addresses these. The latter issue will require a small amount of extra coding both in this repository and in the backoffice match scripts; we will need to update classic so that it assumes it is getting a file that's already correctly formatted.