This repo contains code used to prepare DWR manuscript data for classification by discipline and subject matter. Specifically, this repo houses R scripts for processing publication DOIs and generating RIS citation files that include abstracts. It includes two main workflows:
-
Generate RIS file:
A script that reads a text file containing DOIs and creates a basic RIS file with minimal citation fields. It also logs any errors that indicate that the DOI is invalid.
Script: 01_generate_ris.R -
Enrich RIS with abstracts:
A script that reads the previously generated RIS file, retrieves publication abstracts from the Crossref API, and appends them to the RIS records to create a RIS file with abstracts. It also logs any errors or cases where no abstract is available.
Script: 02_pull_abstracts.R
Note: the input and output documents currently in this repository are meant to illustrate the workflow and should not be used as final products.
Errors can be observed in the error logs generated by each script. Currently, known errors are:
-
Incorrect DOIs: if DOIs contain typos or are otherwise incorrect, the workflow will fail to pull citation information and abstracts for them. The incorrect DOIs are logged in the DOI error log produced by the first script in the pipeline.
-
API does not pull abstracts: for a minority of publications (perhaps 15% to 20%), the Crossref API does not return an abstract. This is an issue with the way that scientific journals and Crossref interact and is a challenge that cannot be solved with automation. For this subset of publications, logged in the abstract error log produced by the second script in the pipeline, abstracts will need to be manually added. It is possible to produce a script to support manual elements of the workflow, but that script does not exist yet.
The scripts use R and the following packages:
Make sure these packages are installed. You can install them by running:
install.packages(c("here", "httr", "jsonlite"))