You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are requesting and HT will set up a one-time rsync endpoint for images from the PPA volumes.
The image rsync endpoint will be a static copy, so we want to use rsync to capture a snapshot of the image and full-text data. We don't have staffing to deal with rsync updates in production (e.g., reviewing modified excerpts and page ranges), so here's what I propose:
use rsync to download the image files from the endpoint HT will provide
use ansible replication playbook to copy PPA production data to staging so that the database and pairtree text match
run the hathi_rsync in staging to update all PPA volumes; preserve the csv report it generates so we have a record of which files were updated by rsync
write brief readme documenting what is in the folder and how it was created
optional (for aligned ppa page corpus)
clear solr and index all pages and works in staging
generate text export from staging
The results of both image and full-text rsync operations should be copied and preserved elsewhere (TigerData?) and labeled somehow (containing folder?) with the date that they were captured.
Optional: It might be useful to run the text corpus export script in production so we have a record of the PPA production texts that are included in the image/text corpus.
We could consider running the text corpus in staging so we have the page content in our preferred jsonl format, but this requires reindexing in Solr first.
The text was updated successfully, but these errors were encountered:
We are requesting and HT will set up a one-time rsync endpoint for images from the PPA volumes.
The image rsync endpoint will be a static copy, so we want to use rsync to capture a snapshot of the image and full-text data. We don't have staffing to deal with rsync updates in production (e.g., reviewing modified excerpts and page ranges), so here's what I propose:
hathi_rsync
in staging to update all PPA volumes; preserve the csv report it generates so we have a record of which files were updated by rsyncoptional (for aligned ppa page corpus)
The results of both image and full-text rsync operations should be copied and preserved elsewhere (TigerData?) and labeled somehow (containing folder?) with the date that they were captured.
Optional: It might be useful to run the text corpus export script in production so we have a record of the PPA production texts that are included in the image/text corpus.
We could consider running the text corpus in staging so we have the page content in our preferred jsonl format, but this requires reindexing in Solr first.
The text was updated successfully, but these errors were encountered: