download synchronized snapshot of image+full text hathitrust PPA content #702

rlskoeser · 2025-01-13T21:13:09Z

We are requesting and HT will set up a one-time rsync endpoint for images from the PPA volumes.

The image rsync endpoint will be a static copy, so we want to use rsync to capture a snapshot of the image and full-text data. We don't have staffing to deal with rsync updates in production (e.g., reviewing modified excerpts and page ranges), so here's what I propose:

use rsync to download the image files from the endpoint HT will provide
use ansible replication playbook to copy PPA production data to staging so that the database and pairtree text match
run the hathi_rsync in staging to update all PPA volumes; preserve the csv report it generates so we have a record of which files were updated by rsync
write brief readme documenting what is in the folder and how it was created

optional (for aligned ppa page corpus)

clear solr and index all pages and works in staging
generate text export from staging

The results of both image and full-text rsync operations should be copied and preserved elsewhere (TigerData?) and labeled somehow (containing folder?) with the date that they were captured.

Optional: It might be useful to run the text corpus export script in production so we have a record of the PPA production texts that are included in the image/text corpus.

We could consider running the text corpus in staging so we have the page content in our preferred jsonl format, but this requires reindexing in Solr first.

The text was updated successfully, but these errors were encountered:

rlskoeser · 2025-02-19T17:02:32Z

Here's the rsync command I'm using for the image dataset:

rsync -az --info=progress2 --inplace --whole-file --copy-links --recursive --times datasets.hathitrust.org::ppa /mnt/tigerdata/cdh/prosody/HathiTrust_image-text-20240219

I ran it first with --dry-run and -v (verbose) options to check the output before starting the rsync for real.

I'll need to update the file permissions after the rsync completes, looks like right now only the file owner has permissions.

rlskoeser added this to Iteration Planning Board Feb 19, 2025

rlskoeser moved this to To Do in Iteration Planning Board Feb 19, 2025

rlskoeser self-assigned this Feb 19, 2025

rlskoeser moved this from To Do to In Progress in Iteration Planning Board Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

download synchronized snapshot of image+full text hathitrust PPA content #702

download synchronized snapshot of image+full text hathitrust PPA content #702

rlskoeser commented Jan 13, 2025 •

edited

Loading

rlskoeser commented Feb 19, 2025

download synchronized snapshot of image+full text hathitrust PPA content #702

download synchronized snapshot of image+full text hathitrust PPA content #702

Comments

rlskoeser commented Jan 13, 2025 • edited Loading

optional (for aligned ppa page corpus)

rlskoeser commented Feb 19, 2025

rlskoeser commented Jan 13, 2025 •

edited

Loading