Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

download synchronized snapshot of image+full text hathitrust PPA content #702

Open
5 of 6 tasks
rlskoeser opened this issue Jan 13, 2025 · 1 comment
Open
5 of 6 tasks
Assignees

Comments

@rlskoeser
Copy link
Contributor

rlskoeser commented Jan 13, 2025

We are requesting and HT will set up a one-time rsync endpoint for images from the PPA volumes.

The image rsync endpoint will be a static copy, so we want to use rsync to capture a snapshot of the image and full-text data. We don't have staffing to deal with rsync updates in production (e.g., reviewing modified excerpts and page ranges), so here's what I propose:

  • use rsync to download the image files from the endpoint HT will provide
  • use ansible replication playbook to copy PPA production data to staging so that the database and pairtree text match
  • run the hathi_rsync in staging to update all PPA volumes; preserve the csv report it generates so we have a record of which files were updated by rsync
  • write brief readme documenting what is in the folder and how it was created

optional (for aligned ppa page corpus)

  • clear solr and index all pages and works in staging
  • generate text export from staging

The results of both image and full-text rsync operations should be copied and preserved elsewhere (TigerData?) and labeled somehow (containing folder?) with the date that they were captured.

Optional: It might be useful to run the text corpus export script in production so we have a record of the PPA production texts that are included in the image/text corpus.

We could consider running the text corpus in staging so we have the page content in our preferred jsonl format, but this requires reindexing in Solr first.

@rlskoeser rlskoeser self-assigned this Feb 19, 2025
@rlskoeser rlskoeser moved this from To Do to In Progress in Iteration Planning Board Feb 19, 2025
@rlskoeser
Copy link
Contributor Author

Here's the rsync command I'm using for the image dataset:

rsync -az --info=progress2 --inplace --whole-file --copy-links --recursive --times datasets.hathitrust.org::ppa /mnt/tigerdata/cdh/prosody/HathiTrust_image-text-20240219

I ran it first with --dry-run and -v (verbose) options to check the output before starting the rsync for real.

I'll need to update the file permissions after the rsync completes, looks like right now only the file owner has permissions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

1 participant