Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide breadcrumbs / code hooks for "partial re-ingestion" of content #152

Open
jacobwegner opened this issue Jun 14, 2023 · 3 comments
Open

Comments

@jacobwegner
Copy link
Contributor

The current ingestion process is idempotent; it assumes that we're always building up data from scratch, because that's what we do when we deploy the site.

During local development, I have a few shortcuts that I use to give a tighter "feedback loop" when working on a particular annotation.

I'd like to have this to support @jchill-git , @gregorycrane and others who may be doing a lot more content previewing / editing than I have been in the past...it will also help us to be better at incremental updates to content when content moves out of this "code" repo and into content repos like https://github.com/PerseusDL/canonical-greekLit or https://github.com/scaife-viewer/ogl-pdl-annotations.

@jacobwegner
Copy link
Contributor Author

@jchill-git I've been working on this today and will hopefully circle back tomorrow. Ping me here or on Slack if there is anything else I can help with as far as getting the texts / alignments in.

If you have been able to get your "version" into the database, this might be a helpful code snippet for getting out the tokens:

from pathlib import Path
from scaife_viewer.atlas.parallel_tokenizers import tokenize_text_parts

outdir = Path(".")  # the current directory, e.g. backend/

version_urn = "urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:"  # replace with your version URN

outf = tokenize_text_parts(outdir, version_urn )  # writes out to urnctsgreeklittlg0012tlg001perseus-grc2.csv

Snippet for that CSV file:

https://gist.github.com/jacobwegner/3a96e1763b7bc22d827680db1351a377

image

This would give you a CSV that could be useful in a dataframe that has calculated that ve_ref value for each token.

@jacobwegner
Copy link
Contributor Author

I'll keep working on on the backend branch and provide updates on my progress on this issue.

@jacobwegner
Copy link
Contributor Author

My commit in scaife-viewer/backend@f4b4ecf wasn't working for the Arabic content in the Codespace today; need to take a closer look.

The other thing I'd like to capture here and add a hook / to documentation is how the SV_ATLAS_DATA_DIR setting works.

If we made it an environment variable, that would allow folks to use a subset of the data in a data-wip directory or something like that. E.g.

data-wip/
├─ library/
│  ├─ <textgroup>/
│  │  ├─ metadata.json  # texgroup metadata
│  │  ├─ <work>/
│  │  │  ├─ metadata.json  # work and version metadata
│  │  │  ├─ <version>.txt  # version content

export SV_ATLAS_DATA_DIR=data-wip

./manage.py prepare_atlas_db --force

Files could be worked on from within data-wip (and even tracked in Git).

Once the file was ready for promotion to data/, it would be moved and updated in Git.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant