Provide breadcrumbs / code hooks for "partial re-ingestion" of content #152

jacobwegner · 2023-06-14T21:55:16Z

The current ingestion process is idempotent; it assumes that we're always building up data from scratch, because that's what we do when we deploy the site.

During local development, I have a few shortcuts that I use to give a tighter "feedback loop" when working on a particular annotation.

I'd like to have this to support @jchill-git , @gregorycrane and others who may be doing a lot more content previewing / editing than I have been in the past...it will also help us to be better at incremental updates to content when content moves out of this "code" repo and into content repos like https://github.com/PerseusDL/canonical-greekLit or https://github.com/scaife-viewer/ogl-pdl-annotations.

jacobwegner · 2023-06-15T16:09:55Z

@jchill-git I've been working on this today and will hopefully circle back tomorrow. Ping me here or on Slack if there is anything else I can help with as far as getting the texts / alignments in.

If you have been able to get your "version" into the database, this might be a helpful code snippet for getting out the tokens:

from pathlib import Path
from scaife_viewer.atlas.parallel_tokenizers import tokenize_text_parts

outdir = Path(".")  # the current directory, e.g. backend/

version_urn = "urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:"  # replace with your version URN

outf = tokenize_text_parts(outdir, version_urn )  # writes out to urnctsgreeklittlg0012tlg001perseus-grc2.csv

Snippet for that CSV file:

https://gist.github.com/jacobwegner/3a96e1763b7bc22d827680db1351a377

This would give you a CSV that could be useful in a dataframe that has calculated that ve_ref value for each token.

jacobwegner · 2023-06-21T20:27:53Z

scaife-viewer/backend@f4b4ecf provides an initial implementation of partial ingestion / re-ingestion, specifically the test_partial_ingestion function.
scaife-viewer/backend@e324cef provides tokenize_textparts_and_insert, which could then be used to tokenize the content that was partially ingested

I'll keep working on on the backend branch and provide updates on my progress on this issue.

jacobwegner · 2023-06-22T22:43:04Z

My commit in scaife-viewer/backend@f4b4ecf wasn't working for the Arabic content in the Codespace today; need to take a closer look.

The other thing I'd like to capture here and add a hook / to documentation is how the SV_ATLAS_DATA_DIR setting works.

If we made it an environment variable, that would allow folks to use a subset of the data in a data-wip directory or something like that. E.g.

data-wip/
├─ library/
│  ├─ <textgroup>/
│  │  ├─ metadata.json  # texgroup metadata
│  │  ├─ <work>/
│  │  │  ├─ metadata.json  # work and version metadata
│  │  │  ├─ <version>.txt  # version content

export SV_ATLAS_DATA_DIR=data-wip

./manage.py prepare_atlas_db --force

Files could be worked on from within data-wip (and even tracked in Git).

Once the file was ready for promotion to data/, it would be moved and updated in Git.

jacobwegner mentioned this issue Jul 7, 2023

Create and visualize textual notes annotations #162

Open

jacobwegner added this to the Beyond Translation, Summer 23 milestone Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide breadcrumbs / code hooks for "partial re-ingestion" of content #152

Provide breadcrumbs / code hooks for "partial re-ingestion" of content #152

jacobwegner commented Jun 14, 2023

jacobwegner commented Jun 15, 2023

jacobwegner commented Jun 21, 2023

jacobwegner commented Jun 22, 2023

Provide breadcrumbs / code hooks for "partial re-ingestion" of content #152

Provide breadcrumbs / code hooks for "partial re-ingestion" of content #152

Comments

jacobwegner commented Jun 14, 2023

jacobwegner commented Jun 15, 2023

jacobwegner commented Jun 21, 2023

jacobwegner commented Jun 22, 2023