Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add WACZ filename, depth, favIconUrl, isSeed to pages #2352

Merged
merged 18 commits into from
Feb 5, 2025

Conversation

tw4l
Copy link
Member

@tw4l tw4l commented Jan 30, 2025

Fixes #2348

Adds filename to pages, pointed to the WACZ file those files come from.

Also adds idempotent migration to backfill this information for existing pages.

Update: Now also adding depth, isSeed, and favIconUrl to pages as well, both for crawls moving forward and in the backfill migration.

Manual testing

  1. Spin up a local environment off of latest main
  2. Create some crawls, verify pages are added to database
  3. Re-deploy from this branch
  4. Verify migration ran successfully in backend logs, didn't raise errors
  5. Verify filename and other fields have been added for pages in older crawls
  6. Create a new crawl, verify filename is present and matches values in backfilled pages (namely, that it's just the WACZ filename, with no oid directory prefix)

@tw4l tw4l requested a review from ikreymer January 30, 2025 17:53
@tw4l tw4l marked this pull request as draft January 30, 2025 19:05
@tw4l
Copy link
Member Author

tw4l commented Jan 30, 2025

Converted to draft to add some additional features:

  • Adding depth, faviconUrl, and seed (potentially? duplicates depth a bit) to pages and to backfill migration
  • Looking into ensuring long-running migrations finish without issue, perhaps in init_container or by modifying migration to spawn background jobs

@ikreymer
Copy link
Member

ikreymer commented Feb 4, 2025

Converted to draft to add some additional features:

  • Adding depth, faviconUrl, and seed (potentially? duplicates depth a bit) to pages and to backfill migration
  • Looking into ensuring long-running migrations finish without issue, perhaps in init_container or by modifying migration to spawn background jobs

Added depth so far. Should we add faviconUrl? Could potentially use it later.
Didn't add seed as that's just depth == 0, main reason would be for an index that we can filter on isSeed true/false, which we might want to do for querying. Though, could still do that with an index on depth.

Re: init_container, I think we can just bump the startupProbe max time for now as the easiest option.

@tw4l
Copy link
Member Author

tw4l commented Feb 4, 2025

Added depth so far. Should we add faviconUrl? Could potentially use it later. Didn't add seed as that's just depth == 0, main reason would be for an index that we can filter on isSeed true/false, which we might want to do for querying. Though, could still do that with an index on depth.

It's such a small amount of additional information, I think it's probably worth the overhead to add all of these fields into the db if there's even a chance they'll be useful to have later (certainly I can see the favicon url being used before long!).

@tw4l tw4l force-pushed the issue-2348-page-wacz-mapping branch from 13d9d44 to 1e4b7f1 Compare February 4, 2025 21:07
@tw4l tw4l force-pushed the issue-2348-page-wacz-mapping branch from 32cbd3f to f013d69 Compare February 4, 2025 21:56
@tw4l tw4l marked this pull request as ready for review February 4, 2025 22:24
@tw4l
Copy link
Member Author

tw4l commented Feb 4, 2025

@ikreymer this should be ready to go for re-review/testing on dev now, there is a test failure but it's unrelated (QA issue, maybe due to a crawler or wabac change)

@tw4l tw4l requested a review from ikreymer February 5, 2025 16:11
@tw4l tw4l changed the title Add WACZ filename to pages Add WACZ filename, depth, favIconUrl, isSeed to pages Feb 5, 2025
@tw4l tw4l merged commit 0e9e70f into main Feb 5, 2025
20 of 22 checks passed
@tw4l tw4l deleted the issue-2348-page-wacz-mapping branch February 5, 2025 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support generating page->WACZ mapping on the backend.
2 participants