-
-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add WACZ filename, depth, favIconUrl, isSeed to pages #2352
Conversation
Converted to draft to add some additional features:
|
Added Re: init_container, I think we can just bump the startupProbe max time for now as the easiest option. |
It's such a small amount of additional information, I think it's probably worth the overhead to add all of these fields into the db if there's even a chance they'll be useful to have later (certainly I can see the favicon url being used before long!). |
fill in 'filename' if available for new crawls set 'filename' when readding pages work for #2348
13d9d44
to
1e4b7f1
Compare
32cbd3f
to
f013d69
Compare
@ikreymer this should be ready to go for re-review/testing on dev now, there is a test failure but it's unrelated (QA issue, maybe due to a crawler or wabac change) |
Fixes #2348
Adds
filename
to pages, pointed to the WACZ file those files come from.Also adds idempotent migration to backfill this information for existing pages.
Update: Now also adding
depth
,isSeed
, andfavIconUrl
to pages as well, both for crawls moving forward and in the backfill migration.Manual testing
filename
and other fields have been added for pages in older crawlsfilename
is present and matches values in backfilled pages (namely, that it's just the WACZ filename, with no oid directory prefix)