Add WACZ filename, depth, favIconUrl, isSeed to pages#2352
Conversation
|
Converted to draft to add some additional features:
|
Added Re: init_container, I think we can just bump the startupProbe max time for now as the easiest option. |
It's such a small amount of additional information, I think it's probably worth the overhead to add all of these fields into the db if there's even a chance they'll be useful to have later (certainly I can see the favicon url being used before long!). |
fill in 'filename' if available for new crawls set 'filename' when readding pages work for #2348
13d9d44 to
1e4b7f1
Compare
32cbd3f to
f013d69
Compare
|
@ikreymer this should be ready to go for re-review/testing on dev now, there is a test failure but it's unrelated (QA issue, maybe due to a crawler or wabac change) |
Fixes #2348
Adds
filenameto pages, pointed to the WACZ file those files come from.Also adds idempotent migration to backfill this information for existing pages.
Update: Now also adding
depth,isSeed, andfavIconUrlto pages as well, both for crawls moving forward and in the backfill migration.Manual testing
filenameand other fields have been added for pages in older crawlsfilenameis present and matches values in backfilled pages (namely, that it's just the WACZ filename, with no oid directory prefix)