- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How should we handle playback of redirects to the web archive itself? #591
Comments
Hm, it looks like that 301 response does not include a Location header, hence the blank page. Is that what it is in the warc record? Did the crawler end up crawling UKWA itself, or stopped there? |
Hm, the WARC records look alright to me (see below). We do have some crufty records from accidentally crawler our own archive in the past, but we don't seem to have one for this particular page.
|
A-ha, I think this arises because there's a |
Hm, something weird is going on. I've deployed our latest PyWB on our BETA service, and made it filter out revisits, leading to this calendar: The ones prior to 20181013061545 work, but 20181013061545 onwards does not work. I've added an API so you can access the raw WARC record directly: |
To be clear, limiting pywb/pywb/warcserver/index/indexsource.py Lines 116 to 121 in 54d8bcc
|
Proposed #606. |
Following update to run under 2.5.0, this should work fine I think. Under a test server, it still says:
But I think that's because it's not running on |
Unfortunately, this doesn't seem to work on live, e.g. https://www.webarchive.org.uk/act/wayback/archive/20181013061546/https://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx still says
EDIT: there are some suggestions the service might be seeing it's internal server name |
Hm, yes, the error message suggests that something odd is happening with the URL look, and maybe some sort of mismatch on the prefix.. |
Hey @ikreymer I don't suppose you've any idea how to fix this? |
Looking at: https://github.com/webrecorder/pywb/blob/main/pywb/warcserver/resource/responseloader.py#L120 I feel pretty sure the cases that are causing problems are a whole class of redirects that are not covered at all by the current implementation. The current implementation catches redirect loops, where the redirect location is the same as the current URL. The cases I'm hitting are cases where the redirect of If I understand the code flow, I think this could be added to the |
We at Arquivo.pt had a similar issue: we archived a page which includes a link to another archived page. So when we try to follow the archived link, pywb tries to redirect us to an archived version of our own archived page. Since we don't archive ourselves, the pywb fails to replay anything. We would have preferred if pywb could recognize that it's trying to replay an archived page of our own archive and redirect to the original page instead. (Link to our github issue) However on our case I believe this could be easily fixed during replay: Since we're not dealing with an HTTP redirect request we don't need to look at headers or anything, so this is all happening on the client-side. I thought that maybe we could just prevent pywb from processing URLs that point towards ourselves, and instead just go to the desired endpoint. As a proof of concept, I messed with the template of our pwyb instance and implemented a very crude way to detect links that point towards ourselves (see below). This worked, so we will probably use a more robust version of this as a workaround for now, but it'd be great if this could be a configurable behavior for pywb. window.addEventListener("message", onMessage, false);
function onMessage(event) {
if (typeof data.wb_type !== 'undefined') {
if (data.wb_type == "load" || data.wb_type == "replace-url" ) {
if(data.url.includes('arquivo.pt/wayback')) { // <-- pywb event includes the data.url parameter which is the original url of the archived link.
window.location.href = data.url; // <-- force the browser to redirect to the url instead of trying to get an archived version
}
}
}
} |
Expected behavior
We've archived this page in the past: http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
The 2008 copy works fine, but it's been replaced with a redirect to us, the UK Web Archive: https://www.webarchive.org.uk/wayback/archive/20140613220103/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
And since then, we've archived the redirect, so now the archive points at itself. This ends with a blank page (at least when using a more recent pywb, here: https://beta.webarchive.org.uk/wayback/archive/20140613220103mp_/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx)
It should ideally somehow know those are self-redirects and drop them, rolling back to the 2008 version: http://beta.webarchive.org.uk/wayback/archive/cdx?url=http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
EDIT to try and make what's going on clear: _The actual WARC response record has a Location header that points back the us, the UK Web Archive, i.e. we indexed a redirect to ourselves, because they put in redirects to us, but we kept archiving their pages.
Really, I guess we don't want to index responses that point to any web archive, so perhaps this is an indexing problem not a playback problem?
What actually happened
Blank page instead of 2008 instance.
Browser
All.
The text was updated successfully, but these errors were encountered: