Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should we handle playback of redirects to the web archive itself? #591

Open
anjackson opened this issue Nov 12, 2020 · 12 comments
Open
Assignees

Comments

@anjackson
Copy link
Contributor

anjackson commented Nov 12, 2020

Expected behavior

We've archived this page in the past: http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx

The 2008 copy works fine, but it's been replaced with a redirect to us, the UK Web Archive: https://www.webarchive.org.uk/wayback/archive/20140613220103/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx

And since then, we've archived the redirect, so now the archive points at itself. This ends with a blank page (at least when using a more recent pywb, here: https://beta.webarchive.org.uk/wayback/archive/20140613220103mp_/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx)

It should ideally somehow know those are self-redirects and drop them, rolling back to the 2008 version: http://beta.webarchive.org.uk/wayback/archive/cdx?url=http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx


EDIT to try and make what's going on clear: _The actual WARC response record has a Location header that points back the us, the UK Web Archive, i.e. we indexed a redirect to ourselves, because they put in redirects to us, but we kept archiving their pages.

Really, I guess we don't want to index responses that point to any web archive, so perhaps this is an indexing problem not a playback problem?


What actually happened

Blank page instead of 2008 instance.

Browser

All.

@ikreymer
Copy link
Member

Hm, it looks like that 301 response does not include a Location header, hence the blank page. Is that what it is in the warc record? Did the crawler end up crawling UKWA itself, or stopped there?
It seems like this should just be a special case for the self-redirect check, but not entirely clear from the response yet...

@anjackson
Copy link
Contributor Author

Hm, the WARC records look alright to me (see below). We do have some crufty records from accidentally crawler our own archive in the past, but we don't seem to have one for this particular page.

/heritrix/output/frequent-npld/20191203215907/warcs/BL-NPLD-20191207020008229-03586-75~npld-heritrix3-worker-1~8443.warc.gz 634600266 0
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
WARC-Date: 2019-12-07T05:49:33Z
WARC-IP-Address: 52.84.141.67
WARC-Payload-Digest: sha1:Z6IJ46JXZU7TCLCDINT3OMVFHV5GZPYU
WARC-Record-ID: <urn:uuid:73e71ddc-1886-4431-af60-d7792f41716e>
Content-Type: application/http; msgtype=response
Content-Length: 635

HTTP/1.1 301 Moved Permanently
Server: CloudFront
Date: Sat, 07 Dec 2019 05:49:33 GMT
Content-Type: text/html
Content-Length: 183
Connection: close
Location: https://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
X-Cache: Redirect from cloudfront
Via: 1.1 3ddebf82c7d3a31f75ae0b53cadb99f3.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: MAN50-C3
X-Amz-Cf-Id: BWDyc1daE0vinNIGPj4gwlw9nl-SwIAhmKDNQtH7EM3Gx5n06DFUTA==

<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>CloudFront</center>
</body>
</html>



/heritrix/output/warcs/quarterly/20191001020435/BL-20191005233634759-01995-62~ukwa-h3-pulse-quarterly~8443.warc.gz 383937190 0
WARC/1.0
WARC-Type: response
WARC-Target-URI: https://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
WARC-Date: 2019-10-05T23:42:42Z
WARC-IP-Address: 54.192.33.125
WARC-Payload-Digest: sha1:Z6S5IZX7WMF4M6AQ7W4C3MIH7IUZN3QT
WARC-Record-ID: <urn:uuid:15b1d8ec-45c1-492b-954d-bb2339f41d63>
Content-Type: application/http; msgtype=response
Content-Length: 1401

HTTP/1.1 301 Moved Permanently
Content-Type: text/html; charset=iso-8859-1
Content-Length: 350
Connection: close
Date: Sat, 05 Oct 2019 23:42:42 GMT
Set-Cookie: AWSALB=Vrci55fIlihaFHr0WbnluCDpZfXRjrPdDr3JvSry9znByUayv6KtF4h3/AAK2wOo3de3me9gbcg6po1sdD5puEy3ISo6n8YsniPmgBg3Le2PNebeVlTOzFvP668R; Expires=Sat, 12 Oct 2019 23:42:42 GMT; Path=/
Server: Apache
X-Frame-Options: SAMEORIGIN
Strict-Transport-Security: max-age=31536000
Feature-Policy: microphone 'none'; payment 'none'; sync-xhr 'self' https://www.jisc.ac.uk”
Referrer-Policy: same-origin
X-Xss-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Location: http://www.webarchive.org.uk/wayback/archive/20140613220103/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
Cache-Control: max-age=1209600
Expires: Sat, 19 Oct 2019 23:42:42 GMT
X-Cache: Miss from cloudfront
Via: 1.1 3eb04a11bfe0f7e0abb7389a916f0d41.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: MAN50-C1
X-Amz-Cf-Id: ocfzaOYcoFIy_K7w7LiEp4s_awox_A9ZwW9ezra34owjGN8xUfmRlw==

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="http://www.webarchive.org.uk/wayback/archive/20140613220103/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx">here</a>.</p>
</body></html>



/heritrix/output/warcs/quarterly/20190701020558/BL-20190704080945660-00599-63~ukwa-h3-pulse-quarterly~8443.warc.gz 959790954 0
WARC/1.0
WARC-Type: revisit
WARC-Target-URI: https://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
WARC-Date: 2019-07-04T10:24:33Z
WARC-IP-Address: 54.192.34.75
WARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digest
WARC-Truncated: length
WARC-Payload-Digest: sha1:Z6S5IZX7WMF4M6AQ7W4C3MIH7IUZN3QT
WARC-Refers-To-Date: 2019-04-03T19:48:41Z
WARC-Refers-To-Target-URI: https://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
WARC-Record-ID: <urn:uuid:25369e12-dfe1-46bd-878e-c12521176c7c>
Content-Type: application/http; msgtype=response
Content-Length: 1051

HTTP/1.1 301 Moved Permanently
Content-Type: text/html; charset=iso-8859-1
Content-Length: 350
Connection: close
Date: Thu, 04 Jul 2019 10:24:33 GMT
Set-Cookie: AWSALB=XWV2MwRQmJwO4YL/voHHHea8XDmOWBK9tcsyquOhIyceJF52oDj6ZHlABSdG5I9oKwyk9zZ0eA/GBMVg+4Y7Jtkfs05FvUFjCeMcR+VwAqtgEpaASOEUguGc0tRJ; Expires=Thu, 11 Jul 2019 10:24:33 GMT; Path=/
Server: Apache
x-frame-options: SAMEORIGIN
Strict-Transport-Security: max-age=31536000
Feature-Policy: microphone 'none'; payment 'none'; sync-xhr 'self' https://www.jisc.ac.uk”
Referrer-Policy: same-origin
X-Xss-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Location: http://www.webarchive.org.uk/wayback/archive/20140613220103/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
Cache-Control: max-age=1209600
Expires: Thu, 18 Jul 2019 10:24:33 GMT
X-Cache: Miss from cloudfront
Via: 1.1 a364335587d085de3832514f7712e0e0.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: MAN50-C1
X-Amz-Cf-Id: 3DYqGb2l7Hhe7loZSahKhc0WzoGjrsLKBUJX6i4tpo9lRzukWKN2fQ==




/heritrix/output/warcs/quarterly/20190401020202/BL-20190403192157681-01649-62~ukwa-h3-pulse-quarterly~8443.warc.gz 904849596 0
WARC/1.0
WARC-Type: response
WARC-Target-URI: https://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
WARC-Date: 2019-04-03T19:48:41Z
WARC-IP-Address: 54.192.33.85
WARC-Payload-Digest: sha1:Z6S5IZX7WMF4M6AQ7W4C3MIH7IUZN3QT
WARC-Record-ID: <urn:uuid:8545283f-6a9e-460b-8079-f9686b7f7fe8>
Content-Type: application/http; msgtype=response
Content-Length: 1377

HTTP/1.1 301 Moved Permanently
Content-Type: text/html; charset=iso-8859-1
Content-Length: 350
Connection: close
Date: Wed, 03 Apr 2019 19:48:41 GMT
Set-Cookie: AWSALB=UykjQ7bRJJiF79tYvgooUZaygW6Ms4qy4z9V7fR4YCNJ79mZ+Qc80QTP2y8zVY32/k070noNtxK98AHX2+f6Sujfg+obKz+Al03s+gBPz7XqtGM5eKZ8X50ukhlT; Expires=Wed, 10 Apr 2019 19:48:41 GMT; Path=/
Server: Apache
X-Frame-Options: SAMEORIGIN
Strict-Transport-Security: max-age=31536000
Feature-Policy: microphone 'none'; payment 'none'; sync-xhr 'self' https://www.jisc.ac.uk”
Referrer-Policy: same-origin
X-Xss-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Location: http://www.webarchive.org.uk/wayback/archive/20140613220103/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
Cache-Control: max-age=1209600
Expires: Wed, 17 Apr 2019 19:48:41 GMT
X-Cache: Miss from cloudfront
Via: 1.1 c6c27fb3a8bc413f99e81981948a67c6.cloudfront.net (CloudFront)
X-Amz-Cf-Id: ydQSmLN9906psJSAJK0v21hrcA30BKpfRpIrtMR7Q8Ct1gare825cg==

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="http://www.webarchive.org.uk/wayback/archive/20140613220103/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx">here</a>.</p>
</body></html>



/heritrix/output/warcs/quarterly/20190401020202/BL-20190403192157680-01648-62~ukwa-h3-pulse-quarterly~8443.warc.gz 882680918 0
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
WARC-Date: 2019-04-03T19:48:40Z
WARC-IP-Address: 54.192.33.85
WARC-Payload-Digest: sha1:Z6IJ46JXZU7TCLCDINT3OMVFHV5GZPYU
WARC-Record-ID: <urn:uuid:f52e8a14-3662-484b-9789-ff0d5be11a3a>
Content-Type: application/http; msgtype=response
Content-Length: 611

HTTP/1.1 301 Moved Permanently
Server: CloudFront
Date: Wed, 03 Apr 2019 19:48:40 GMT
Content-Type: text/html
Content-Length: 183
Connection: close
Location: https://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
X-Cache: Redirect from cloudfront
Via: 1.1 5df88084d2e6c90392a3f4e5a634f39d.cloudfront.net (CloudFront)
X-Amz-Cf-Id: LVWNGCzgau2FFu6GgC-i51SJPBwqoAB7C6JtGZDtuRUu5ntyTXVZIQ==

<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>CloudFront</center>
</body>
</html>



/heritrix/output/warcs/quarterly/20181001021312/BL-20181013053614140-01151-63~ukwa-h3-pulse-quarterly~8443.warc.gz 623624382 0
WARC/1.0
WARC-Type: response
WARC-Target-URI: https://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
WARC-Date: 2018-10-13T06:15:46Z
WARC-IP-Address: 13.33.54.2
WARC-Payload-Digest: sha1:Z6S5IZX7WMF4M6AQ7W4C3MIH7IUZN3QT
WARC-Record-ID: <urn:uuid:882211ed-fcc1-45f0-a1aa-be342c641bf7>
Content-Type: application/http; msgtype=response
Content-Length: 1443

HTTP/1.1 301 Moved Permanently
Content-Type: text/html; charset=iso-8859-1
Content-Length: 350
Connection: close
Date: Sat, 13 Oct 2018 06:15:46 GMT
Server: Apache
X-Frame-Options: SAMEORIGIN
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Feature-Policy: microphone 'none'; payment 'none'; sync-xhr 'self' https://www.jisc.ac.uk”
Referrer-Policy: same-origin
Public-Key-Pins: pin-sha256='X3pGTSOuJeEVw989IJ/cEtXUEmy52zs1TZQrU06KUKg='; pin-sha256='MHJYVThihUrJcxW6wcqyOISTXIsInsdj3xK8QrZbHec='; pin-sha256='isi41AizREkLvvft0IRW4u3XMFR2Yg7bvrF7padyCJg='; includeSubdomains; max-age=2592000
X-Xss-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Location: http://www.webarchive.org.uk/wayback/archive/20140613220103/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
Cache-Control: max-age=1209600
Expires: Sat, 27 Oct 2018 06:15:46 GMT
X-Cache: Miss from cloudfront
Via: 1.1 4583e6648e47a3495c29f53f72bab417.cloudfront.net (CloudFront)
X-Amz-Cf-Id: GnI55yw_sYGl6cyVo3HiaES8TeMc3iBJDNkZv2RTAbUMdO6zwcG7ww==

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="http://www.webarchive.org.uk/wayback/archive/20140613220103/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx">here</a>.</p>
</body></html>



/heritrix/output/warcs/quarterly/20181001021312/BL-20181013053614145-01152-63~ukwa-h3-pulse-quarterly~8443.warc.gz 635049293 0
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
WARC-Date: 2018-10-13T06:15:45Z
WARC-IP-Address: 13.33.54.2
WARC-Payload-Digest: sha1:Z6IJ46JXZU7TCLCDINT3OMVFHV5GZPYU
WARC-Record-ID: <urn:uuid:9ef26f78-33b1-4b42-8d87-c2b040076ee6>
Content-Type: application/http; msgtype=response
Content-Length: 611

HTTP/1.1 301 Moved Permanently
Server: CloudFront
Date: Sat, 13 Oct 2018 06:15:45 GMT
Content-Type: text/html
Content-Length: 183
Connection: close
Location: https://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
X-Cache: Redirect from cloudfront
Via: 1.1 17570bdaeda2a4497e4f831a500e55ff.cloudfront.net (CloudFront)
X-Amz-Cf-Id: msONVO-E33_sEVZ63a55-FDGPwH7U32RF2dtRVV5q2HX-ib_2G6Qvw==

<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>CloudFront</center>
</body>
</html>



/data/129641/8618135/WARCS/BL-8618135-72.warc.gz 44738434 0
WARC/1.0
WARC-Type: response
WARC-Date: 2008-06-25T15:26:08Z
WARC-Record-ID: <urn:uuid:2b18bb8e-14bc-47bf-8688-5469fe75767d>
WARC-IP-Address: 83.137.214.22
WARC-Target-URI: http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx
WARC-Payload-Digest: sha512:3c8ece225eeef7b8484991d572e59f10335b4acaf689e6923554b087a69b8056e5703c0aaaed22da452a911ae74faa3a9ec3f2fa0e4668b2059b22d6b80386fe
Content-Type: application/http;msgtype=response
WARC-Identified-Payload-Type: text/html
Content-Length: 16490

HTTP/1.1 200 OK
Connection: close
Date: Wed, 25 Jun 2008 15:26:09 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Cache-Control: no-cache, no-store
Pragma: no-cache
Expires: -1
Content-Type: text/html; charset=utf-8
Content-Length: 16207


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head> 
	<meta content="text/html; charset=UTF-8" http-equiv="Content-Type" />
	<link href="/css/print.css" rel="style
...

@anjackson
Copy link
Contributor Author

A-ha, I think this arises because there's a closest_limit of 10 that's used when looking up the URL in OutbackCDX. PyWB appends &limit=10&matchType=exact to the query and that fails if there's a lot of revisits etc. I can't find a way to configure this setting!? HELP! :-)

@anjackson
Copy link
Contributor Author

Hm, something weird is going on. I've deployed our latest PyWB on our BETA service, and made it filter out revisits, leading to this calendar:

https://beta.webarchive.org.uk/wayback/archive/*/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx#

The ones prior to 20181013061545 work, but 20181013061545 onwards does not work. I've added an API so you can access the raw WARC record directly:

https://beta.webarchive.org.uk/api/query/warc/20181013061545/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx

@anjackson
Copy link
Contributor Author

anjackson commented Jan 20, 2021

To be clear, limiting closest_limit to a hardcoded value of 10 is definately a problem and is causing various playback issues. It may not be the only problem.

def __init__(self, api_url, replay_url, url_field='load_url', closest_limit=10):
self.api_url = api_url
self.replay_url = replay_url
self.url_field = url_field
self.closest_limit = closest_limit
self._init_sesh()

@anjackson
Copy link
Contributor Author

Proposed #606.

@anjackson
Copy link
Contributor Author

Following update to run under 2.5.0, this should work fine I think. Under a test server, it still says:

The url http://www.webarchive.org.uk/wayback/archive/20140613220103/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx could not be found in this collection. 

But I think that's because it's not running on www.webarchive.org.uk and that'll be fine once on live.

@anjackson
Copy link
Contributor Author

anjackson commented Nov 25, 2021

Unfortunately, this doesn't seem to work on live, e.g. https://www.webarchive.org.uk/act/wayback/archive/20181013061546/https://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx still says

The url http://www.webarchive.org.uk/wayback/archive/20140613220103/http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx could not be found in this collection. 

EDIT: there are some suggestions the service might be seeing it's internal server name prod1 at the NGINX level at least. See ukwa/w3act#664

@ikreymer
Copy link
Member

Hm, yes, the error message suggests that something odd is happening with the URL look, and maybe some sort of mismatch on the prefix..

@anjackson
Copy link
Contributor Author

anjackson commented Jan 30, 2023

Hey @ikreymer I don't suppose you've any idea how to fix this? It's now causing us major problems. Okay, we have a workaround, but this is a bit of a pain. Perhaps it's easier to modify cdxj-indexer to stop the records getting to the index.

@anjackson
Copy link
Contributor Author

Looking at: https://github.com/webrecorder/pywb/blob/main/pywb/warcserver/resource/responseloader.py#L120 I feel pretty sure the cases that are causing problems are a whole class of redirects that are not covered at all by the current implementation.

The current implementation catches redirect loops, where the redirect location is the same as the current URL.

The cases I'm hitting are cases where the redirect of URL goes to https://www.webarchive.org.uk/wayback/archive/URL - i.e. extra configuration/logic is needed to drop redirects that go to hosts like *.webarchive.org.uk. It may be possible to block these at indexing time (see webrecorder/cdxj-indexer#21) but ideally they should be blocked here too.

If I understand the code flow, I think this could be added to the raise_on_self_redirect function as an additional case it deals with and treats as a self-redirect, so that those corresponding index entries get skipped.

@tw4l tw4l self-assigned this Feb 2, 2023
@VascoRatoFCCN
Copy link

VascoRatoFCCN commented Sep 4, 2023

We at Arquivo.pt had a similar issue: we archived a page which includes a link to another archived page. So when we try to follow the archived link, pywb tries to redirect us to an archived version of our own archived page. Since we don't archive ourselves, the pywb fails to replay anything. We would have preferred if pywb could recognize that it's trying to replay an archived page of our own archive and redirect to the original page instead. (Link to our github issue)

However on our case I believe this could be easily fixed during replay: Since we're not dealing with an HTTP redirect request we don't need to look at headers or anything, so this is all happening on the client-side. I thought that maybe we could just prevent pywb from processing URLs that point towards ourselves, and instead just go to the desired endpoint.

As a proof of concept, I messed with the template of our pwyb instance and implemented a very crude way to detect links that point towards ourselves (see below). This worked, so we will probably use a more robust version of this as a workaround for now, but it'd be great if this could be a configurable behavior for pywb.

window.addEventListener("message", onMessage, false);        

function onMessage(event) {
  if (typeof data.wb_type !== 'undefined') {
    if (data.wb_type == "load" || data.wb_type == "replace-url" ) {
      if(data.url.includes('arquivo.pt/wayback')) { // <-- pywb event includes the data.url parameter which is the original url of the archived link.
         window.location.href = data.url;  // <-- force the browser to redirect to the url instead of trying to get an archived version
      }
    }
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants