Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ways of handling problematic WARC records #21

Open
anjackson opened this issue Dec 7, 2022 · 1 comment
Open

Ways of handling problematic WARC records #21

anjackson opened this issue Dec 7, 2022 · 1 comment

Comments

@anjackson
Copy link

We've found some weird WARCs, looking like this:

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://www.estiethirionphotography.co.za/2011/10/fransua-anne-louise-wedding/feed/
WARC-Date: 2017-09-19T03:35:35Z
WARC-IP-Address: 176.58.112.27
WARC-Payload-Digest: sha1:ZQZJUQJW34BYM2R23SI7PDFMYFUTXGVU
WARC-Record-ID: <urn:uuid:d15353f7-1bb7-4441-92bf-1f2268639d52>
Content-Type: application/http; msgtype=response
Content-Length: 7026

19/Sep/2017:03:35:35 +0000|v1|40.77.167.54|www.mobyaffiliates.com|200|17922|35.197.249.238:80|0.019|0.019|GET /wp-content/uploads/2015/05/i6d2e3jOCVVc-e1432221090328.jpg HTTP/1.1||
19/Sep/2017: 03:35:35 +0000|v1|24.18.58.84|thestar.ie|200|73232|162.13.191.183:80|0.061|0.374|GET /wp-content/uploads/2015/12/video-woman-abusing-mcdonalds-cookies-brandy-wooten-353018.jpg HTTP/1.1||
19/Sep/2017: 03:35:36 +0000|v1|5.62.39.244|markom2020.no|403|0|35.197.196.129:80|0.339|0.339|GET /?author=1 HTTP/1.1||
19/Sep/2017: 03:35:36 +0000|v1|54.82.184.78|thestar.ie|200|0|162.13.191.183:80|0.389|0.389|HEAD /about-us/out-in-the-open-ace-back-at-work-hours-after-pittsburgh-defeat/ HTTP/1.1||
19/Sep/2017: 03:35:36 +0000|v1|69.162.124.230|www.adventure-holidays.ie|301|178|35.197.246.117:80|0.022|0.022|GET / HTTP/1.1||
19/Sep/2017: 03:35:36 +0000|v1|180.76.15.136|www.mobyaffiliates.com|200|20225|35.197.249.238:80|0.945|0.945|GET /mobile-advertising-networks/?key-markets=japan+indonesia&targeting=custom+operator HTTP/1.1||
19/Sep/2017: 03:35:37 +0000|v1|51.255.71.100|thestar.ie|200|32351|162.13.191.183:80|0.416|0.416|GET /about-us/sharon-corr-we-dont-judge-age/ HTTP/1.0||
19/Sep/2017: 03:35:37 +0000|v1|5.9.60.241|gullfoss.is|200|32745|35.197.192.76:80|2.819|2.819|GET /shop/?_wpnonce=9c17844d42&add_to_wishlist=3015 HTTP/1.1||
19/Sep/2017: 03:35:37 +0000|v1|188.163.72.15|www.alnouran.com|200|18017|35.189.109.142:80|0.006|0.006|GET /en/corporate-governance/corporate-social-responsibilities/ HTTP/1.1||
19/Sep/2017: 03:35:37 +0000|v1|178.154.200.9|canieatthere.eu|301|178|104.155.26.132:80|0.018|0.018|GET /robots.txt HTTP/1.1||
19/Sep/2017: 03:35:38 +0000|v1|141.8.142.44|canieatthere.co.uk|301|178|104.155.26.132:80|0.016|0.016|GET / HTTP/1.1||
19/Sep/2017: 03:35:38 +0000|v1|54.80.111.161|ravatherm.com|200|201784|104.199.60.90:80|0.071|0.071|GET /files/2016/03/DoP_RAVATHERM_300WB180_SK.pdf HTTP/1.0||
19/Sep/2017: 03:35:38 +0000|v1|131.253.25.146|www.grandunionhousing.co.uk|200|2117|35.189.99.79:80|0.060|0.060|GET /wp-content/uploads/2017/05/twitter.png HTTP/1.1||
19/Sep/2017: 03:35:38 +0000|v1|131.253.25.146|www.grandunionhousing.co.uk|200|4746|35.189.99.79:80|0.060|0.060|GET /wp-content/uploads/2017/05/google-plus.png HTTP/1.1||
19/Sep/2017: 03:35:38 +0000|v1|131.253.25.146|www.grandunionhousing.co.uk|200|1746|35.189.99.79:80|0.061|0.061|GET /wp-content/uploads/2017/05/facebook-icon.png HTTP/1.1||
19/Sep/2017: 03:35:38 +0000|v1|45.55.55.18|www.lanzarotesurf.com|200|28740|35.197.214.99:80|1.859|1.859|GET /es/reservas/surf-camp-nivel-intermedio/ HTTP/1.1||
19/Sep/2017: 03:35:38 +0000|v1|207.46.13.65|www.stickybottle.com|200|11209|134.213.209.62:80|1.007|1.007|GET /latest-news/hotly-contest-shay-elliott-memorial-in-prospect-as-top-men-fine-tune-ras-form/ HTTP/1.1||
19/Sep/2017: 03:35:38 +0000|v1|66.249.85.10|nutroexpertos.com|200|33426|35.189.69.242:80|0.022|0.022|GET /wp-content/uploads/2015/05/Ejercicio-perro-484x330.jpg HTTP/1.1||
19/Sep/2017: 03:35:39 +0000|v1|157.55.39.239|www.janminihane.co.uk|200|4319|35.197.245.96:80|0.017|0.017|GET /wp-includes/js/jquery/jquery-migrate.min.js HTTP/1.1||
19/Sep/2017: 03:35:39 +0000|v1|35.189.215.158|www.axbom.se|301|178|35.197.249.238:80|0.005|0.005|GET /feed/axbom-se HTTP/1.1||
19/Sep/2017: 03:35:39 +0000|v1|66.249.85.10|nutroexpertos.com|200|96474|35.189.69.242:80|0.016|0.395|GET /wp-content/uploads/2015/06/Post-c%C3%B3mo-ense%C3%B1ar-a-cachorro-a-hacer-sus-necesidades-sobre-los-peri%C3%B3dicos.jpg HTTP/1.1||
19/Sep/2017: 03:35:39 +0000|v1|66.249.85.8|nutroexpertos.com|200|65231|35.189.69.242:80|0.040|0.428|GET /wp-content/uploads/2014/12/garrapatas-en-perros-2-484x330.jpg HTTP/1.1||
19/Sep/2017: 03:35:39 +0000|v1|35.197.192.76|gullfoss.is|200|6275|35.197.192.76:80|0.005|0.007|GET /wp-content/uploads/2016/07/Logo-Gullfoss_website-XI-1.png HTTP/1.0||
19/Sep/2017: 03:35:40 +0000|v1|54.82.184.78|thestar.ie|200|0|162.13.191.183:80|0.424|0.424|HEAD /about-us/1ds-niall-ill-find-the-next-mcilroy/ HTTP/1.1||
19/Sep/2017: 03:35:40 +0000|v1|218.90.137.18|laorcare.com|200|0|35.189.99.79:80|1.079|1.079|HEAD /wp-json/oembed/1.0/embed?url=http%3A%2F%2Flaorcare.com%2F HTTP/1.1||
19/Sep/2017: 03:35:40 +0000|v1|218.90.137.18|laorcare.com|200|2507|35.189.99.79:80|0.006|0.006|GET /wp-json/oembed/1.0/embed?url=http%3A%2F%2Flaorcare.com%2F HTTP/1.1||
Server: nginx
Date: Tue, 19 Sep 2017 03:35:41 GMT
Content-Type: application/rss+xml; charset=UTF-8
Connection: close
X-Cacheable: CacheAlways: feed
Cache-Control: max-age=600, must-revalidate
X-Cache: MISS
X-Cache-Group: bot
X-Pingback: http://www.estiethirionphotography.co.za/xmlrpc.php
Link: <http://www.estiethirionphotography.co.za/wp-json/>; rel="https://api.w.org/"
Link: <http://wp.me/p2ZY6I-Mb>; rel=shortlink
X-Type: feed
ETag: "1f5dd55566f2f1de600da749924ac5fb-gzip"
X-Pass-Why: 
Last-Modified: Fri, 27 Jan 2017 11:12:31 GMT

<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	
	>
<channel>
	<title>Comments on: Fransua &#038; Anne-Louise wedding</title>
	<atom:link href="http://www.estiethirionphotography.co.za/2011/10/fransua-anne-louise-wedding/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.estiethirionphotography.co.za/2011/10/fransua-anne-louise-wedding/</link>
	<description>Photography</description>
	<lastBuildDate>Fri, 27 Jan 2017 11:12:31 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.8.1</generator>
	<item>
		<title>By: nastassja harvey</title>
		<link>http://www.estiethirionphotography.co.za/2011/10/fransua-anne-louise-wedding/#comment-9616</link>
		<dc:creator><![CDATA[nastassja harvey]]></dc:creator>
		<pubDate>Thu, 27 Oct 2011 17:41:40 +0000</pubDate>
		<guid isPermaLink="false">http://www.estiethirionphotography.co.za/?p=2987#comment-9616</guid>
		<description><![CDATA[sooo mooi estie! :)]]></description>
		<content:encoded><![CDATA[<p>sooo mooi estie! 🙂</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Kathryn van Eck</title>
		<link>http://www.estiethirionphotography.co.za/2011/10/fransua-anne-louise-wedding/#comment-9609</link>
		<dc:creator><![CDATA[Kathryn van Eck]]></dc:creator>
		<pubDate>Wed, 26 Oct 2011 09:48:58 +0000</pubDate>
		<guid isPermaLink="false">http://www.estiethirionphotography.co.za/?p=2987#comment-9609</guid>
		<description><![CDATA[Beautiful work! I love the softness of your images and how you captured the couples joy.]]></description>
		<content:encoded><![CDATA[<p>Beautiful work! I love the softness of your images and how you captured the couples joy.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
19/Sep/2017:03:35:41 +0000|v1|194.66.232.93|www.estiethirionphotography.co.za|200|1985|162.13.104.162:80|5.773|5.773|GET /2011/10/fransua-anne-louise-wedding/feed/ HTTP/1.0||

which comes out as a malformed CDX record:

za,co,estiethirionphotography)/2011/10/fransua-anne-louise-wedding/feed 20170919033535 {"url": "http://www.estiethirionphotography.co.za/2011/10/fransua-anne-louise-wedding/feed/", "mime": "application/rss+xml", "status": "+0000|v1|40.77.167.54|www.mobyaffiliates.com|200|17922|35.197.249.238:80|0.019|0.019|GET", "digest": "sha1:ZQZJUQJW34BYM2R23SI7PDFMYFUTXGVU", "length": "2785", "offset": "861793820", "filename": "test.warc.gz"}

But I think it'd be better to skip/drop these records?

@anjackson
Copy link
Author

We also have redirects that point back to the web archive, that PyWB is unable to deal with (webrecorder/pywb#591) - it would be great to be able to filter our records because they redirect to a particular host (www.webarchive.org.uk in this case).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant