Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SURT are not created for HTTP CONNECT requests in WARC file #20

Open
ARiedijk opened this issue Nov 30, 2022 · 0 comments
Open

SURT are not created for HTTP CONNECT requests in WARC file #20

ARiedijk opened this issue Nov 30, 2022 · 0 comments

Comments

@ARiedijk
Copy link

Hi, we are using this cdx-indexer tool and found out that while replaying our Wacz files in Replayweb.page player, sometimes certain resources were not found, while they were present in the Warc files.

What turned out in our Warc files are CONNECT requests and these are not converted to a SURT. For example, url=distillery.wistia.com:443 remains after surt.surt(url) method call distillery.wistia.com:443. The Replayweb.page player checks whether the index.idx has a surt, using useSurt = prefix.indexOf(")/") > 0; in the MultiWacz.js. If by chance the last line has a CONNECT then this block is considered surt = false in the cdx. Then querying in the browser DB using the upperBound method does not work properly.

Given:

A warc file with:

WARC/1.0
Content-Length: 308
Content-Type: application/http;msgtype=request
WARC-Block-Digest: sha1:XDTRC67IG3EYGKYRBFK7BOYLBRJHW52X
WARC-Date: 2022-09-14T14:45:01Z
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-Record-ID: <urn:uuid:d083e59a-e1c5-4079-bb20-cf6115fa342d>
WARC-Target-URI: distillery.wistia.com:443
WARC-Type: request

CONNECT distillery.wistia.com:443 HTTP/1.1
Accept-Encoding: *, compress;q=0, br;q=0
Content-Length: 0
Host: distillery.wistia.com:443
Proxy-Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/105.0.5195.102 Safari/537.36

When running the cdxj_indexer with the following parameters:

main.py -p -o index.idx -c index.cdx.gz -s -d -l 1024 small.warc

Then the result in de index is:

!meta 0 {"format": "cdxj-gzip-1.0", "filename": "c:\\temp\\index.cdx.gz"}
distillery.wistia.com:443 20220914144501 {"offset": 0, "length": 371, "digest": "sha256:8e8d3aa0f13b077615de09a2d349121130ec5fca9783c97d10c07721e1d13585"}

excepted:

com,wistia,distillery)/ 20220914144501 {"offset": 0, "length": 377, "digest": "sha256:b75ede157ec02f31a25126270771b287d1ccc42554c9678ebc2c1446249a554d"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant