Description
Describe the bug
I'm trying to issue Range requests for some bytes out of pypi wheels. The ETag response header is not constant, and I think it should be.
The immediate application is some code to extract metadata (yes, I know it's being backfilled), but would like to use this same code as a stopgap to enable targeted malware scanning of large wheels without downloading the whole thing.
I need to figure out if this is something that can be fixed, or if I need to work around changing etags by sniffing the Cache-Control: immutable
or something.
Expected behavior
Requesting ranges from the beginning and end of wheels returns the same etag and content-type.
To Reproduce
$ curl -I https://files.pythonhosted.org/packages/88/8a/76a32ba459b4c376cc3780dca0f600bbbb63b3610249a068f7eb20991ee3/pandas-2.2.0-cp310-cp310-macosx_10_9_x86_64.whl
...
etag: "40d570a6b0f1aa3c1de02d4bf7b8e3ae-2"
content-type: binary/octet-stream
$ curl -I -H "Range: bytes=0-1" https://files.pythonhosted.org/packages/88/8a/76a32ba459b4c376cc3780dca0f600bbbb63b3610249a068f7eb20991ee3/pandas-2.2.0-cp310-cp310-macosx_10_9_x86_64.whl
...
etag: "40d570a6b0f1aa3c1de02d4bf7b8e3ae-2"
content-type: binary/octet-stream
$ curl -I -H "Range: bytes=10000000-10000001" https://files.pythonhosted.org/packages/88/8a/76a32ba459b4c376cc3780dca0f600bbbb63b3610249a068f7eb20991ee3/pandas-2.2.0-cp310-cp310-macosx_10_9_x86_64.whl
etag: "a0ee78f286e2d5f37a25879378e0c316"
content-type: application/octet-stream
The last-modified
dates also differ by a few seconds, and the server
and x-amz-server-side-encryption
response headers are only present when requesting the beginning of the file. The cutoff appears to be at 3145728 bytes (3 * 2**20
).
I'm assuming this is either bad cached content (if the etag has definitively changed) or a misbehavior in s3 (the docs are unclear on whether it can change over time, and my read of the mozilla writeup on etags is that it shouldn't depend on the range). Happy to work around it if you have ideas for doing so safely. The caching layer doesn't appear to support If-Range
requests so I went down the road of verifying them myself.