Skip to content

ETag on artifacts seems inconsistent #15387

Open
@thatch

Description

@thatch

Describe the bug
I'm trying to issue Range requests for some bytes out of pypi wheels. The ETag response header is not constant, and I think it should be.

The immediate application is some code to extract metadata (yes, I know it's being backfilled), but would like to use this same code as a stopgap to enable targeted malware scanning of large wheels without downloading the whole thing.

I need to figure out if this is something that can be fixed, or if I need to work around changing etags by sniffing the Cache-Control: immutable or something.

Expected behavior
Requesting ranges from the beginning and end of wheels returns the same etag and content-type.

To Reproduce

$ curl -I https://files.pythonhosted.org/packages/88/8a/76a32ba459b4c376cc3780dca0f600bbbb63b3610249a068f7eb20991ee3/pandas-2.2.0-cp310-cp310-macosx_10_9_x86_64.whl 
...
etag: "40d570a6b0f1aa3c1de02d4bf7b8e3ae-2"
content-type: binary/octet-stream

$ curl -I -H "Range: bytes=0-1" https://files.pythonhosted.org/packages/88/8a/76a32ba459b4c376cc3780dca0f600bbbb63b3610249a068f7eb20991ee3/pandas-2.2.0-cp310-cp310-macosx_10_9_x86_64.whl
...
etag: "40d570a6b0f1aa3c1de02d4bf7b8e3ae-2"
content-type: binary/octet-stream

$ curl -I -H "Range: bytes=10000000-10000001" https://files.pythonhosted.org/packages/88/8a/76a32ba459b4c376cc3780dca0f600bbbb63b3610249a068f7eb20991ee3/pandas-2.2.0-cp310-cp310-macosx_10_9_x86_64.whl
etag: "a0ee78f286e2d5f37a25879378e0c316"
content-type: application/octet-stream

The last-modified dates also differ by a few seconds, and the server and x-amz-server-side-encryption response headers are only present when requesting the beginning of the file. The cutoff appears to be at 3145728 bytes (3 * 2**20).

I'm assuming this is either bad cached content (if the etag has definitively changed) or a misbehavior in s3 (the docs are unclear on whether it can change over time, and my read of the mozilla writeup on etags is that it shouldn't depend on the range). Happy to work around it if you have ideas for doing so safely. The caching layer doesn't appear to support If-Range requests so I went down the road of verifying them myself.

Metadata

Metadata

Assignees

No one assigned

    Labels

    CDN/networkIssues related to our CDN, users having problems connecting to PyPIbug 🐛

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions