Notes on "high-level" HTTP support #124
Description
I've heard repeatedly that requests
is a "high-level" HTTP API and urllib3
is a "low-level" HTTP API, but I wasn't super clear on what exactly that meant in practice. So I read requests
today in hopes of learning more about what secret sauce it's adding. These are some notes/thoughts I wrote down while doing that.
Things that requests does:
-
Convenience functions:
get
,post
, etc., plus all the helpers for encoding and decoding bodies. Including:-
the thing where you can pass in a bunch of different formats of request body and requests will try to DWIM
-
ability to read the response body as bytes, text, json, streaming bytes, streaming text, lines, ...
-
charset autodetection when reading text/json
-
adding
http://
on the front of URLs that are missing a scheme -
choosing between
Content-Length
andTransfer-Encoding: chunked
framing -
measures time from starting the request until response headers are received (or something like that)
-
lots of fiddling with URLs and URL encodings. Seems to normalize URL encoding even. Uses
urlparse
, which is highly dubious. Not sure what this is actually about.
-
-
a "hooks" system that lets you register callbacks on particular events: https://2.python-requests.org//en/master/user/advanced/#event-hooks
- In practice, it just lets you pass some functions that get to inspect/replace the
Response
object before it's returned to the user
- In practice, it just lets you pass some functions that get to inspect/replace the
-
automatic redirect handling. this has a lot of interesting complexities
-
there's some redirect defaults in
requests/sessions.py
forget
,options
,head
... but not the other methods for some reason? I'm not sure what this actually does. (they're also inrequests/api.py
, but AFAICT that's totally redundant) -
requests accepts file objects as request bodies, and streams them. This requires seeking back to the beginning on redirects
-
switching proxy configuration when redirected
-
saves the list of response objects from intermediate requests. I think the body is unconditionally read and stashed in memory on the response object! this is plausibly correct, since I guess redirect bodies are always small, and you want to consume the body so the connection can be released back to the pool? though there's some kind of DoS opportunity here if the server gives back a large body – normally a client can protect themselves against large response bodies by choosing to use the streaming body reading APIs, but in this case requests unconditionally fetches and stores the whole body and there's no way to stop it. Not the scariest thing, but probably should be considered a bug.
-
there's a lower-level "incremental" API: if you set
allow_redirects=False
, you get back the response object for the first request in the chain, and that object has some extra state on it that tracks where it is in the request chain, which you can access withrequests.Response.next
. However, I'm not 100% sure how you use this... seems like it would be more useful fornext()
to go ahead and perform the next request and return the response? maybe if the API were that easy, that could be what gets recommended to everyone who wants to see the detailed history of the redirect chain, and avoid some of the awkwardness about intermediate bodies mentioned in the previous bullet point? -
the redirect code uses a ton of mutable state and is entangled with a lot of the other features here, which makes it hard to understand and reason about
-
-
authorization
-
digest auth handling is really complex: thread-local variables! relies on the hook system to integrate into redirect handling! reimplements the file seek on redirect handling (I have no idea why)! some silly code for generating nonces that combines
os.urandom
and ad hoc entropy sources! -
if the user passed
https://username:password@host/
URLs, requests removes them and converts them into aHTTPBasicAuth
setup
-
-
cookies: you can pass in a cookiejar of cookies to use on requests, and it will automatically update it across a
Session
- also interacts with redirects
-
mounting different adapters
-
setting up TLS trust via certifi
-
proxy autoconfiguration from environment (reading
.netrc
,$http_proxy
,$NO_PROXY
, etc.) -
forces all HTTP verbs to be uppercase -- interesting choice. the HTTP RFCs say that method names are case-sensitive, so in principle
GET
andget
are both valid HTTP verbs, and they're different from each other. In practice, I've never heard of anyone using anything except all-uppercase, and apparently the requests devs never have either. -
pickling!
Dealing with cookies is really messy! AFAICT there isn't a modern WHATWG-style standard yet, so no-one entirely knows how they're supposed to work. In Python there's the venerable http.cookies
, whose API is super gnarly to work with, and I'm skeptical that its semantics are really suitable for the modern world, given that it comes from the same era as http.client
. There's this cookies
package on PyPI, which hasn't had a release since 2014, but the README has a bunch of important-sounding critiques of http.client
. And... that's about it! Eek.
I've been thinking about architectures like: a package.Session
with a highlevel API (similar to requests), and package.lowlevel.ConnectionPool
with a lowlevel API (similar to what urllib3 has now, but simplified), where the low-level object is "mounted" onto the highlevel object, so we can use the mounting feature as an API for users to inject quirky configuration stuff. This is on the same lines as httpx's middleware system. httpx wants to use this for redirects/auth/etc. Looking at the notes above, I'm not sure they're that easily splittable? But need to think about it more. What parts would people reasonable need to be able to mix and match?
Note: I literally just mean I've been thinking about these kinds of architectures, like as a design puzzle. I'm not saying that we should adopt an architecture like that for this project. But I am pondering what we should do.