Don't normalise or double-escape urls #6923
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
I recently hit a url that I could not retrieve with
requests
, but that can be retrieved using another client executing an identical http request. Specifically: When a url contains a percent-escaped tilde~
(i.e.%7E
),requests
behaves differently than any other http client that I tried and performs unneeded normalization. In addition to that, it double encodes invalid urls, which again differs from any other client.Testing
To illustrate this, I wrote some test code that records the paths that a variety of clients request. In addition to
requests
, I tested the following http client libraries / browsers acrosspython
,java
,go
,C
,js
andrust
. All http clients that I tested behave the same and use the target as-is, exceptrequests
.requests
normalizes%7E
to~
:The test code (server + automation to run all listed clients against it) is available here: https://github.com/moben/bugs/tree/aa8e4eb928e8189f5d863748d6d06e980d4f8a87/requests_http_location
The test server redirects
/v
tof"/v/%7E/-._~/{urllib.parse.quote(string.printable)}"
, i.e./v/%7E/-._~/0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ%21%22%23%24%25%26%27%28%29%2A%2B%2C-./%3A%3B%3C%3D%3E%3F%40%5B%5C%5D%5E_%60%7B%7C%7D~%20%09%0A%0D%0B%0C
All clients listed above use this verbatim, but
requests
normalizes it to/v/~/....
. Note: I originally thought this was specific to redirects, but it also happens when passing the url directly torequests.get
.Root Cause & Proposed Fix
You might object that some of the tested clients are built on each other so they should be one data point. But on the other hand, note that
requests
behaves differently fromurllib3
here.The reason for
requests
differing fromurllib3
can be found in the history of the current normalization: The normalization (requote_uri
) was added in 2013 in #1361 to resolve #1360. But in 2019 handling of invalid urls was also solved inurllib3
in urllib3/urllib3#1647. Not only does theurllib3
implementation more closely match what all other clients are doing and does not change valid urls. It also means thatrequests
is double-encoding invalid uris. If the url is invalid we probably can't expect much but the path thatrequests
uses has no chance to decode to the same that any other client uses, even to the most lenient server.I believe the best way to fix this is to simply drop the "requoting" that
requests
is doing and rely onurllib3
's implementation here: https://github.com/urllib3/urllib3/blob/main/src/urllib3/util/url.py#L227This gives the same behavior as (almost, see below) all other clients for valid urls and avoids the double encoding for invalid ones. In my test code, this is
requests-patched
.Further notes
I also tested weird and invalid url edge cases:
-._~
, one other client differs from the rest.chrome
decodes specifically.
. I believe this to be a rather esotheric test case because unlike~
, which became unreserved in RFC 3986 compared to RFC 1738, these characters were always unreserved.requests
use a url that decodes to the same string via e.g.urllib.parse.unquote
. (requests
differs because of the double encoding)It can of course be argued that the server should treat
%7E
and~
in my original test case the same. But in the interest of interoperability I believe it still makes sense to align with what every other client is doing and also drop the double encoding.