Don't normalise or double-escape urls #6923

moben · 2025-03-28T23:47:51Z

Summary

I recently hit a url that I could not retrieve with requests, but that can be retrieved using another client executing an identical http request. Specifically: When a url contains a percent-escaped tilde ~ (i.e. %7E), requests behaves differently than any other http client that I tried and performs unneeded normalization. In addition to that, it double encodes invalid urls, which again differs from any other client.

Testing

To illustrate this, I wrote some test code that records the paths that a variety of clients request. In addition to requests, I tested the following http client libraries / browsers across python, java, go, C, js and rust. All http clients that I tested behave the same and use the target as-is, except requests. requests normalizes %7E to ~:

- Go-http-client/1.1
- Java-http-client/21.0.5
- Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36
- Mozilla/5.0 (X11; Linux x86_64; rv:136.0) Gecko/20100101 Firefox/136.0
- Python-urllib/3.13
- axios/1.8.4
- curl/8.12.1
- go-resty/3.0.0-beta.1 (https://resty.dev)
- got (https://github.com/sindresorhus/got)
- node
- python-httpx/0.28.1
- python-urllib3/2.3.0
- requests-patched (see suggestion below)
- reqwest

The test code (server + automation to run all listed clients against it) is available here: https://github.com/moben/bugs/tree/aa8e4eb928e8189f5d863748d6d06e980d4f8a87/requests_http_location

The test server redirects /v to f"/v/%7E/-._~/{urllib.parse.quote(string.printable)}", i.e. /v/%7E/-._~/0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ%21%22%23%24%25%26%27%28%29%2A%2B%2C-./%3A%3B%3C%3D%3E%3F%40%5B%5C%5D%5E_%60%7B%7C%7D~%20%09%0A%0D%0B%0C

All clients listed above use this verbatim, but requests normalizes it to /v/~/..... Note: I originally thought this was specific to redirects, but it also happens when passing the url directly to requests.get.

Root Cause & Proposed Fix

You might object that some of the tested clients are built on each other so they should be one data point. But on the other hand, note that requests behaves differently from urllib3 here.

The reason for requests differing from urllib3 can be found in the history of the current normalization: The normalization (requote_uri) was added in 2013 in #1361 to resolve #1360. But in 2019 handling of invalid urls was also solved in urllib3 in urllib3/urllib3#1647. Not only does the urllib3 implementation more closely match what all other clients are doing and does not change valid urls. It also means that requests is double-encoding invalid uris. If the url is invalid we probably can't expect much but the path that requests uses has no chance to decode to the same that any other client uses, even to the most lenient server.

I believe the best way to fix this is to simply drop the "requoting" that requests is doing and rely on urllib3's implementation here: https://github.com/urllib3/urllib3/blob/main/src/urllib3/util/url.py#L227
This gives the same behavior as (almost, see below) all other clients for valid urls and avoids the double encoding for invalid ones. In my test code, this is requests-patched.

Further notes

I also tested weird and invalid url edge cases:

When percent-encoding all characters, including alphanumeric and -._~, one other client differs from the rest. chrome decodes specifically .. I believe this to be a rather esotheric test case because unlike ~, which became unreserved in RFC 3986 compared to RFC 1738, these characters were always unreserved.
For invalid uris (broken percent encoding, unencoded characters) behavior differs wildly. But all clients except requests use a url that decodes to the same string via e.g. urllib.parse.unquote. (requests differs because of the double encoding)

It can of course be argued that the server should treat %7E and ~ in my original test case the same. But in the interest of interoperability I believe it still makes sense to align with what every other client is doing and also drop the double encoding.

`requote_uri` was added in 2013 in psf#1361 to resolve psf#1360. But in 2019 this was also solved in `urllib3` in urllib3/urllib3#1647. Not only does the `urllib3` implementation more closely match what all other clients are doing. It also means that we are double-encoding invalid uris. If the redirect is invalid we're in unchartered territory but the path that `requests` uses has no chance to decode to the same that any other client uses, even to the most lenient server. Simply drop `requote_uri`, as `urllib3`'s `parse_url` will always handle invalid urls.

moben force-pushed the align_url_encoding branch 2 times, most recently from 7ec84f3 to 3a47592 Compare March 29, 2025 00:04

moben force-pushed the align_url_encoding branch from 3a47592 to 0aceb9f Compare March 29, 2025 00:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't normalise or double-escape urls #6923

Don't normalise or double-escape urls #6923

moben commented Mar 28, 2025

Don't normalise or double-escape urls #6923

Are you sure you want to change the base?

Don't normalise or double-escape urls #6923

Conversation

moben commented Mar 28, 2025

Summary

Testing

Root Cause & Proposed Fix

Further notes