Skip to content
This repository has been archived by the owner on Apr 14, 2022. It is now read-only.

Add API for Request, SyncRequestData, and AsyncRequestData #151

Closed
wants to merge 2 commits into from

Conversation

sethmlarson
Copy link
Contributor

I started off the API design by removing everything under src/ and replacing it with only *.pyi files. We'll see how far I can push the API design while only using .pyi, if I need to switch to real .py files eventually that'll be fine I think?

This is the first commit which adds Request model along with the SyncRequestData and AsyncRequestData wrappers which are an API that will wrap all the types of data we accept for a request (bytes, str, files, file-like objects, iterators) which will be passed down to the HTTP lifecycle and eventually the low-level backend stream objects.

I'm tagging down my thoughts within the doc-strings. If you have additional comments I can add them inline as well

@pquentin
Copy link
Member

@sethmlarson Thank you for this! It's detailed enough for us to think about actual usage, but also small enough for us to give 30% feedback. And the way you prepared the api-design branch makes the PR really easy to read. So, again, thank you!

I initially assumed that this was the API we were going to expose to users, but in fact it's our internal representation of requests after some preprocessing, right?


In #124 @njsmith mentions "low level" and "high level" APIs. I think we should explicitly state whether we're going to have those two APIs or not. I think a single high-level API is enough?

I'm asking this because low-level urllib3 uses body while high-level requests uses data as you're doing here. I think it's best to continue using the requests terminology now that we target the high-level features of requests, so 👍 on the name here.


Regarding request data, we looked in #135 at all the things requests support. I think the end result is either bytes, a file-like object that can be rewinded or just an iterator, and this is exactly what you're supporting in SyncRequestData, so that looks good!

For the async API, I don't know what would be best and will let @njsmith comment. Maybe the situation is similar to that of response data that we explored in #123?


Oh, and by the way, since we've removed all the code in src/urllib3, I think we can remove the travis and appveyor configuration files as they're currently distracting. (But probably not in this PR.)

@sethmlarson
Copy link
Contributor Author

There'll probably be quite a few merge conflicts early on if I'm doing these concurrently so make sure these are squash-merged :)

@sethmlarson sethmlarson requested a review from njsmith November 27, 2019 14:42
Ref: https://github.com/python-trio/urllib3/issues/135
"""

def read(self, nbytes: int) -> bytes:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding there only being one interface, I think we should have something like SyncRequestData.sync_read() and AsyncRequestData.async_read() or maybe we switch to using async for chunk in request.data which would automatically get transformed by unasync but we'd forfeit the ability to grab a certain amount of data. I guess that interface is easier to implement anyways?

The reason I think this would be good to have the split is you'd only have to subclass AsyncRequestData and could technically use that class in both async and synchronous requests. Would just need to have good error reporting if you're passed input data that is for example, an async generator, and then using it in a sync session. I'm guessing that wouldn't happen often and an error message would fix any ambiguity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example of my plan in action:

import typing

StrOrInt = typing.Union[str, int]
FormType = typing.Union[
    typing.Sequence[typing.Tuple[str, StrOrInt]],
    typing.Mapping[str, typing.Union[StrOrInt, typing.Sequence[StrOrInt]]],
]


def _int_to_urlenc():
    values = {}
    special = {0x2A, 0x2D, 0x2E, 0x5F}
    for byte in range(256):
        if (
            (0x61 <= byte <= 0x7A)
            or (0x030 <= byte <= 0x5A and byte != 0x40)
            or (byte in special)
        ):  # Keep the ASCII
            values[byte] = bytes((byte,))
        elif byte == 0x020:  # Space -> '+'
            values[byte] = b"+"
        else:  # Percent-encoded
            values[byte] = b"%" + hex(byte)[2:].upper().encode()
    return values


INT_TO_URLENC = _int_to_urlenc()


class URLEncoded(AsyncRequestData):
    """Implements x-www-form-urlencoded as a RequestData object"""
    def __init__(self, form: FormType):
        self._form = form
        self._data = None

    @property
    def content_type(self) -> str:
        return "application/x-www-form-urlencoded"

    @property
    def content_length(self) -> int:
        return len(self._encode_form())

    def is_rewindable(self) -> bool:
        return True

    def __iter__(self) -> typing.Iterable[bytes]:
        yield self._encode_form()

    async def __aiter__(self) -> typing.AsyncIterable[bytes]:
        yield self._encode_form()

    def _encode_form(self) -> bytes:
        if self._data is None:

            def serialize(x: str) -> bytes:
                return b"".join([INT_TO_URLENC[byte] for byte in x.encode("utf-8")])

            output: typing.List[bytes] = []
            for k, vs in (
                self._form.items() if hasattr(self._form, "items") else self._form
            ):
                if isinstance(vs, str) or not hasattr(vs, "__iter__"):
                    vs = (vs,)
                for v in vs:
                    output.append(serialize(k) + b"=" + serialize(v))

            self._data = b"&".join(output)

        return self._data

Suddenly this one class can be used for both synchronous and asynchronous requests. We keep the is_rewindable (or maybe rename it to is_replayable? or something else) and then we'd signal a "rewind" by calling __aiter__ or __iter__ again.

in the case of a timeout/socket error on an idempotent request, a redirect,
etc so that the new request can be sent. This works for file-like objects
and bytes / strings (where it's a no-op).
"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also add a content_type property so that can be passed to the request as a default if none is set.

@njsmith
Copy link
Member

njsmith commented Dec 1, 2019

What do you think about using something like this as our internal representation for request bodies?

class AsyncRequestData:
    content_length: Optional[int]
    async def data_chunks(self) -> AsyncIterable[bytes]:
       ...

So instead of having a stateful object that we rewind, it acts as a factory for iterators.

(I'm focusing on the async version because I'm assuming the sync version will be identical except for being unasync'ed.)

Random thoughts:

  • Checking the length of a file might be an async operation, but accessing content_length here is synchronous. So this means that when you're wrapping a file into one of these things, you'd have to check the length up front and stash it in the object. That seems fine to me.

  • For objects like files that actually use a stateful rewind-style API underneath, there's a hidden constraint: if you call data_chunks several times, then you can get several "independent" iterator objects, but in fact they're not independent; you have to make sure that you stop using the old iterator once the new one is created. In practice we'd never want to do this anyway, and I guess if we want to be extra-careful we could use some logic to invalidate old iterators when a new one is created.

  • If the underlying data isn't rewindable, I guess in this approach the way we'd handle that is that iter_data would raise an exception if you tried to call it a second time. Is that a good solution? Or are there cases where we really need to check can_rewind ahead of time, because we'll do something different depending on whether we anticipate rewinding being possible?

  • Do we need to have a close method on this thing? If you pass a file object to requests, does it close it when its done?

@sethmlarson
Copy link
Contributor Author

What do you think about using something like this as our internal representation for request bodies?
So instead of having a stateful object that we rewind, it acts as a factory for iterators.

I'd rather go with this approach, the only issue I see is how an "unsynced" Session could use a single "Async/SyncRequestData"? I was proposing the interface with __iter__ and __aiter__ because we'd be able to rely on unasync transforming AsyncRequestData.__aiter__() into SyncRequestData.__iter__(). Would be beneficial to not have to have two implementations of each custom body type, I guess we could live with it though?

* Checking the length of a file might be an async operation, but accessing `content_length` here is synchronous. So this means that when you're wrapping a file into one of these things, you'd have to check the length up front and stash it in the object. That seems fine to me.

You're on the same track here as me.

* For objects like files that actually use a stateful rewind-style API underneath, there's a hidden constraint: if you call `data_chunks` several times, then you can get several "independent" iterator objects, but in fact they're not independent; you have to make sure that you stop using the old iterator once the new one is created. In practice we'd never want to do this anyway, and I guess if we want to be extra-careful we could use some logic to invalidate old iterators when a new one is created.

Definitely true.

* If the underlying data _isn't_ rewindable, I guess in this approach the way we'd handle that is that `iter_data` would raise an exception if you tried to call it a second time. Is that a good solution? Or are there cases where we really need to check `can_rewind` ahead of time, because we'll do something different depending on whether we anticipate rewinding being possible?

That's true, there's not much we can do here if the underlying data isn't rewindable. We can drop that section of the API.

* Do we need to have a `close` method on this thing? If you pass a file object to requests, does it close it when its done?
>>> import requests
>>> f = open("setup.py", "rb")
>>> requests.post("https://httpbin.org/anything", data=f)
<Response [200]>
>>> f.closed
False

Looks like no, maybe we don't need a close method either!

@njsmith
Copy link
Member

njsmith commented Dec 1, 2019

I'd rather go with this approach, the only issue I see is how an "unsynced" Session could use a single "Async/SyncRequestData"? I was proposing the interface with __iter__ and __aiter__ because we'd be able to rely on unasync transforming AsyncRequestData.__aiter__() into SyncRequestData.__iter__(). Would be beneficial to not have to have two implementations of each custom body type, I guess we could live with it though?

Hmm, I was imagining we'd have something like:

class RequestData:
    async def data_chunks(self):
        if self._consumed:
            raise ...
        elif self._bytes is not None:
            yield self._bytes
       elif self._sync_iterable is not None:
           self._consumed = True
           for chunk in self._sync_iterable:
               yield chunk
       elif self._async_iterable is not None:
           self._consumed = True
           async for chunk in self._async_iterable:
               yield chunk
       elif self._sync_file is not None:
           # you get the idea

This would live at hip._async.RequestData and be used by hip._async.Session. And then unasync would take care of creating hip._sync.RequestData, and also creating hip._sync.Session that uses it. And I think everything would just work? Maybe I'm not understanding the question :-)

@sethmlarson
Copy link
Contributor Author

The AsyncRequestData and SyncRequestData interface aren't the issue, that can be unasynced and will work fine. I'm more worried about having an interface that can be sub-classed into things like URLEncodedForm, MultipartFormData, etc and only having to be sub-classed once instead of under _async/. But I guess in the end it's the same amount of code to maintain thanks to unasync so shouldn't be a problem?

@njsmith
Copy link
Member

njsmith commented Dec 1, 2019

Well, I'm one of those weirdos who refuses to use subclassing unless there's absolutely no other option :-). But assuming you mean that this might be an abstract interface that has multiple implementations, then yeah, unasync should just take care of that, I think?

@sethmlarson
Copy link
Contributor Author

Closed in favor of #185.

@sethmlarson sethmlarson closed this Dec 2, 2019
@sethmlarson sethmlarson deleted the request-api branch December 4, 2019 18:16
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants