-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add basic DataFrame.from_arrow class method for importing through Arrow PyCapsule interface #59696
base: main
Are you sure you want to change the base?
ENH: add basic DataFrame.from_arrow class method for importing through Arrow PyCapsule interface #59696
Conversation
…h Arrow PyCapsule interface
cc @kylebarron |
pandas/core/frame.py
Outdated
@@ -1746,6 +1746,52 @@ def __rmatmul__(self, other) -> DataFrame: | |||
# ---------------------------------------------------------------------- | |||
# IO methods (to / from other formats) | |||
|
|||
@classmethod | |||
def from_arrow(cls, data) -> DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to type data
through some typing protocol? (@kylebarron like you have ArrayStreamExportable in https://kylebarron.dev/arro3/latest/api/core/table/#arro3.core.Table.from_arrow)
I am not super familiar with typing, but I can just copy paste https://github.com/kylebarron/arro3/blob/main/arro3-core/python/arro3/core/types.py ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, you can copy from there. Those come originally from the part of the spec here: https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#protocol-typehints
I wanted to add a bit more documentation on them so that the docs website would be friendly.
"'_arrow_c_array__' or '__arrow_c_stream__' method), got " | ||
f"'{type(data).__name__}' instead." | ||
) | ||
data = pa.table(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually work for things that only expose __arrow_c_array__
?
In [28]: arr = pa.array([1, 2, 3])
In [29]: hasattr(arr, "__arrow_c_array__")
Out[29]: True
In [30]: pa.table(arr)
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
Cell In[30], line 1
----> 1 pa.table(arr)
File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/table.pxi:6022, in pyarrow.lib.table()
File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/table.pxi:5841, in pyarrow.lib.record_batch()
File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/table.pxi:3886, in pyarrow.lib.RecordBatch._import_from_c_device_capsule()
File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()
File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()
ArrowInvalid: Cannot import schema: ArrowSchema describes non-struct type int64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exposing __arrow_c_array__
is necessary but not sufficient. Both Array
and RecordBatch
expose the same __arrow_c_array__
interface. It's overloaded to be able to interpret a RecordBatch
as the same as an Array
with type Struct
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And to be fair, RecordBatch has both __arrow_c_array__
and __arrow_c_stream__
dunder methods, so just testing with RecordBatch does not actually prove that pa.table(..)
works with objects that only implement the array version. But because I wrap the record batch in the tests in a dummy object that only exposes __arrow_c_array__
, the tests should cover this and assert DataFrame.from_arrow()
works with both dunder methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's overloaded to be able to interpret a
RecordBatch
as the same as anArray
with typeStruct
.
Ah OK that's good to know. So essentially its up to the producer to be able to determine if this makes sense right?
I think there is still a consistency problem with how we as a consumer then work. A RecordBatch can be read through both the array and stream interface, but a Table can only be read through the latter (unless it is forced to consolidate chunks and produce an Array).
I'm sure PyArrow has that covered well, but unless something gets clarified in the spec expecting array to work a certain way, that might make push libraries into making the (assumedly poor) decision that their streams should also produce consolidated array data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's say it's up to the consumer to decide if the input makes sense. The producer just says "here's my data".
But I think the key added part is user intention. A struct array can represent either one array or a full RecordBatch, and we need a hint from the user for which is which. This is why I couldn't add PyCapsule Interface support to polars.from_arrow
, because it's missing the user intention of "this object is a series" or "this object is a DataFrame".
I'm not sure I follow the rest of your comment @WillAyd. A stream never needs to concatenate data before starting the stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow the rest of your comment @WillAyd. A stream never needs to concatenate data before starting the stream.
A theoretical example is a library that produces Arrow data thinking that they need to implement __arrow_c_array__
for their "Table" equivalent since they did so for their RecordBatch equivalent. If the Table contained multiple chunks of data, I assume they would need to combine all of the chunks to pass data on through the __arrow_c_array__
interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the spec should be more explicit about when to implement which interface. I think it's implicit that a RecordBatch can implement both, because both are zero copy, but a Table should only implement the stream interface, because only the stream interface is always zero copy.
I raised an issue a while ago to discuss consumer implications, if you haven't seen it: apache/arrow#40648
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK great - thanks for sharing. I'll track that issue upstream
libraries. | ||
|
||
.. _Arrow PyCapsule Protocol: https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html | ||
.. _Arrow C Data Interface: https://arrow.apache.org/docs/format/CDataInterface.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe this should link to the stream interface page instead? https://arrow.apache.org/docs/format/CStreamInterface.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm ex @kylebarron feedback
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
See #59631
For now, this adds the most basic method, just converting an Arrow tabular object, without exposing any keyword arguments (or without exposing it in
pd.DataFrame()
directly)doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.