Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support for GPU Buffer in cudf.read_parquet in Python #17742

Open
JigaoLuo opened this issue Jan 15, 2025 · 2 comments
Open

[FEA] Support for GPU Buffer in cudf.read_parquet in Python #17742

JigaoLuo opened this issue Jan 15, 2025 · 2 comments
Labels
feature request New feature or request

Comments

@JigaoLuo
Copy link

Is your feature request related to a problem? Please describe.

Hi all,

Following a previous discussion in the community, I would like to request an enhancement to the cudf.read_parquet(filepath_or_buffer, ...) API. Specifically, to enable it to work with GPU buffers as input, allowing for Parquet files already in GPU memory to be processed directly.

Reasons for this improvement:

  • Symmetry: Currently, cudf.read_parquet supports CPU memory buffers but not GPU memory buffers.
  • Performance clarity: By using GPU buffers directly, this enhancement would able to showoff the decompression performance of data in GPU memory, without any I/O overhead.

Describe the solution you'd like
As discussed, implementing this feature would likely require only plumbing through the libcudf bindings to handle GPU buffers seamlessly.


Additional context: Current limitation & Error
When attempting to use cudf.read_parquet with a GPU memory buffer (e.g., a CuPy ndarray), the following error is encountered:

ValueError: Sources must be a list of str/paths, bytes, io.BytesIO, io.StringIO, or a Datasource 
@JigaoLuo JigaoLuo added the feature request New feature or request label Jan 15, 2025
@GregoryKimball
Copy link
Contributor

Thank you @JigaoLuo for your request, it's likely that we would add device buffer data source support first through our new pylibcudf library rather than the "cuDF classic" API, at least to start.

@vyasr @Matt711 Would you please let me know if there is an issue about adding a device buffer data source in pylibcudf?

@Matt711
Copy link
Contributor

Matt711 commented Jan 27, 2025

@vyasr @Matt711 Would you please let me know if there is an issue about adding a device buffer data source in pylibcudf?

No, it doesn't look like there's one, but @vyasr may know if there's an older issue not tagged as pylibcudf. I wonder if it would be sufficient to flesh out cudf::io::datasource in pylibcudf? It seems to be an acceptable type for our readers today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants