Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas and cuDF DataFrames in DocumentDataset #195

Open
sarahyurick opened this issue Aug 8, 2024 · 0 comments · May be fixed by #494
Open

Pandas and cuDF DataFrames in DocumentDataset #195

sarahyurick opened this issue Aug 8, 2024 · 0 comments · May be fixed by #494
Assignees
Labels
bug Something isn't working

Comments

@sarahyurick
Copy link
Collaborator

Right now, there is some confusion around DataFrames being passed into DocumentDataset. For now, we expect them to be Dask or Dask-cuDF DataFrames, so we should add stronger type checking for this.

Let's also investigate automatically converting Pandas and cuDF DataFrames to Dask and cuDF-Dask DataFrames, respectively. Perhaps we should just throw an error if a user tries to create a DocumentDataset with them, or maybe we should try to automatically convert them to Dask.

We should at least do the former for now.

If we decide to do the latter, this could involve using get_current_client, is_cudf_type, from_pandas, and creating a from_cudf function. If we decide to go this route, it would probably be a good idea to tell the user that the conversion is happening to avoid any confusion if they try looking at the DocumentDataset.df.

Somewhat related to #79.

cc @ayushdg @ryantwolf @VibhuJawa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant