Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

from_pandas should be more flexible than requiring a full row on ingestion #2158

Open
kylemann16 opened this issue Feb 11, 2025 · 0 comments

Comments

@kylemann16
Copy link

kylemann16 commented Feb 11, 2025

I am trying to convert my project from using sparse arrays to dense arrays, and I ran into a lot of problems while trying to use the same methods I had been using on sparse arrays, specifically from_pandas.

Is it correct that TileDB requires an entire row of data to be consumed at the same time in order to use from_pandas?

The data I work with is represented in a dataframe as MultiIndex and is very variable in size (State-sized LiDAR pointcloud data), with a high likelihood that 1 row of data is too large to consume at once while also running all the pre-tiledb processes I need to run over it.

To me, it should be possible to call from_pandas on a dataframe that matches your TileDB array, and have it inserted to the Array based on the indices it finds there. When I followed from_pandas through it's flow, I noticed that much of the logic required for this is already available, but skipped over or not used in favor of using a row index slice.

I have created a branch where I've written a preliminary implementation of the feature (and a test) with no interruption to current usage, and I can make a PR if you're interested in it: kylemann16@60defc0

It's a pretty rudimentary implementation, and I'm certain I don't know all the implications it would have, but it passes tests and works when I use it for my project.

If I'm missing something and this is redundant, or if it's not in line with how you you'd like TileDB-py to work, I'd love to get some feedback/discussion on this going. As it currently is, from_pandas is only useful to me in a sparse array scenario.

Thank you!

@johnkerl johnkerl changed the title from_pandas should be more flexible than requiring a full row on ingestion from_pandas should be more flexible than requiring a full row on ingestion Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant