`from_pandas` should be more flexible than requiring a full row on ingestion #2158

kylemann16 · 2025-02-11T22:47:12Z

I am trying to convert my project from using sparse arrays to dense arrays, and I ran into a lot of problems while trying to use the same methods I had been using on sparse arrays, specifically from_pandas.

Is it correct that TileDB requires an entire row of data to be consumed at the same time in order to use from_pandas?

The data I work with is represented in a dataframe as MultiIndex and is very variable in size (State-sized LiDAR pointcloud data), with a high likelihood that 1 row of data is too large to consume at once while also running all the pre-tiledb processes I need to run over it.

To me, it should be possible to call from_pandas on a dataframe that matches your TileDB array, and have it inserted to the Array based on the indices it finds there. When I followed from_pandas through it's flow, I noticed that much of the logic required for this is already available, but skipped over or not used in favor of using a row index slice.

I have created a branch where I've written a preliminary implementation of the feature (and a test) with no interruption to current usage, and I can make a PR if you're interested in it: kylemann16@60defc0

It's a pretty rudimentary implementation, and I'm certain I don't know all the implications it would have, but it passes tests and works when I use it for my project.

If I'm missing something and this is redundant, or if it's not in line with how you you'd like TileDB-py to work, I'd love to get some feedback/discussion on this going. As it currently is, from_pandas is only useful to me in a sparse array scenario.

Thank you!

The text was updated successfully, but these errors were encountered:

johnkerl changed the title ~~from_pandas should be more flexible than requiring a full row on ingestion~~ from_pandas should be more flexible than requiring a full row on ingestion Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`from_pandas` should be more flexible than requiring a full row on ingestion #2158

`from_pandas` should be more flexible than requiring a full row on ingestion #2158

kylemann16 commented Feb 11, 2025 •

edited

Loading

from_pandas should be more flexible than requiring a full row on ingestion #2158

from_pandas should be more flexible than requiring a full row on ingestion #2158

Comments

kylemann16 commented Feb 11, 2025 • edited Loading

`from_pandas` should be more flexible than requiring a full row on ingestion #2158

`from_pandas` should be more flexible than requiring a full row on ingestion #2158

kylemann16 commented Feb 11, 2025 •

edited

Loading