You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to convert my project from using sparse arrays to dense arrays, and I ran into a lot of problems while trying to use the same methods I had been using on sparse arrays, specifically from_pandas.
Is it correct that TileDB requires an entire row of data to be consumed at the same time in order to use from_pandas?
The data I work with is represented in a dataframe as MultiIndex and is very variable in size (State-sized LiDAR pointcloud data), with a high likelihood that 1 row of data is too large to consume at once while also running all the pre-tiledb processes I need to run over it.
To me, it should be possible to call from_pandas on a dataframe that matches your TileDB array, and have it inserted to the Array based on the indices it finds there. When I followed from_pandas through it's flow, I noticed that much of the logic required for this is already available, but skipped over or not used in favor of using a row index slice.
I have created a branch where I've written a preliminary implementation of the feature (and a test) with no interruption to current usage, and I can make a PR if you're interested in it: kylemann16@60defc0
It's a pretty rudimentary implementation, and I'm certain I don't know all the implications it would have, but it passes tests and works when I use it for my project.
If I'm missing something and this is redundant, or if it's not in line with how you you'd like TileDB-py to work, I'd love to get some feedback/discussion on this going. As it currently is, from_pandas is only useful to me in a sparse array scenario.
Thank you!
The text was updated successfully, but these errors were encountered:
johnkerl
changed the title
from_pandas should be more flexible than requiring a full row on ingestionfrom_pandas should be more flexible than requiring a full row on ingestion
Feb 11, 2025
I am trying to convert my project from using sparse arrays to dense arrays, and I ran into a lot of problems while trying to use the same methods I had been using on sparse arrays, specifically
from_pandas
.Is it correct that TileDB requires an entire row of data to be consumed at the same time in order to use
from_pandas
?The data I work with is represented in a dataframe as MultiIndex and is very variable in size (State-sized LiDAR pointcloud data), with a high likelihood that 1 row of data is too large to consume at once while also running all the pre-tiledb processes I need to run over it.
To me, it should be possible to call
from_pandas
on a dataframe that matches your TileDB array, and have it inserted to the Array based on the indices it finds there. When I followedfrom_pandas
through it's flow, I noticed that much of the logic required for this is already available, but skipped over or not used in favor of using a row index slice.I have created a branch where I've written a preliminary implementation of the feature (and a test) with no interruption to current usage, and I can make a PR if you're interested in it: kylemann16@60defc0
It's a pretty rudimentary implementation, and I'm certain I don't know all the implications it would have, but it passes tests and works when I use it for my project.
If I'm missing something and this is redundant, or if it's not in line with how you you'd like TileDB-py to work, I'd love to get some feedback/discussion on this going. As it currently is,
from_pandas
is only useful to me in a sparse array scenario.Thank you!
The text was updated successfully, but these errors were encountered: