Consider optional support for sparse array backing. #178
Replies: 3 comments 2 replies
-
If you plot your raster array on a map, it usually doesn't look like this: I'd say it probably looks like this: The short answer is that we don't need every chunk to be a sparse array, we need a dask Array data structure that can represent sparse chunks. Using
@sharkinsspatial I definitely would be interested in seeing some performance numbers with using sparse arrays, if it's something you want to try. Right now though, I'd be more interested in something like:
which would work quite well for the pattern of "either all empty, or all full" we typically have. And be much simpler for compatibility with other libraries (since everything is still plain NumPy arrays). As a first step, you could even try making I also do think that just giving in and implementing a |
Beta Was this translation helpful? Give feedback.
-
@gjoseph92 I'd agree that there are many analysis cases that have the totally full or totally sparse block characteristics you described above. In this case, your recommendation for materializing "broadcast-trick" arrays would drastically improve memory efficiency. If I understand correctly, this would still result in identical graph sizes? I had followed this previous discussion dask/dask#7652 about improved graph efficiency for sparse blocks and some implementation of both ideas would be great to see. In many of our ML workflows, we are sometimes dealing with large temporal and spatial dimensions resulting in arrays with many actually sparse blocks. Before diving into any detailed work on stackstac I followed your advice and built a quick performance example of this situation with I don't have a clear idea of why performance is so improved in this case when using If I did pursue implementing this |
Beta Was this translation helpful? Give feedback.
-
@sharkinsspatial are these arrays coming straight out of If they're just becoming sparse through your processing, then could you do something like If your data source is actually highly sparse, that's interesting because I didn't realize that was very common. Maybe it's more common than I think, but I still have a feeling that dask/dask#7652 would be good enough for most use cases, and much more inter-compatible. |
Beta Was this translation helpful? Give feedback.
-
After reading through the excellent NCAR post on sparse array type interoperability with
xarray
I was considering how this approach might alleviate some of the size and graph complexity issues related to the dense representation used bystackstac
. Much of the EO data described by STAC items is spatially and temporally sparse given the nature of orbital paths. Though it's not a panacea, sparse backing might help with some common scaling issues.I've done an initial review of the
stackstac
code and while this seems feasible given the current dask array construction model, I'm a bit unsure of the cleanest way to tackle this and didn't want to invest too much time without insight and recommendations from @gjoseph92. I'm also completely new to sparse arrays so I don't have a full understanding if this sparse representation would reduce the size and complexity of graphs in the wide mosaic and large composite use cases to be of sufficient value.I realize that
stackstac
is more of a side project without significant time available for heavy maintenance so I'm more than willing to jump in and help on the implementation side with a bit of guidance if this sounds like a worthwhile idea.Beta Was this translation helpful? Give feedback.
All reactions