Consider optional support for sparse array backing. #178

sharkinsspatial · 2022-09-20T00:08:02Z

sharkinsspatial
Sep 20, 2022

After reading through the excellent NCAR post on sparse array type interoperability with xarray I was considering how this approach might alleviate some of the size and graph complexity issues related to the dense representation used by stackstac. Much of the EO data described by STAC items is spatially and temporally sparse given the nature of orbital paths. Though it's not a panacea, sparse backing might help with some common scaling issues.

I've done an initial review of the stackstac code and while this seems feasible given the current dask array construction model, I'm a bit unsure of the cleanest way to tackle this and didn't want to invest too much time without insight and recommendations from @gjoseph92. I'm also completely new to sparse arrays so I don't have a full understanding if this sparse representation would reduce the size and complexity of graphs in the wide mosaic and large composite use cases to be of sufficient value.

I realize that stackstac is more of a side project without significant time available for heavy maintenance so I'm more than willing to jump in and help on the implementation side with a bit of guidance if this sounds like a worthwhile idea.

gjoseph92 · 2022-09-23T23:38:44Z

gjoseph92
Sep 23, 2022
Maintainer

If you plot your raster array on a map, it usually doesn't look like this:

I'd say it probably looks like this:

The short answer is that we don't need every chunk to be a sparse array, we need a dask Array data structure that can represent sparse chunks. Using sparse.COO as the chunk type makes more sense when every chunk is likely to be sparse. But in the geospatial case, I think it's much more common to have:

many chunks be completely empty
nearly all the rest be completely full
a small number at the borders be properly sparse

@sharkinsspatial I definitely would be interested in seeing some performance numbers with using sparse arrays, if it's something you want to try. Right now though, I'd be more interested in something like:

Avoid expanding "broadcast-trick" NumPy arrays when possible dask/dask#9517 (note that stackstac already produces these broadcast-trick arrays, so nothing would have to change here)

which would work quite well for the pattern of "either all empty, or all full" we typically have. And be much simpler for compatibility with other libraries (since everything is still plain NumPy arrays).

As a first step, you could even try making stackstac.mosaic broadcast-trick-aware and see how that affects wide mosaic performance?

I also do think that just giving in and implementing a groupby parameter to stackstac.stack might be the most practical way to go #66 (comment). Because a big part of the problem is how huge the dask graphs get with large composites. Even if all the operations in them are highly efficient on missing data, that many tasks is often just too much for dask.

0 replies

sharkinsspatial · 2022-10-18T01:39:06Z

sharkinsspatial
Oct 18, 2022
Author

@gjoseph92 I'd agree that there are many analysis cases that have the totally full or totally sparse block characteristics you described above. In this case, your recommendation for materializing "broadcast-trick" arrays would drastically improve memory efficiency. If I understand correctly, this would still result in identical graph sizes? I had followed this previous discussion dask/dask#7652 about improved graph efficiency for sparse blocks and some implementation of both ideas would be great to see.

In many of our ML workflows, we are sometimes dealing with large temporal and spatial dimensions resulting in arrays with many actually sparse blocks. Before diving into any detailed work on stackstac I followed your advice and built a quick performance example of this situation with sparse.COO backing. You can find an example notebook here https://nbviewer.org/gist/sharkinsspatial/fa845d876a8f7ee421803ecb020216dd. I'm new to working with sparse so my approach here might not be completely correct 😸.

I don't have a clear idea of why performance is so improved in this case when using sparse.COO backing? I'm also not clear on what types of xarray operations might trigger an underlying todense() call nullifying this performance improvement?

If I did pursue implementing this sparse.COO backing optional flag for stackstac, I'd also like to get some advice on examples of culling the Dask HLG. SparseArray implements a fast nnz count attribute so I'm hoping we could also provide some examples of culling tasks for completely empty blocks based on this count.

2 replies

weiji14 Oct 20, 2022

Jumping ahead a little bit, but if you're plugging things into an ML model and are hitting memory limits, is there a way to actually represent the tensors in a sparse form? Are you interesting in reducing memory on CPU RAM or GPU RAM? If the latter (GPU RAM), then data and/or model sharding might be a better solution for your workflow, e.g. DeepSpeed. If it's on the CPU side though, then just ignore what I said 🙂

gjoseph92 Oct 24, 2022
Maintainer

If I understand correctly, this would still result in identical graph sizes?

Yes. Neither dask/dask#7652 nor using sparse.COO chunks would reduce the size of the graph.

I don't have a clear idea of why performance is so improved in this case when using sparse.COO backing?

I'm not surprised to see this is faster. Your sparse array has a density of 0.0005. It basically contains no data. With the dense array, you're doing actual CPU work on 6gb of data. With the sparse array, there's nearly zero CPU work to do, plus (more importantly) nearly zero memory to bring to the CPU. I wouldn't be surprised if most of the time spent in the sparse case is just doing the groupby itself and moving data around—i.e. you might even see a similar runtime using a sparse array of all 0s.

I'd also like to get some advice on examples of culling the Dask HLG. SparseArray implements a fast nnz count attribute so I'm hoping we could also provide some examples of culling tasks for completely empty blocks based on this count

Culling empty tasks is orthogonal to whether a sparse.COO, or dense NumPy array, or broadcast-trick NumPy array is used to represent the chunks. I guess this is the main thing I want to get across. Dropping tasks from the graph that you know a priori will not contain any data is a separate and harder problem. And honestly, I wouldn't even try to take it on. The infrastructure in Dask isn't currently there to do these sorts of optimizations: dask/dask#7933. That's why I want to add a groupby argument to stackstac to sidestep the most common case stackstac users run into that produces a lot of empty tasks: #66 (comment).

gjoseph92 · 2022-10-24T17:03:33Z

gjoseph92
Oct 24, 2022
Maintainer

In many of our ML workflows, we are sometimes dealing with large temporal and spatial dimensions resulting in arrays with many actually sparse blocks

@sharkinsspatial are these arrays coming straight out of stackstac.stack though? As in, are you loading a geospatial data source that's actually highly sparse? Or is the initial data you load more like what I described, but then whatever post-processing you're doing transforms them into something sparse?

If they're just becoming sparse through your processing, then could you do something like arr.map_blocks(COO.from_numpy) or arr.apply_ufunc(COO.from_numpy, dask="parallelized") before/during the step that makes them sparse?

If your data source is actually highly sparse, that's interesting because I didn't realize that was very common. Maybe it's more common than I think, but I still have a feeling that dask/dask#7652 would be good enough for most use cases, and much more inter-compatible.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider optional support for sparse array backing. #178

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Consider optional support for sparse array backing. #178

sharkinsspatial Sep 20, 2022

Replies: 3 comments · 2 replies

gjoseph92 Sep 23, 2022 Maintainer

sharkinsspatial Oct 18, 2022 Author

weiji14 Oct 20, 2022

gjoseph92 Oct 24, 2022 Maintainer

gjoseph92 Oct 24, 2022 Maintainer

sharkinsspatial
Sep 20, 2022

Replies: 3 comments 2 replies

gjoseph92
Sep 23, 2022
Maintainer

sharkinsspatial
Oct 18, 2022
Author

gjoseph92 Oct 24, 2022
Maintainer

gjoseph92
Oct 24, 2022
Maintainer