[WIP] Incremental rechunking #28

davidbrochart · 2020-07-18T20:55:27Z

This is a limited implementation of incremental rechunking. There is still a lot to do, but I'd like to get early feedback on the approach.
Closes #8

codecov · 2020-07-18T21:03:24Z

Codecov Report

Merging #28 into master will increase coverage by 0.75%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #28      +/-   ##
==========================================
+ Coverage   88.94%   89.70%   +0.75%     
==========================================
  Files           2        2              
  Lines         190      204      +14     
  Branches       44       50       +6     
==========================================
+ Hits          169      183      +14     
  Misses         11       11              
  Partials       10       10

Impacted Files	Coverage Δ
rechunker/api.py	`93.52% <100.00%> (+0.72%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fdddf0f...4506e97. Read the comment docs.

rabernat · 2020-07-20T16:15:47Z

Thanks a lot for this PR @davidbrochart! I really appreciate your contribution. I will try to give a thorough review in the next few days.

rabernat

Thanks a lot for this @davidbrochart. I'm have some questions about this implementation. I'd also want @TomAugspurger to have a look.

rabernat · 2020-07-21T19:49:19Z

rechunker/api.py

@@ -314,13 +374,21 @@ def _rechunk_array(
    source_read = dsa.from_zarr(
        source_array, chunks=read_chunks, storage_options=source_storage_options
    )
+    if source_slice is not None:
+        source_read = source_read[source_slice]


It would be awesome to have access to Xarray's array adaptors here. Xarray has an internal implementation of lazily indexed arrays that would be preferable to slicing the dask array.

The downside of this approach is that, source_read is much bigger than source_read[source_slice], we end up carrying around a big, unnecessary dask graph. This will be culled on computation, but it still creates an unnecessary overhead.

I was thinking about that, but that would make xarray a dependency of this project, right? And we would use xarray.open_zarr instead of dask.array.from_zarr?

I agree, something like a "virtually sliced array" that effectively adds another slice onto supplied slices in __getitem__/__setitem__ would be really convenient here. This would let us reuse the same existing Executor interface for appending, just by passing in virtually sliced arrays for source and target.

I don't think we want to directly use xarray's array adapters (which come with lots of other baggage), but they are definitely good prior art. Rechunker's versions can be much simpler than xarray's LazilyOuterIndexedArray because it only needs to support slices of the form tuple(slice(int, int, None), ...) with start/stop bounds on the slices that never expand beyond the array itself.

Something like the following (untested!) might actually be enough, and should work OK with dask.array.from_array and dask.array.store:

class VirtuallySlicedArray: def __init__(self, target, key): if not all( k.start >= 0 and k.stop <= s and k.step is None for k, s in zip(key, target.shape) ): raise ValueError(f'invalid key {key} for shape {key}') self.target = target self.key = key @property def shape(self): return tuple(k.stop - k.start for k in self.key) @property def dtype(self): return self.target.dtype def _expand_key(self, key): return tuple( slice(k1.start + k2.start, k1.start + k2.stop) for k1, k2 in zip(self.key, key) ) def __getitem__(self, key): return self.target[self._expand_key(key)] def __setitem__(self, key, value): self.target[self._expand_key(key)] = value

rabernat · 2020-07-21T19:50:44Z

rechunker/api.py


    # create target
    shape = tuple(int(x) for x in shape)  # ensure python ints for serialization
    target_chunks = tuple(int(x) for x in target_chunks)
    int_chunks = tuple(int(x) for x in int_chunks)
    write_chunks = tuple(int(x) for x in write_chunks)

+    if target_append:
+        target_store_or_group_original = target_store_or_group
+        target_store_or_group = "to_append_" + target_store_or_group


I don't think this will work on stores that are not paths. Note that the store can also be a mutable mapping (e.g. from s3fs). This is not covered by tests but should be!

Yes, this is what I meant by "limited implementation". I will work on that later.

rabernat · 2020-07-21T19:54:57Z

rechunker/api.py

+            return self._target
+
+        target = zarr.open(self._target_original)
+        target.append(self._target)


I must admit, I don't understand why we are calling append here. Shouldn't the append operation happen inside the self.compute call?

My idea was that when we append (target_append=True), we want to append new source data to an already existing zarr target_store. So we just rechunk the new source data to a temporary target store (which is 'to_append_' + target_store for now just to make it simple), that is created with the self.compute call, and we use zarr's append method to append it to the original target_store. Do you think this is too expensive, and that we should directly append to the original target store?

I think it is best of we try to append directly to the original target store.

Also, we want to have all the computations inside the dask graph. As currently coded (append in execute), this part of the operation will not be run by dask, but rather will run from the client. That will be a huge bottleneck for large arrays.

Agreed that we want this to happen inside the compute, so it should be added as a task.

so it should be added as a task.

But not just a single task. We need to parallelize the append operations just like we do all other writes, via dask.array.to_zarr.

TomAugspurger · 2020-07-22T12:53:52Z

rechunker/api.py

+        The slice of the source to rechunk. The structure depends on ``source``.
+
+        - For a single array source, ``source_slice`` can be either a tuple (e.g.
+        ``((0, 20), None, None)`` or a dictionary (e.g. ``{'time': (0, 20),


I think these continuation lines need to be aligned with the F in For.

TomAugspurger · 2020-07-22T12:55:49Z

tests/test_rechunk.py

+    def prod(iterable):
+        return reduce(operator.mul, iterable, 1)


Use np.prod?

TomAugspurger · 2020-07-22T12:56:02Z

tests/test_rechunk.py

+    import operator
+    from functools import reduce
+    import numpy as np


Imports can go at the top.

TomAugspurger · 2020-07-22T12:58:45Z

rechunker/api.py

+            return self._target
+
+        target = zarr.open(self._target_original)
+        target.append(self._target)


Agreed that we want this to happen inside the compute, so it should be added as a task.

rabernat · 2020-07-24T16:22:51Z

I would wait on further action on this until #30 is merged. That is a pretty significant refactor to the internal structure of the package.

davidbrochart · 2020-07-24T16:37:10Z

Yes, I agree.

rabernat · 2020-07-27T19:15:06Z

@davidbrochart, now that #30 is done, we might want to revisit this.

Perhaps @shoyer has some ideas about how to best incorporate incremental rechunking / appending into the new code structure.

Again it seems like xarray's lazy indexing adaptors could come in very handy.

davidbrochart · 2020-07-27T19:56:58Z

@rabernat do you mean rechunker would depend on xarray, or pulling xarray's lazy indexing logic into rechunker's code?

rsemlal-murmuration · 2024-04-11T14:18:31Z

Was there any progress on this since then?

rabernat · 2024-04-11T14:30:23Z

Hi @rsemlal-murmuration - turns out that incremental rechunking is pretty tricky (lots of edge cases)! There hasn't been any work on this recently in rechunker.

However, at Earthmover, we are exploring many different approaches to this problem currently.

rsemlal-murmuration · 2024-04-11T14:40:14Z

Understood! Thanks for the quick reply!

Looking into this as well at the moment.
The workaround we are considering: using rechunker to write the data slice into a new intermediate location, then appending it from there to the existing dataset using xarray.to_zarr(mode="a"). But it is obviously not the most efficient approach.

Would be interested if there are other approaches/workarounds out there.

davidbrochart added 3 commits July 18, 2020 22:49

Implement incremental rechunking

348883c

Fix flake8

09ea9ee

Apply black format

4506e97

rabernat requested changes Jul 21, 2020

View reviewed changes

rabernat requested a review from TomAugspurger July 21, 2020 19:56

TomAugspurger reviewed Jul 22, 2020

View reviewed changes

rabernat mentioned this pull request Dec 16, 2020

Allow for single partial chunk #73

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Incremental rechunking #28

[WIP] Incremental rechunking #28

davidbrochart commented Jul 18, 2020

codecov bot commented Jul 18, 2020 •

edited

Loading

rabernat commented Jul 20, 2020

rabernat left a comment

rabernat Jul 21, 2020

davidbrochart Jul 21, 2020

shoyer Jul 31, 2020 •

edited

Loading

rabernat Jul 21, 2020

davidbrochart Jul 21, 2020

rabernat Jul 21, 2020

davidbrochart Jul 21, 2020

rabernat Jul 22, 2020

TomAugspurger Jul 22, 2020

rabernat Jul 22, 2020

TomAugspurger Jul 22, 2020

TomAugspurger Jul 22, 2020

TomAugspurger Jul 22, 2020

TomAugspurger Jul 22, 2020

rabernat commented Jul 24, 2020

davidbrochart commented Jul 24, 2020

rabernat commented Jul 27, 2020

davidbrochart commented Jul 27, 2020

rsemlal-murmuration commented Apr 11, 2024

rabernat commented Apr 11, 2024

rsemlal-murmuration commented Apr 11, 2024

[WIP] Incremental rechunking #28

Are you sure you want to change the base?

[WIP] Incremental rechunking #28

Conversation

davidbrochart commented Jul 18, 2020

codecov bot commented Jul 18, 2020 • edited Loading

Codecov Report

rabernat commented Jul 20, 2020

rabernat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer Jul 31, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabernat commented Jul 24, 2020

davidbrochart commented Jul 24, 2020

rabernat commented Jul 27, 2020

davidbrochart commented Jul 27, 2020

rsemlal-murmuration commented Apr 11, 2024

rabernat commented Apr 11, 2024

rsemlal-murmuration commented Apr 11, 2024

codecov bot commented Jul 18, 2020 •

edited

Loading

shoyer Jul 31, 2020 •

edited

Loading