Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental rechunk #8

Open
davidbrochart opened this issue Jun 11, 2020 · 8 comments · May be fixed by #28
Open

Incremental rechunk #8

davidbrochart opened this issue Jun 11, 2020 · 8 comments · May be fixed by #28

Comments

@davidbrochart
Copy link
Collaborator

rechunker solves a problem I was trying to solve in a much cleaner way, thanks a lot for working on that. I've tried on the GPM dataset and it seems to work fine.
Do you know if it would work in an incremental mode? By that I mean that if I have already rechunked a part of a dataset, and want to continue later on, is it possible to rechunk only the remaining source and append that to the already rechunked destination?

@rabernat
Copy link
Member

I'm glad this is helpful! 😄

Do you know if it would work in an incremental mode?

It should definitely be possible in principle. But not as currently implemented.

We are trying to release this soon with its current feature set. Once we stabilize the API a bit, we would be happy to have a PR that would add incremental support.

@rabernat
Copy link
Member

Hi @davidbrochart. We have done a first release and have some decent docs up. It would be fantastic if you wanted to tackle the incremental case. What sort of API did you have in mind?

@davidbrochart
Copy link
Collaborator Author

Great @rabernat, I'll try and implement the incremental rechunking. As far as the API is concerned, we probably want to slice the source in order not to rechunk the whole dataset and restart from a different position. So for the initial rechunk we could have:

source = zarr.ones((4, 4), chunks=(1, 4), store="source.zarr")
intermediate = "intermediate.zarr"
target = "target.zarr"
rechunked = rechunk(source,
                    target_chunks=(2, 2),
                    target_store=target,
                    max_mem=256000,
                    temp_store=intermediate,
                    source_slice=((0, 2), (0, 4)))

And for the next rechunk we need to get the next slice and specify that it should be appended to the previous target:

rechunked = rechunk(source,
                    target_chunks=(2, 2),
                    target_store=target,
                    max_mem=256000,
                    temp_store=intermediate,
                    source_slice=((2, 4), (0, 4)),
                    target_append=True)

What do you think?

@rabernat
Copy link
Member

I'm curious why we need the source_slice argument. It seems like we should be able to just pass a sliced array, no?

But I guess zarr may not support lazy slicing.

@davidbrochart
Copy link
Collaborator Author

But I guess zarr may not support lazy slicing.

Yes, I think if we slice the Zarr array we get an in-memory NumPy array.

@rabernat
Copy link
Member

Thoughts on the API @TomAugspurger, @tomwhite?

@tomwhite
Copy link
Collaborator

This feature will be very useful. The API looks good to me.

I briefly wondered if source_slice is needed at all, since in append mode only new data would be rechunked, but that's not safe if the source is being written to at the same time as being incrementally rechunked. So source_slice is needed. It should be optional though to support the non-incremental case.

@davidbrochart
Copy link
Collaborator Author

Also, even if the source is not being written, you may not want to rechunk the whole of it, because it can take a lot of time. Instead, you should be able to rechunk parts of it. It should be optional also in the incremental case, in which case the whole dataset should be rechunked.

@davidbrochart davidbrochart linked a pull request Jul 18, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants