-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A basic default ChunkManager for arrays that report their own chunks #8733
Comments
Even removing the "abstractmethod" decorator would be welcome ;) |
Hi @hmaarrfk ! I'm excited to see someone is actually using the I'm curious to hear more about your use case - do you have the code for your custom I'm not totally sure I understand what the purpose of a "default" chunkmanager would be - IIUC you're trying to expose your underlying array's
Which one? Or you mean all of them except literally just |
I've provided the patch that is essentially my code for the chunk manager ;). I am a little stricter and I check for
100% correct. Its my own implementation of "lazy access" to TIFF and MP4s as read-only arrays. It really only implements the very features I need so it isn't really "full feature complete". I've found this necessary to implement myself for performance reasons.
I mean all of them except
edit: reordered some paragraphs |
Okay overall this seems like a good idea to me! Basically allowing for the idea that there are really three types of arrays: un-chunked duck arrays, chunked duck arrays where the chunks might be read but won't be changed, and re-chunkable duck arrays where the re-chunking happens via some (probably parallel) processing framework.
Sorry I'm not quite following this part. You're talking about inside |
Also note there are two senses of the idea of a "default" chunkmanager to keep track of here: (1) what is used to request the |
This is my code ;) import numpy as np
from xarray.core.parallelcompat import ChunkManagerEntrypoint
from xarray.core.types import T_NormalizedChunks
from ._my_array import MyArray1, MyArray2
class MyChunkManager(ChunkManagerEntrypoint["MyArrayEntryPoint"]):
array_cls: None
available: bool = True
def __init__(self) -> None:
self.array_cls = (MyArray1, MyArray2)
def chunks(self, data) -> T_NormalizedChunks:
return data.chunks
def compute(self, *data, **kwargs) -> tuple[np.ndarray, ...]:
return tuple(np.asarray(d) for d in data)
def apply_gufunc(self, *args, **kwargs):
raise NotImplementedError()
def from_array(self, *args, **kwargs):
raise NotImplementedError()
def normalize_chunks(self, *args, **kwargs):
raise NotImplementedError()
You are correct. To be explicit, my problem is that implementing a whole chunkmanager, that allows for rechunking and many other fancy operations, seems a little overkill when I (or other users) try to expose a File "/home/mark/git/xarray/xarray/core/dataset.py", line 842, in load
return dataset.isel(isel).compute()
File "/home/mark/git/xarray/xarray/core/dataset.py", line 1011, in compute
chunkmanager = get_chunked_array_type(*lazy_data.values())
File "/home/mark/git/xarray/xarray/core/parallelcompat.py", line 142, in get_chunked_array_type
return new.load(**kwargs)
File "/home/mark/git/xarray/xarray/core/dataset.py", line 842, in load
raise TypeError(
TypeError: Could not find a Chunk Manager which recognises type <class 'mark._my_array.MyArray1'> I'm sorry for jumping to a proposal for a solution. I'm trying to get better at framing my asks overall (even beyond open source). |
The other reason why I think it is good for xarray to consider, is that I was pretty close to avoiding exposing It is clever and avoids circular dependencies and allows for plugins to be built. But a developer that writes |
No this is great! Some initiative in thinking of possible solutions is always appreciated :) So there's arguably a more general issue here: If you have a new duckarray type, xarray will happily wrap that and use it without you explicitly telling xarray what type it is (hence the term "duck-typed".) But currently if you have a duckarray type that also implements a The reason we have chunkmanagers at all is for when chunked computation requires calling a function like One solution would be to just remove Another solution is similar to what you're suggesting: define a default
|
FYI @andersy005 |
@hmaarrfk would you be interested in submitting a PR to add the "default" chunk manager? |
Is your ask simply that we forward |
Yeah we might have fallen down a rabbit hole here... |
What kind of property could one use to identify arrays that support distributed computing?
|
dask and cubed both have a |
Sorry if I'm being dense, but I thought all you need is to access the underlying |
Yes, but I believe @hmaarrfk is asking about what to do with the existing |
no need to appologize. The thread got kinda long. All i need is for xarray not to complain if I add a I'm trying to do so without destroying the pluggable backends you all built for dask/cubed. |
Ah thanks for explaining. maybe |
Of course.
There currently is no global |
See the suggestion for the lack of global in the PR. It seems like a solvable problem if changes to the function signature are ok |
Sorry for not responding for a while. An other thing I learned in my usage is that the definition of: def store(
self,
sources: Any,
targets: Any,
**kwargs: dict[str, Any],
) -> Any:
# basic error checking for things i don't support like regions
for source, target in zip(sources, targets):
for s in iterchunk_slices(source.shape, source.chunks):
target[s] = source[s] was necessary to get things to serialize to an nc file. The fact that my implementation of |
My turn to apologise for not responding for a while!
I realised I have exactly the same problem as you do in this other library - see zarr-developers/VirtualiZarr#114. My understanding of your PR is to just change xarray's behaviour upon encoutering an array with chunks that it doesn't know how to compute from raising an error to silently passing through. This would solve both our errors but the problem with this is that if you try to compute a dask/cubed array without the corresponding library installed, the correct chunkmanager won't be defined, and a non-loaded version of an array which definitely should be loaded will be passed further through the code. An alternative suggestion would be to change the check from checking for chunks to checking for a |
it might be good to get usage feedback from one more person. Silently failing would have made my application's memory grow to 500GB, and just kill itself, so the error message was useful. We could link them to this discussion and learn more from an other real usage case. My chunk manager will likely persist in my codebase for the foreseeable future. |
Okay so I just hit this issue again in virtualizarr but for In hindsight I now think that this taxonomy of mine is not quite right
and that the Instead we have arrays, which may have chunks (and may be rechunkable), and separately may be computable. One can imagine chunked arrays with no special computing semantics beyond defining One suggestion for how to deal with this: pass all This suggestion is subtly different to either what @hmaarrfk or @dcherian suggested above. Crucially it means that (a) rechunking wrapped arrays doesn't require defining an entrypoint but (b) computing with custom semantics always does require a corresponding entrypoint, and if there isn't the correct This would imply minor changes to cubed-xarray, which would define a Another interesting possible use case might be a
What do you think @keewis @dcherian @negin513 @tomwhite @alxmrs @slevang ? Is this a reasonable way to abstract out these special-cases? Or is cupy/pint/sparse defining an entrypoint just so that their one final compute method works actually massive overkill? (Alternative idea for cupy - define Note also that this is related to how the array API standard does not define a specific method for materializing lazy arrays (data-apis/array-api#748). |
I think the error here was that the ChunkManager ABC is too greedy and handles
Yes. You are also conflating "compute" with "to_numpy". And I don't think it's a fundamental property --- that VirtualiZarr does not implement it seems like an implementation detail. "Compute" in Xarray means load lazy arrays to memory -- here lazy arrays are wrappers around on-disk data, or dask/cubed arrays. Usually, this means load as numpy, but dask could end up "computing" to cupy or sparse instead. With Zarr-to-GPU support, even Xarray's lazy loading classes will resolve to cupy.
No, this would be quite the regression. IIUC this mostly gets resolved if the ChunkManager is less greedy and doesn't trigger on the existence of PS: Apparently the array api defines |
thanks @dcherian
👍
I don't think I agree - in VirtualiZarr there is no definition of "Compute" that would return a numpy-like in-memory array to still be wrapped by xarray. The closest operation to "compute" in virtualizarr writes json/parquet files to disk. (Though if we integrated virtualizarr upstream in zarr-python then it would be a different story - see #9281.)
You're right, that was a sloppy description on my part.
But currently xarray has internal special-casing to know how to coerce
Unless I've misunderstood something I don't think we can do that, because the return type of
I agree that this would probably close this particular issue though. Currently it triggers on seeing |
There is no implemented definition of compute, in the sense of "give me concrete values from this abstract array". You make my point in the second line -- you could return concrete values, it just hasn't been implemented yet.
Not just that but also they actively error on
That's what I meant by "generalize". It seems like we can attempt to allow arbitrary array-array conversions.
Nice, let's do that! |
Is your feature request related to a problem?
I'm creating duckarrays for various file backed datastructures for mine that are naturally "chunked". i.e. different parts of the array may appear in completely different files.
Using these "chunks" and the "strides" algorithms can better decide on how to iterate in a convenient manner.
For example, an MP4 file's chunks may be defined as being delimited by I frames, while images stored in a TIFF may be delimited by a page.
So for me, chunks are not so useful for parallel computing, but more for computing locally and choosing the appropriate way to iterate through a large arrays (TB of uncompressed data).
Describe the solution you'd like
I think a default Chunk manager could simply implement
compute
asnp.asarray
as a default instance, and be a catchall to all other instances.Advanced users could then go in an reimplement their own chunkmanager, but I was unable to use my duckarrays that incldued a
chunk
property because they weren't associated with any chunk manager.Something as simple as:
Describe alternatives you've considered
I created my own chunk manager, with my own chunk manager entry point.
Kinda tedious...
Additional context
It seems that this is related to: #7019
The text was updated successfully, but these errors were encountered: