-
-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to prevent Zarr from returning NaN for missing chunks? #486
Comments
Looks like it's necessary to override Line 1581 in e7708c9
|
Have you tried setting the fill value? |
Yes this is not currently supported, but would be straightforward to add.
I've actually been thinking myself recently that this would be useful for
very similar reasons, i.e., when you want to ensure that an exception is
raised when trying to access data from a missing chunk.
One way to achieve this could be to add a mechanism for activating this
behaviour explicitly. E.g., something like:
z = ... # some zarr array
z.set_options(fill_missing_chunk=False)
Thoughts and suggestions welcome.
…On Sat, 19 Oct 2019, 15:34 Willi Rath, ***@***.***> wrote:
Looks like it's necessary to override
https://github.com/zarr-developers/zarr-python/blob/e7708c948d2c0ff91863d675c1072b0dfe9ce2a6/zarr/core.py#L1581
et al to make zarr raise on non-existing chunks?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#486?email_source=notifications&email_token=AAFLYQXHM2WFRK5BDATMIDDQPMLIDA5CNFSM4JCPOTIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBXRXRA#issuecomment-544152516>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFLYQXAROADSJEALO3X4YLQPMLIDANCNFSM4JCPOTIA>
.
|
Just to say that I still think this is a valid thing to address in zarr, but there could/should also be some work upstream in fsspec to ensure that |
(was fixed in fsspec/filesystem_spec#259 ) |
Thanks @martindurant for the upstream fix. I think I will reopen this issue, however, as there may still be use cases where you want to change zarr's behaviour. E.g., you may know that you definitely do have some missing chunks in the data, and you want to make sure you don't accidentally request any data from a region overlapping a missing chunk. |
Hi @alimanfoo and @willirath - thanks for raising this. Just ran across this issue while using zarr on google cloud. With huge jobs there are always I/O failures, but I never would have expected this behavior: import xarray as xr
import numpy as np
import os
ds = xr.Dataset({
'x': [0, 1],
'y': [0, 1],
'myarr': (('x', 'y'), [[0., np.nan], [2., 3.]]),
})
ds.chunk({'x': 1, 'y': 1}).to_zarr('myzarr.zarr')
# chunk file disappears due to bad write, storage failure, gremlins...
os.remove('myzarr.zarr/myarr/1.0')
# I would LOVE for this to return an error
read_ds = xr.open_zarr('myzarr.zarr').compute()
# instead the error is here
xr.testing.assert_equal(ds, read_ds) This seems pretty problematic when NaNs are actually meaningful and without doing additional inspection of the filesystem and zarr metadata it's impossible to know if a NaN is a NaN or a failed write from a previous step, or just the chaos monkeys that live on cloud systems generally. Is there a patch you can recommend as a workaround? I'm less familiar with the zarr API as I've only used it via xarray. |
Did you see #489 (comment), @delgadom? Perhaps give that some testing to help drive it forward? |
The fspsec mapper is designed to be able to specify which basic exceptions get turned into KeyError (i.e., missing data) and which do not. I cannot quite seem to make ti work for this use-case though, I expect the code needs a little work. Also, there is a difference between creating an fsspec mapper instance and passing it to open_zarr versus passing a I think the bug is in zarr |
@martindurant : do you assume #489 is unneeded then? |
If we are willing to have people use fsspec for this kind of use case, then it can be fixed in fsspec and zarr's fsspec-specific code. This is an alternative option, one that fsspec would potentially benefit from too (although zarr is probably the only main user of the mapper interface). Of course, I don't mind if there's another way to achieve the same goal. |
I think relying on But I also see use cases where being able to raise the |
If I'm interpreting the (This is separate from the issue @martindurant brought up about
I'm also not sure if there's a more fsspec-specific way that #489 should be implemented or whether the current approach makes sense. Does catching |
Correct, I am saying that working in the fsspec code is a possible alternative to #489 - you could get the behaviour without that PR, but the fsspec code would need fixing, I don't think it's quite right now. But yes, in general, |
I read https://github.com/zarr-developers/zarr-python/blob/master/docs/spec/v2.rst to say that you should get a None if there's a KeyError when fill_value is None:
But trying out:
that's not the behavior I see. Newly created arrays seem to get random values for the missing chunk. |
@willirath: did you look into whether the behavior of a |
@joshmoore No I didn't check this. |
FYI I tried to use fsspec's Here's the code I used to reproduce this: import fsspec
import zarr
fs = fsspec.filesystem("file")
# create an array with no chunks on disk
mapper = fs.get_mapper("tmp.zarr")
za = zarr.open(mapper, mode="w", shape=(3, 3), chunks=(2, 2))
# ensure no exceptions are converted to KeyError
mapper = fs.get_mapper("tmp.zarr", missing_exceptions=())
# following should fail since chunks are missing
print(zarr.open(mapper, mode="r")[:]) |
Thanks for reporting this Tom. I think I see why 4e633ad caused this problem. Now that all Lines 1425 to 1430 in f542fca
I thought I could make your example work by doing the following store = zarr.storage.FSStore("tmp.zarr", exceptions=(), missing_exceptions=())
try:
store['0.0']
except FileNotFoundError as e:
print(type(e))
except KeyError as e:
print(type(e)) At this point, we can verify that we are not raising a a2 = zarr.open(store, mode="r")
print(a2[:]) I have not been able to track down why that is. |
I suppose that coercing an FSMap->FSStore should do its best to get all available options; but we should still figure out why the exception is not bubbling up. Is zarr doing an explicit |
One problem here is that there is redundant handling of exceptions in FSStore. The logic is implemented in the mapper object |
Ah you are right - if the user is coming with a ready made mapper, then the store simply won't see those exceptions. In that case, the higher level exceptions never get used unless they were exclusive - |
I'm a bit confused after reading this issue if improvements are still needed in both fsspec and zarr-python or just zarr-python (e.g., #489). Is anyone able to clarify whether #489 would be expected to work, or if #486 (comment) is blocking that PR? For reference, I believe this could help us solve issues with accessing our downscaled CMIP6 data on Planetary Computer (e.g., carbonplan/cmip6-downscaling#323). |
We were talking about the mapper interface and FSStore, both of which are within zarr-python. fsspec's exception handling in FSMap is stable, and the question above is how zarr should handle it when given one of these rather than creating it's own via FSStore (the latter is now the normal path, but the former still works). I suppose for complete control, you can always do
Also note that some storage backends differentiate between types of error. In particular, referenceFS raises ReferenceNotReachable(RuntimeError) for a key that should be there, because it's in the reference list, but for some maybe intermittent reason, failed to load. |
I think there's still a problem in Zarr that needs fixing, whether or not #489 is added. In the following code Lines 1414 to 1421 in 3db4176
Not sure what the fix is though... |
I'm so confused about this. I'm trying to run this example notebook, sometimes it returns data, and sometimes NaNs, even when querying the same spatiotemporal bounds on different calls! I'm not sure where to ask this, since it may be an S3 thing, but if someone knows the answer: why is this not consistent? How do I ensure my calls receive data? I'm a Zarr-head, and believe this should be the easiest way to access this CF-compliant data, but how is this usable if you're playing Russian roulette trying to access data and coming up with blanks most of the time? I'm not sure I'd rather see errors cancelling an entire operation, especially if some data is coming through. But I would like to be able to differentiate between missing data and understand what's hanging up S3 calls. fsspec v |
We really don't have a way of doing that. In another project I work on, dask-awkward, we have implemented IO reports that list all the failures, but that's in the context of dask tasks. Zarr only allows for data or nothing; you would need to have at least one other special value to show failure. |
Can you explain what you mean by special value? How does anyone actually use this in practice? Hit the bucket a hundred times waiting to see if data ever appears? The NaNs are not something consistent. I'm querying four datasets in the same bucket one after another in a pd apply operation with no wait time. one time I get hit-miss-miss-miss (hit returning data and miss returning NANs). I try again a while later and get miss-miss-miss-hit. it doesn't seem like an API timeout issue or it might be consistently hit-miss-miss-miss. how many times would I have to run this to be confident that the middle 2 datasets are actually empty? |
Using an FSMap directly with xarray open_dataset (or indeed zarr.open) is still allowed but deprecated - zarr will construct an FSStore over it, which is why any arguments you are passing are probably getting lost.
|
Edited - HTML text boxes aren't great for this :) |
This gets even stranger- running on dataset 2 cell by cell, the open_zarr does indeed return data this time. however, da.to_netcdf is producing a file of array shape 0, or the time dimension lost all its values, and all the data is missing. a search doesn't show anyone else with the same issue, so it doesn't seem like a datetime format thing. |
I'm not sure where to_netcdf happens, but if this is now using dask, make sure that the workers have the same working directory and environment variables (especially AWS ones) as the client. |
Hi, just commenting to add a +1 to this. Current behavior is problematic in some cases as it makes it impossible to distinguish whether the nan's are legit nans in the data, or are a result of missing chunk file. Also checking for nan's on large arrays is expensive. We usually interface with zarr via xarray, and woudl ove to be able to do somethig like this:
Xarray maintainers have indicated this needs to be fixed at the zarr-python level, though. |
Is there a way of preventing Zarr from returning NaNs if a chunk is missing?
Background of my question: We're seeing problems with either copying data to GCS or with GCS having problems to reliably serve all chunks of a Zarr store.
In
arr
below, there's two types of NaN filled chunks returned by Zarr.First, there's a chunk that is completely flagged missing in the data (chunk is over land in an Ocean dataset) but present on GCS (https://console.cloud.google.com/storage/browser/_details/pangeo-data/eNATL60-BLBT02X-ssh/sossheig/0.0.0) and Zarr correctly find all items marked as invalid:
Then, there's a chunk (https://console.cloud.google.com/storage/browser/_details/pangeo-data/eNATL60-BLBT02X-ssh/sossheig/0.7.3) that is not present (at the time of writing this, I get a "load failed" and a tracking id from GCS) and Zarr returns all items marked invalid as well:
How do I make Zarr raise an Exception on the latter?
cc: @auraoupa
related: pangeo-data/pangeo#691
The text was updated successfully, but these errors were encountered: