How to prevent Zarr from returning NaN for missing chunks? #486

willirath · 2019-10-19T11:52:54Z

Is there a way of preventing Zarr from returning NaNs if a chunk is missing?

Background of my question: We're seeing problems with either copying data to GCS or with GCS having problems to reliably serve all chunks of a Zarr store.

In arr below, there's two types of NaN filled chunks returned by Zarr.

from dask import array as darr
import numpy as np

arr = darr.from_zarr(""gs://pangeo-data/eNATL60-BLBT02X-ssh/sossheig/")

First, there's a chunk that is completely flagged missing in the data (chunk is over land in an Ocean dataset) but present on GCS (https://console.cloud.google.com/storage/browser/_details/pangeo-data/eNATL60-BLBT02X-ssh/sossheig/0.0.0) and Zarr correctly find all items marked as invalid:

np.isnan(arr.blocks[0, 0, 0]).mean().compute()
# -> 1.0

Then, there's a chunk (https://console.cloud.google.com/storage/browser/_details/pangeo-data/eNATL60-BLBT02X-ssh/sossheig/0.7.3) that is not present (at the time of writing this, I get a "load failed" and a tracking id from GCS) and Zarr returns all items marked invalid as well:

np.isnan(arr.blocks[0, 7, 3]).mean().compute()
# -> 1.0

How do I make Zarr raise an Exception on the latter?

cc: @auraoupa
related: pangeo-data/pangeo#691

The text was updated successfully, but these errors were encountered:

willirath · 2019-10-19T14:34:40Z

Looks like it's necessary to override

zarr-python/zarr/core.py

Line 1581 in e7708c9

except KeyError:

et al to make zarr raise on non-existing chunks?

jakirkham · 2019-10-19T14:51:50Z

Have you tried setting the fill value?

alimanfoo · 2019-10-19T22:20:42Z

Yes this is not currently supported, but would be straightforward to add. I've actually been thinking myself recently that this would be useful for very similar reasons, i.e., when you want to ensure that an exception is raised when trying to access data from a missing chunk. One way to achieve this could be to add a mechanism for activating this behaviour explicitly. E.g., something like: z = ... # some zarr array z.set_options(fill_missing_chunk=False) Thoughts and suggestions welcome.

…

On Sat, 19 Oct 2019, 15:34 Willi Rath, ***@***.***> wrote: Looks like it's necessary to override https://github.com/zarr-developers/zarr-python/blob/e7708c948d2c0ff91863d675c1072b0dfe9ce2a6/zarr/core.py#L1581 et al to make zarr raise on non-existing chunks? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#486?email_source=notifications&email_token=AAFLYQXHM2WFRK5BDATMIDDQPMLIDA5CNFSM4JCPOTIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBXRXRA#issuecomment-544152516>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFLYQXAROADSJEALO3X4YLQPMLIDANCNFSM4JCPOTIA> .

alimanfoo · 2020-03-24T17:28:10Z

Just to say that I still think this is a valid thing to address in zarr, but there could/should also be some work upstream in fsspec to ensure that KeyError is only ever raised by the store in the case where it is certain that the underlying file/object does not exist. If some other problem is encountered such as a transient error attempting to read the file/object, then fsspec should raise as some other kind of error, which would get propagated up through zarr. I've raised an issue here: fsspec/filesystem_spec#255

jakirkham · 2020-03-29T21:33:22Z

cc @martindurant

martindurant · 2020-04-01T12:35:30Z

(was fixed in fsspec/filesystem_spec#259 )

alimanfoo · 2020-04-01T12:42:44Z

Thanks @martindurant for the upstream fix.

I think I will reopen this issue, however, as there may still be use cases where you want to change zarr's behaviour. E.g., you may know that you definitely do have some missing chunks in the data, and you want to make sure you don't accidentally request any data from a region overlapping a missing chunk.

delgadom · 2021-04-20T23:13:54Z

Hi @alimanfoo and @willirath - thanks for raising this. Just ran across this issue while using zarr on google cloud. With huge jobs there are always I/O failures, but I never would have expected this behavior:

import xarray as xr
import numpy as np
import os

ds = xr.Dataset({
    'x': [0, 1],
    'y': [0, 1],
    'myarr': (('x', 'y'), [[0., np.nan], [2., 3.]]),
})

ds.chunk({'x': 1, 'y': 1}).to_zarr('myzarr.zarr')

# chunk file disappears due to bad write, storage failure, gremlins...
os.remove('myzarr.zarr/myarr/1.0')

# I would LOVE for this to return an error
read_ds = xr.open_zarr('myzarr.zarr').compute()

# instead the error is here
xr.testing.assert_equal(ds, read_ds)

This seems pretty problematic when NaNs are actually meaningful and without doing additional inspection of the filesystem and zarr metadata it's impossible to know if a NaN is a NaN or a failed write from a previous step, or just the chaos monkeys that live on cloud systems generally.

Is there a patch you can recommend as a workaround? I'm less familiar with the zarr API as I've only used it via xarray.

joshmoore · 2021-04-21T07:11:31Z

Did you see #489 (comment), @delgadom? Perhaps give that some testing to help drive it forward?

martindurant · 2021-04-21T13:19:08Z

The fspsec mapper is designed to be able to specify which basic exceptions get turned into KeyError (i.e., missing data) and which do not. I cannot quite seem to make ti work for this use-case though, I expect the code needs a little work. Also, there is a difference between creating an fsspec mapper instance and passing it to open_zarr versus passing a "file://..." URL and opening it in zarr (which uses FSStore, a wrapper with its own exception catching mechanism).

I think the bug is in zarr _chunk_getitems calling .getitems(ckeys, on_error="omit") - it should raise for any error except KeyError.

joshmoore · 2021-04-21T13:36:17Z

@martindurant : do you assume #489 is unneeded then?

martindurant · 2021-04-21T13:40:13Z

If we are willing to have people use fsspec for this kind of use case, then it can be fixed in fsspec and zarr's fsspec-specific code. This is an alternative option, one that fsspec would potentially benefit from too (although zarr is probably the only main user of the mapper interface). Of course, I don't mind if there's another way to achieve the same goal.

willirath · 2021-04-21T19:09:02Z

I think relying on fsspec to catch these kinds of intermittent read errors from object stores will do.

But I also see use cases where being able to raise the KeyError is necessary (think of a RedisStore which drops keys after some time) and just filling in .fill_value may not be the best thing to do.

bolliger32 · 2021-04-21T20:06:00Z

If I'm interpreting the fsspec code correctly, it is properly catching these errors and raising a KeyError in response (FileNotFoundError is caught by default in FSMap objects), and then #489 is allowing for a flag as to what you want to do with that KeyError in the context of opening a zarr store. So if I'm understanding this correctly, #489 is needed, while the fsspec behavior doesn't necessarily need changing for this case. Does that sound right?

(This is separate from the issue @martindurant brought up about open_zarr handling "file://" URLs and FSMap objects differently.)

If we are willing to have people use fsspec for this kind of use case, then it can be fixed in fsspec and zarr's fsspec-specific code. This is an alternative option, one that fsspec would potentially benefit from too (although zarr is probably the only main user of the mapper interface). Of course, I don't mind if there's another way to achieve the same goal.

I'm also not sure if there's a more fsspec-specific way that #489 should be implemented or whether the current approach makes sense. Does catching KeyError make sense regardless of what MutableMapping is used when instantiating an array, or is something unique to fsspec

martindurant · 2021-04-21T20:21:17Z

Correct, I am saying that working in the fsspec code is a possible alternative to #489 - you could get the behaviour without that PR, but the fsspec code would need fixing, I don't think it's quite right now.

But yes, in general, KeyError is supposed to mean "use the fill value" when reading chunks. I'm not sure if this is explicitly documented.

joshmoore · 2021-04-22T06:28:30Z

I read https://github.com/zarr-developers/zarr-python/blob/master/docs/spec/v2.rst to say that you should get a None if there's a KeyError when fill_value is None:

fill_value:
A scalar value providing the default value to use for uninitialized portions of the array, or null if no fill_value is to be used.
...
There is no need for all chunks to be present within an array store. If a chunk is not present then it is considered to be in an uninitialized state. An unitialized chunk MUST be treated as if it was uniformly filled with the value of the "fill_value" field in the array metadata. If the "fill_value" field is null then the contents of the chunk are undefined.

But trying out:

import zarr
import json
import numpy as np
import os

z = zarr.storage.DirectoryStore("fill_value_test")
a = zarr.creation.create(shape=(2, 2), chunks=(1, 1), store=z,
                         fill_value=None, dtype=np.int8)
a[:] = 1
with open("fill_value_test/.zarray") as o:
    meta = json.load(o)
print(f"fill={meta.get('fill_value', 'missing')}")
print(f"value={a[0][1]}")
print(f"value={a[0][0]}")
print(f"value={a[:]}")

os.remove("fill_value_test/0.0")

that's not the behavior I see. Newly created arrays seem to get random values for the missing chunk.

joshmoore · 2021-04-26T06:32:38Z

@willirath: did you look into whether the behavior of a null fill value was behaving as you would have expected?

willirath · 2021-04-26T06:54:05Z

@joshmoore No I didn't check this.

tomwhite · 2023-08-14T11:02:20Z

FYI I tried to use fsspec's missing_exceptions, but it no longer works from commit 4e633ad onwards.

Here's the code I used to reproduce this:

import fsspec
import zarr

fs = fsspec.filesystem("file")

# create an array with no chunks on disk
mapper = fs.get_mapper("tmp.zarr")
za = zarr.open(mapper, mode="w", shape=(3, 3), chunks=(2, 2))

# ensure no exceptions are converted to KeyError
mapper = fs.get_mapper("tmp.zarr", missing_exceptions=())

# following should fail since chunks are missing
print(zarr.open(mapper, mode="r")[:])

rabernat · 2023-08-14T13:31:48Z

Thanks for reporting this Tom. I think I see why 4e633ad caused this problem. Now that all FSMap objects are coerced to FSStore objects, it looks like we have redundant mechanisms for dealing with exceptions. FSStore introduced the additional exceptions parameter, which is not being propagated correctly when we coerce from FSMap

zarr-python/zarr/storage.py

Lines 1425 to 1430 in f542fca

    
           def __getitem__(self, key): 
        
               key = self._normalize_key(key) 
        
               try: 
        
                   return self.map[key] 
        
               except self.exceptions as e: 
        
                   raise KeyError(key) from e

I thought I could make your example work by doing the following

store = zarr.storage.FSStore("tmp.zarr", exceptions=(), missing_exceptions=())

try:
    store['0.0']
except FileNotFoundError as e:
    print(type(e))
except KeyError as e:
    print(type(e))

At this point, we can verify that we are not raising a KeyError. However, it still fails to raise an error when accessing the array!

a2 = zarr.open(store, mode="r")
print(a2[:])

I have not been able to track down why that is.

martindurant · 2023-08-14T13:49:20Z

I suppose that coercing an FSMap->FSStore should do its best to get all available options; but we should still figure out why the exception is not bubbling up. Is zarr doing an explicit exists() call?

rabernat · 2023-08-14T14:33:00Z

I suppose that coercing an FSMap->FSStore should do its best to get all available options

One problem here is that there is redundant handling of exceptions in FSStore. The logic is implemented in the mapper object store.map via the missing_exceptions parameter and then again the FSStore level via the exceptions attribute (see __getitem__ code above).

martindurant · 2023-08-14T14:40:02Z

Ah you are right - if the user is coming with a ready made mapper, then the store simply won't see those exceptions. In that case, the higher level exceptions never get used unless they were exclusive - missing_exceptions=() seems right, then (but having them the same wouldn't hurt either).

maxrjones · 2024-02-13T16:34:01Z

I'm a bit confused after reading this issue if improvements are still needed in both fsspec and zarr-python or just zarr-python (e.g., #489). Is anyone able to clarify whether #489 would be expected to work, or if #486 (comment) is blocking that PR? For reference, I believe this could help us solve issues with accessing our downscaled CMIP6 data on Planetary Computer (e.g., carbonplan/cmip6-downscaling#323).

martindurant · 2024-02-13T16:42:33Z

We were talking about the mapper interface and FSStore, both of which are within zarr-python. fsspec's exception handling in FSMap is stable, and the question above is how zarr should handle it when given one of these rather than creating it's own via FSStore (the latter is now the normal path, but the former still works).

I suppose for complete control, you can always do

fsm = FSMap(..., missing_exceptions=())
store = FSStore(fsm, missing_exceptions=(...))  # check call, I'm not sure how to pass this
group = zarr.open(store)

Also note that some storage backends differentiate between types of error. In particular, referenceFS raises ReferenceNotReachable(RuntimeError) for a key that should be there, because it's in the reference list, but for some maybe intermittent reason, failed to load.

tomwhite · 2024-02-15T15:48:19Z

I'm a bit confused after reading this issue if improvements are still needed in both fsspec and zarr-python or just zarr-python (e.g., #489). Is anyone able to clarify whether #489 would be expected to work, or if #486 (comment) is blocking that PR? For reference, I believe this could help us solve issues with accessing our downscaled CMIP6 data on Planetary Computer (e.g., carbonplan/cmip6-downscaling#323).

I think there's still a problem in Zarr that needs fixing, whether or not #489 is added.

In the following code FSStore will always pass on_error="omit" to fsspec, regardless of the setting of exceptions, or missing_exceptions:

zarr-python/zarr/storage.py

Lines 1414 to 1421 in 3db4176

    
           def getitems( 
        
               self, keys: Sequence[str], *, contexts: Mapping[str, Context] 
        
           ) -> Mapping[str, Any]: 
        
               keys_transformed = [self._normalize_key(key) for key in keys] 
        
               results = self.map.getitems(keys_transformed, on_error="omit") 
        
               # The function calling this method may not recognize the transformed keys 
        
               # So we send the values returned by self.map.getitems back into the original key space. 
        
               return {keys[keys_transformed.index(rk)]: rv for rk, rv in results.items()}

Not sure what the fix is though...

openSourcerer9000 · 2024-03-28T17:50:42Z

I'm so confused about this. I'm trying to run this example notebook, sometimes it returns data, and sometimes NaNs, even when querying the same spatiotemporal bounds on different calls! I'm not sure where to ask this, since it may be an S3 thing, but if someone knows the answer: why is this not consistent? How do I ensure my calls receive data?

I'm a Zarr-head, and believe this should be the easiest way to access this CF-compliant data, but how is this usable if you're playing Russian roulette trying to access data and coming up with blanks most of the time?

I'm not sure I'd rather see errors cancelling an entire operation, especially if some data is coming through. But I would like to be able to differentiate between missing data and understand what's hanging up S3 calls.

fsspec v
'2024.3.1'
zarr v
'2.17.1'

martindurant · 2024-03-28T17:54:09Z

I'm not sure I'd rather see errors cancelling an entire operation, especially if some data is coming through. But I would like to be able to differentiate between missing data and understand what's hanging up S3 calls.

We really don't have a way of doing that. In another project I work on, dask-awkward, we have implemented IO reports that list all the failures, but that's in the context of dask tasks. Zarr only allows for data or nothing; you would need to have at least one other special value to show failure.

openSourcerer9000 · 2024-03-28T18:05:14Z

Can you explain what you mean by special value?

How does anyone actually use this in practice? Hit the bucket a hundred times waiting to see if data ever appears? The NaNs are not something consistent. I'm querying four datasets in the same bucket one after another in a pd apply operation with no wait time. one time I get hit-miss-miss-miss (hit returning data and miss returning NANs). I try again a while later and get miss-miss-miss-hit. it doesn't seem like an API timeout issue or it might be consistently hit-miss-miss-miss. how many times would I have to run this to be confident that the middle 2 datasets are actually empty?

martindurant · 2024-03-28T18:17:29Z

Using an FSMap directly with xarray open_dataset (or indeed zarr.open) is still allowed but deprecated - zarr will construct an FSStore over it, which is why any arguments you are passing are probably getting lost.

ds_single = xr.open_dataset(
    url, 
    backend_kwargs=dict(
        storage_options=dict(
            anon=True, exceptions=(FileNotFoundError, )
        ), 
    consolidated=True
    )
)

openSourcerer9000 · 2024-03-28T18:21:38Z

So that's the current format? Would you mind adding the missing closing bracket, it's not clear where it would go

Thanks

martindurant · 2024-03-28T18:26:03Z

Edited - HTML text boxes aren't great for this :)

openSourcerer9000 · 2024-03-28T18:26:21Z

This gets even stranger- running on dataset 2 cell by cell, the open_zarr does indeed return data this time. however, da.to_netcdf is producing a file of array shape 0, or the time dimension lost all its values, and all the data is missing. a search doesn't show anyone else with the same issue, so it doesn't seem like a datetime format thing.

martindurant · 2024-03-28T18:55:32Z

I'm not sure where to_netcdf happens, but if this is now using dask, make sure that the workers have the same working directory and environment variables (especially AWS ones) as the client.

alvarosg · 2024-10-31T17:54:15Z

Hi, just commenting to add a +1 to this. Current behavior is problematic in some cases as it makes it impossible to distinguish whether the nan's are legit nans in the data, or are a result of missing chunk file. Also checking for nan's on large arrays is expensive.

We usually interface with zarr via xarray, and woudl ove to be able to do somethig like this:

dataset = xarray.open_dataset("path_to_zarr_with_missing_chunk_for_2021-01-02.zarr", error_on_missing_chunks)

data_slice = dataset.sel(time="2021-01-01")
data_slice.compute()

data_slice = dataset.sel(time="2021-01-02")
data_slice.compute(). # Raises MissingChunkError("Could not retrieve data. At least one chunk for the selected slice is missing")

Xarray maintainers have indicated this needs to be fixed at the zarr-python level, though.

willirath mentioned this issue Oct 19, 2019

Missing rectangles in zarr dataset loaded from GCS pangeo-data/pangeo#691

Closed

willirath mentioned this issue Oct 22, 2019

Allow disabling filling of missing chunks #489

Open

7 tasks

martindurant closed this as completed Apr 1, 2020

alimanfoo reopened this Apr 1, 2020

brews mentioned this issue Feb 17, 2022

EC-Earth3 downscaled precip ssp126 data has nans ClimateImpactLab/downscaleCMIP6#581

Closed

dcherian mentioned this issue Oct 31, 2024

Error when loading Zarr chunk missing. pydata/xarray#9701

Closed

How to prevent Zarr from returning NaN for missing chunks? #486

How to prevent Zarr from returning NaN for missing chunks? #486

Comments

willirath commented Oct 19, 2019

willirath commented Oct 19, 2019

jakirkham commented Oct 19, 2019

alimanfoo commented Oct 19, 2019 via email

alimanfoo commented Mar 24, 2020

jakirkham commented Mar 29, 2020

martindurant commented Apr 1, 2020

alimanfoo commented Apr 1, 2020

delgadom commented Apr 20, 2021 • edited Loading

joshmoore commented Apr 21, 2021

martindurant commented Apr 21, 2021

joshmoore commented Apr 21, 2021

martindurant commented Apr 21, 2021

willirath commented Apr 21, 2021

bolliger32 commented Apr 21, 2021

martindurant commented Apr 21, 2021

joshmoore commented Apr 22, 2021

joshmoore commented Apr 26, 2021

willirath commented Apr 26, 2021

tomwhite commented Aug 14, 2023

rabernat commented Aug 14, 2023

martindurant commented Aug 14, 2023

rabernat commented Aug 14, 2023

martindurant commented Aug 14, 2023

maxrjones commented Feb 13, 2024

martindurant commented Feb 13, 2024

tomwhite commented Feb 15, 2024

openSourcerer9000 commented Mar 28, 2024

martindurant commented Mar 28, 2024

openSourcerer9000 commented Mar 28, 2024

martindurant commented Mar 28, 2024 • edited Loading

openSourcerer9000 commented Mar 28, 2024

martindurant commented Mar 28, 2024

openSourcerer9000 commented Mar 28, 2024

martindurant commented Mar 28, 2024

alvarosg commented Oct 31, 2024

delgadom commented Apr 20, 2021 •

edited

Loading

martindurant commented Mar 28, 2024 •

edited

Loading