Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to prevent Zarr from returning NaN for missing chunks? #486

Open
willirath opened this issue Oct 19, 2019 · 35 comments
Open

How to prevent Zarr from returning NaN for missing chunks? #486

willirath opened this issue Oct 19, 2019 · 35 comments

Comments

@willirath
Copy link

Is there a way of preventing Zarr from returning NaNs if a chunk is missing?

Background of my question: We're seeing problems with either copying data to GCS or with GCS having problems to reliably serve all chunks of a Zarr store.

In arr below, there's two types of NaN filled chunks returned by Zarr.

from dask import array as darr
import numpy as np

arr = darr.from_zarr(""gs://pangeo-data/eNATL60-BLBT02X-ssh/sossheig/")

First, there's a chunk that is completely flagged missing in the data (chunk is over land in an Ocean dataset) but present on GCS (https://console.cloud.google.com/storage/browser/_details/pangeo-data/eNATL60-BLBT02X-ssh/sossheig/0.0.0) and Zarr correctly find all items marked as invalid:

np.isnan(arr.blocks[0, 0, 0]).mean().compute()
# -> 1.0

Then, there's a chunk (https://console.cloud.google.com/storage/browser/_details/pangeo-data/eNATL60-BLBT02X-ssh/sossheig/0.7.3) that is not present (at the time of writing this, I get a "load failed" and a tracking id from GCS) and Zarr returns all items marked invalid as well:

np.isnan(arr.blocks[0, 7, 3]).mean().compute()
# -> 1.0

How do I make Zarr raise an Exception on the latter?

cc: @auraoupa
related: pangeo-data/pangeo#691

@willirath
Copy link
Author

Looks like it's necessary to override

except KeyError:
et al to make zarr raise on non-existing chunks?

@jakirkham
Copy link
Member

Have you tried setting the fill value?

@alimanfoo
Copy link
Member

alimanfoo commented Oct 19, 2019 via email

@alimanfoo
Copy link
Member

Just to say that I still think this is a valid thing to address in zarr, but there could/should also be some work upstream in fsspec to ensure that KeyError is only ever raised by the store in the case where it is certain that the underlying file/object does not exist. If some other problem is encountered such as a transient error attempting to read the file/object, then fsspec should raise as some other kind of error, which would get propagated up through zarr. I've raised an issue here: fsspec/filesystem_spec#255

@jakirkham
Copy link
Member

cc @martindurant

@martindurant
Copy link
Member

(was fixed in fsspec/filesystem_spec#259 )

@alimanfoo
Copy link
Member

Thanks @martindurant for the upstream fix.

I think I will reopen this issue, however, as there may still be use cases where you want to change zarr's behaviour. E.g., you may know that you definitely do have some missing chunks in the data, and you want to make sure you don't accidentally request any data from a region overlapping a missing chunk.

@alimanfoo alimanfoo reopened this Apr 1, 2020
@delgadom
Copy link

delgadom commented Apr 20, 2021

Hi @alimanfoo and @willirath - thanks for raising this. Just ran across this issue while using zarr on google cloud. With huge jobs there are always I/O failures, but I never would have expected this behavior:

import xarray as xr
import numpy as np
import os

ds = xr.Dataset({
    'x': [0, 1],
    'y': [0, 1],
    'myarr': (('x', 'y'), [[0., np.nan], [2., 3.]]),
})

ds.chunk({'x': 1, 'y': 1}).to_zarr('myzarr.zarr')

# chunk file disappears due to bad write, storage failure, gremlins...
os.remove('myzarr.zarr/myarr/1.0')

# I would LOVE for this to return an error
read_ds = xr.open_zarr('myzarr.zarr').compute()

# instead the error is here
xr.testing.assert_equal(ds, read_ds)

This seems pretty problematic when NaNs are actually meaningful and without doing additional inspection of the filesystem and zarr metadata it's impossible to know if a NaN is a NaN or a failed write from a previous step, or just the chaos monkeys that live on cloud systems generally.

Is there a patch you can recommend as a workaround? I'm less familiar with the zarr API as I've only used it via xarray.

@joshmoore
Copy link
Member

Did you see #489 (comment), @delgadom? Perhaps give that some testing to help drive it forward?

@martindurant
Copy link
Member

The fspsec mapper is designed to be able to specify which basic exceptions get turned into KeyError (i.e., missing data) and which do not. I cannot quite seem to make ti work for this use-case though, I expect the code needs a little work. Also, there is a difference between creating an fsspec mapper instance and passing it to open_zarr versus passing a "file://..." URL and opening it in zarr (which uses FSStore, a wrapper with its own exception catching mechanism).

I think the bug is in zarr _chunk_getitems calling .getitems(ckeys, on_error="omit") - it should raise for any error except KeyError.

@joshmoore
Copy link
Member

@martindurant : do you assume #489 is unneeded then?

@martindurant
Copy link
Member

If we are willing to have people use fsspec for this kind of use case, then it can be fixed in fsspec and zarr's fsspec-specific code. This is an alternative option, one that fsspec would potentially benefit from too (although zarr is probably the only main user of the mapper interface). Of course, I don't mind if there's another way to achieve the same goal.

@willirath
Copy link
Author

I think relying on fsspec to catch these kinds of intermittent read errors from object stores will do.

But I also see use cases where being able to raise the KeyError is necessary (think of a RedisStore which drops keys after some time) and just filling in .fill_value may not be the best thing to do.

@bolliger32
Copy link

If I'm interpreting the fsspec code correctly, it is properly catching these errors and raising a KeyError in response (FileNotFoundError is caught by default in FSMap objects), and then #489 is allowing for a flag as to what you want to do with that KeyError in the context of opening a zarr store. So if I'm understanding this correctly, #489 is needed, while the fsspec behavior doesn't necessarily need changing for this case. Does that sound right?

(This is separate from the issue @martindurant brought up about open_zarr handling "file://" URLs and FSMap objects differently.)

If we are willing to have people use fsspec for this kind of use case, then it can be fixed in fsspec and zarr's fsspec-specific code. This is an alternative option, one that fsspec would potentially benefit from too (although zarr is probably the only main user of the mapper interface). Of course, I don't mind if there's another way to achieve the same goal.

I'm also not sure if there's a more fsspec-specific way that #489 should be implemented or whether the current approach makes sense. Does catching KeyError make sense regardless of what MutableMapping is used when instantiating an array, or is something unique to fsspec

@martindurant
Copy link
Member

Correct, I am saying that working in the fsspec code is a possible alternative to #489 - you could get the behaviour without that PR, but the fsspec code would need fixing, I don't think it's quite right now.

But yes, in general, KeyError is supposed to mean "use the fill value" when reading chunks. I'm not sure if this is explicitly documented.

@joshmoore
Copy link
Member

I read https://github.com/zarr-developers/zarr-python/blob/master/docs/spec/v2.rst to say that you should get a None if there's a KeyError when fill_value is None:

fill_value:
A scalar value providing the default value to use for uninitialized portions of the array, or null if no fill_value is to be used.
...
There is no need for all chunks to be present within an array store. If a chunk is not present then it is considered to be in an uninitialized state. An unitialized chunk MUST be treated as if it was uniformly filled with the value of the "fill_value" field in the array metadata. If the "fill_value" field is null then the contents of the chunk are undefined.

But trying out:

import zarr
import json
import numpy as np
import os

z = zarr.storage.DirectoryStore("fill_value_test")
a = zarr.creation.create(shape=(2, 2), chunks=(1, 1), store=z,
                         fill_value=None, dtype=np.int8)
a[:] = 1
with open("fill_value_test/.zarray") as o:
    meta = json.load(o)
print(f"fill={meta.get('fill_value', 'missing')}")
print(f"value={a[0][1]}")
print(f"value={a[0][0]}")
print(f"value={a[:]}")

os.remove("fill_value_test/0.0")

that's not the behavior I see. Newly created arrays seem to get random values for the missing chunk.

@joshmoore
Copy link
Member

@willirath: did you look into whether the behavior of a null fill value was behaving as you would have expected?

@willirath
Copy link
Author

@joshmoore No I didn't check this.

@tomwhite
Copy link
Contributor

FYI I tried to use fsspec's missing_exceptions, but it no longer works from commit 4e633ad onwards.

Here's the code I used to reproduce this:

import fsspec
import zarr

fs = fsspec.filesystem("file")

# create an array with no chunks on disk
mapper = fs.get_mapper("tmp.zarr")
za = zarr.open(mapper, mode="w", shape=(3, 3), chunks=(2, 2))

# ensure no exceptions are converted to KeyError
mapper = fs.get_mapper("tmp.zarr", missing_exceptions=())

# following should fail since chunks are missing
print(zarr.open(mapper, mode="r")[:])

@rabernat
Copy link
Contributor

Thanks for reporting this Tom. I think I see why 4e633ad caused this problem. Now that all FSMap objects are coerced to FSStore objects, it looks like we have redundant mechanisms for dealing with exceptions. FSStore introduced the additional exceptions parameter, which is not being propagated correctly when we coerce from FSMap

zarr-python/zarr/storage.py

Lines 1425 to 1430 in f542fca

def __getitem__(self, key):
key = self._normalize_key(key)
try:
return self.map[key]
except self.exceptions as e:
raise KeyError(key) from e

I thought I could make your example work by doing the following

store = zarr.storage.FSStore("tmp.zarr", exceptions=(), missing_exceptions=())

try:
    store['0.0']
except FileNotFoundError as e:
    print(type(e))
except KeyError as e:
    print(type(e))

At this point, we can verify that we are not raising a KeyError. However, it still fails to raise an error when accessing the array!

a2 = zarr.open(store, mode="r")
print(a2[:])

I have not been able to track down why that is.

@martindurant
Copy link
Member

I suppose that coercing an FSMap->FSStore should do its best to get all available options; but we should still figure out why the exception is not bubbling up. Is zarr doing an explicit exists() call?

@rabernat
Copy link
Contributor

I suppose that coercing an FSMap->FSStore should do its best to get all available options

One problem here is that there is redundant handling of exceptions in FSStore. The logic is implemented in the mapper object store.map via the missing_exceptions parameter and then again the FSStore level via the exceptions attribute (see __getitem__ code above).

@martindurant
Copy link
Member

Ah you are right - if the user is coming with a ready made mapper, then the store simply won't see those exceptions. In that case, the higher level exceptions never get used unless they were exclusive - missing_exceptions=() seems right, then (but having them the same wouldn't hurt either).

@maxrjones
Copy link
Member

I'm a bit confused after reading this issue if improvements are still needed in both fsspec and zarr-python or just zarr-python (e.g., #489). Is anyone able to clarify whether #489 would be expected to work, or if #486 (comment) is blocking that PR? For reference, I believe this could help us solve issues with accessing our downscaled CMIP6 data on Planetary Computer (e.g., carbonplan/cmip6-downscaling#323).

@martindurant
Copy link
Member

We were talking about the mapper interface and FSStore, both of which are within zarr-python. fsspec's exception handling in FSMap is stable, and the question above is how zarr should handle it when given one of these rather than creating it's own via FSStore (the latter is now the normal path, but the former still works).

I suppose for complete control, you can always do

fsm = FSMap(..., missing_exceptions=())
store = FSStore(fsm, missing_exceptions=(...))  # check call, I'm not sure how to pass this
group = zarr.open(store)

Also note that some storage backends differentiate between types of error. In particular, referenceFS raises ReferenceNotReachable(RuntimeError) for a key that should be there, because it's in the reference list, but for some maybe intermittent reason, failed to load.

@tomwhite
Copy link
Contributor

I'm a bit confused after reading this issue if improvements are still needed in both fsspec and zarr-python or just zarr-python (e.g., #489). Is anyone able to clarify whether #489 would be expected to work, or if #486 (comment) is blocking that PR? For reference, I believe this could help us solve issues with accessing our downscaled CMIP6 data on Planetary Computer (e.g., carbonplan/cmip6-downscaling#323).

I think there's still a problem in Zarr that needs fixing, whether or not #489 is added.

In the following code FSStore will always pass on_error="omit" to fsspec, regardless of the setting of exceptions, or missing_exceptions:

zarr-python/zarr/storage.py

Lines 1414 to 1421 in 3db4176

def getitems(
self, keys: Sequence[str], *, contexts: Mapping[str, Context]
) -> Mapping[str, Any]:
keys_transformed = [self._normalize_key(key) for key in keys]
results = self.map.getitems(keys_transformed, on_error="omit")
# The function calling this method may not recognize the transformed keys
# So we send the values returned by self.map.getitems back into the original key space.
return {keys[keys_transformed.index(rk)]: rv for rk, rv in results.items()}

Not sure what the fix is though...

@openSourcerer9000
Copy link

I'm so confused about this. I'm trying to run this example notebook, sometimes it returns data, and sometimes NaNs, even when querying the same spatiotemporal bounds on different calls! I'm not sure where to ask this, since it may be an S3 thing, but if someone knows the answer: why is this not consistent? How do I ensure my calls receive data?

I'm a Zarr-head, and believe this should be the easiest way to access this CF-compliant data, but how is this usable if you're playing Russian roulette trying to access data and coming up with blanks most of the time?

I'm not sure I'd rather see errors cancelling an entire operation, especially if some data is coming through. But I would like to be able to differentiate between missing data and understand what's hanging up S3 calls.

fsspec v
'2024.3.1'
zarr v
'2.17.1'

@martindurant
Copy link
Member

I'm not sure I'd rather see errors cancelling an entire operation, especially if some data is coming through. But I would like to be able to differentiate between missing data and understand what's hanging up S3 calls.

We really don't have a way of doing that. In another project I work on, dask-awkward, we have implemented IO reports that list all the failures, but that's in the context of dask tasks. Zarr only allows for data or nothing; you would need to have at least one other special value to show failure.

@openSourcerer9000
Copy link

Can you explain what you mean by special value?

How does anyone actually use this in practice? Hit the bucket a hundred times waiting to see if data ever appears? The NaNs are not something consistent. I'm querying four datasets in the same bucket one after another in a pd apply operation with no wait time. one time I get hit-miss-miss-miss (hit returning data and miss returning NANs). I try again a while later and get miss-miss-miss-hit. it doesn't seem like an API timeout issue or it might be consistently hit-miss-miss-miss. how many times would I have to run this to be confident that the middle 2 datasets are actually empty?

@martindurant
Copy link
Member

martindurant commented Mar 28, 2024

Using an FSMap directly with xarray open_dataset (or indeed zarr.open) is still allowed but deprecated - zarr will construct an FSStore over it, which is why any arguments you are passing are probably getting lost.

ds_single = xr.open_dataset(
    url, 
    backend_kwargs=dict(
        storage_options=dict(
            anon=True, exceptions=(FileNotFoundError, )
        ), 
    consolidated=True
    )
)

@openSourcerer9000
Copy link

So that's the current format? Would you mind adding the missing closing bracket, it's not clear where it would go
image
Thanks

@martindurant
Copy link
Member

Edited - HTML text boxes aren't great for this :)

@openSourcerer9000
Copy link

This gets even stranger- running on dataset 2 cell by cell, the open_zarr does indeed return data this time. however, da.to_netcdf is producing a file of array shape 0, or the time dimension lost all its values, and all the data is missing. a search doesn't show anyone else with the same issue, so it doesn't seem like a datetime format thing.

@martindurant
Copy link
Member

I'm not sure where to_netcdf happens, but if this is now using dask, make sure that the workers have the same working directory and environment variables (especially AWS ones) as the client.

@alvarosg
Copy link

Hi, just commenting to add a +1 to this. Current behavior is problematic in some cases as it makes it impossible to distinguish whether the nan's are legit nans in the data, or are a result of missing chunk file. Also checking for nan's on large arrays is expensive.

We usually interface with zarr via xarray, and woudl ove to be able to do somethig like this:

dataset = xarray.open_dataset("path_to_zarr_with_missing_chunk_for_2021-01-02.zarr", error_on_missing_chunks)

data_slice = dataset.sel(time="2021-01-01")
data_slice.compute()

data_slice = dataset.sel(time="2021-01-02")
data_slice.compute(). # Raises MissingChunkError("Could not retrieve data. At least one chunk for the selected slice is missing")

Xarray maintainers have indicated this needs to be fixed at the zarr-python level, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests