-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing rectangles in zarr dataset loaded from GCS #691
Comments
Hi @rafa-guedes - could you share more information about your dataset? Like the xarray repr and the zarr |
Hi @rabernat please find some more info below.
|
Perhaps GCS is not returning the data reliably? So @martindurant might have some suggestions on how to get some debugging out of gcsfs. I seem to recall a trick to turn on http logging. |
You can set the level |
Sorry for taking long to reply, I am only now getting back on this. Thanks for the suggestions @rabernat, @martindurant. There is nothing obvious from those logs though, and the problem still persists, intermittently. I may try and recreate this zarr dataset and check again. I'd be interested to hear if someone has found some similar behaviour in the past, this looks like a bit of a serious issue and I don't know where it would be coming from.. |
I'm seeing similar intermittent errors with data in the gs://pangeo-parcels bucket. Here's a gist showing that calculations fail with https://nbviewer.jupyter.org/gist/willirath/b7af4bc9a93a79b81772910f8ee5c630 |
Is there a way of forcing Zarr to retry if a chunk cannot be read rather than setting the chunk to all NaNs? |
I would open an issue in gcsfs for this.
https://github.com/dask/gcsfs/issues/
…On Tue, Oct 1, 2019 at 7:41 AM Willi Rath ***@***.***> wrote:
Is there a way of forcing Zarr to retry if a chunk cannot be read rather
than setting the chunk to all NaNs?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#691?email_source=notifications&email_token=AAJEKJRFUUMYX3KTDKVQZSDQMMZPHA5CNFSM4IK5ONJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAA64LI#issuecomment-536997421>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAJEKJXNQVQJSBXY3QWJ64DQMMZPHANCNFSM4IK5ONJQ>
.
|
Does it happen if you set the number of threads per worker to one? |
Yes, configuring |
and |
Errr, yes, I meant and --nthreads 1. |
I honestly cannot think of why this is happening - help appreciated in debugging when/how this happens, the exact error, etc. If it happens in one thread, apparently not a race condition or effect of session sharing. If we are lucky, it is merely one particular exception from the remote backend which we don't consider as retriable, but should (e.g., API rate limit). |
Is there an easy way of preventing GCSFS from retrying at all? |
Not super-easy: the GCSFileSystem class has an attribute
I would also run with gcsfs.utils.is_retriable is the place where errors are sorted. |
One possible way to debug could be to access the data using fsspec's httpfilesystem, bypassing gcsfs completely. fsspec/filesystem_spec#144 shows how you might do this (but also reveals a bug, which has been fixed in the latest fsspec master.) |
Can the data be public? I can spin up my own cluster (on pangeo!) and see if I can reproduce? Any other dataset on gcs known to show this? |
It's public: https://nbviewer.jupyter.org/gist/willirath/b7af4bc9a93a79b81772910f8ee5c630 |
dataset_version = "v2019.09.11.2"
bucket = f"pangeo-parcels/med_sea_connectivity_{dataset_version}/traj_data_without_stokes.zarr" |
Do we know if this is tied to dask / distributed, or can it be reproduced purely at the gcsfs / zarr level? |
(I was trying 'gs://oceanum-era5/wind_10m.zarr', sorry) |
Agree, @rabernat , would be worthwhile paging through all of the pieces of a zarr in a single thread or a local 4/8-thread dask scheduler. Here is an interesting snippet on mitigating request-rate related errors: |
I started looking into this question, will post it here once I get some insight
I could make this one public if needed for further tests, only need a way to ensure it will be pulled from us-central region to avoid egress costs.. |
@martindurant @rabernat I have done a few tests - the problem only happens with me when I load data from GCS on dask distributed cluster. I loaded a spatial slice (always the same slice) from my dataset under 3 different conditions: [1] no dask distributed cluster, [2] distributed cluster scaled up to 2 workers and [3] up to 4 workers. Testing script is here, the machine where I ran it is described below. Each test was run 170 times. [1] No dask distributed clusterThere were no missing cells reading zarr from GCS by running tests without distributed cluster. [2] Local dask cluster with 2 workers
I did not observe entire chunks missing using 2 workers. However, over 3 (1.8%) out of 170 cases, there were missing values within a chunk. These 3 cases are shown below, there were 2100, 2100 and 840 missing cells in these 3 (out of 1036800 cells). [3] Local dask cluster with 4 workersWhen using 4 workers:
I observed entire chunks missing over 41 out of 170 times (24%) as shown below. Interesting to note that usually the chunks missing are in the first latitude row (this dataset has reversed coordinates, [90:-90]). This is in contrast to the missing partial chunks in [2] which occurred always in the last row. VM used to run testing
|
Missing pieces within a chunk is totally out-of-the-world strange. Can you tell if the NaNs for a contiguous block in the original data? What was your worker setup, please, e.g., threads/process? |
Another quick question: are there any NaNs in the whole dataset, or is any NaN coming at any point necessarily a symptom of the syndrome? |
There are certainly no NaNs in this slice I am loading in these tests in the original data. There should be no NaNs in the entire dataset which covers 40 years but I need to check that.. |
I can check the cluster setup a bit later on. |
The two setups looked like this:
|
Hm, tried it now with middling (10 workers, 20 cores) and small (2 workers) cluster now, and got no missing data, working on the dask arrays directly. Note that the chunking I chose matches the zarr inetrnal chunking, which is what I think xarray does automatically :| |
(I get the same success with a local cluster too; all of this running in a google cluster) |
I only saw these errors with fairly large kubernetes clusters. This is a fully working example that returned different counts of NaNs (which all shouldn't be there) on ocean.pangeo.io a few minutes ago: https://nbviewer.jupyter.org/gist/willirath/b40b0b6f281cb1a46fcf848150ca0367 |
In my case I was not using a kubernetes cluster but a single 8-core virtual machine, with small local dask cluster setup. Also, I was working with another dataset recently and did not notice this problem (though I did not test it extensively) so this may perhaps be related to some characteristics of the dataset, just guessing... |
@martindurant I interpreted case [2] (missing intra-chunks) in those tests wrong. My coordinates (longitude=1440, latitude=721) are not divisible by my chunk sizes (longitude=100, latitude=100), and the last chunk row and column have less cells. It turns out entire chunks are also missing for those 3 cases I mentioned, see plot.
|
Just a bit more info in case it is useful.
|
Can you please try again with fsspec and gcsfs from master? |
It looks great @martindurant - no missing chunks on either a 2- or a 4-worker distributed cluster, repeating the tests 100 times. Thanks for looking into this! |
hurray! |
@willirath could you please check those also fix it for you? |
I get I'll have a closer look tomorrow... |
You will need the worker and client environments to agree, which would mean remaking the image in this case, a bit of a pain... I am running the same thing in a single process and not finding any errors so far, but I have no idea how long this will take (did not think to make a progress bar...). |
Finished without a NaN:
(I do this via the dask array so that I don't have to worry about interpretation of NaN - even one would make the result be NaN) |
Could we close this? |
Waiting for the OK from @willirath ; who may want to run with |
I've just created a setup with KubeCluster on the pangeo binder. Stay tuned... |
I still get NaN's. Gist (still running, so there will be an update) is here: https://gist.github.com/willirath/b40b0b6f281cb1a46fcf848150ca0367 My impression is that the errors occur a lot less frequent. But this might be due to adapted rate limits on the GCS side? |
results = []
for n in range(20):
vlist = [
ds[vname].isnull().data.sum()
for vname in ds.data_vars.keys()
]
results.append(sum(vlist).compute(retries=10))
print(n, results[-1])
sleep(20) # allow for scaling down
|
I repeated my tests 1000 times now on a 4-worker local cluster, no nans at all for me |
OK, so closing this, and maybe @willirath is facing a different issue (another rate limit?). More logging may tell. |
I've raised zarr-developers/zarr-python#486 hoping to find out how to more easily trigger an Exception upon errors when loading chunks. |
Zarr deciding that where a chunk does not exist, it should be nan is the right behaviour. The question is, what error is actually coming out of gcsfs (FileNotFound instead of some permission error in the file-system layer, and/or IndexError in the mapper) and what errors would be right for zarr. Is gcsfs retrying the underlying error? This info should be in the logs, or else the logging should be improved! |
Just to say we're hitting something like this currently, looking at the fsspec code I think there may still be a possibility that transient errors from storage are getting propagated as |
We're currently using this workaround which can be used where you know all chunks should be there for an array, i.e., you never expect a |
For information, changes merged upstream in fsspec that should protect against this happening: fsspec/filesystem_spec#259 |
Thanks for tracking this down @alimanfoo! Over in pydata/xarray#3831, we have been discussing how to set up ecosystem-wide integration testing for all of the interconnected packages one needs to use xarray + zarr on the cloud. I'd love to get your thoughts on how best to do this. |
I have had some issue with loading slices from some zarr dataset stored on GCS. I don't know where the problem could be coming from so am posting it here in case someone has experienced something similar.
The figure below shows what is going on. Upon slicing and loading some zarr dataset stored on GCS using xarray, I'm getting missing squares (often in the northern hemisphere). The squares change their positions each time I reload the same slice, sometimes they appear for one single variable, sometimes for more than one (at different locations), sometimes they don't happen at all. The rectangle sizes match the chunking which for this dataset corresponds to 25 degrees in lat/lon so it looks like entire chunks are randomly failing to be transferred for some reason.
Has someone seen something similar or would have an idea about this please?
Thanks
The text was updated successfully, but these errors were encountered: