-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance of iris with latest dask #4736
Comments
@bjlittle explained to me last week that constraints during FF/PP |
As for Dask: for me this is just more support for #4572 and similar. Dask's default chunk handling frequently changes, AFAICT based on the assumption that users can flexibly adapt the chunking in their scripts. As we know Iris currently buries the chunking control inside the loading automation, risking each Dask improvement becoming a 'breaking performance change'. |
That PP load&extract gets a lot quicker if you simplify the latitude constraint lat_point = c.coord('latitude').cell(45)
lat_const = iris.Constraint(latitude=lat_point.point) Possibly because of Lines 308 to 315 in caae450
|
@jamesp where are you seeing this? When I run your code then ask if the cube is lazy it claims to be |
I'm seeing it in the memory footprint that is recorded by From our recent dash on MeshCoords (#4749) I understand a bit more about how the real/lazy mechanics of iris works: it seems very possible that I could realise some/all of the data in a cube and yet it would still claim to be lazy. |
PP load with various environments, using the user-provided file that prompted @jamesp's concern:
We also worked out earlier that the netCDF file isn't realising everything, it's just pulling the entire chunk into memory (i.e. with a 2GB file it still pulls in about 200MB but doesn't pull the full 2GB) |
Oh that's fine then, it's only the worst of both worlds 😆 |
I don't know, I thought it would have to pull things in at at least the chunk level to read them, so it seems kind of alright? (at least it doesn't pull 2GB into memory - I reckon that's a win! :p) |
Oh you're right I misinterpreted this - I'm too ready to hear "poor chunking means the while file is loaded". |
Well yes, of course it must load "some few" chunks at some time.
In practice, I've seen that you can expect about 3* this. |
In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity. If this issue is still important to you, then please comment on this issue and the stale label will be removed. Otherwise this issue will be automatically closed in 28 days time. |
🐛 Bug Report
Reading netcdf and pp files in iris with the latest version of dask is slower and takes a lot more memory.
I'm not familiar enough with the mechanics of iris loading to understand whether this is a dask issue or iris issue. I suspect it may be dask, if so I'd appreciate any help in pinpointing the specific thing I need to raise a bug report with dask about.
The attached script creates a 4D cube full of random numbers and saves to pp and netcdf. It then loads the file and performs the following operations:
load&extract: constrain to single time and lat when loading, then take lon mean:
load,extract: take lon mean, then constrain single time and lat:
Both should end up with identical results. Here is the time taken and peak memory usage for iris 3.2 on my machine
netcdf file:
pp file:
How To Reproduce
Steps to reproduce the behaviour:
mamba create -y -c conda-forge -n iris-3.2-dask-2021.6 iris=3.2 dask=2021.6.0
mamba create -y -c conda-forge -n iris-3.2-dask-2022.2 iris=3.2 dask=2022.2.0
Expected behaviour
Performance should be better in later versions of iris and dask.
Environment
Analysis script
Expand for full script to cut-paste
The text was updated successfully, but these errors were encountered: