How do I make this code faster #2821

DOptimusPrime · 2022-12-07T06:17:03Z

DOptimusPrime
Dec 7, 2022

I am trying to analyze some climate data. Below is the code

pwg_eu10 = xr.open_mfdataset('*/*PGW_EU10*.nc')
pwg_ev10= xr.open_mfdataset('*/*PGW_EV10*.nc')

this is the main code that I want to run and for some reason it is incrediblity slow. Just to calculate the wind speed takes about 2
hours and I have to do it for 4 locations and get where the speed is greater than 10 and 15. That will be a bout 12 hours to run all. Can some give me an idea about how to speed it up.

Answered by dopplershift

Dec 7, 2022

I did some testing locally with some synthetic data I made of similar size (though for only two time chunks of 2208):

import metpy.calc as mpcalc
from metpy.units import units
import numpy as np
import xarray as xr

ntime = 2208
nlat = 1015
nlon = 1359

# Create one big array and then tell xarray to turn into Dask chunks
u = v = xr.DataArray(np.full((2 * ntime, nlat, nlon), 1., dtype=np.float32),
                     coords={'time': np.arange(2 * ntime), 'lat': np.linspace(-90, 90, nlat), 'lon': np.linspace(0, 360, nlon)},
                     attrs={'units': 'm/s'}).chunk({'time': 2208})

# Calculate, .compute() tells Dask to actually do the computation
s = mpcalc.wind_speed(u, v).compute()

View full answer

akrherz · 2022-12-07T16:38:50Z

akrherz
Dec 7, 2022

I'm working with @DOptimusPrime on this. The issue is a very slow xr.open_mfdataset that I am not able to resolve yet. There's a wildcard in the code example that got swallowed.

5 replies

DOptimusPrime Dec 7, 2022
Author

I initially did open the files with xr.open_dataset.But problem was that xr.open_dataset open all the files even those I did not want. xr.open_mfdataset allows me to open just the files containing EU10 and EV10.

dopplershift Dec 7, 2022
Maintainer

What's the size of pwg_eu10 and pwg_ev10?

DOptimusPrime Dec 7, 2022
Author

about 11 GB

dopplershift Dec 7, 2022
Maintainer

Can you post eu10.info() and eu10.chunks?

DOptimusPrime Dec 7, 2022
Author

I have posted it. Thank you.

dopplershift · 2022-12-07T22:36:16Z

dopplershift
Dec 7, 2022
Maintainer

I did some testing locally with some synthetic data I made of similar size (though for only two time chunks of 2208):

import metpy.calc as mpcalc
from metpy.units import units
import numpy as np
import xarray as xr

ntime = 2208
nlat = 1015
nlon = 1359

# Create one big array and then tell xarray to turn into Dask chunks
u = v = xr.DataArray(np.full((2 * ntime, nlat, nlon), 1., dtype=np.float32),
                     coords={'time': np.arange(2 * ntime), 'lat': np.linspace(-90, 90, nlat), 'lon': np.linspace(0, 360, nlon)},
                     attrs={'units': 'm/s'}).chunk({'time': 2208})

# Calculate, .compute() tells Dask to actually do the computation
s = mpcalc.wind_speed(u, v).compute()

The above code, which computes windspeed on the entire grid (not just one lat/lon point), takes ~100 seconds on my MacBook laptop with 64GB of memory (~40s of that time is spent just generating the xarray source data). What that means is that your problem is I/O-bound, meaning of that two hours for your full dataset, the time is mostly spent loading the data from disk into memory.

As far as making it quicker:

Try to do all your lat/lon selection while a single time chunk is loaded in memory. If you have the memory, don't even worry about selecting a single lat/lon. If not, you can do the selection of multiple sites at once:

eu10 = pwg_eu10.EU10[:, [lat1, lat2, lat3], [lon1, lon2, lon3]]
ev10 = pwg_ev10.EV10[:, [lat1, lat2, lat3], [lon1, lon2, lon3]]

I'd expect this to still take 2 hours, but at least it doesn't scale to 12.

I'm not sure what kind of storage/computing environment you have, so I'm not sure if doing the I/O in parallel will help. To do so, you could try to set up a Dask cluster (https://dask.org) to do the loading (and calculation) in parallel.

You might want to try asking on the Pangeo Discourse or xarray GitHub discussions since this isn't really a MetPy-specific question (the metpy calculation in question is a glorified wrapper around numpy.hypot) and those audiences might have more experience and ideas.

1 reply

DOptimusPrime Dec 8, 2022
Author

All the data is on the agron server on campus and I am running it from that server. A colleague told me that mfdataset run slow and this week being final's week, everybody is running their project on that same server which is making that server also very slow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I make this code faster #2821

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How do I make this code faster #2821

DOptimusPrime Dec 7, 2022

Replies: 2 comments · 6 replies

akrherz Dec 7, 2022

DOptimusPrime Dec 7, 2022 Author

dopplershift Dec 7, 2022 Maintainer

DOptimusPrime Dec 7, 2022 Author

dopplershift Dec 7, 2022 Maintainer

DOptimusPrime Dec 7, 2022 Author

dopplershift Dec 7, 2022 Maintainer

DOptimusPrime Dec 8, 2022 Author

DOptimusPrime
Dec 7, 2022

Replies: 2 comments 6 replies

akrherz
Dec 7, 2022

DOptimusPrime Dec 7, 2022
Author

dopplershift Dec 7, 2022
Maintainer

DOptimusPrime Dec 7, 2022
Author

dopplershift Dec 7, 2022
Maintainer

DOptimusPrime Dec 7, 2022
Author

dopplershift
Dec 7, 2022
Maintainer

DOptimusPrime Dec 8, 2022
Author