Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMOR fixes +> clip_timerange interactions before time statistics pre-processors are causing outputs with missing data #2018

Open
Neah-Ko opened this issue May 5, 2023 · 2 comments · May be fixed by #2039
Assignees
Labels
cmor Related to the CMOR standard

Comments

@Neah-Ko
Copy link

Neah-Ko commented May 5, 2023

Describe the bug
Hello,
I am currently trying to build a dataset using ERA5 data downloaded from the CDS.

The requirements are asking for tasmin/tasmax variables, which are not available in daily frequency. However, hourly tas is available, so I am attempting to use daily_statistics preprocessor to build them.

Here is a minimal version of the recipe I am using for CMORization:

hourly: &hourly_data
    - {exp: reanalysis-era5-single-levels,
    project: native6, timerange: 194001/194012, dataset: ERA5, version: v1, tier: 3, type: reanaly}
preprocessors:
  hourly_to_daily_max:
    daily_statistics:
      operator: max
diagnostics:
  CMORize:
    title: ERA5 CMORisation
    description: ERA5 CMORisation
    variables:
      tasmax:
        raw: t2m
        short_name: tasmax
        mip: AERhr
        additional_datasets: *hourly_data
        preprocessor: hourly_to_daily_max

It produces the expected resulting file. However, when attempting to load them in another recipe, this happens during the run:

esmvalcore.cmor.check.CMORCheckError: There were errors in variable tasmax:
 time: Frequency day does not match input data

I went on to investigate this, and it turns out that the time coordinate of the cube is responsible:

>>> import iris
>>> x = iris.load_cube("OBS_ERA5_reanaly_v1_day_tasmax_19400101-19401231.nc")
>>> xpts = x.coord('time').points
>>> xpts
[32871.5, 32872.5, ...., 33235.5, 33236.47916667]

The very last point is not strictly monotonically increasing, and by a difference that is above the accepted margin (i.e. 0.001) so the last loop of _check_time_coord from check.py finds it and raises the error.

Now, one normally fixes this kind of issues by tweaking the CMORizer, right ? But in the CMORization run, here is the order of preprocessors actions:

2023-05-05 09:47:43,434 UTC [219849] DEBUG   Running preprocessor step concatenate
2023-05-05 09:47:43,620 UTC [219849] DEBUG   Running preprocessor step cmor_check_metadata
2023-05-05 09:47:43,699 UTC [219849] DEBUG   Running preprocessor step clip_timerange
2023-05-05 09:47:43,799 UTC [219849] DEBUG   Running preprocessor step fix_data
2023-05-05 09:47:43,804 UTC [219849] DEBUG   Running preprocessor step cmor_check_data
2023-05-05 09:47:43,808 UTC [219849] DEBUG   Running preprocessor step add_supplementary_variables
2023-05-05 09:47:43,811 UTC [219849] DEBUG   Running preprocessor step daily_statistics
2023-05-05 09:47:50,756 UTC [219849] DEBUG   Running preprocessor step save

I want to point out here that the checks happens before the daily statistics, and does not go again through it. So I do not have a point where I could plug in a fix for time coordinate before it gets saved.

I thought about doing it in a script, however that means I have to load and realize the data of the cube because saving doesn't work with lazy data. Which sounds like a pretty heavy operation just to add 0.02 to the last point of the time coordinate.

Explanation

What I think is happening

>>> a = 33236.47916667
>>> b = 33236.5
>>> b - a
0.020833330003370065
>>> 1/48
0.020833333333333332

The shift is exactly of -1/2h. This is most likely caused by the checking and fixing of hourly data that happens before the daily_statistics is called, then is carried over until saving of the data.

I tried using this preprocessor instead, but it turns the time coord in a 'day_of_year' one, so not quite the expected result either.

    climate_statistics:
      operator: max
      period: day

So what I would like to know is if there is a known way to change frequency of a variable and save the resulting file in a compliant way with the tool ? Else I would like to know if it is possible to explicitly call cmor_check_metadata preprocessor in the recipe after daily_statistics ?

  hourly_to_daily_max:
    daily_statistics:
      operator: max
    cmor_check_metadata:

Redefining it like this, causes the run to fail because it is missing {'mip', 'cmor_table', 'frequency', 'short_name'} arguments, but I would like those to be passed over from the last operation.

@sloosvel
Copy link
Contributor

sloosvel commented May 5, 2023

Hi @Neah-Ko ,

In this recipe you have a way to deal with the issue of the last time point not being CMOR compatible: https://github.com/ESMValGroup/ESMValTool/blob/main/esmvaltool/recipes/cmorizers/recipe_daily_era5.yml

@Neah-Ko Neah-Ko self-assigned this May 18, 2023
@Neah-Ko Neah-Ko added the cmor Related to the CMOR standard label May 18, 2023
@Neah-Ko
Copy link
Author

Neah-Ko commented May 18, 2023

Hello,
The solution that was suggested in this recipe is a possibility in case we have extra data available, which may or may not be the case.

I decided to investigate a bit on what was causing this besides the data being shifted half an interval back in time by the fixes.

Now if we look again at the points of the output file time coord: [32871.5, 32872.5, ...., 33235.5, 33236.47916667] and we observe the first and last cell points during the execution flow of pre-processing steps:

...
2023-05-05 09:47:43,434 UTC [219849] DEBUG   Running preprocessor step concatenate
    cube.coord('time').cell(0).point -> DatetimeGregorian(1940, 1, 1, 1, 0, 0, 0, has_year_zero=False)
    cube.coord('time').cell(-1).point -> DatetimeGregorian(1940, 1, 31, 23, 0, 0, 0, has_year_zero=False)
2023-05-05 09:47:43,620 UTC [219849] DEBUG   Running preprocessor step cmor_check_metadata
    cube.coord('time').cell(0).point -> DatetimeGregorian(1939, 12, 31, 13, 0, 0, 0, has_year_zero=False)
    cube.coord('time').cell(-1).point -> DatetimeGregorian(1940, 1, 31, 11, 0, 0, 0, has_year_zero=False)
2023-05-05 09:47:43,699 UTC [219849] DEBUG   Running preprocessor step clip_timerange
2023-05-17 15:18:39,313 UTC [303452] DEBUG   esmvalcore.preprocessor:327 Running preprocessor function 'clip_timerange' on the data
<iris 'Cube' of air_temperature / (K) (time: 744; latitude: 721; longitude: 1440)>
loaded from original input file(s)
[LocalFile('/data/ejodry/climate_data/Tier3/ERA5/reanalysis-era5-single-levels_194001.nc')]
with function argument(s)
timerange = '194001/194001'

    cube.coord('time').cell(0).point -> DatetimeGregorian(1940, 1, 1, 1, 0, 0, 0, has_year_zero=False)
    cube.coord('time').cell(-1).point -> DatetimeGregorian(1940, 1, 31, 11, 0, 0, 0, has_year_zero=False)
...

It is not that the last point is not increasing like the others, in fact it is missing the data for that point. Because clip_timerange in that case is called with the recipe dataset's timerange, which is the whole input. However, it is being shifted back in time by CMOR fixes, then this bit falls out of the timerange and is being lost. That is why my resulting cubes were having a wrong last time point. And also why adding more points on extremities "fixes" the problem, but in reality is just making up for the lost data.

Moreover, I think that it might affect the mapping between the time coordinate and the data points, because everything is shifted so we are computing statistics on data that comes from half an interval later and pinning it on the current time. Which for low frequencies is more or less fine, but if we speak of half a month/season/year that sounds more concerning already.

I have designed an experimental feature in the pull request attached, that attempts at fixing this problem.

@Neah-Ko Neah-Ko changed the title Last time coord point wrong after applying daily_statistics: loading fails during checks CMOR fixes +> clip_timerange interactions before time statistics pre-processors are causing outputs with missing data May 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cmor Related to the CMOR standard
Projects
None yet
2 participants