Feature: Allows single time stastical pre-processing step to preceed checks and fixes upon loading the data #2039

Neah-Ko · 2023-05-18T15:14:58Z

Description

Closes #2018

Hello esmvalgroup,

This pull request implements a feature that would correct the behavior that I encountered on this issue.
For the more general case I also think it could be a base to eventually support changes of frequencies between input and output data, and have more descriptive output files, that can be subsequently loaded in the tool.

It is for now still in draft phase, I am putting it out there in order to get some of the core development team's opinion.
In particular @bouweandela, you seem to be the main maintainer of the code on which I've added modifications from the git blame command.

I joined this project recently.. This is the first time I am touching functionalities located that deep in the core. I tinkered with several designs, but am still unsure about what kind of modifications are "allowed" in those parts.
For now as it is stated below, I restricted the scope to a single case, in order to minimize risk of causing side effects in other parts of the tool. In particular I don't have enough experience with it to understand fully how the chaining of pre-procesors work in detail. I wanted to have your opinions on it, and know if you think that this feature could be integrated or even extended to some other cases.

Logic behind

The conclusions from my investigations on the matter (c.f. issue), are that in the case of time statistics being applied such as '[daily | monthly | ....]_statistics', the output mip stays the same as the input. In consequence native data that is being loaded undergoes fixes that are not adapted for it's intended frequency.

Moreover, clip_timerange is being called before time statistics pre-processors. Some of the CMOR fixes effects are to shift back averaged data half a relevant interval back in time. Which is then 'clipped' and not available for the statistical step later on.

There is no easy way to go around that as upon loading, the defaults pre-processors are applied in priority before all the others pre-processing steps.

This feature core idea is to detect such changes of frequency, guess the output mip using relevant informations in the cmor tables and the definition of the time statistics pre-processor. Then pass it during the loading so that the statistics are applied after loading raw data but before the default steps.

Scope: The scope is strictly restricted to the case where exactly one non-implicit (as in not being part of INITAL_STEPS nor FINAL_STEPS) time stastical pre-processor (members of _time ending with 'statistics') is being applied on input data.

Do you think the functionality as it is could already be problematic ? From what I have seen, those statistics are iris's behind the hood. If data is loaded correctly then not being CMOR compliant should have no bad side effects.

Technical details

The detection and guessing of the use case and output mip is happening in the constructor of PreprocessorFile class
- Conveniently has all the information and is an entry point for Dataset loading calls
- guessing of the mip made by narrowing down _get_mips function from cmor package that gives all mips that are valid given the variable and the project.
  - still quite "hacky", needs rework (and probably won't pass complexity check)
  - So far I didn't found any efficient nor clean way of generically guessing a mip.
- sets a flag if conditions are met
- If the flag is up at load time, then change metadata accordingly and pass the settings bit of the preprocessor with its arguments to datasets loading function
- In particular I have added a kwargs argument to both Dataset._load_with_callback and Dataset._load functions
  - Avoids duplication of code, but touches onto critical fucntions

About the outputs

I tested generating daily tasmax from hourly to daily on ERA5 data downloaded from the CDS (product: reanalysis-era5-single-levels, year: 1940, mon: 01). The code might be faulty for other frequencies. I will conduct more testing on other frequencies after solving some of the design questions.

I checked how the generated outputs are looking like
- using ncview: I am no climate expert, but no glitches in the heatmaps
  - Header, time:units is of the desired frequency.
- The generated file has now a correct time coordinate, with equal jumps in time for all points, spanning the whole input data.

Link to documentation:

Before you get started

☝ Create an issue to discuss what you are going to do

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.

🧪 The new functionality is relevant and scientifically sound
🛠 This pull request has a descriptive title and labels
🛠 Code is written according to the code quality guidelines
🧪 and 🛠 Documentation is available
🛠 Unit tests have been added
🛠 Changes are backward compatible
🛠 Any changed dependencies have been added or removed correctly
🛠 The list of authors is up to date
🛠 All checks below this pull request were successful

…+ according metadata fixes for output files

Feature: Allows time stastical functions to preceed checks and fixes …

7489ddd

…+ according metadata fixes for output files

Neah-Ko added bug Something isn't working help wanted Extra attention is needed cmor Related to the CMOR standard data issue labels May 18, 2023

Neah-Ko requested review from bouweandela and schlunma May 18, 2023 15:14

Neah-Ko self-assigned this May 18, 2023

Neah-Ko marked this pull request as draft May 18, 2023 15:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Allows single time stastical pre-processing step to preceed checks and fixes upon loading the data #2039

Feature: Allows single time stastical pre-processing step to preceed checks and fixes upon loading the data #2039

Neah-Ko commented May 18, 2023 •

edited

Loading

Feature: Allows single time stastical pre-processing step to preceed checks and fixes upon loading the data #2039

Are you sure you want to change the base?

Feature: Allows single time stastical pre-processing step to preceed checks and fixes upon loading the data #2039

Conversation

Neah-Ko commented May 18, 2023 • edited Loading

Description

Logic behind

Technical details

About the outputs

Before you get started

Checklist

Neah-Ko commented May 18, 2023 •

edited

Loading