Feature: Allows single time stastical pre-processing step to preceed checks and fixes upon loading the data #2039
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Closes #2018
Hello esmvalgroup,
This pull request implements a feature that would correct the behavior that I encountered on this issue.
For the more general case I also think it could be a base to eventually support changes of frequencies between input and output data, and have more descriptive output files, that can be subsequently loaded in the tool.
It is for now still in draft phase, I am putting it out there in order to get some of the core development team's opinion.
In particular @bouweandela, you seem to be the main maintainer of the code on which I've added modifications from the git blame command.
I joined this project recently.. This is the first time I am touching functionalities located that deep in the core. I tinkered with several designs, but am still unsure about what kind of modifications are "allowed" in those parts.
For now as it is stated below, I restricted the scope to a single case, in order to minimize risk of causing side effects in other parts of the tool. In particular I don't have enough experience with it to understand fully how the chaining of pre-procesors work in detail. I wanted to have your opinions on it, and know if you think that this feature could be integrated or even extended to some other cases.
Logic behind
The conclusions from my investigations on the matter (c.f. issue), are that in the case of time statistics being applied such as
'[daily | monthly | ....]_statistics'
, the output mip stays the same as the input. In consequence native data that is being loaded undergoes fixes that are not adapted for it's intended frequency.Moreover,
clip_timerange
is being called before time statistics pre-processors. Some of the CMOR fixes effects are to shift back averaged data half a relevant interval back in time. Which is then 'clipped' and not available for the statistical step later on.There is no easy way to go around that as upon loading, the defaults pre-processors are applied in priority before all the others pre-processing steps.
This feature core idea is to detect such changes of frequency, guess the output mip using relevant informations in the cmor tables and the definition of the time statistics pre-processor. Then pass it during the loading so that the statistics are applied after loading raw data but before the default steps.
Scope: The scope is strictly restricted to the case where exactly one non-implicit (as in not being part of
INITAL_STEPS
norFINAL_STEPS
) time stastical pre-processor (members of_time
ending with 'statistics') is being applied on input data.Do you think the functionality as it is could already be problematic ? From what I have seen, those statistics are iris's behind the hood. If data is loaded correctly then not being CMOR compliant should have no bad side effects.
Technical details
PreprocessorFile
classDataset
loading calls_get_mips
function fromcmor
package that gives all mips that are valid given the variable and the project.settings
bit of the preprocessor with its arguments to datasets loading functionkwargs
argument to bothDataset._load_with_callback
andDataset._load
functionsAbout the outputs
I tested generating daily tasmax from hourly to daily on ERA5 data downloaded from the CDS (product:
reanalysis-era5-single-levels
, year: 1940, mon: 01). The code might be faulty for other frequencies. I will conduct more testing on other frequencies after solving some of the design questions.time:units
is of the desired frequency.Link to documentation:
Before you get started
Checklist
It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.