Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Allows single time stastical pre-processing step to preceed checks and fixes upon loading the data #2039

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Neah-Ko
Copy link

@Neah-Ko Neah-Ko commented May 18, 2023

Description

Closes #2018

Hello esmvalgroup,

This pull request implements a feature that would correct the behavior that I encountered on this issue.
For the more general case I also think it could be a base to eventually support changes of frequencies between input and output data, and have more descriptive output files, that can be subsequently loaded in the tool.

It is for now still in draft phase, I am putting it out there in order to get some of the core development team's opinion.
In particular @bouweandela, you seem to be the main maintainer of the code on which I've added modifications from the git blame command.

I joined this project recently.. This is the first time I am touching functionalities located that deep in the core. I tinkered with several designs, but am still unsure about what kind of modifications are "allowed" in those parts.
For now as it is stated below, I restricted the scope to a single case, in order to minimize risk of causing side effects in other parts of the tool. In particular I don't have enough experience with it to understand fully how the chaining of pre-procesors work in detail. I wanted to have your opinions on it, and know if you think that this feature could be integrated or even extended to some other cases.

Logic behind

The conclusions from my investigations on the matter (c.f. issue), are that in the case of time statistics being applied such as '[daily | monthly | ....]_statistics', the output mip stays the same as the input. In consequence native data that is being loaded undergoes fixes that are not adapted for it's intended frequency.

Moreover, clip_timerange is being called before time statistics pre-processors. Some of the CMOR fixes effects are to shift back averaged data half a relevant interval back in time. Which is then 'clipped' and not available for the statistical step later on.

There is no easy way to go around that as upon loading, the defaults pre-processors are applied in priority before all the others pre-processing steps.

This feature core idea is to detect such changes of frequency, guess the output mip using relevant informations in the cmor tables and the definition of the time statistics pre-processor. Then pass it during the loading so that the statistics are applied after loading raw data but before the default steps.

Scope: The scope is strictly restricted to the case where exactly one non-implicit (as in not being part of INITAL_STEPS nor FINAL_STEPS) time stastical pre-processor (members of _time ending with 'statistics') is being applied on input data.

Do you think the functionality as it is could already be problematic ? From what I have seen, those statistics are iris's behind the hood. If data is loaded correctly then not being CMOR compliant should have no bad side effects.

Technical details

  • The detection and guessing of the use case and output mip is happening in the constructor of PreprocessorFile class
    • Conveniently has all the information and is an entry point for Dataset loading calls
    • guessing of the mip made by narrowing down _get_mips function from cmor package that gives all mips that are valid given the variable and the project.
      • still quite "hacky", needs rework (and probably won't pass complexity check)
      • So far I didn't found any efficient nor clean way of generically guessing a mip.
    • sets a flag if conditions are met
    • If the flag is up at load time, then change metadata accordingly and pass the settings bit of the preprocessor with its arguments to datasets loading function
    • In particular I have added a kwargs argument to both Dataset._load_with_callback and Dataset._load functions
      • Avoids duplication of code, but touches onto critical fucntions

About the outputs

I tested generating daily tasmax from hourly to daily on ERA5 data downloaded from the CDS (product: reanalysis-era5-single-levels, year: 1940, mon: 01). The code might be faulty for other frequencies. I will conduct more testing on other frequencies after solving some of the design questions.

  • I checked how the generated outputs are looking like
    • using ncview: I am no climate expert, but no glitches in the heatmaps
      • Header, time:units is of the desired frequency.
    • The generated file has now a correct time coordinate, with equal jumps in time for all points, spanning the whole input data.

Link to documentation:


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


@Neah-Ko Neah-Ko added bug Something isn't working help wanted Extra attention is needed cmor Related to the CMOR standard data issue labels May 18, 2023
@Neah-Ko Neah-Ko requested review from bouweandela and schlunma May 18, 2023 15:14
@Neah-Ko Neah-Ko self-assigned this May 18, 2023
@Neah-Ko Neah-Ko marked this pull request as draft May 18, 2023 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cmor Related to the CMOR standard data issue help wanted Extra attention is needed
Projects
None yet
1 participant