Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add preprocessors distance_metrics and histogram #2299

Merged
merged 68 commits into from
May 8, 2024

Conversation

schlunma
Copy link
Contributor

@schlunma schlunma commented Jan 12, 2024

Description

This PR adds a distance_metrics preprocessor that is able to calculate distance metrics between datasets and a reference dataset. In addition, a preprocessor histogram is added (which is necessary to calculate one of the metrics). These preprocessors have the following call signatures:

def distance_metric(
    products: set[PreprocessorFile] | Iterable[Cube],
    metric: MetricType,
    reference: Optional[Cube] = None,
    coords: Iterable[Coord] | Iterable[str] | None = None,
    keep_reference_dataset: bool = True,
    **kwargs,
) -> set[PreprocessorFile] | CubeList:
def histogram(
    cube: Cube,
    coords: Iterable[Coord] | Iterable[str] | None = None,
    bins: int | Sequence[float] = 10,
    bin_range: tuple[float, float] | None = None,
    weights: np.ndarray | da.Array | bool | None = None,
    normalization: Literal['sum', 'integral'] | None = None,
) -> Cube:

If used within a recipe, exactly one dataset in the products which enter distance_metrics needs the reference_for_metric: true key. Example:

datasets:
  - {dataset: BCC-ESM1, project: CMIP6, exp: historical, ensemble: r1i1p1f1, grid: gn}
  - {dataset: bcc-csm1-1, version: v1, project: CMIP5, exp: historical, ensemble: r1i1p1, reference_for_metric: true}

preprocessors:
  calc_rmse:
    regrid:
      target_grid: 3x3
      scheme: linear
    distance_metric:
      metric: rmse
      coords: [latitude, longitude]
      # keep_reference_dataset: false

diagnostics:
  test:
    variables:
      tas:
        mip: Amon
        timerange: '2000/2005'
        preprocessor: calc_rmse
    scripts:
      null

Currently supported metrics:

Closes #2266

Link to documentation:


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

@schlunma schlunma added the preprocessor Related to the preprocessor label Jan 12, 2024
@schlunma schlunma added this to the v2.11.0 milestone Jan 12, 2024
@schlunma schlunma requested a review from axel-lauer January 12, 2024 16:11
@schlunma schlunma self-assigned this Jan 12, 2024
Copy link

codecov bot commented Jan 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.51%. Comparing base (8276a62) to head (5bdb89b).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2299      +/-   ##
==========================================
+ Coverage   94.44%   94.51%   +0.07%     
==========================================
  Files         246      246              
  Lines       13745    14020     +275     
==========================================
+ Hits        12981    13251     +270     
- Misses        764      769       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@axel-lauer
Copy link
Contributor

I took a first look. The example using two CMIP6 models works nicely but when I try to use ERA-Interim or ERA5 as a reference dataset I get an error message about insufficient coordinate metadata:

ValueError: Cannot calculate distance metric between cube and reference cube: Insufficient matching coordinate metadata to resolve cubes, cannot map dimension (0,) of the RHS cube 'air_temperature' to the LHS cube 'air_temperature'.

I have no idea what might cause this problem. This is the recipe I tried:

# recipe_test.yml
---
documentation:
  title: Test

  description: |
    Test new preproc.

  authors:
    - lauer_axel

  maintainer:
    - lauer_axel

  projects:
    - esmval

datasets:
  - {dataset: BCC-ESM1, project: CMIP6, exp: historical, ensemble: r1i1p1f1, grid: gn}
  - {dataset: bcc-csm1-1, version: v1, project: CMIP5, exp: historical, ensemble: r1i1p1}
#  - {dataset: ERA-Interim, project: OBS6, type: reanaly, version: '1', tier: 3, reference_for_metric: true}
  - {dataset: ERA5, project: native6, type: reanaly, version: v1, tier: 3, reference_for_metric: true}

preprocessors:
  calc_rmse:
    regrid:
      target_grid: 3x3
      scheme: linear
    distance_metric:
      metric: rmse
      coords: [latitude, longitude]
      # keep_reference_dataset: false

diagnostics:
  test:
    variables:
      tas:
        mip: Amon
        timerange: '2000/2005'
        preprocessor: calc_rmse
    scripts:
      null

@schlunma
Copy link
Contributor Author

Thanks for testing Axel! I think the reason for this error are different time coordinates in the cubes. Could you try to add the preprocessor regrid_time to your recipe? I am not really optimistic that it will solve the problem (I don't have many good experiences with that preprocessor), but it's worth a try.

@axel-lauer
Copy link
Contributor

Thanks for the idea. When using regrid_time, the error message changes to:

ValueError: Cannot calculate distance metric between cube and reference cube: Coordinate 'day_of_year' has different points for the LHS cube 'air_temperature' and RHS cube 'air_temperature'.

@schlunma
Copy link
Contributor Author

Could you try to delete the following lines and run again?

iris.coord_categorisation.add_day_of_month(cube,
cube.coord('time'),
name='day_of_month')
iris.coord_categorisation.add_day_of_year(cube,
cube.coord('time'),
name='day_of_year')

@axel-lauer
Copy link
Contributor

Commenting out those lines in _time.py results in this error message:

ValueError: Cannot calculate distance metric between cube and reference cube: Insufficient matching coordinate metadata to resolve cubes, cannot map dimension (0,) of the RHS cube 'air_temperature' to the LHS cube 'air_temperature'.

@schlunma
Copy link
Contributor Author

Then I suggest to fix regrid_time for monthly and yearly data so that ALL cubes will end up with the exact same time coordinate. There are several open issues about this, I just commented in #2106.

@axel-lauer
Copy link
Contributor

Sounds like a plan!

@axel-lauer
Copy link
Contributor

I found another little problem with this preprocessor: I would like to apply e.g. RMSE calculation to geographical distributions of multi-year annual means. For this, it would make sense to first use climate_statistics to calculate the annual means. When doing so, The variables lose their "time" dimension (the coordinate is still there but not used as a variable dimension). For this reason (I think), I get the following error message when running distance_metric:

  File "/work/bd0854/b380103/ESMValCore/esmvalcore/preprocessor/_bias.py", line 388, in _calculate_metric
    res_cube = cube.collapsed(coords, iris.analysis.MAX)
  File "/work/bd0854/b380103/mambaforge/envs/esmvaltool/lib/python3.10/site-packages/iris/cube.py", line 3888, in collapsed
    raise iris.exceptions.CoordinateCollapseError(msg)
iris.exceptions.CoordinateCollapseError: Cannot collapse a dimension which does not describe any data.

Any ideas what this could work? Is the cube.collapsed in line 388 of _bias.py really needed?

@schlunma
Copy link
Contributor Author

I am not entirely sure if I understand you correctly. What are your options to climate_statistics? The default is to calculate the mean over the entire time period, so you will end up with just 1 value for each grid cell. In this case you cannot calculate RMSEs over time anymore (i.e., only coords: [latitude, longitude] would work).

If you want annual means for multiple years instead (1 value per year for each grid cell) , I think annual_statistics is the better choice. In this case the input for distance_metrics would probably be coords: [year].

@axel-lauer
Copy link
Contributor

Yes that's exactly what I mean:

    climate_statistics:
      period: full

This would be very useful, however, to compare annual means (e.g. for a benchmarking map plot)...

@schlunma
Copy link
Contributor Author

Ah, so you want to get an RMSE for each grid cell that has been calculated from one time step? In that case RMSE=absolute bias so it's probably better to use the bias preprocessor.

I think in general it's not possible to allow calculations of distance metrics over scalar dimensions. For example, the Pearson correlation is (AFAIK) undefined for one single value (it boils down to 0/0). In addition, it's not as easy as removing the line cube.collapsed(coords, iris.analysis.MAX); you would also need to make sure that the axis parameter (for example in np.mean) is handled correctly.

@axel-lauer
Copy link
Contributor

I guess I have to rethink this a bit. That happens when I try to be clever...

esmvalcore/preprocessor/_other.py Outdated Show resolved Hide resolved
Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approval from me on a technical basis (Manu, pls fix merge conflicts), lemme know when Axel approves from testing/sci side, and will mergy-merge 🍺

Copy link
Contributor

@axel-lauer axel-lauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. My tests with histogram and distance_metric were successfuly and look fine. A few small wording suggestions for the documentation (please see additional comments).

doc/recipe/preprocessor.rst Outdated Show resolved Hide resolved
doc/recipe/preprocessor.rst Outdated Show resolved Hide resolved
doc/recipe/preprocessor.rst Outdated Show resolved Hide resolved
doc/recipe/preprocessor.rst Outdated Show resolved Hide resolved
doc/recipe/preprocessor.rst Outdated Show resolved Hide resolved
schlunma and others added 2 commits May 3, 2024 13:46
@valeriupredoi
Copy link
Contributor

@schlunma I fixed the merge conflict (not really something to write home about - just the weights func moved to shared but GH was confused) but also fixed one of the tests in 2bffd03 - it makes more sense as it is now, but do please have a looksee and if it's OK for you then I will merge 🍺

@schlunma schlunma force-pushed the distance_metric_preproc branch from 2bffd03 to cbb6680 Compare May 8, 2024 08:14
@valeriupredoi valeriupredoi merged commit cffb1e9 into main May 8, 2024
5 of 6 checks passed
@valeriupredoi valeriupredoi deleted the distance_metric_preproc branch May 8, 2024 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preprocessor Related to the preprocessor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New preprocessor: Distance metrics dataset vs. reference
4 participants