Fix multiprocessing and consolidate QC #68

kyleredilla · 2024-10-02T17:01:03Z

The initial goal with this PR was simply to improve the reliability of using multiprocessing to scrape the grid info from files in the batch file generation step of the regridding pipeline, and in the QC section of sanity checking all regridded files. The behavior we consistently see in using multiprocessing (or concurrent.futures paradigm) is that whatever method is used to dispatch some function that operates on multiple netcdf files can sometimes hang indefinitely. This PR has taken some steps to improve this, but it would seem that total reliability here is out of scope for now. It seems that simply processing fewer files by breaking things up into smaller groups, such as we have begun doing for the prefect flows (with v1_1 and v1_2 variables etc), is a sane way forward for now. The QC step that was checking every single new file has been changed as well to only check a random subset of the files, which should help with the hanging symptoms. You will notice that the quality control step step that was originally outside of the QC notebook has been moved into that notebook, so that we only have one QC product to evaluate following a flow run.

To test, simply run a regridding flow using prefect, probably for a subset of variables and frequencies, such as monthly v1_2 etc and check out the quality control notebook.

cstephen

I like the idea of folding all of the QC work into a single spot, and this looks great overall! I ran into a few small issues along the way though:

While running this Prefect flow using the v1_config branch of the prefect repo, it complained that the --qc_notebook option was missing from the QC command and quit/stalled out. I replaced the system call here https://github.com/ua-snap/prefect/blob/58843c3f0070bd8ef9094326d826b58e92136bfa/regridding/regridding_functions.py#L394 with the following and it seemed to work after that:

f"export PATH=$PATH:/opt/slurm-22.05.4/bin:/opt/slurm-22.05.4/sbin:$HOME/miniconda3/bin && python {run_qc_script} --qc_notebook '{visual_qc_notebook}' --conda_init_script '{conda_init_script}' --conda_env_name '{conda_env_name}' --cmip6_directory '{cmip6_directory}' --output_directory '{output_directory}' --repo_regridding_directory '{repo_regridding_directory}' --vars '{vars}' --freqs '{freqs}' --models '{models}' --scenarios '{scenarios}'"

The QC notebook complained that error_file was not defined in a couple places. See PR code review comments.
After removing the error_file references to run the QC notebook to completion, the random src vs. regrid files it chose to inspect produced the following error:
```
AssertionError: No files found for regridded file clt_Amon_MPI-ESM1-2-HR_historical_regrid_196201-196212.nc in /beegfs/CMIP6/arctic-cmip6/CMIP6/CMIP/DKRZ/MPI-ESM1-2-HR/historical with */Amon/clt/*/*/clt_Amon_MPI-ESM1-2-HR_historical_*.nc.
```
My second run chose a different set of random files & succeeded, however. And it looks great!! Once those small issues are fixed, I think this is good to merge.

cstephen · 2024-10-14T20:19:53Z

regridding/qc.ipynb

+    "print(f\"QC process complete: {error_count} errors found.\")\n",
+    "if len(ds_errors) > 0:\n",
+    "    print(\n",
+    "        f\"Errors in opening some datasets. {len(ds_errors)} files could not be opened. See {str(error_file)} for error log.\"\n",


The QC notebook complained that error_file was not defined here and execution stopped. I got around this temporarily just by removing the reference to error_file here.

cstephen · 2024-10-14T20:20:14Z

regridding/qc.ipynb

    "    )\n",
+    "if len(value_errors) > 0:\n",
+    "    print(\n",
+    "        f\"Errors in dataset values. {len(value_errors)} files have regridded values outside of source file range. See {str(error_file)} for error log.\"\n",


Same as my above comment, error_file was not defined here either.

kyleredilla · 2025-02-13T23:19:33Z

OK - lots of updates since this was last reviewed! I am sorry this became a hero branch.

This PR now generalizes the regridding pipeline so that it may be used to regrid our cmip6 holdings to different grids. This branch started out as an attempt to improve resilience of some scripts against getting stuck / hanging while calling the multiprocessing library for parallel processing of netCDF files. There are some changes included which might help this, but the issue still happens intermittently. But by far the biggest help is breaking the processing down into smaller chunks, which is accomplished on the orchestration (prefect) side of things, e.g. running flows with fewer variables for daily resolutions, etc. We decided to continue with this branch for development of other features, and it grew and grew. Here is a summary of the changes:

adds some things to try and help with multiprocessing hanging
add functionality for regridding land/sea only variables
we now rely on the xESMF regridder to crop each dataset to the extent of the destination file provided instead of manually cropping every single dataset. This makes things a bit simpler in the regridding sense - we are matching the grid of the target file exactly, including extent. This means that the destination file is should have the desired extent.
Cleaned up and standardized some documentation.

there have been some additional changes to transfers included in this branch:

script to remove older versions of duplicate files (version in the strict CMIP6 sense, as found in the native ESGF directory structure).
Found that the set of all instances (i.e. all models/scenarios) of prsn in the Omon table_id is a subset of what is available under the Amon table_id. Omon data seems to be on an ocean grid thus eliminates prsn over the sea, so prsn is not truly a land/sea variable. So, Omon was removed as a possible table ID in the config of the transfers pipeline, and all Omon prsn data was removed from the manifest files and the main CMIP6 ACDN filesystem.

To test:

Run the regrid_cmip6_common.py flow using prefect as you see fit, using the generalize_regridding branch (see Generalize cmip6 regridding flow prefect#51). At a minimum, run the regridding for siconc and snw variables using the v1_landsea_config.json file to run the flow, as this includes some additional land/sea variables that can now be regridded. See the prefect README for using these config files. Make sure the flow runs and the generated QC file looks good.
Run the regrid_cmip6_4km.py flow using prefect. Do this for at least one of tasmax, tasmin, and pr, for daily frequency

Make sure to use the fix_multiprocessing branch of the cmip6-utils repo in the prefect arguments.

Note - there are still some weird things going on with some of the land/sea variables. For example, using "conservative" interpolation method for siconc generates some strange behavior and may cause QC errors in value checks, and using bilinear method causes vertical stripes of nodata at the meridians.

Closes #48

Joshdpaul · 2025-02-18T00:18:30Z

@kyleredilla Happy to dig into this hero branch! 🦸🏻 The comments below concern both this PR and this corresponding PR in the prefect repo.

I was able to successfully run the regridding for all V1 variables (from the v1_1_config.json , v1_2_config.json , and v1_landsea_config.json in the prefect repo) by using the cmip6-utils environment. As discussed in Slack, the QC step was failing when using the snap-geo environment.

I tried something new, which was to run the deployment using parameters in the JSON files via command line like so:

cat v1_2_config.json | prefect deployment run 'regrid-cmip6-common/regrid-cmip6-common' --param ssh_username='jdpaul3' --param ssh_private_key_path='/Users/joshpaul/.ssh/id_rsa' --param branch_name='fix_multiprocessing' --param scratch_directory='/beegfs/CMIP6/jdpaul3' --param conda_env_name='cmip6-utils' --params -

The individual --param arguments override the JSON parameters, so this was an interesting way to set my environment variables. Maybe this will be useful in other ways in the future, if we want to avoid the UI for starting specific flow runs?

Anyways, the coarse resolution regridding pipeline finished with no hangups, and the QC files look good to me with the exception of the weird vertical lines you mentioned:

I really don't know what to make of that! Might need to create another issue to specifically look at the files that are turning out that way, since it only seems to happen on some of them.

For the 4km regridding, I had to comment out everything to do with the create_target_grid_file() in regrid_4km_cmip6.py in the prefect repo in order to avoid the xarray import error. As discussed in the Slack, this occurs due to my conda base environment on Chinook lacking the xarray package. At the moment this create_target_grid_file() is just a placeholder anyways, so its up to you how you'd like to proceed (keep commented out, delete, add conda block here too, etc.) but there needs to be a change here so that the flow can start from the standard base environment.

Otherwise, the 4km regridding runs fine and the QC looks good. I ran it for pr on a daily frequency, and the regridding + reprojection results appear reasonable on visual inspection. So this part is also approved contingent on the changes mentioned above.

kyleredilla · 2025-02-18T17:33:13Z

Awesome, thanks @Joshdpaul ! I just dropped that unused function from the 4km flow for now. We can add something back in later if we see fit.
@cstephen as Josh noted, you will probably have to use the cmip6-utils conda env to run this pipeline for the time being. Which may have been your original thought anyway, because that was the whole purpose of a repo-based env! But we have collectively discussed using a shared env for processing pipelines (i.e. snap-geo) because we typically use the same tools and there has maybe been more headache with using the individual conda envs than would be expected. But it should work here!

cstephen

This all looks good to me! I ran a handful of jobs through Prefect using this branch of cmip6-utils, including various combinations of:

v1.1 and v1.2 run configurations
common grid and 4km grid
clobber and no clobber

Most jobs succeeded and produced the expected output files. I looked through the QC notebook output and that looked good too 👍

I did run into a couple cases of processes hanging along the way, usually (maybe always) the generate_batch_files.slurm step, and the Slurm job ultimately timing out. Most jobs did succeed, however, so merging this branch as-is sounds like a good move, especially since there seems to be no obvious way to fix the hanging processes.

Joshdpaul and others added 16 commits August 30, 2024 15:46

move multiprocessing out of for loop

82eaee6

add qc_config and job array to qc sbatch

f879224

add print statement to track file names/times

f92c051

use actual variable count in sbatch params

0c1a6b7

Combine qc script and notebook and simplify code

7bf8b11

drop refs to visual qc for runner script

5896faa

make qc scripts and notebook consistent

67f9f9f

small fixes for regridding qc

db5ce3d

remove unused args in qc runner

8a80efe

pull subsampling code into qc module

bdc7ef3

checkpoint for script to combine regridded data for rasdaman

60dc1b4

finalize script to combine regridded files for rasdaman

1aadf51

fix regridding batch files script to handle MPI-M institution ID

f09963a

add empty variables if missing

6001ac5

remove rasdaman preprocessing script for monthly common cmip6

9b90485

remove unused dict from regridding config

0cce56a

kyleredilla requested a review from cstephen October 2, 2024 17:01

kyleredilla changed the title ~~Fix multiprocessing~~ Fix multiprocessing and consolidate QC Oct 2, 2024

cstephen requested changes Oct 14, 2024

View reviewed changes

kyleredilla added 11 commits December 10, 2024 10:16

print job_id for prefect ssh to parse

aa91734

try command in place of conda_init_script

7172b07

print job IDs for regrid runner

89c2de7

print list of job ids as space-separated string

6bbf6a0

drop crop from target dataset

369dfe6

ensure lon dim is 1D when sorting

7cdfe9d

disable tryexcept for regrid call

e635d47

check for latlon dims before fixing

22f4675

add interp method as top level parameter

6725514

fix interp_method top level parameter

c06a64b

add missing kwarg

81ef5d1

kyleredilla added 21 commits January 23, 2025 10:52

remove Omon prsn from transfers

9383a9f

add script to remove old versions of duplicate raw cmip6 files

158b2a2

updates to improve regridding qc

dd2e99a

add function to drop empty directories in cleanout script

3de63dd

remove prsn as landsea variable 🤦

be339d5

use nan funcs for regrid qc plot comparisons

61663d8

clean up regrid qc code and fix up docs

64afc34

clean up regridding pipeline including docs

56394bd

typo fix

4694994

improve time dim encoding in regrid files

049eac8

fix units and dims in plotting in regrid qc

a20dc6a

update readme

928fa29

bump regridding QC time limit to 2 hours

4c64df9

tweaks to fix regridding for 360 day calendars

1c9e33c

add missing code for rasdafy arg

9ed7e84

de-duplicate function name

a2e477e

fix regrid bounding box fetch for qc and add buffer for file min max

41ae4ac

fix regrid to source bbox conversion in qc

4c3ec4b

improve regrid qc plotting

cc4d9d9

add missing arg in regrid qc function

4129940

remove holdings output text files

1e45c1c

kyleredilla mentioned this pull request Feb 13, 2025

Generalize cmip6 regridding flow ua-snap/prefect#51

Merged

kyleredilla requested review from cstephen and Joshdpaul February 13, 2025 23:19

cstephen approved these changes Feb 18, 2025

View reviewed changes

kyleredilla merged commit 63dad3d into main Feb 19, 2025

kyleredilla deleted the fix_multiprocessing branch February 19, 2025 00:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multiprocessing and consolidate QC #68

Fix multiprocessing and consolidate QC #68

kyleredilla commented Oct 2, 2024

cstephen left a comment

cstephen Oct 14, 2024

cstephen Oct 14, 2024

kyleredilla commented Feb 13, 2025

Joshdpaul commented Feb 18, 2025

kyleredilla commented Feb 18, 2025

cstephen left a comment

Fix multiprocessing and consolidate QC #68

Fix multiprocessing and consolidate QC #68

Conversation

kyleredilla commented Oct 2, 2024

cstephen left a comment

Choose a reason for hiding this comment

cstephen Oct 14, 2024

Choose a reason for hiding this comment

cstephen Oct 14, 2024

Choose a reason for hiding this comment

kyleredilla commented Feb 13, 2025

Joshdpaul commented Feb 18, 2025

kyleredilla commented Feb 18, 2025

cstephen left a comment

Choose a reason for hiding this comment