-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix multiprocessing and consolidate QC #68
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of folding all of the QC work into a single spot, and this looks great overall! I ran into a few small issues along the way though:
-
While running this Prefect flow using the
v1_config
branch of the prefect repo, it complained that the--qc_notebook
option was missing from the QC command and quit/stalled out. I replaced the system call here https://github.com/ua-snap/prefect/blob/58843c3f0070bd8ef9094326d826b58e92136bfa/regridding/regridding_functions.py#L394 with the following and it seemed to work after that:f"export PATH=$PATH:/opt/slurm-22.05.4/bin:/opt/slurm-22.05.4/sbin:$HOME/miniconda3/bin && python {run_qc_script} --qc_notebook '{visual_qc_notebook}' --conda_init_script '{conda_init_script}' --conda_env_name '{conda_env_name}' --cmip6_directory '{cmip6_directory}' --output_directory '{output_directory}' --repo_regridding_directory '{repo_regridding_directory}' --vars '{vars}' --freqs '{freqs}' --models '{models}' --scenarios '{scenarios}'"
-
The QC notebook complained that
error_file
was not defined in a couple places. See PR code review comments. -
After removing the
error_file
references to run the QC notebook to completion, the random src vs. regrid files it chose to inspect produced the following error:AssertionError: No files found for regridded file clt_Amon_MPI-ESM1-2-HR_historical_regrid_196201-196212.nc in /beegfs/CMIP6/arctic-cmip6/CMIP6/CMIP/DKRZ/MPI-ESM1-2-HR/historical with */Amon/clt/*/*/clt_Amon_MPI-ESM1-2-HR_historical_*.nc.
My second run chose a different set of random files & succeeded, however. And it looks great!! Once those small issues are fixed, I think this is good to merge.
regridding/qc.ipynb
Outdated
"print(f\"QC process complete: {error_count} errors found.\")\n", | ||
"if len(ds_errors) > 0:\n", | ||
" print(\n", | ||
" f\"Errors in opening some datasets. {len(ds_errors)} files could not be opened. See {str(error_file)} for error log.\"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The QC notebook complained that error_file
was not defined here and execution stopped. I got around this temporarily just by removing the reference to error_file
here.
regridding/qc.ipynb
Outdated
" )\n", | ||
"if len(value_errors) > 0:\n", | ||
" print(\n", | ||
" f\"Errors in dataset values. {len(value_errors)} files have regridded values outside of source file range. See {str(error_file)} for error log.\"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as my above comment, error_file
was not defined here either.
OK - lots of updates since this was last reviewed! I am sorry this became a hero branch. This PR now generalizes the regridding pipeline so that it may be used to regrid our cmip6 holdings to different grids. This branch started out as an attempt to improve resilience of some scripts against getting stuck / hanging while calling the multiprocessing library for parallel processing of netCDF files. There are some changes included which might help this, but the issue still happens intermittently. But by far the biggest help is breaking the processing down into smaller chunks, which is accomplished on the orchestration (prefect) side of things, e.g. running flows with fewer variables for daily resolutions, etc. We decided to continue with this branch for development of other features, and it grew and grew. Here is a summary of the changes:
there have been some additional changes to transfers included in this branch:
To test:
Make sure to use the Note - there are still some weird things going on with some of the land/sea variables. For example, using "conservative" interpolation method for Closes #48 |
@kyleredilla Happy to dig into this hero branch! 🦸🏻 The comments below concern both this PR and this corresponding PR in the prefect repo. I was able to successfully run the regridding for all V1 variables (from the I tried something new, which was to run the deployment using parameters in the JSON files via command line like so:
The individual Anyways, the coarse resolution regridding pipeline finished with no hangups, and the QC files look good to me with the exception of the weird vertical lines you mentioned: For the 4km regridding, I had to comment out everything to do with the Otherwise, the 4km regridding runs fine and the QC looks good. I ran it for |
Awesome, thanks @Joshdpaul ! I just dropped that unused function from the 4km flow for now. We can add something back in later if we see fit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks good to me! I ran a handful of jobs through Prefect using this branch of cmip6-utils, including various combinations of:
- v1.1 and v1.2 run configurations
- common grid and 4km grid
- clobber and no clobber
Most jobs succeeded and produced the expected output files. I looked through the QC notebook output and that looked good too 👍
I did run into a couple cases of processes hanging along the way, usually (maybe always) the generate_batch_files.slurm
step, and the Slurm job ultimately timing out. Most jobs did succeed, however, so merging this branch as-is sounds like a good move, especially since there seems to be no obvious way to fix the hanging processes.
The initial goal with this PR was simply to improve the reliability of using multiprocessing to scrape the grid info from files in the batch file generation step of the regridding pipeline, and in the QC section of sanity checking all regridded files. The behavior we consistently see in using multiprocessing (or concurrent.futures paradigm) is that whatever method is used to dispatch some function that operates on multiple netcdf files can sometimes hang indefinitely. This PR has taken some steps to improve this, but it would seem that total reliability here is out of scope for now. It seems that simply processing fewer files by breaking things up into smaller groups, such as we have begun doing for the prefect flows (with v1_1 and v1_2 variables etc), is a sane way forward for now. The QC step that was checking every single new file has been changed as well to only check a random subset of the files, which should help with the hanging symptoms. You will notice that the quality control step step that was originally outside of the QC notebook has been moved into that notebook, so that we only have one QC product to evaluate following a flow run.
To test, simply run a regridding flow using prefect, probably for a subset of variables and frequencies, such as monthly v1_2 etc and check out the quality control notebook.