Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transfers pipeline improvements #54

Merged
merged 47 commits into from
Apr 25, 2024
Merged

Transfers pipeline improvements #54

merged 47 commits into from
Apr 25, 2024

Conversation

Joshdpaul
Copy link
Contributor

This PR closes #39 , closes #38 , addresses #32, and closes #3.

Changes to model ensemble

The following models were added to the transfer pipeline:

  • E3SM-1-0, E3SM-1-1, E3SM-1-1-ECA
  • MPI-ESM1-2-HR (historical)
  • MRI-ESM2-0

and the following model was removed:

  • CESM2-WACCM

All files were successfully transferred by running the pipeline as described in the README, except the E3SM files. Though the E3SM models appear in the holdings, only the historical period is available on the LLNL ESGF2 globus node. Therefore the E3SM models did not have a variant chosen for the config.prod_variant_lu table and were not included in the batch files. We will need to keep #32 open for now and decide what to do about the incomplete data for the E3SM models.

Retry in ls utility

The utils.operation_ls() function was improved by incorporating a "retry" logic, to attempt 5 retries if the operation recieves an HTTP error. This seemed to fix the intermittent errors from issue #3 and resulted in a clean transfers run. By printing error statements from the holdings audit (see changes to esgf_holdings.get_filenames() function) and capturing the output in a text file (see the new esgf_holdings_output.txt) it should be possible to parse any persistent errors and potentially retry those specfic paths in a new transfer session. (This is somewhat related to issue #53 in that we may want to include the ability to audit/transfer smaller batches of files.) That output text file also prints errors for empty directories, which could be retried at a later date to check if more data has been uploaded to the globus node.

Allow multiple institutions & multiple grid types

The code was refactored to allow for including lists of institutions and grid types in config.py. Some functions were revised to return a list of dicts instead of just a dict, so that if multiple directories were encountered during ls operations, all contents of the directories could be audited. (Since there is no recursive ls function availble in the globus SDK, this was necessary in cases where we don't know which models have multiple grid types.) Using the new select_grids.ipynb notebook, we see that only two models (CNRM-CM6-1-HR and EC-Earth3-Veg) require more than one grid type in order to capture all frequency/variable combinations. Otherwise, models with more than one grid type have equal representation of frequency/variable combinations regardless of which grid type is chosen; in those cases, the native grid gn is always specified in the config.py.

In all cases where the config.model_inst_lu table is used, a conditional statement was added in order to allow for multiple institutions. This only applies to the MPI-ESM1-2-HR model, but the code now supports this situation in general without flagging this model explicitly. From what I can tell, these revisions to the transfers config.py should not affect the regridding or indicators pipelines since those use their own config.py files.

TO TEST:

I don't think you need to run the entire transfer, but you can if you really want to! What I'd really appreciate is a sanity check on the list logic in the esgf_holdings.py functions, and the conditional statements in generate_manifest.py, generate_batch_files.py, and tests/test_mirror.py. I think my testing confirms that this works for the cases we have identified, but are there other cases that could cause this to break or behave unexpectedly? Are there any obvious cases that are not captured by the conditional statements in esgf_holdings.get_filenames()?

Check out the new select_grids.ipynb to see the analysis. This notebook uses very similar logic and borrows some code from select_variants.ipynb.

Also, if there are any ideas to improve the printed error statements that end up in esgf_holdings_output.txt, let me know how that could be improved to make that output more useful in subsequent transfer attempts.

…l possibilities to avoid row_di reference before assignment error
…ptive ls error messages; handle multiple row returns in make_holdings_table() and check for non-dict rows; incorporate retry into operation_ls()
Copy link
Contributor

@kyleredilla kyleredilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all is looking good to me!! As you know I have made a few changes:

  • improved network error handling of the esgf_holdings.py - it seemed like we were really only getting 502 errors (at least I was) on directories that were in fact valid paths / could feasibly have data. I tried to make this more robust and add retry logic that, while it seems to work, can really stretch the execution time (exponential backoff). See also revision suggestion on the operation_ls function.
  • changed things in the config a bit to include E3SM models and added an explicit "models_of_interest" list and similar tweaks to relevant scripts that makes it clear "we are interested in some models. we want to download a subset of those based on results"
  • added a script esgf_holdings_e3sm.py for getting the holdings of E3SM only. This seemed like the less complex way of doing things at the time. it creates a CSV for E3SM holdings only.
  • updated the select_variants.ipynb based on updated holdings and knowledge of ensemble selection
  • update the holdings_summary.ipynb notebook based on updated holdings
  • reviewed made a couple of tweaks to the select_grids.ipynb notebook

Comment on lines 171 to 179
try:
r = tc.operation_ls(ep, path)
except globus_sdk.TransferAPIError as exc:
return exc.http_status

n = 5
for retry in range(n):
try:
r = tc.operation_ls(ep, path)
except globus_sdk.TransferAPIError as exc:
if retry < n - 1:
continue
else:
return exc.http_status

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we actually might want to keep this as is. I added an exponential time backoff to the call of this function should it return specific status. I think I was still getting errors even with the retries because this would retry immediately. I think it is kind of helpful to keep this function simply "try it once, and return the error code if it fails" and we can add wrappers for the retries like I did in esgf_holdings.py. Open to discussing though.

Suggested change
try:
r = tc.operation_ls(ep, path)
except globus_sdk.TransferAPIError as exc:
return exc.http_status
n = 5
for retry in range(n):
try:
r = tc.operation_ls(ep, path)
except globus_sdk.TransferAPIError as exc:
if retry < n - 1:
continue
else:
return exc.http_status
try:
r = tc.operation_ls(ep, path)
except globus_sdk.TransferAPIError as exc:
return exc.http_status

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revised in c5f3c2b

@Joshdpaul Joshdpaul merged commit 979b06b into main Apr 25, 2024
@Joshdpaul Joshdpaul deleted the add_models branch April 25, 2024 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Get MRI-ESM2-0 data! Missing historical MPI-ESM1-2-HR data Intermittent errors in ESGF holdings audit
2 participants