-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rename, fix, and extend NAWQA (NWQN) demo #153
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool stuff! I was able to run the scripts by setting lithops to the LocalhostExecutor, and I used the small "testing" site list. I successfully downloaded several parquet files, though to be honest I haven't gotten to re-opening them and understanding how they're structured.
It would be nice to overall see more documentation of the code lines and different functions. Though I was able to (mostly) figure out what the code does, this is my first exposure to some of these functions, and I could've gotten to the point faster if there was more narrative on what was going on (for example, I've never seen the exponential_backoff
method to improve API call handling, and I'd like to know more about the mapping functionality in lithops...is it any more complicated than an "apply" function?).
Do you plan to "fill in" water quality values in some way, similar to streamflow?
Overall, very cool example of using dataretrieval-python with larger data calls.
demos/nwqn_data_pull/README.md
Outdated
|
||
This examples walks through using lithops to retrieve data from every NAWQA | ||
This examples walks through using lithops to retrieve data from every NWQN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This examples walks through using lithops to retrieve data from every NWQN | |
This example walks through using lithops to retrieve data from every NWQN |
demos/nwqn_data_pull/README.md
Outdated
|
||
This examples walks through using lithops to retrieve data from every NAWQA | ||
This examples walks through using lithops to retrieve data from every NWQN | ||
monitoring site, then writes the results to a parquet files on s3. Each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
monitoring site, then writes the results to a parquet files on s3. Each | |
monitoring site, then writes the results to a parquet file on s3. Each |
python retrieve_nawqa_with_lithops.py | ||
python retrieve_nwqn_samples.py | ||
|
||
python retrieve_nwqn_streamflow.py | ||
``` | ||
|
||
## Cleaning up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small typo: lithops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
☝️
@@ -32,9 +34,11 @@ wget https://www.sciencebase.gov/catalog/file/get/655d2063d34ee4b6e05cc9e6?f=__d | |||
export DESTINATION_BUCKET=<path/to/bucket> | |||
``` | |||
|
|||
1. Run the script |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't seem to comment on unchanged lines, but this refers to line 27: I didn't know I needed to download wget (either in bash or pip install via python) before downloading the sciencebase data using that method. Add a note about it, perhaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. That will be system-dependent, but I noted that alternatively you can navigate to the url to download the file.
attempts += 1 | ||
if attempts > max_retries: | ||
raise e | ||
wait_time = base_delay * (2 ** attempts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I follow to this point: are you making it so that with every failed attempt, the wait time increases exponentially between attempts (until max_retries is satisified)? Might be helpful to add a comment here.
# compute fraction of drainage area | ||
site_info = site_info[["drain_area_va"]].copy() | ||
site_info["drain_fraction"] = site_info["drain_area_va"] / main_site_drainage_area | ||
site_info["fraction_diff"] = np.abs(1 - site_info["drain_fraction"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any cutoff(s) for drain fraction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just that imposed by the neighborhood search.
|
||
output = pd.DataFrame() | ||
|
||
# loop through sites, updating the best data first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"best data" meaning the site with the most similar drainage area?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, revised
output = update_dataframe(output, fill_data) | ||
|
||
output = output.drop(columns=["drain_area_va", "drain_fraction", "fraction_diff"]) | ||
output["site_no"] = site |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should output include metadata indicating when data are from a different basin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary for the demo.
|
||
output = output.drop(columns=["drain_area_va", "drain_fraction", "fraction_diff"]) | ||
output["site_no"] = site | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried running this function line-by-line for site '01170095', which had 3 nearby sites: ['01170070', '01170095', '01170100', '01170103']. It appears that the only site with data was '01170070', so that site was wholly used to fill in for site '01170095'. Is this the desired outcome?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
# loop through sites, updating the best data first | ||
for fill_site in fill_order: | ||
fill_data = df.loc[fill_site] | ||
output = update_dataframe(output, fill_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Snazzy!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing a bug in update_dataframe(). Bots wrote most of this code, so it might take a few iterations to find the bugs.
That process is similar, except we don't rescale by drainage area. For the moment, I think the demo is a successful proof-of-concept for using the NLDI to aggregate data. In practice, we would want several more steps, but I have no plans to implement them at the moment. Otherwise, this is more or less working as intended and I've addressed the review comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks!
The National Watern Quality Network (NWQN) demo uses AWS serverless to search and pull all NWQN data into an S3 bucket. This PR makes some fixes to the demo and incorporates streamflow.
For context, this is an advanced usage example, which does not currently appear in the doc page. Nevertheless, I host it in the repo for instructing others, but also for helping us to scope development of
dataretrieval
more generally. These pipelines stress several endpoints and help us expose failure modes that appear when we scale up our workflows.