Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug harvesting data from WDC for services other than NWISUV & NWISDV #2417

Closed
emiliom opened this issue Oct 23, 2017 · 19 comments
Closed

Bug harvesting data from WDC for services other than NWISUV & NWISDV #2417

emiliom opened this issue Oct 23, 2017 · 19 comments
Assignees
Labels
+ BiG CZ November demo BigCZ bug WPF Funding Source: William Penn Foundation

Comments

@emiliom
Copy link
Contributor

emiliom commented Oct 23, 2017

At our last call @rajadain mentioned something wrong with the GHCN service available via WDC, such that the data are not plotted in the detailed view. I've done some digging and tests on those other services. I also discussed this with @aufdenkampe on Thursday, but before I did more thorough testing.

I think the bug is on the portal code itself, and I suspect it has to do with trying to query data from services that don't include recent data (the last 1-3 months, the default window used by th BiG CZ portal).

The problem is not just with GHCN. No service other than NWISUV & NWISDV seems to work, even NWISGW. I've taken two sites identified via the portal (one from GHCN and one from CocoRAHs -- note these maybe temporary links), and tried them on the CUAHSI/WDC HydroClient. In both cases, the sites were fine on the latter.

If confirmed, this is an important bug. WDC data access that's limited to NWISUV & NWISDV is a very constrained implementation.

cc @ajrobbins

@rajadain
Copy link
Member

Thanks for the investigation and examples @emiliom, this is a great catch. I'll take a look when I get the chance.

@rajadain rajadain added in progress and removed 1 labels Oct 29, 2017
@rajadain rajadain self-assigned this Oct 29, 2017
@rajadain rajadain added the + label Oct 29, 2017
@rajadain
Copy link
Member

Hi @emiliom, I just made a gist demoing the error we get for CoCoRaHs values: https://gist.github.com/rajadain/22b0bc5546fcd0dfdc0dba874bb31698. In this gist, I'm first doing a GetSeriesCatalogForBox2 search, then finding a CocoRaHs series record, then fetching date range for it using the ULMO's siteinfo, and finally fetching values using ULMO's values, where the failure happens.

Looking at the error output it seems to be related to how the return values are being parsed. Do you have any insight into this?

@emiliom
Copy link
Contributor Author

emiliom commented Oct 29, 2017

@rajadain I'm looking into it. I'll try to follow up today; otherwise, definitely by early Monday.

@emiliom
Copy link
Contributor Author

emiliom commented Oct 30, 2017

I've found that the date range you're searching (2016-11-30 to 2016-12-31) doesn't actually have any data, despite the GetSiteInfo information that says the time series data ends on 2016-12-31 for the variable you're targeting (PRCPS). In fact, after some digging, it looks like the last time step available is actually 2016-4-30.

ulmo is not failing gracefully in this situation. That's understandable; the server should return a formal, informative exception (in XML; for example, this) rather than a normal-looking time series response with no time series values (and corresponding value elements) at all!

I have no clue how widespread this problem is, within the CoCoRaHs service or others. But obviously in this particular case the problem is not on your end. You're applying correct logic: issuing date range parameters based on the metadata information extracted from GetSiteInfo.

There's nothing constructive we can do at this time, that I can think of right now. Except, maybe, changing the default date interval to be harvested initially to one year. Let's keep in mind that we've tended to assume (biased by often focusing on NWISUV) that all data sets are fairly high frequency, such as hourly. But many datasets are likely to be daily. The volume of data is not a function solely of the date range, but also of the data time step. Even a year of hourly data might not be a very large response, for a single variable. But, we'll only find out by trial and error.

I'll give it more thought on Monday ...

@emiliom
Copy link
Contributor Author

emiliom commented Oct 30, 2017

I should add that many datasets are much less frequent than daily! Some are very sporadic and sparse. For example, within the bounding box @rajadain used in his sample notebook, we can see these two examples (at least according to the series metadata):

  • ShaleNetworkODM:POWR_2299, ShaleNetwork:POWR_Turbidity_JTU: 7 values between 2000-04-06T13:40:00Z and 2005-10-11T11:00:00Z
  • NWISGW:395408075104001, NWISGW:72019: 84 values between 1954-09-14T00:00:00Z and 2017-09-14T00:00:00Z

@rajadain
Copy link
Member

Thanks for the insight. Given that GetSiteInfo's values can't always be relied upon, are there some intelligent defaults we can assume? For NWISUV and NWISDV, I think there are too many values to fetch a year's worth, which may cause timeouts. If we can guarantee that CoCoRaHS and ShaleNetworkODM or others will have fewer values, we can hard code these to fetch a year.

I'll try and benchmark the time difference between fetching a month and a year for NWISUV / DV and report back.

@rajadain
Copy link
Member

rajadain commented Oct 30, 2017

I just tried fetching 2 years of data (1 January 2015 to 31 December 2016) and got the same error: https://gist.github.com/rajadain/a03c6009b0ef80d1b3326e41c3c174af

@emiliom
Copy link
Contributor Author

emiliom commented Oct 30, 2017

I just tried fetching 2 years of data (1 January 2015 to 31 December 2016) and got the same error:

My apologies; what I reported on earlier about the CoCoRaHS site (being able to ingest data up to 2016-4-30) was probably based on tests that used lower-level, suds direct access rather than ulmo.cuahsi.wof.get_values; I should've been clear about that, but it was late at night :(

I'm able to reproduce the error with ulmo.cuahsi.wof.get_values you report, and also to successfully issue a suds request that looks like this:

serv_client = Client(sample.ServURL + '?WSDL', timeout=5)
response = serv_client.service.GetValues(sample.location, sample.VarCode, 
                                         startDate=from_date.strftime('%m/%d/%Y'),
                                         endDate=to_date.strftime('%m/%d/%Y'))

I'll try to track down the problem in ulmo (which, FYI, likely involves an idiosyncracy with the service response).

But as @ajrobbins said in her email today, we need to step back and decide whether pursuing much more effort on this issue is worthwhile.

In the meantime, I'll comment briefly on your other comments:

Given that GetSiteInfo's values can't always be relied upon, are there some intelligent defaults we can assume?

That's the crux. I don't have anything new to say, yet, beyond my earlier suggestion for a year.

For NWISUV and NWISDV, I think there are too many values to fetch a year's worth, which may cause timeouts.

Well, NWISDV will definitely be fine. The D stands for "daily", so a one-year request will return 366 values at most. That shouldn't be a problem at all. For comparison, since NWISUV is typically hourly (I think), the one month request you're already doing returns 24*30 = 720 values.

If we can guarantee that CoCoRaHS and ShaleNetworkODM or others will have fewer values, we can hard code these to fetch a year.

No blanket guarantees. Every service is different. There will be some that are in fact hourly or more frequent, and others that are super sparse -- yearly, irregular, or a single value 80 years ago!

Let me see if I can come up with a scheme that doesn't involve much hard-wiring and is generally applicable.

@emiliom
Copy link
Contributor Author

emiliom commented Oct 30, 2017

I've found the source of the problem that @rajadain demoed in his latest notebook (where he tries fetching two years of data for a CoCoRaHs site). Unfortunately it's due to an interaction of conventions used in the WDC response (less than ideal, or possibly wrong, depending on what the standards spec says) and ulmo expectations.

I think I (with help from Don) can create a fix to ulmo by tomorrow; however, if you are currently using the ulmo conda package, you'd have to switch to pip install from an ulmo fork instead.

Anyway, more fodder for discussion and decision making.

@aufdenkampe
Copy link
Member

@emiliom, it would be great if you and Don found the fix. I just had a conversation with @ajrobbins and let her know that we can use WPF funds to fix this, as it will be important for Monitor My Watershed.

So, if you do get a fix working in Ulmo by tomorrow, there might just be enough time for us to fix this bug before finalizing the Nov. release.

@emiliom
Copy link
Contributor Author

emiliom commented Oct 30, 2017

Thanks, @aufdenkampe. I assume that Azavea has the time, though? I thought the core bottleneck was more time than money. Be that as it may ...

I'd like to hear from @rajadain first if he's able to install ulmo into their environment from a github fork rather than the official conda or PyPI pip packages, w/o introducing other complexities and time sinks.

(This would be an interim solution, of course. Ultimately we would submit a PR to ulmo, but we can't count on a time frame for the PR to be accepted there and turned into a new release.)

@rajadain
Copy link
Member

Yes we can support custom GitHub fork URLs. We'll incorporate it as soon as it is available.

@emiliom
Copy link
Contributor Author

emiliom commented Oct 30, 2017

Yes we can support custom GitHub fork URLs. We'll incorporate it as soon as it is available.

Great, thanks.

Fingers crossed ...

@emiliom
Copy link
Contributor Author

emiliom commented Oct 30, 2017

@lsetiawan Look over the exchanges on this issue, from the start. I'll send you more info later today. I'll take a shot late today/tonight at implementing the ulmo fix I mention here. But I'll probably ask for your help with testing and possibly additional work first thing Tue morning.

Or if things go well I just may be done (including testing) by tomorrow morning 😸

@emiliom
Copy link
Contributor Author

emiliom commented Oct 31, 2017

@rajadain I have a tweaked fork of ulmo that "fixes" the fatal parsing error you were running into. I've tested it using pretty much the same notebook in your gist (2 years of data of the CoCoRaHs site).

You can grab it as the wml_values_md branch at https://github.com/emiliom/ulmo

Note that for expedience, my fix simply skips the "metadata" parsing for elements that were causing the error due to the convention mismatch. This means that metadata is not being read. I'm going to fix that later before I submit a PR to ulmo. But for the portal application, the skipped metadata is actually never used at this time, so you won't notice its absence.

Let me know if you have any questions or run into any issues. I hope it works cleanly!

Of course, we still have to deal with the issue of the GetSiteInfo CoCoRaHs end date not being accurate (FYI, I think it's incorrectly rounded up to the nearest 7/31 or 12/31 of the last year with data). I'll try to come up with an approach on Tuesday.

@aufdenkampe
Copy link
Member

Thanks @emiliom and @rajadain for working hard to fix this bug before the November release!

rajadain added a commit that referenced this issue Oct 31, 2017
Because of parse issues in the main repo. See here for background:
#2417 (comment)
@rajadain rajadain added in review WPF Funding Source: William Penn Foundation and removed in progress labels Oct 31, 2017
@emiliom
Copy link
Contributor Author

emiliom commented Oct 31, 2017

@rajadain Thanks for cc'ing me on #2446. That's looking great!

One quick question/suggestion: we're going with a year of data as the default. Hopefully that length of time is specified in such a way that we could throttle it easily later on, as we accumulate experience and conclude that, for example, it's slowing down NWISUV responses too much and should be reduced.

@rajadain
Copy link
Member

I went with a year on all because it seemed more consistent, although for many we don't actually get a year (I think NWISUV values are going back only till July, so maybe a quarter?). We can have different limits for different service providers, and if it makes sense to limit NWISUV to one month that can be done.

@emiliom
Copy link
Contributor Author

emiliom commented Oct 31, 2017

I understand and agree with the reasoning for having a single length across all services (a year). What I'm suggesting is that if the implementation was done in such a way that that time interval is specified in a single place/configuration, we'd be able to adjust it easily (still across all services) later on.

rajadain added a commit that referenced this issue Nov 1, 2017
…tch-errors

BiG-CZ: Fix WDC Fetch Errors

Connects #2417
Connects #2413
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
+ BiG CZ November demo BigCZ bug WPF Funding Source: William Penn Foundation
Projects
None yet
Development

No branches or pull requests

4 participants