Bug harvesting data from WDC for services other than NWISUV & NWISDV #2417

emiliom · 2017-10-23T02:46:03Z

At our last call @rajadain mentioned something wrong with the GHCN service available via WDC, such that the data are not plotted in the detailed view. I've done some digging and tests on those other services. I also discussed this with @aufdenkampe on Thursday, but before I did more thorough testing.

I think the bug is on the portal code itself, and I suspect it has to do with trying to query data from services that don't include recent data (the last 1-3 months, the default window used by th BiG CZ portal).

The problem is not just with GHCN. No service other than NWISUV & NWISDV seems to work, even NWISGW. I've taken two sites identified via the portal (one from GHCN and one from CocoRAHs -- note these maybe temporary links), and tried them on the CUAHSI/WDC HydroClient. In both cases, the sites were fine on the latter.

If confirmed, this is an important bug. WDC data access that's limited to NWISUV & NWISDV is a very constrained implementation.

cc @ajrobbins

rajadain · 2017-10-23T13:16:04Z

Thanks for the investigation and examples @emiliom, this is a great catch. I'll take a look when I get the chance.

rajadain · 2017-10-29T19:25:10Z

Hi @emiliom, I just made a gist demoing the error we get for CoCoRaHs values: https://gist.github.com/rajadain/22b0bc5546fcd0dfdc0dba874bb31698. In this gist, I'm first doing a GetSeriesCatalogForBox2 search, then finding a CocoRaHs series record, then fetching date range for it using the ULMO's siteinfo, and finally fetching values using ULMO's values, where the failure happens.

Looking at the error output it seems to be related to how the return values are being parsed. Do you have any insight into this?

emiliom · 2017-10-29T21:21:16Z

@rajadain I'm looking into it. I'll try to follow up today; otherwise, definitely by early Monday.

emiliom · 2017-10-30T07:26:51Z

I've found that the date range you're searching (2016-11-30 to 2016-12-31) doesn't actually have any data, despite the GetSiteInfo information that says the time series data ends on 2016-12-31 for the variable you're targeting (PRCPS). In fact, after some digging, it looks like the last time step available is actually 2016-4-30.

ulmo is not failing gracefully in this situation. That's understandable; the server should return a formal, informative exception (in XML; for example, this) rather than a normal-looking time series response with no time series values (and corresponding value elements) at all!

I have no clue how widespread this problem is, within the CoCoRaHs service or others. But obviously in this particular case the problem is not on your end. You're applying correct logic: issuing date range parameters based on the metadata information extracted from GetSiteInfo.

There's nothing constructive we can do at this time, that I can think of right now. Except, maybe, changing the default date interval to be harvested initially to one year. Let's keep in mind that we've tended to assume (biased by often focusing on NWISUV) that all data sets are fairly high frequency, such as hourly. But many datasets are likely to be daily. The volume of data is not a function solely of the date range, but also of the data time step. Even a year of hourly data might not be a very large response, for a single variable. But, we'll only find out by trial and error.

I'll give it more thought on Monday ...

emiliom · 2017-10-30T07:35:48Z

I should add that many datasets are much less frequent than daily! Some are very sporadic and sparse. For example, within the bounding box @rajadain used in his sample notebook, we can see these two examples (at least according to the series metadata):

ShaleNetworkODM:POWR_2299, ShaleNetwork:POWR_Turbidity_JTU: 7 values between 2000-04-06T13:40:00Z and 2005-10-11T11:00:00Z
NWISGW:395408075104001, NWISGW:72019: 84 values between 1954-09-14T00:00:00Z and 2017-09-14T00:00:00Z

rajadain · 2017-10-30T13:57:15Z

Thanks for the insight. Given that GetSiteInfo's values can't always be relied upon, are there some intelligent defaults we can assume? For NWISUV and NWISDV, I think there are too many values to fetch a year's worth, which may cause timeouts. If we can guarantee that CoCoRaHS and ShaleNetworkODM or others will have fewer values, we can hard code these to fetch a year.

I'll try and benchmark the time difference between fetching a month and a year for NWISUV / DV and report back.

rajadain · 2017-10-30T15:36:18Z

I just tried fetching 2 years of data (1 January 2015 to 31 December 2016) and got the same error: https://gist.github.com/rajadain/a03c6009b0ef80d1b3326e41c3c174af

emiliom · 2017-10-30T18:24:23Z

I just tried fetching 2 years of data (1 January 2015 to 31 December 2016) and got the same error:

My apologies; what I reported on earlier about the CoCoRaHS site (being able to ingest data up to 2016-4-30) was probably based on tests that used lower-level, suds direct access rather than ulmo.cuahsi.wof.get_values; I should've been clear about that, but it was late at night :(

I'm able to reproduce the error with ulmo.cuahsi.wof.get_values you report, and also to successfully issue a suds request that looks like this:

serv_client = Client(sample.ServURL + '?WSDL', timeout=5)
response = serv_client.service.GetValues(sample.location, sample.VarCode, 
                                         startDate=from_date.strftime('%m/%d/%Y'),
                                         endDate=to_date.strftime('%m/%d/%Y'))

I'll try to track down the problem in ulmo (which, FYI, likely involves an idiosyncracy with the service response).

But as @ajrobbins said in her email today, we need to step back and decide whether pursuing much more effort on this issue is worthwhile.

In the meantime, I'll comment briefly on your other comments:

Given that GetSiteInfo's values can't always be relied upon, are there some intelligent defaults we can assume?

That's the crux. I don't have anything new to say, yet, beyond my earlier suggestion for a year.

For NWISUV and NWISDV, I think there are too many values to fetch a year's worth, which may cause timeouts.

Well, NWISDV will definitely be fine. The D stands for "daily", so a one-year request will return 366 values at most. That shouldn't be a problem at all. For comparison, since NWISUV is typically hourly (I think), the one month request you're already doing returns 24*30 = 720 values.

If we can guarantee that CoCoRaHS and ShaleNetworkODM or others will have fewer values, we can hard code these to fetch a year.

No blanket guarantees. Every service is different. There will be some that are in fact hourly or more frequent, and others that are super sparse -- yearly, irregular, or a single value 80 years ago!

Let me see if I can come up with a scheme that doesn't involve much hard-wiring and is generally applicable.

emiliom · 2017-10-30T19:29:33Z

I've found the source of the problem that @rajadain demoed in his latest notebook (where he tries fetching two years of data for a CoCoRaHs site). Unfortunately it's due to an interaction of conventions used in the WDC response (less than ideal, or possibly wrong, depending on what the standards spec says) and ulmo expectations.

I think I (with help from Don) can create a fix to ulmo by tomorrow; however, if you are currently using the ulmo conda package, you'd have to switch to pip install from an ulmo fork instead.

Anyway, more fodder for discussion and decision making.

aufdenkampe · 2017-10-30T20:56:27Z

@emiliom, it would be great if you and Don found the fix. I just had a conversation with @ajrobbins and let her know that we can use WPF funds to fix this, as it will be important for Monitor My Watershed.

So, if you do get a fix working in Ulmo by tomorrow, there might just be enough time for us to fix this bug before finalizing the Nov. release.

emiliom · 2017-10-30T21:09:06Z

Thanks, @aufdenkampe. I assume that Azavea has the time, though? I thought the core bottleneck was more time than money. Be that as it may ...

I'd like to hear from @rajadain first if he's able to install ulmo into their environment from a github fork rather than the official conda or PyPI pip packages, w/o introducing other complexities and time sinks.

(This would be an interim solution, of course. Ultimately we would submit a PR to ulmo, but we can't count on a time frame for the PR to be accepted there and turned into a new release.)

rajadain · 2017-10-30T21:15:56Z

Yes we can support custom GitHub fork URLs. We'll incorporate it as soon as it is available.

emiliom · 2017-10-30T21:23:19Z

Yes we can support custom GitHub fork URLs. We'll incorporate it as soon as it is available.

Great, thanks.

Fingers crossed ...

emiliom · 2017-10-30T22:52:29Z

@lsetiawan Look over the exchanges on this issue, from the start. I'll send you more info later today. I'll take a shot late today/tonight at implementing the ulmo fix I mention here. But I'll probably ask for your help with testing and possibly additional work first thing Tue morning.

Or if things go well I just may be done (including testing) by tomorrow morning 😸

emiliom · 2017-10-31T07:44:46Z

@rajadain I have a tweaked fork of ulmo that "fixes" the fatal parsing error you were running into. I've tested it using pretty much the same notebook in your gist (2 years of data of the CoCoRaHs site).

You can grab it as the wml_values_md branch at https://github.com/emiliom/ulmo

Note that for expedience, my fix simply skips the "metadata" parsing for elements that were causing the error due to the convention mismatch. This means that metadata is not being read. I'm going to fix that later before I submit a PR to ulmo. But for the portal application, the skipped metadata is actually never used at this time, so you won't notice its absence.

Let me know if you have any questions or run into any issues. I hope it works cleanly!

Of course, we still have to deal with the issue of the GetSiteInfo CoCoRaHs end date not being accurate (FYI, I think it's incorrectly rounded up to the nearest 7/31 or 12/31 of the last year with data). I'll try to come up with an approach on Tuesday.

aufdenkampe · 2017-10-31T12:28:31Z

Thanks @emiliom and @rajadain for working hard to fix this bug before the November release!

Because of parse issues in the main repo. See here for background: #2417 (comment)

emiliom · 2017-10-31T16:02:41Z

@rajadain Thanks for cc'ing me on #2446. That's looking great!

One quick question/suggestion: we're going with a year of data as the default. Hopefully that length of time is specified in such a way that we could throttle it easily later on, as we accumulate experience and conclude that, for example, it's slowing down NWISUV responses too much and should be reduced.

rajadain · 2017-10-31T16:05:34Z

I went with a year on all because it seemed more consistent, although for many we don't actually get a year (I think NWISUV values are going back only till July, so maybe a quarter?). We can have different limits for different service providers, and if it makes sense to limit NWISUV to one month that can be done.

emiliom · 2017-10-31T16:08:34Z

I understand and agree with the reasoning for having a single length across all services (a year). What I'm suggesting is that if the implementation was done in such a way that that time interval is specified in a single place/configuration, we'd be able to adjust it easily (still across all services) later on.

…tch-errors BiG-CZ: Fix WDC Fetch Errors Connects #2417 Connects #2413

emiliom added BiG CZ November demo BigCZ bug labels Oct 23, 2017

emiliom added this to the BiG CZ Contract Burndown milestone Oct 23, 2017

rajadain added the 1 label Oct 23, 2017

rajadain added in progress and removed 1 labels Oct 29, 2017

rajadain self-assigned this Oct 29, 2017

rajadain added the + label Oct 29, 2017

rajadain added a commit that referenced this issue Oct 31, 2017

Use ulmo fork

7b311b7

Because of parse issues in the main repo. See here for background: #2417 (comment)

rajadain mentioned this issue Oct 31, 2017

BiG-CZ: Fix WDC Fetch Errors #2446

Merged

rajadain added in review WPF Funding Source: William Penn Foundation and removed in progress labels Oct 31, 2017

rajadain added a commit that referenced this issue Nov 1, 2017

Merge pull request #2446 from WikiWatershed/tt/bigcz-cuahsi-detail-fe…

5b9fc63

…tch-errors BiG-CZ: Fix WDC Fetch Errors Connects #2417 Connects #2413

rajadain added tested/verified and removed in review labels Nov 1, 2017

caseycesari closed this as completed Nov 2, 2017

caseycesari removed the tested/verified label Nov 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug harvesting data from WDC for services other than NWISUV & NWISDV #2417

Bug harvesting data from WDC for services other than NWISUV & NWISDV #2417

emiliom commented Oct 23, 2017

rajadain commented Oct 23, 2017

rajadain commented Oct 29, 2017

emiliom commented Oct 29, 2017

emiliom commented Oct 30, 2017

emiliom commented Oct 30, 2017

rajadain commented Oct 30, 2017

rajadain commented Oct 30, 2017 •

edited

Loading

emiliom commented Oct 30, 2017

emiliom commented Oct 30, 2017

aufdenkampe commented Oct 30, 2017

emiliom commented Oct 30, 2017

rajadain commented Oct 30, 2017

emiliom commented Oct 30, 2017

emiliom commented Oct 30, 2017

emiliom commented Oct 31, 2017

aufdenkampe commented Oct 31, 2017

emiliom commented Oct 31, 2017

rajadain commented Oct 31, 2017

emiliom commented Oct 31, 2017

Bug harvesting data from WDC for services other than NWISUV & NWISDV #2417

Bug harvesting data from WDC for services other than NWISUV & NWISDV #2417

Comments

emiliom commented Oct 23, 2017

rajadain commented Oct 23, 2017

rajadain commented Oct 29, 2017

emiliom commented Oct 29, 2017

emiliom commented Oct 30, 2017

emiliom commented Oct 30, 2017

rajadain commented Oct 30, 2017

rajadain commented Oct 30, 2017 • edited Loading

emiliom commented Oct 30, 2017

emiliom commented Oct 30, 2017

aufdenkampe commented Oct 30, 2017

emiliom commented Oct 30, 2017

rajadain commented Oct 30, 2017

emiliom commented Oct 30, 2017

emiliom commented Oct 30, 2017

emiliom commented Oct 31, 2017

aufdenkampe commented Oct 31, 2017

emiliom commented Oct 31, 2017

rajadain commented Oct 31, 2017

emiliom commented Oct 31, 2017

rajadain commented Oct 30, 2017 •

edited

Loading