votes html scraper should not fail silently by default #4

OriHoch · 2017-03-19T10:21:26Z

following PR #3 - it looks like in case an unknown member is encountered it will fail silently

in knesset-data-python we would like to be as low-level as possible and make as little assumptions as possible about how the data will be used

in this case, silently ignoring an error might cause unexpected problems down the line

I prefer that by default we will raise an exception, and add a skip_exceptions parameter which will instead of raising an exception yield the exception object as part of the data stream - to allow the end-user decide how to handle the exception

this is done in the dataservice scrapers, see for example this unit test -
https://github.com/hasadna/knesset-data-python/blob/master/knesset_data/dataservice/tests/base/test_exceptions.py#L38

OriHoch · 2017-03-19T10:22:02Z

@alonisser - I haven't actually tested it, please confirm/deny if my assumption that it fails silently is correct..
feel free to fix it as well :)

ytbaum · 2017-06-08T18:14:02Z

working on this

ytbaum · 2017-06-10T13:24:52Z

I'm a little bit confused about this issue. What is the failure and how exactly do I test or reproduce it?

More generally, I don't understand how the pieces of this project fit together. For example, I see the different classes representing different types of data (Bill, Vote, Member, etc.). But where are they instantiated or otherwise used?

OriHoch · 2017-06-10T13:42:38Z

knesset-data-python is a python module, you can pip install it, then call the functions / classes directly

regarding this bug -

reproduction - should raise exceptions

open jupyter notebook or python shell
from knesset_data.html_scrapers.votes import HtmlVote
HtmlVote.get_from_vote_id(24084)

expected

in case of any problem - raise an exception
ensure it never silently fails

actual

in case of missing member - it will silently fail

reproduction - skip exceptions param

HtmlVote.get_from_vote_id(24084, skip_exceptions=True)

expected

don't raise exceptions - try to return data you have, without the problematic items
exactly how it's handled might be different depending on the data / scraper

actual

no skip_exceptions param in HtmlVote class

ytbaum · 2017-06-10T14:42:21Z

Why should HtmlVote.get_from_vote_id(24084) cause an exception regarding unknown MK ID's? Looks like all that function is doing is grabbing the raw html. I assume the actual failure should occur in member_votes()?

I tried the following:

from knesset_data.html_scrapers.votes import HtmlVote h = HtmlVote.get_from_vote_id(24084) h.member_votes

but it simply returned []. Digging a little deeper, it looks like h.page is not the same html as I see when I manually inspect the page source in the browser. Specifically, in the page source I can see instances of "Vote_Bord" (the expression that the raw html is supposed to be split on), but that substring doesn't appear to exist in h.page:
str.index(h.page, "Vote_Bord") ValueError: substring not found

Am I doing something wrong here?

ytbaum · 2017-06-10T14:43:51Z

Sorry for some reason in the above comment github wouldn't let me put newlines in code blocks

OriHoch · 2017-06-10T14:56:39Z

You found one of the nice features of Knesset - arbitrary blocking with some kind of security solution called Reblaze (I think..). It seems to affect mostly IPs from outside Israel, but it happened in Israel as well sometimes..

We have a function that detects the reblaze block, you can add a call to it in the HtmlVote class.

(Our servers are on Knesset white-list so they aren't blocked..)

To continue, the best solution is to use unit tests with mock data. First, access the page in your browser, and save the source to a file. Then, mock the HtmlVote class to return this mock html..

You can see an example of such a mock class here: https://github.com/hasadna/knesset-data-python/blob/master/knesset_data/html_scrapers/mocks.py
and the corresponding test that uses this mock class - https://github.com/hasadna/knesset-data-python/blob/master/knesset_data/html_scrapers/tests/test_plenum.py

OriHoch · 2017-06-10T14:57:23Z

the reblaze detection function: https://github.com/hasadna/knesset-data-python/blob/master/knesset_data/utils/reblaze.py#L1

alonisser · 2017-06-10T19:18:40Z

For tests you should use stubbed data anyway and not use an outside api
Regarding general work with knesset apis
You can work against the knesset api with the following work around
add:
127.0.0.1 main.knesset.gov.il
127.0.0.1 knesset.gov.il

To /etc/hosts (or similar if you are using windows. but I don't know how)
Then shutdown anything listening on port 80 on your machine (nginx or apache)
and (Using an ssh pem file we use to connect to oknesset)
sudo ssh -N -L 80:knesset.gov.il:80 [email protected] -i ~/.ssh/hasadna_amir_1.pem
then you'll have an open ssh tunnel and you can run against knesset without being blocked (since the data is actually sent from oknesset whitelisted servers)
Have fun

alonisser · 2017-06-10T19:19:48Z

Not sure why detecting reblaze would help him here (This is the problem of course) he needs to work on various exceptions, reblaze is just one

alonisser · 2017-06-10T19:24:13Z

BTW regarding #3 This should be published since solves a scraping problem.. for lots of votes, if not published then this is still happening (Especially if was redployed to server and was reinstalled with the original package)

about:

           if vote_result_code != "":
                member_id = re.search("""mk_individual_id_t=(\d+)""",i).group(1)		 +                try:
                results.append((member_id, vote_result_code))		 +                    member_id = re.search("""MKID=(\d+)""", i).group(1)
                    results.append((member_id, vote_result_code))
                except AttributeError as e:
                    logger.exception('Failed to find html vote for specific mk %s' % i)
                    continue
          return results		          return results

I'd argue that this is a good handling, for the specific case (missing mk in html vote) - move forward with the vote processing and log the error so we'll know that happend. what's missing is "above" this code . the case that other exceptions should be thrown or handled and not fail silently

OriHoch · 2017-06-11T07:29:48Z

PR #3 was published in v1.7.3

regarding the is_reblaze_content - it's useful for 2 things:

to confirm that indeed the problem is due to reblaze
to help the next developer that might encounter this, and prevent a WTF moment for him..

OriHoch · 2017-06-11T07:33:37Z

I also updated README regarding the reblaze block with a link to this issue for reference

alonisser · 2017-06-11T18:27:56Z

Hmm , preventing the WTF moment is a good idea. I'll open a following issue about that

OriHoch mentioned this issue Mar 19, 2017

Fixed html votes scraper #3

Merged

OriHoch added help wanted noob friendly labels Jun 8, 2017

OriHoch removed the help wanted label Jun 9, 2017

alonisser mentioned this issue Jun 11, 2017

Wrap knesset proxies with reblaze detection and throw a dedicated informative exception #18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

votes html scraper should not fail silently by default #4

votes html scraper should not fail silently by default #4

OriHoch commented Mar 19, 2017

OriHoch commented Mar 19, 2017

ytbaum commented Jun 8, 2017

ytbaum commented Jun 10, 2017

OriHoch commented Jun 10, 2017

ytbaum commented Jun 10, 2017

ytbaum commented Jun 10, 2017

OriHoch commented Jun 10, 2017 •

edited

Loading

OriHoch commented Jun 10, 2017

alonisser commented Jun 10, 2017

alonisser commented Jun 10, 2017

alonisser commented Jun 10, 2017

OriHoch commented Jun 11, 2017

OriHoch commented Jun 11, 2017

alonisser commented Jun 11, 2017

votes html scraper should not fail silently by default #4

votes html scraper should not fail silently by default #4

Comments

OriHoch commented Mar 19, 2017

OriHoch commented Mar 19, 2017

ytbaum commented Jun 8, 2017

ytbaum commented Jun 10, 2017

OriHoch commented Jun 10, 2017

reproduction - should raise exceptions

expected

actual

reproduction - skip exceptions param

expected

actual

ytbaum commented Jun 10, 2017

ytbaum commented Jun 10, 2017

OriHoch commented Jun 10, 2017 • edited Loading

OriHoch commented Jun 10, 2017

alonisser commented Jun 10, 2017

alonisser commented Jun 10, 2017

alonisser commented Jun 10, 2017

OriHoch commented Jun 11, 2017

OriHoch commented Jun 11, 2017

alonisser commented Jun 11, 2017

OriHoch commented Jun 10, 2017 •

edited

Loading