Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

votes html scraper should not fail silently by default #4

Open
OriHoch opened this issue Mar 19, 2017 · 14 comments
Open

votes html scraper should not fail silently by default #4

OriHoch opened this issue Mar 19, 2017 · 14 comments

Comments

@OriHoch
Copy link
Contributor

OriHoch commented Mar 19, 2017

following PR #3 - it looks like in case an unknown member is encountered it will fail silently

in knesset-data-python we would like to be as low-level as possible and make as little assumptions as possible about how the data will be used

in this case, silently ignoring an error might cause unexpected problems down the line

I prefer that by default we will raise an exception, and add a skip_exceptions parameter which will instead of raising an exception yield the exception object as part of the data stream - to allow the end-user decide how to handle the exception

this is done in the dataservice scrapers, see for example this unit test -
https://github.com/hasadna/knesset-data-python/blob/master/knesset_data/dataservice/tests/base/test_exceptions.py#L38

@OriHoch
Copy link
Contributor Author

OriHoch commented Mar 19, 2017

@alonisser - I haven't actually tested it, please confirm/deny if my assumption that it fails silently is correct..
feel free to fix it as well :)

@ytbaum
Copy link

ytbaum commented Jun 8, 2017

working on this

@ytbaum
Copy link

ytbaum commented Jun 10, 2017

I'm a little bit confused about this issue. What is the failure and how exactly do I test or reproduce it?

More generally, I don't understand how the pieces of this project fit together. For example, I see the different classes representing different types of data (Bill, Vote, Member, etc.). But where are they instantiated or otherwise used?

@OriHoch
Copy link
Contributor Author

OriHoch commented Jun 10, 2017

knesset-data-python is a python module, you can pip install it, then call the functions / classes directly

regarding this bug -

reproduction - should raise exceptions

  • open jupyter notebook or python shell
  • from knesset_data.html_scrapers.votes import HtmlVote
  • HtmlVote.get_from_vote_id(24084)

expected

  • in case of any problem - raise an exception
    ensure it never silently fails

actual

  • in case of missing member - it will silently fail

reproduction - skip exceptions param

  • HtmlVote.get_from_vote_id(24084, skip_exceptions=True)

expected

  • don't raise exceptions - try to return data you have, without the problematic items
  • exactly how it's handled might be different depending on the data / scraper

actual

  • no skip_exceptions param in HtmlVote class

@ytbaum
Copy link

ytbaum commented Jun 10, 2017

Why should HtmlVote.get_from_vote_id(24084) cause an exception regarding unknown MK ID's? Looks like all that function is doing is grabbing the raw html. I assume the actual failure should occur in member_votes()?

I tried the following:

from knesset_data.html_scrapers.votes import HtmlVote h = HtmlVote.get_from_vote_id(24084) h.member_votes

but it simply returned []. Digging a little deeper, it looks like h.page is not the same html as I see when I manually inspect the page source in the browser. Specifically, in the page source I can see instances of "Vote_Bord" (the expression that the raw html is supposed to be split on), but that substring doesn't appear to exist in h.page:
str.index(h.page, "Vote_Bord") ValueError: substring not found

Am I doing something wrong here?

@ytbaum
Copy link

ytbaum commented Jun 10, 2017

Sorry for some reason in the above comment github wouldn't let me put newlines in code blocks

@OriHoch
Copy link
Contributor Author

OriHoch commented Jun 10, 2017

You found one of the nice features of Knesset - arbitrary blocking with some kind of security solution called Reblaze (I think..). It seems to affect mostly IPs from outside Israel, but it happened in Israel as well sometimes..

We have a function that detects the reblaze block, you can add a call to it in the HtmlVote class.

(Our servers are on Knesset white-list so they aren't blocked..)

To continue, the best solution is to use unit tests with mock data. First, access the page in your browser, and save the source to a file. Then, mock the HtmlVote class to return this mock html..

You can see an example of such a mock class here: https://github.com/hasadna/knesset-data-python/blob/master/knesset_data/html_scrapers/mocks.py
and the corresponding test that uses this mock class - https://github.com/hasadna/knesset-data-python/blob/master/knesset_data/html_scrapers/tests/test_plenum.py

@OriHoch
Copy link
Contributor Author

OriHoch commented Jun 10, 2017

@alonisser
Copy link
Collaborator

For tests you should use stubbed data anyway and not use an outside api
Regarding general work with knesset apis
You can work against the knesset api with the following work around
add:
127.0.0.1 main.knesset.gov.il
127.0.0.1 knesset.gov.il

To /etc/hosts (or similar if you are using windows. but I don't know how)
Then shutdown anything listening on port 80 on your machine (nginx or apache)
and (Using an ssh pem file we use to connect to oknesset)
sudo ssh -N -L 80:knesset.gov.il:80 [email protected] -i ~/.ssh/hasadna_amir_1.pem
then you'll have an open ssh tunnel and you can run against knesset without being blocked (since the data is actually sent from oknesset whitelisted servers)
Have fun

@alonisser
Copy link
Collaborator

Not sure why detecting reblaze would help him here (This is the problem of course) he needs to work on various exceptions, reblaze is just one

@alonisser
Copy link
Collaborator

BTW regarding #3 This should be published since solves a scraping problem.. for lots of votes, if not published then this is still happening (Especially if was redployed to server and was reinstalled with the original package)

about:

           if vote_result_code != "":
                member_id = re.search("""mk_individual_id_t=(\d+)""",i).group(1)		 +                try:
                results.append((member_id, vote_result_code))		 +                    member_id = re.search("""MKID=(\d+)""", i).group(1)
                    results.append((member_id, vote_result_code))
                except AttributeError as e:
                    logger.exception('Failed to find html vote for specific mk %s' % i)
                    continue
          return results		          return results

I'd argue that this is a good handling, for the specific case (missing mk in html vote) - move forward with the vote processing and log the error so we'll know that happend. what's missing is "above" this code . the case that other exceptions should be thrown or handled and not fail silently

@OriHoch
Copy link
Contributor Author

OriHoch commented Jun 11, 2017

PR #3 was published in v1.7.3

regarding the is_reblaze_content - it's useful for 2 things:

  1. to confirm that indeed the problem is due to reblaze
  2. to help the next developer that might encounter this, and prevent a WTF moment for him..

@OriHoch
Copy link
Contributor Author

OriHoch commented Jun 11, 2017

I also updated README regarding the reblaze block with a link to this issue for reference

@alonisser
Copy link
Collaborator

Hmm , preventing the WTF moment is a good idea. I'll open a following issue about that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants