-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
votes html scraper should not fail silently by default #4
Comments
@alonisser - I haven't actually tested it, please confirm/deny if my assumption that it fails silently is correct.. |
working on this |
I'm a little bit confused about this issue. What is the failure and how exactly do I test or reproduce it? More generally, I don't understand how the pieces of this project fit together. For example, I see the different classes representing different types of data (Bill, Vote, Member, etc.). But where are they instantiated or otherwise used? |
knesset-data-python is a python module, you can pip install it, then call the functions / classes directly regarding this bug - reproduction - should raise exceptions
expected
actual
reproduction - skip exceptions param
expected
actual
|
Why should I tried the following:
but it simply returned Am I doing something wrong here? |
Sorry for some reason in the above comment github wouldn't let me put newlines in code blocks |
You found one of the nice features of Knesset - arbitrary blocking with some kind of security solution called Reblaze (I think..). It seems to affect mostly IPs from outside Israel, but it happened in Israel as well sometimes.. We have a function that detects the reblaze block, you can add a call to it in the HtmlVote class. (Our servers are on Knesset white-list so they aren't blocked..) To continue, the best solution is to use unit tests with mock data. First, access the page in your browser, and save the source to a file. Then, mock the HtmlVote class to return this mock html.. You can see an example of such a mock class here: https://github.com/hasadna/knesset-data-python/blob/master/knesset_data/html_scrapers/mocks.py |
the reblaze detection function: https://github.com/hasadna/knesset-data-python/blob/master/knesset_data/utils/reblaze.py#L1 |
For tests you should use stubbed data anyway and not use an outside api To /etc/hosts (or similar if you are using windows. but I don't know how) |
Not sure why detecting reblaze would help him here (This is the problem of course) he needs to work on various exceptions, reblaze is just one |
BTW regarding #3 This should be published since solves a scraping problem.. for lots of votes, if not published then this is still happening (Especially if was redployed to server and was reinstalled with the original package) about:
I'd argue that this is a good handling, for the specific case (missing mk in html vote) - move forward with the vote processing and log the error so we'll know that happend. what's missing is "above" this code . the case that other exceptions should be thrown or handled and not fail silently |
I also updated README regarding the reblaze block with a link to this issue for reference |
Hmm , preventing the WTF moment is a good idea. I'll open a following issue about that |
following PR #3 - it looks like in case an unknown member is encountered it will fail silently
in knesset-data-python we would like to be as low-level as possible and make as little assumptions as possible about how the data will be used
in this case, silently ignoring an error might cause unexpected problems down the line
I prefer that by default we will raise an exception, and add a skip_exceptions parameter which will instead of raising an exception yield the exception object as part of the data stream - to allow the end-user decide how to handle the exception
this is done in the dataservice scrapers, see for example this unit test -
https://github.com/hasadna/knesset-data-python/blob/master/knesset_data/dataservice/tests/base/test_exceptions.py#L38
The text was updated successfully, but these errors were encountered: