Simple utility which helps to organize code of your scraper.
Go to the example
directory.
- Via pip
pip install roboparse
- Via git
git clone https://github.com/Toffooo/roboparse.git
cd roboparse
pip install -e .
You have 2 options when you create routers.
- Make one and big router for all features that you need
- Divide it to small parts
- Big router
from roboparse import BaseRouter
from roboparse.schemas import RouterResponse
class BlogSiteRouter(BaseRouter):
def get_posts(self) -> RouterResponse:
response = self.create_router_response(
path="<site_url>", # Path is just meta data. It uses for nothing
linter={
"type": "LIST",
"tag": "li",
"attrs": {"class": "content-list__item"},
"children": {
"type": "ELEMENT",
"tag": "h2",
"attrs": {"class": "post__title"},
"children": {
"type": "ELEMENT",
"tag": "a",
"attrs": {"class": "post__title_link"}
}
}
}
)
return response
def get_main(self) -> RouterResponse:
response = self.create_router_response_from_json(
path="json_file.json"
)
return response
def _fb_exclude_none_blocks(self, data):
return [element for element in data if element is not None]
- Small router
from roboparse import BaseRouter
from roboparse.schemas import RouterResponse
class BlogFilters:
def _fb_exclude_none_blocks(self, data):
return [element for element in data if element is not None]
class BlogMainRouter(BaseRouter, BlogFilters):
def get(self) -> RouterResponse:
response = self.create_router_response_from_json(
path="json_file.json"
)
return response
class BlogPostRouter(BaseRouter, BlogFilters):
def get(self) -> RouterResponse:
response = self.create_router_response(
path="<site_url>", # Path is just meta data. It uses for nothing
linter={
"type": "LIST",
"tag": "li",
"attrs": {"class": "content-list__item"},
"children": {
"type": "ELEMENT",
"tag": "h2",
"attrs": {"class": "post__title"},
"children": {
"type": "ELEMENT",
"tag": "a",
"attrs": {"class": "post__title_link"}
}
}
}
)
return response
Explanation:
create_router_response
- Every method of router should return router response as following, this responses will be provided to parser, and handled by it
a)path
- Meta about url of page
b)linter
- You have to provide there hierarchy of html elementscreate_router_responsefrom_json
- Same ascreate_router_response
, provide json file's path and load your linter's schema from it. Json structure should be same_fb prefix
- You can register filters for your router. In this example, I've declared the filter by adding to method name_fb
prefix. This will register your method in the class as filter. My filter just removes None elements from list and returning handled data.
See code example at example/scraper.py
- Handle with filters
import requests
from roboparse import Parser
from .routers import BlogPostRouter
if __name__ == "__main__":
response = requests.get("site_url")
parser = Parser()
router = BlogPostRouter("username", "password")
data = parser.load(response.content, router.get(), router.filters)
print(data)
- Handle without filters
import requests
from roboparse import Parser
from .routers import BlogPostRouter
if __name__ == "__main__":
response = requests.get("site_url")
parser = Parser()
router = BlogPostRouter("username", "password")
data = parser.load(response.content, router.get())
print(data)