roboparse

Simple utility which helps to organize code of your scraper.

Example

Go to the example directory.

Installation

Via pip

pip install roboparse

Via git

git clone https://github.com/Toffooo/roboparse.git
cd roboparse
pip install -e .

Routers

You have 2 options when you create routers.

Make one and big router for all features that you need
Divide it to small parts

Big router

from roboparse import BaseRouter
from roboparse.schemas import RouterResponse


class BlogSiteRouter(BaseRouter):
    def get_posts(self) -> RouterResponse:    
        response = self.create_router_response(
            path="<site_url>",  # Path is just meta data. It uses for nothing
            linter={
                "type": "LIST",
                "tag": "li",
                "attrs": {"class": "content-list__item"},
                "children": {
                    "type": "ELEMENT",
                    "tag": "h2",
                    "attrs": {"class": "post__title"},
                    "children": {
                        "type": "ELEMENT",
                        "tag": "a",
                        "attrs": {"class": "post__title_link"}
                    }
                }
            }
        )
        return response
    
    def get_main(self) -> RouterResponse:
        response = self.create_router_response_from_json(
            path="json_file.json"
        )
        return response

    def _fb_exclude_none_blocks(self, data):
        return [element for element in data if element is not None]

Small router

from roboparse import BaseRouter
from roboparse.schemas import RouterResponse


class BlogFilters:
    def _fb_exclude_none_blocks(self, data):
        return [element for element in data if element is not None]


class BlogMainRouter(BaseRouter, BlogFilters):
    def get(self) -> RouterResponse:
        response = self.create_router_response_from_json(
            path="json_file.json"
        )
        return response


class BlogPostRouter(BaseRouter, BlogFilters):
    def get(self) -> RouterResponse:    
        response = self.create_router_response(
            path="<site_url>",  # Path is just meta data. It uses for nothing
            linter={
                "type": "LIST",
                "tag": "li",
                "attrs": {"class": "content-list__item"},
                "children": {
                    "type": "ELEMENT",
                    "tag": "h2",
                    "attrs": {"class": "post__title"},
                    "children": {
                        "type": "ELEMENT",
                        "tag": "a",
                        "attrs": {"class": "post__title_link"}
                    }
                }
            }
        )
        return response

Explanation:

create_router_response - Every method of router should return router response as following, this responses will be provided to parser, and handled by it
a) path - Meta about url of page
b) linter - You have to provide there hierarchy of html elements
create_router_responsefrom_json - Same as create_router_response, provide json file's path and load your linter's schema from it. Json structure should be same
_fb prefix - You can register filters for your router. In this example, I've declared the filter by adding to method name _fb prefix. This will register your method in the class as filter. My filter just removes None elements from list and returning handled data.

See code example at example/scraper.py

Parser

Handle with filters

import requests

from roboparse import Parser
from .routers import BlogPostRouter


if __name__ == "__main__":
    response = requests.get("site_url")
    parser = Parser()
    router = BlogPostRouter("username", "password")
    data = parser.load(response.content, router.get(), router.filters)
    print(data)

Handle without filters

import requests

from roboparse import Parser
from .routers import BlogPostRouter


if __name__ == "__main__":
    response = requests.get("site_url")
    parser = Parser()
    router = BlogPostRouter("username", "password")
    data = parser.load(response.content, router.get())
    print(data)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
example		example
roboparse		roboparse
samples		samples
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
tasks.py		tasks.py
test-requirements.txt		test-requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

roboparse

Example

Installation

Routers

Parser

About

Releases

Packages

Languages

License

Kel0/roboparse

Folders and files

Latest commit

History

Repository files navigation

roboparse

Example

Installation

Routers

Parser

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages