Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve prioritizing algorithm to circumvent black holes #20

Closed
jogli5er opened this issue May 24, 2018 · 0 comments
Closed

Improve prioritizing algorithm to circumvent black holes #20

jogli5er opened this issue May 24, 2018 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@jogli5er
Copy link
Member

jogli5er commented May 24, 2018

Currently, we have an issue with pages that have a considerable amount of links to themselves, as websites that list all bitcoin transactions, blocks or websites that host an extensive library and let you read a book page by page. To circumvent this issue and get a content page for all yet scraped pages, we should first consider scraping pages of hosts that were not yet scraped.
So the prioritization would be as following (Always sorted by depth as well. We want to scrape in depth order):

  1. Scrape paths from hosts that were never scraped yet
  2. Sort by incoming/outgoing unique links (Unique in the sense of different hosts that link to this host)
  3. Sort by random

Here a few SQL snippets from todays meeting (@dionyziz , @zetavar ):

Find all unique incoming/outgoing diffs

        SELECT
             destpathid, COUNT(DISTINCT p.baseUrlId) AS inuniquecount
        FROM
             links l CROSS JOIN paths p ON l.srcpathid = p.pathid 
        WHERE
             destpathid IN (1, 2, 3, 4, ...)
        GROUP BY
             destpathid

Find all that are not yet scraped

        SELECT
            pathid, MIN(lastFinishedTimestamp) AS mintime, BOOL_OR(inProgress) AS ongoing
        FROM
            path
        GROUP BY
            baseUrlBaseUrlId
        HAVING
            ongoing = false AND
            mintime = '0000-00-00 00:00:00'
@jogli5er jogli5er added the enhancement New feature or request label May 31, 2018
@jogli5er jogli5er self-assigned this May 31, 2018
jogli5er added a commit that referenced this issue Jun 6, 2018
We introduced an incoming column of links, such that we scrape the most
important pages first. This column is important, since it is not
feasible to calculate the data dynamically (even after optimization of
the request, indexes and similar, it takes up to 1.5 minutes, if we
want to get 2000 entries). If we update only the paths we just found,
it is much cheaper (typically only a few dozen entries to update), and
since we can sort cheaply (index on the column) and limit fast, this
small amount of storage seems well spent space in contrast to the time
savings.
[#20]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant