Improve prioritizing algorithm to circumvent black holes #20

jogli5er · 2018-05-24T13:11:38Z

Currently, we have an issue with pages that have a considerable amount of links to themselves, as websites that list all bitcoin transactions, blocks or websites that host an extensive library and let you read a book page by page. To circumvent this issue and get a content page for all yet scraped pages, we should first consider scraping pages of hosts that were not yet scraped.
So the prioritization would be as following (Always sorted by depth as well. We want to scrape in depth order):

Scrape paths from hosts that were never scraped yet
Sort by incoming/outgoing unique links (Unique in the sense of different hosts that link to this host)
Sort by random

Here a few SQL snippets from todays meeting (@dionyziz , @zetavar ):

Find all unique incoming/outgoing diffs

        SELECT
             destpathid, COUNT(DISTINCT p.baseUrlId) AS inuniquecount
        FROM
             links l CROSS JOIN paths p ON l.srcpathid = p.pathid 
        WHERE
             destpathid IN (1, 2, 3, 4, ...)
        GROUP BY
             destpathid

Find all that are not yet scraped

        SELECT
            pathid, MIN(lastFinishedTimestamp) AS mintime, BOOL_OR(inProgress) AS ongoing
        FROM
            path
        GROUP BY
            baseUrlBaseUrlId
        HAVING
            ongoing = false AND
            mintime = '0000-00-00 00:00:00'

The text was updated successfully, but these errors were encountered:

We introduced an incoming column of links, such that we scrape the most important pages first. This column is important, since it is not feasible to calculate the data dynamically (even after optimization of the request, indexes and similar, it takes up to 1.5 minutes, if we want to get 2000 entries). If we update only the paths we just found, it is much cheaper (typically only a few dozen entries to update), and since we can sort cheaply (index on the column) and limit fast, this small amount of storage seems well spent space in contrast to the time savings. [#20]

jogli5er added the enhancement New feature or request label May 31, 2018

jogli5er self-assigned this May 31, 2018

jogli5er closed this as completed in bda2648 Jun 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve prioritizing algorithm to circumvent black holes #20

Improve prioritizing algorithm to circumvent black holes #20

jogli5er commented May 24, 2018 •

edited

Loading

Improve prioritizing algorithm to circumvent black holes #20

Improve prioritizing algorithm to circumvent black holes #20

Comments

jogli5er commented May 24, 2018 • edited Loading

jogli5er commented May 24, 2018 •

edited

Loading