You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, we have an issue with pages that have a considerable amount of links to themselves, as websites that list all bitcoin transactions, blocks or websites that host an extensive library and let you read a book page by page. To circumvent this issue and get a content page for all yet scraped pages, we should first consider scraping pages of hosts that were not yet scraped.
So the prioritization would be as following (Always sorted by depth as well. We want to scrape in depth order):
Scrape paths from hosts that were never scraped yet
Sort by incoming/outgoing unique links (Unique in the sense of different hosts that link to this host)
Sort by random
Here a few SQL snippets from todays meeting (@dionyziz , @zetavar ):
Find all unique incoming/outgoing diffs
SELECT
destpathid, COUNT(DISTINCT p.baseUrlId) AS inuniquecount
FROM
links l CROSS JOIN paths p ON l.srcpathid = p.pathid
WHERE
destpathid IN (1, 2, 3, 4, ...)
GROUP BY
destpathid
Find all that are not yet scraped
SELECT
pathid, MIN(lastFinishedTimestamp) AS mintime, BOOL_OR(inProgress) AS ongoing
FROM
path
GROUP BY
baseUrlBaseUrlId
HAVING
ongoing = false AND
mintime = '0000-00-00 00:00:00'
The text was updated successfully, but these errors were encountered:
We introduced an incoming column of links, such that we scrape the most
important pages first. This column is important, since it is not
feasible to calculate the data dynamically (even after optimization of
the request, indexes and similar, it takes up to 1.5 minutes, if we
want to get 2000 entries). If we update only the paths we just found,
it is much cheaper (typically only a few dozen entries to update), and
since we can sort cheaply (index on the column) and limit fast, this
small amount of storage seems well spent space in contrast to the time
savings.
[#20]
Currently, we have an issue with pages that have a considerable amount of links to themselves, as websites that list all bitcoin transactions, blocks or websites that host an extensive library and let you read a book page by page. To circumvent this issue and get a content page for all yet scraped pages, we should first consider scraping pages of hosts that were not yet scraped.
So the prioritization would be as following (Always sorted by depth as well. We want to scrape in depth order):
Here a few SQL snippets from todays meeting (@dionyziz , @zetavar ):
Find all unique incoming/outgoing diffs
Find all that are not yet scraped
The text was updated successfully, but these errors were encountered: