Skip to content

Latest commit

 

History

History
311 lines (215 loc) · 16 KB

concepts.adoc

File metadata and controls

311 lines (215 loc) · 16 KB

Pulsar concepts

English | 简体中文

This document describes Pulsar concepts.

Core concepts

Web Scraping

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.

Network As A Database

The Internet, especially, the World Wide Web (WWW), is the largest database in the world, but extracting data from the Web has never been easy.

Pulsar treats the network as a database. For each webpage, Pulsar will first check the local storage, if it does not exist, or expired, or any other fetch condition is triggered, Pulsar will retrieve it from the Web.

Pulsar has SQL support, so we can turn the Web into tables and charts using simple SQLs, and also we can query the Web using SQL directly.

Auto Extract

Web data extraction has evolved over the last three decades from using statistical methods to more advanced machine learning methods. Machine learning techniques are much preferred nowadays due to demand for time-to-market, developer productivity, and cost concerns. Using our cutting edge technology, the entire end-to-end Web data extraction lifecycle is automated without any manual intervention.

We provided a preview project to show how to use our world leading machine learning algorithm to extract almost every field in webpages automatically: Exotic.

You can also use Exotic as an intelligent CSS selector generator, the generated CSS selectors can be used in any traditional scrape systems to extract data from webpages.

Browser Rendering

Although Pulsar supports various web scraping methods, browser rendering is the primary way Pulsar scrapes webpages.

Browser rendering means every webpage is open by a real browser to make sure all fields on the webpage are present correctly.

Pulsar Context

PulsarContext consists of a set of highly customizable components that provide the core set of interfaces of the system and is used to create PulsarSessions.

The PulsarContext is the interface to all other contexts.

A StaticPulsarContext consists of the default components.

A ClassPathXmlPulsarContext consists of components which are customized using Spring bean configuration files.

A SQLContext consists of components to support X-SQL.

Programmers can write their own Pulsar Contexts to extend the system.

Pulsar Session

PulsarSession defines an interface to load web pages from local storage or fetch from the Internet, as well as methods for parsing, extracting, saving, indexing, and exporting web pages.

Key methods:

  • .load(): load a webpage from local storage, or fetch it from the Internet.

  • .parse(): parse a webpage into a document.

  • .scrape(): load a webpage, parse it into a document and then extract fields from the document.

  • .submit(): submit a url to the url pool, the url will be processed in the main loop later.

And also the batch versions:

  • .loadOutPages(): load the portal page and out pages.

  • .scrapeOutPages(): load the portal page and out pages, extract fields from out pages.

Check PulsarSession to learn all methods.

The first thing to understand is how to load a page. Load methods like session.load() first check the local storage and return the local version if the required page exists and meets the requirements, otherwise it will be fetched from the Internet.

The load parameters or load options can be used to specify when the system will fetch a webpage from the Internet:

  1. Expiration

  2. Force refresh

  3. Page size

  4. Required fields

  5. Other conditions

val session = PulsarContexts.createSession()
val url = "..."
val page = session.load(url, "-expires 1d")
val document = session.parse(page)
val document2 = session.loadDocument(url, "-expires 1d")
val pages = session.loadOutPages(url, "-outLink a[href~=item] -expires 1d -itemExpires 7d -itemRequireSize 300000")
val fields = session.scrape(url, "-expires 1d", "li[data-sku]", listOf(".p-name em", ".p-price"))
// ...

Once a page is loaded from local storage, or fetched from the Internet, we will proceed to the next processing process:

  1. parse the page content into an HTML document

  2. extract fields from the HTML document

  3. write the fields into a destination, such as

    1. plain file, avro file, CSV, excel, mongodb, mysql, etc.

    2. solr, elastic, etc.

There are many ways to fetch the page content from the Internet:

  1. through http protocol

  2. through a real browser

Since webpages are becoming more and more complex, fetching webpages through real browsers is the primary way nowadays.

When we fetch webpages using a real browser, we may need to interact with pages to ensure the desired fields are loaded correctly and completely. Activate PageEvent and use WebDriver to archive such purpose.

val options = session.options(args)
options.event.browseEvent.onDidDOMStateCheck.addLast { page, driver ->
  driver.scrollDown()
}
session.load(url, options)

WebDriver provides a complete method set for RPA, just like selenium, playwright and puppeteer. All actions and behaviors are optimized to mimic real people as closely as possible.

Web Driver

WebDriver defines a concise interface to visit and interact with web pages, all actions and behaviors are optimized to mimic real people as closely as possible, such as scrolling, clicking, typing text, dragging and dropping, etc.

The methods in this interface fall into three categories:

  1. Control of the browser itself

  2. Selection of elements, extracting textContent and attributes

  3. Interact with the webpage

Key methods:

  • .navigateTo(): load a new webpage.

  • .scrollDown(): scroll down on a webpage to fully load the page. Most modern webpages support lazy loading using ajax tech, where the page content only starts to load when it is scrolled into view.

  • .pageSource(): retrieve the source code of a webpage.

URLs

A Uniform Resource Locator (URL), colloquially termed a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it.

A URL in Pulsar is a normal URL with extra information to describe a task. Every task in Pulsar is defined as some form of URL.

There are several basic forms of urls in Pulsar:

NormUrl stands for normal url, which means the url is the final form, and is usually passed to a real browser eventually.

If not specified, a url in string format actually means a configured url, or a url with arguments, for example:

val url = "https://www.amazon.com/dp/B10000 -taskName amazon -expires 1d -ignoreFailure"
session.load(url)

The above code has the same meaning as the following code:

val url = "https://www.amazon.com/dp/B10000"
val args = "-taskName amazon -expires 1d -ignoreFailure"
session.load(url, args)

A UrlAware provides much more complex controls to do crawl tasks. UrlAware is the interface of all Hyperlinks, see Hyperlinks section for details.

At last, a DegenerateUrl is actually not a URL, it’s an interface of any task to be executed in the crawl loop.

A hyperlink, or simply a link, refers specifically to a reference to data on the Web, usually containing a URL, a text, and a set of attributes that the user can follow by clicking or tapping on it.

Hyperlinks in Pulsar are like normal hyperlinks, but with additional information to describe the task.

There are several hyperlinks predefined by Pulsar:

A ParsableHyperlink is a convenient abstraction to do fetch-and-parse tasks in continuous crawl jobs:

val parseHandler = { _: WebPage, document: FeaturedDocument ->
    // do something wonderful with the document
}

val urls = LinkExtractors.fromResource("seeds.txt")
    .map { ParsableHyperlink(it, parseHandler) }
PulsarContexts.create().submitAll(urls).await()

A CompletableHyperlink helps us to do java style asynchronous computation: submit a hyperlink and wait for the task to complete.

A ListenableHyperlink help us to attach event handlers:

val session = PulsarContexts.createSession()
val link = ListenableHyperlink(portalUrl, args = "-refresh -parse", event = PrintFlowEvent())
session.submit(link)

The example code can be found here: kotlin and the order in which they are executed.

A CompletableListenableHyperlink helps us to do the both:

fun executeQuery(request: ScrapeRequest): ScrapeResponse {
    // the hyperlink is a CompletableListenableHyperlink
    val hyperlink = createScrapeHyperlink(request)
    session.submit(hyperlink)
    // wait for the task to complete or timeout
    return hyperlink.get(3, TimeUnit.MINUTES)
}

The example code can be found here: kotlin.

Load Options

Almost every method in Pulsar Session accepts a parameter called load arguments, or load options, to control how to load, fetch and extract webpages.

There are three forms to combine URLs and their parameters:

  1. URL-arguments form

  2. URL-options form

  3. configured-URL form

// use URL-arguments form:
val page = session.load(url, "-expires 1d")
val page2 = session.load(url, "-refresh")
val document = session.loadDocument(url, "-expires 1d -ignoreFailure")
val pages = session.loadOutPages(url, "-outLink a[href~=item] -itemExpires 7d")
session.submit(Hyperlink(url, args = "-expires 1d"))

// Or use configured-URL form:
val page = session.load("$url -expires 1d")
val page2 = session.load("$url -refresh")
val document = session.loadDocument("$url -expires 1d -ignoreFailure")
val pages = session.loadOutPages("$url -expires 1d -ignoreFailure", "-outLink a[href~=item] -itemExpires 7d")
session.submit(Hyperlink("$url -expires 1d"))

// Or use URL-options form:
var options = session.options("-expires 1d -ignoreFailure")
val document = session.loadDocument(url, options)
options = session.options("-outLink a[href~=item] -itemExpires 7d")
val pages = session.loadOutPages("$url -expires 1d -ignoreFailure", options)

// ...

The configured-URL form can be mixed with the other two forms and has the higher priority.

The most important load options are:

-expires     // The expiry time of a page
-itemExpires // The expiry time of item pages in batch scraping methods
-outLink     // The selector of out links to scrape
-refresh     // Force (re)fetch the page, just like hitting the refresh button on a real browser
-parse       // Activate parse subsystem
-resource    // Fetch the url as a resource without browser rendering

The load arguments are parsed into a LoadOptions object, check the code for all the supported options.

It is worth noting that when we execute the load() family of methods, the system does not parse the page, but provides the parse() method to parse the page. However, once we add the -parse argument, the parsing subsystem will be activated and the page will be parsed automatically. We can register handlers to perform tasks such as data extraction, data persistence and link collection.

There are two ways to register handlers in the parsing subsystem: Register a global ParseFilter with ParseFilters, or register a page wide event handler with PageEvent.

A good example of using ParseFilter to perform complex tasks is e-commerce site-wide data collection, where a separate ParseFilter is registered for each type of page to handle data extraction, extraction result persistence, link collection, etc.

Browser Settings

BrowserSettings defines a convenient interface to specify the behavior of browser automation, such as:

  1. Headed or headless?

  2. SPA or not?

  3. Enable proxy ips or not?

  4. Block media resources or not?

Check BrowserSettings for detail.

Event Handler

Event handlers here are webpage event handlers that capture and process events throughout the lifecycle of webpages.

Check EventHandlerUsage for all available event handlers.

X-SQL

Pulsar supports the Network As A Database paradigm, we developed X-SQL to query the Web directly and convert webpages into tables and charts.

Click X-SQL to see a detailed introduction and function descriptions about X-SQL.

Implementation concepts

Developers don’t need to study the implementation concepts, but knowing these concepts helps us better understand how the whole system works.

Url Pool

When running continuous crawls, urls are added into a UrlPool. A UrlPool contains a variety of UrlCaches to satisfy different requirements, for example, priority, delaying, deadline, external loading requirements, and so on.

Crawl Loop

When running continuous crawls, a crawl loop is started to keep fetching urls from the UrlPool, and then load/fetch them asynchronously in a PulsarSession.

Keep in mind that every task in Pulsar is a url, so the crawl loop can accept and execute any kind of tasks.

Privacy Context

One of the biggest difficulties in web scraping tasks is the bot stealth. For web scraping tasks, the website should have no idea whether a visit is from a human being or a bot. Once a page visit is suspected by the website, which we call a privacy leak, the privacy context has to be dropped, and Pulsar will visit the page in another privacy context.

Proxy Management

Obtain IPs from proxy vendors, record proxy status, rotate IPs smart and automatically, and more.

Web Driver Stealth

When a browser is programmed to access a webpage, the website may detect that the visit is automated, Web Driver stealth technology is used to prevent detection.

Backend Storage

A variety of backend storage solutions are supported by Pulsar to meet our customers' pressing needs: Local File System, MongoDB, HBase, Gora, etc.