English | 简体中文
This document describes Pulsar concepts.
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.
The Internet, especially, the World Wide Web (WWW), is the largest database in the world, but extracting data from the Web has never been easy.
Pulsar treats the network as a database. For each webpage, Pulsar will first check the local storage, if it does not exist, or expired, or any other fetch condition is triggered, Pulsar will retrieve it from the Web.
Pulsar has SQL support, so we can turn the Web into tables and charts using simple SQLs, and also we can query the Web using SQL directly.
Web data extraction has evolved over the last three decades from using statistical methods to more advanced machine learning methods. Machine learning techniques are much preferred nowadays due to demand for time-to-market, developer productivity, and cost concerns. Using our cutting edge technology, the entire end-to-end Web data extraction lifecycle is automated without any manual intervention.
We provided a preview project to show how to use our world leading machine learning algorithm to extract almost every field in webpages automatically: Exotic.
You can also use Exotic as an intelligent CSS selector generator, the generated CSS selectors can be used in any traditional scrape systems to extract data from webpages.
Although Pulsar supports various web scraping methods, browser rendering is the primary way Pulsar scrapes webpages.
Browser rendering means every webpage is open by a real browser to make sure all fields on the webpage are present correctly.
PulsarContext consists of a set of highly customizable components that provide the core set of interfaces of the system and is used to create PulsarSessions.
The PulsarContext is the interface to all other contexts.
A StaticPulsarContext consists of the default components.
A ClassPathXmlPulsarContext consists of components which are customized using Spring bean configuration files.
A SQLContext consists of components to support X-SQL.
Programmers can write their own Pulsar Contexts to extend the system.
PulsarSession defines an interface to load web pages from local storage or fetch from the Internet, as well as methods for parsing, extracting, saving, indexing, and exporting web pages.
Key methods:
-
.load()
: load a webpage from local storage, or fetch it from the Internet. -
.parse()
: parse a webpage into a document. -
.scrape()
: load a webpage, parse it into a document and then extract fields from the document. -
.submit()
: submit a url to the url pool, the url will be processed in the main loop later.
And also the batch versions:
-
.loadOutPages()
: load the portal page and out pages. -
.scrapeOutPages()
: load the portal page and out pages, extract fields from out pages.
Check PulsarSession to learn all methods.
The first thing to understand is how to load a page. Load methods like session.load()
first check the local storage and return the local version if the required page exists and meets the requirements, otherwise it will be fetched from the Internet.
The load parameters
or load options
can be used to specify when the system will fetch a webpage from the Internet:
-
Expiration
-
Force refresh
-
Page size
-
Required fields
-
Other conditions
val session = PulsarContexts.createSession()
val url = "..."
val page = session.load(url, "-expires 1d")
val document = session.parse(page)
val document2 = session.loadDocument(url, "-expires 1d")
val pages = session.loadOutPages(url, "-outLink a[href~=item] -expires 1d -itemExpires 7d -itemRequireSize 300000")
val fields = session.scrape(url, "-expires 1d", "li[data-sku]", listOf(".p-name em", ".p-price"))
// ...
Once a page is loaded from local storage, or fetched from the Internet, we will proceed to the next processing process:
-
parse the page content into an HTML document
-
extract fields from the HTML document
-
write the fields into a destination, such as
-
plain file, avro file, CSV, excel, mongodb, mysql, etc.
-
solr, elastic, etc.
-
There are many ways to fetch the page content from the Internet:
-
through http protocol
-
through a real browser
Since webpages are becoming more and more complex, fetching webpages through real browsers is the primary way nowadays.
When we fetch webpages using a real browser, we may need to interact with pages to ensure the desired fields are loaded correctly and completely. Activate PageEvent and use WebDriver to archive such purpose.
val options = session.options(args)
options.event.browseEvent.onDidDOMStateCheck.addLast { page, driver ->
driver.scrollDown()
}
session.load(url, options)
WebDriver provides a complete method set for RPA, just like selenium, playwright and puppeteer. All actions and behaviors are optimized to mimic real people as closely as possible.
WebDriver defines a concise interface to visit and interact with web pages, all actions and behaviors are optimized to mimic real people as closely as possible, such as scrolling, clicking, typing text, dragging and dropping, etc.
The methods in this interface fall into three categories:
-
Control of the browser itself
-
Selection of elements, extracting textContent and attributes
-
Interact with the webpage
Key methods:
-
.navigateTo()
: load a new webpage. -
.scrollDown()
: scroll down on a webpage to fully load the page. Most modern webpages support lazy loading using ajax tech, where the page content only starts to load when it is scrolled into view. -
.pageSource()
: retrieve the source code of a webpage.
A Uniform Resource Locator (URL), colloquially termed a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it.
A URL in Pulsar is a normal URL with extra information to describe a task. Every task in Pulsar is defined as some form of URL.
There are several basic forms of urls in Pulsar:
NormUrl stands for normal url
, which means the url is the final form, and is usually passed to a real browser eventually.
If not specified, a url in string format actually means a configured url
, or a url with arguments
, for example:
val url = "https://www.amazon.com/dp/B10000 -taskName amazon -expires 1d -ignoreFailure"
session.load(url)
The above code has the same meaning as the following code:
val url = "https://www.amazon.com/dp/B10000"
val args = "-taskName amazon -expires 1d -ignoreFailure"
session.load(url, args)
A UrlAware provides much more complex controls to do crawl tasks. UrlAware is the interface of all Hyperlinks, see Hyperlinks section for details.
At last, a DegenerateUrl is actually not a URL, it’s an interface of any task to be executed in the crawl loop.
A hyperlink, or simply a link, refers specifically to a reference to data on the Web, usually containing a URL, a text, and a set of attributes that the user can follow by clicking or tapping on it.
Hyperlinks in Pulsar are like normal hyperlinks, but with additional information to describe the task.
There are several hyperlinks predefined by Pulsar:
A ParsableHyperlink is a convenient abstraction to do fetch-and-parse tasks in continuous crawl jobs:
val parseHandler = { _: WebPage, document: FeaturedDocument ->
// do something wonderful with the document
}
val urls = LinkExtractors.fromResource("seeds.txt")
.map { ParsableHyperlink(it, parseHandler) }
PulsarContexts.create().submitAll(urls).await()
A CompletableHyperlink helps us to do java style asynchronous computation: submit a hyperlink and wait for the task to complete.
A ListenableHyperlink help us to attach event handlers:
val session = PulsarContexts.createSession()
val link = ListenableHyperlink(portalUrl, args = "-refresh -parse", event = PrintFlowEvent())
session.submit(link)
The example code can be found here: kotlin and the order in which they are executed.
A CompletableListenableHyperlink helps us to do the both:
fun executeQuery(request: ScrapeRequest): ScrapeResponse {
// the hyperlink is a CompletableListenableHyperlink
val hyperlink = createScrapeHyperlink(request)
session.submit(hyperlink)
// wait for the task to complete or timeout
return hyperlink.get(3, TimeUnit.MINUTES)
}
The example code can be found here: kotlin.
Almost every method in Pulsar Session accepts a parameter called load arguments, or load options, to control how to load, fetch and extract webpages.
There are three forms to combine URLs and their parameters:
-
URL-arguments form
-
URL-options form
-
configured-URL form
// use URL-arguments form:
val page = session.load(url, "-expires 1d")
val page2 = session.load(url, "-refresh")
val document = session.loadDocument(url, "-expires 1d -ignoreFailure")
val pages = session.loadOutPages(url, "-outLink a[href~=item] -itemExpires 7d")
session.submit(Hyperlink(url, args = "-expires 1d"))
// Or use configured-URL form:
val page = session.load("$url -expires 1d")
val page2 = session.load("$url -refresh")
val document = session.loadDocument("$url -expires 1d -ignoreFailure")
val pages = session.loadOutPages("$url -expires 1d -ignoreFailure", "-outLink a[href~=item] -itemExpires 7d")
session.submit(Hyperlink("$url -expires 1d"))
// Or use URL-options form:
var options = session.options("-expires 1d -ignoreFailure")
val document = session.loadDocument(url, options)
options = session.options("-outLink a[href~=item] -itemExpires 7d")
val pages = session.loadOutPages("$url -expires 1d -ignoreFailure", options)
// ...
The configured-URL form can be mixed with the other two forms and has the higher priority.
The most important load options are:
-expires // The expiry time of a page -itemExpires // The expiry time of item pages in batch scraping methods -outLink // The selector of out links to scrape -refresh // Force (re)fetch the page, just like hitting the refresh button on a real browser -parse // Activate parse subsystem -resource // Fetch the url as a resource without browser rendering
The load arguments are parsed into a LoadOptions object, check the code for all the supported options.
It is worth noting that when we execute the load()
family of methods, the system does not parse the page, but provides the parse()
method to parse the page. However, once we add the -parse
argument, the parsing subsystem will be activated and the page will be parsed automatically. We can register handlers to perform tasks such as data extraction, data persistence and link collection.
There are two ways to register handlers in the parsing subsystem: Register a global ParseFilter with ParseFilters, or register a page wide event handler with PageEvent.
A good example of using ParseFilter to perform complex tasks is e-commerce site-wide data collection, where a separate ParseFilter is registered for each type of page to handle data extraction, extraction result persistence, link collection, etc.
BrowserSettings defines a convenient interface to specify the behavior of browser automation, such as:
-
Headed or headless?
-
SPA or not?
-
Enable proxy ips or not?
-
Block media resources or not?
Check BrowserSettings for detail.
Event handlers here are webpage event handlers that capture and process events throughout the lifecycle of webpages.
Check EventHandlerUsage for all available event handlers.
Pulsar supports the Network As A Database paradigm, we developed X-SQL to query the Web directly and convert webpages into tables and charts.
Click X-SQL to see a detailed introduction and function descriptions about X-SQL.
Developers don’t need to study the implementation concepts, but knowing these concepts helps us better understand how the whole system works.
When running continuous crawls, a crawl loop is started to keep fetching urls from the UrlPool, and then load/fetch them asynchronously in a PulsarSession.
Keep in mind that every task in Pulsar is a url, so the crawl loop can accept and execute any kind of tasks.
One of the biggest difficulties in web scraping tasks is the bot stealth. For web scraping tasks, the website should have no idea whether a visit is from a human being or a bot. Once a page visit is suspected by the website, which we call a privacy leak, the privacy context has to be dropped, and Pulsar will visit the page in another privacy context.
Obtain IPs from proxy vendors, record proxy status, rotate IPs smart and automatically, and more.
When a browser is programmed to access a webpage, the website may detect that the visit is automated, Web Driver stealth technology is used to prevent detection.