Scrape ecommerce site metadata for classification and keyword analysis.
- Use puppeteer to run browser instance for scraping.
- Url validation/parsing managed with public suffix list.
- Iterate on list of validated urls (input can be origin, hostname, domain name)
- Batch input (~5) with concurrent browser instances to invoke the page
- For each, extract basic site metadata:
- Html
lang
value to guide any subsequent analysis - Document
title
- Meta information:
keywords
anddescription
if available - Social media handle anchors for major platforms
- Html
Service intended to run infrequently e.g. on a monthly basis with build and run from repository source via e.g. AWS CodeBuild...
Export variables to the environment:
<path>
: endpoint for index of url data to iterate on
<size>
: a reasonable batch size for concurrent browser instances (~5)
export INPUT_PATH=<path>
export BATCH_SIZE=<size>
Run the service:
npm run start
Dockerise to run headful puppeteer in container with xvfb.