Site Metadata Scraper

Scrape ecommerce site metadata for classification and keyword analysis.

Iterate on list of validated urls (input can be origin, hostname, domain name)
Batch input (~5) with concurrent browser instances to invoke the page
For each, extract basic site metadata:
1. Html lang value to guide any subsequent analysis
2. Document title
3. Meta information: keywords and description if available
4. Social media handle anchors for major platforms

Service intended to run infrequently e.g. on a monthly basis with build and run from repository source via e.g. AWS CodeBuild...

Export variables to the environment:

<path>: endpoint for index of url data to iterate on
<size>: a reasonable batch size for concurrent browser instances (~5)

export INPUT_PATH=<path>
export BATCH_SIZE=<size>

Run the service:

npm run start

Dockerise to run headful puppeteer in container with xvfb.

Provide feedback