Skip to content

haydnba/site-metadata-scraper

Repository files navigation

Site Metadata Scraper

Scrape ecommerce site metadata for classification and keyword analysis.

Dependencies

Methodology

  • Iterate on list of validated urls (input can be origin, hostname, domain name)
  • Batch input (~5) with concurrent browser instances to invoke the page
  • For each, extract basic site metadata:
    1. Html lang value to guide any subsequent analysis
    2. Document title
    3. Meta information: keywords and description if available
    4. Social media handle anchors for major platforms

Service intended to run infrequently e.g. on a monthly basis with build and run from repository source via e.g. AWS CodeBuild...

Run

Export variables to the environment:

<path>: endpoint for index of url data to iterate on
<size>: a reasonable batch size for concurrent browser instances (~5)

export INPUT_PATH=<path>
export BATCH_SIZE=<size>

Run the service:

npm run start

TODO

Dockerise to run headful puppeteer in container with xvfb.

About

Gather seo metadata and social media handles etc.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published