GitHub - haydnba/site-metadata-scraper: Gather seo metadata and social media handles etc.

Site Metadata Scraper

Scrape ecommerce site metadata for classification and keyword analysis.

Iterate on list of validated urls (input can be origin, hostname, domain name)
Batch input (~5) with concurrent browser instances to invoke the page
For each, extract basic site metadata:
1. Html lang value to guide any subsequent analysis
2. Document title
3. Meta information: keywords and description if available
4. Social media handle anchors for major platforms

Service intended to run infrequently e.g. on a monthly basis with build and run from repository source via e.g. AWS CodeBuild...

Export variables to the environment:

<path>: endpoint for index of url data to iterate on
<size>: a reasonable batch size for concurrent browser instances (~5)

export INPUT_PATH=<path>
export BATCH_SIZE=<size>

Run the service:

npm run start

Dockerise to run headful puppeteer in container with xvfb.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.vscode		.vscode
src		src
.editorconfig		.editorconfig
.eslintignore		.eslintignore
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json