GitHub - trentbuck/staticice-for-cameras: proof-of-concept web scraping for online camera shops

GOAL: be staticice , except for cameras.

Summary of proof-of-concept experiments:

msy.py - a fully working scraper for MSY's parts list, complete with example output, and a Functional Requirements document.

This like staticice and steamprices -- it remembers and charts price changes over time, so you can observe long-term trends!
FIXME: haven't put the FR up yet.
test2 - follow the scrapy tutorial, basic poking around scrapy
```
cd test2 && scrapy crawl quotes
```
test3 - work out how to make scrapy into a "normal" app. I couldn't quite reduce it to a single file, so
```
python3 -m test3      # "run" the directory as-is
```
Also worked out how to make scrapy save to a database (badly).

Also work out how to turn the database able into an Excel spreadsheet (xlsx), for showing to regular people.
test3-jb - because JB's sitemap has ALL products, not just cameras, I thought I'd ignore it and instead try to read from their user-facing pages like https://www.jbhifi.com.au/collections/cameras

Big mistake - it's all generated by hairy javascript, so the only way to do that would be to either
1. run an entire GUI browser in "headless" / "remote control" mode. requires like 2GB of RAM and 500MB of disk, and just really bad.
2. reverse-engineer shopify's (deliberately confusing) javascript
3. pretend to be a shopify retailer and dig through their (paywalled?) retailer docs, hoping it gives away something.
So for now give up on that, and instead just read EVERY product, and throw away 98% of them (non-camera ones).
test4.py - go back to doing scraping the "lo-fi" way, with no confusing OO middleware. scrapy is 3 MEGABYTES of code, we should be able to do this in about 0.04 MEGABYTES.
- Successfully scraping basic metadata from JB prodcts.
- Add a quick hack to discard all the DVDs and CDs.
- Add a quick hack to NEVER re-scrape any product.
- test4.db
- test4.xlsx
- test4.csv
test5 - have a go at using scrapy's helper code specifically designed to deal with sitemap.xml.
- upstream CSV writer (instead of database hack).
- upstream throttling options
- Basic scraper for digidirect.com.au.
- Initial "don't rescrape the same URL repeatedly" code.
Partial output: test5.csv (~4000 of ~10000 SKUs)
sqlite2xlsx.py - since sqlitebrowser is a bit too simple and lobase + JDBC is really tedious, make a bare-bones report generator for non-IT stakeholders.

python3 sqlite2xlsx.py test4.db -q 'SELECT * FROM SKUs WHERE type = "CAMERAS" ORDER BY make, price DESC'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
msy-data		msy-data
test2		test2
test3-jb		test3-jb
test3		test3
test5		test5
README.rst		README.rst
msy.py		msy.py
spaceanal.py		spaceanal.py
sqlite2xlsx.py		sqlite2xlsx.py
test-sqlite3-date-sizes.py		test-sqlite3-date-sizes.py
test4.csv		test4.csv
test4.db		test4.db
test4.py		test4.py
test4.xlsx		test4.xlsx
test5.csv		test5.csv
test5.timestamp		test5.timestamp

trentbuck/staticice-for-cameras

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages