This is a fork of Scrapy's QuotesBot. It adds a new spider called toscrape-hext.
$ pip install hext scrapy
$ scrapy list
toscrape-css
toscrape-hext
toscrape-xpath
$ scrapy crawl toscrape-hext -o quotes.json
Original readme:
This is a Scrapy project to scrape quotes from famous people from http://quotes.toscrape.com (github repo).
This project is only meant for educational purposes.
This project extracts quotes, combined with the respective author names and tags. The extracted data looks like this sample:
{
'author': 'Douglas Adams',
'text': '“I may not have gone where I intended to go, but I think I ...”',
'tags': ['life', 'navigation']
}
This project contains two spiders and you can list them using the list
command:
$ scrapy list
toscrape-css
toscrape-xpath
Both spiders extract the same data from the same website, but toscrape-css
employs CSS selectors, while toscrape-xpath
employs XPath expressions.
You can learn more about the spiders by going through the Scrapy Tutorial.
You can run a spider using the scrapy crawl
command, such as:
$ scrapy crawl toscrape-css
If you want to save the scraped data to a file, you can pass the -o
option:
$ scrapy crawl toscrape-css -o quotes.json