This is a report during my intern course.
Improve crawler system. Focus on:
- Scheduling
- Running on a headless server
- Building an automated pipeline
- Design your system on
- Bypass security
Thanks to anh Đức's and anh Phúc's comments :))) on my works last time, I have improved a little bit on my crawler system.
First is fix the hard code which you guys told me. Second is I made my system become more automatic. Instead of changing the source code to crawl on 1 or multiple page and multiple point of time, I can now do all of that with one command line. Here is how I do it.
Because the url of each place is hard to generate, so I used Selenium to automatically search for a place and navigate to the first option in drop down menu.
Then I took the url of place and used it to generate many other pages on that place. For example, after click on the first option from previous part. I got this
place = 'quận-tân-bình'
Another thing I want to get is the period of history weather data. So I adjusted __init__()
function of spider to take my extra arguments, which is key
to search for place and period
to get start and end time crawl on demand.
def __init__(self, key=None, period=None,*args, **kwargs):
super(WundergroundSpider, self).__init__(*args, **kwargs)
self.key = key
self.start = period.split(' ')[0]
self.end = period.split(' ')[1]
PATH = "..\chromedriver.exe"
self.driver = webdriver.Chrome(PATH)
For example, if I run a command with key='Quận Tân Bình' period='2020-11 2021-2'
, I will get all the recorded data in 4 month 11/2020, 12/2020, 1/2021, 2/2021 at Quận Tân Bình. And I generated all the urls with:
def get_url_list(place, start, end):
url_lst = []
datelist = get_datelist(start, end)
for date in datelist:
new_url = f'{place}/VVTS/date/{date}'
return url_lst
is just a function I used to generate a list of date from the given period
Finally, to get data from a place in a period of time, I used the command below:
$ scrapy crawl monthly -a key=<place-to-search> -a period=<crawl-period>
For example:
$ scrapy crawl monthly -a key='Quận Tân Bình' -a period='2020-11 2021-2' -o items.json
And I got this as result:
But there is a drawback of my solution:
- The argument
from command line need to be very similar to where you want to get the data. For example, If you search for Quận Tân Bình, but in command the argument is only-a key='Tân Bình
, result will be wrong.
Last week, I just scheduled the task with a period of time like 'crawl every 5 minutes'. The disadvantage of period is that if I want to run task at 9-PM/21 o'clock on company server, I can not always wait until 9-PM and set it to run 'crawl every 1 days'. It's possible but not good. A solution for this issue is using crontab package, which is made similar to cron command in Unix-like OS.
First, I had to enable UTC and set timezone:
enable_utc = True
CELERY_TIMEZONE = "Asia/Ho_Chi_Minh"
Then I change value in schedule
key with crontab:
from celery.schedules import crontab
'crawl': {
'task': 'tasks.daily_crawl',
'schedule': crontab(minute='0', hour='8', day_of_week='*/2', day_of_month='*', month_of_year='*'),
All the change is in file
And with the code above, I will have my system run at 8:00 AM, every 2 days of the week