[feature] Crawl all non-medium websites to fetch all articles #22

vmeylan · 2024-04-07T11:33:05Z

Update the src/populate_csv_files/get_article_content/crawl_non_medium_websites.py to crawl all posts (URLs) from all websites in data.mev.fyi available at data/links/websites.csv. Visualize websites at data.mev.fyi on Websites tab.
Input: website URLs. Output: dict of website mapping to all the articles' URLs for that website (with pagination)
Approach: have a general script to which you pass config items for each website.
Work in progress:
- Fix pagination
- Make sure it works for all websites namely the config skeleton might need to be updated
- If there are no articles in the website, first try go to the said website, find if there are other URLs available (in other indexes like /technology or /writing [...])
- If there are new websites and the config already exists, add the empty config items to the existing config file
- If there are new websites available e.g. a /technology while we added the /writing, then append this /techonology in to_parse.csv

Make sure it works for pagination
Make code general and robust. Abstract all the complexity into the config items. We can expect several containers, each with their own selectors, for each site

get all the unique author blog posts. Then we crawl all their websites. Then once all unique articles' URLs are indexed, we scrap all articles and add them to the database.
Expected cost: 2-3 hours to reach >50% of websites covered. Challenges: possible numerous updates to the config format.

Task: Obtain a list of all article urls for each website

classes are not important as long as the file works
input: called by cli with no arguments
output: dict of mapping of from website to lists to all article links
how does the code obtain the list of articles to crawl? -> implemented with the config file generated from websites.csv. Namely now all that is needed is to update the selectors for each website
the articles in medium should NOT be crawled because valmeylan is working on it
how do I know which websited should NOT be crawled because they only have one article?
modify an existing file created recently by valmeylan
Put logging for verify that pagination works.
Continue this chat https://chat.openai.com/share/0f46d34f-156f-417a-ab1d-6924ac6462a2

The text was updated successfully, but these errors were encountered:

girotomas · 2024-04-07T12:07:39Z

vmeylan assigned girotomas Apr 7, 2024

vmeylan unassigned girotomas Apr 24, 2024

Provide feedback