Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] Crawl all non-medium websites to fetch all articles #22

Open
vmeylan opened this issue Apr 7, 2024 · 1 comment
Open

[feature] Crawl all non-medium websites to fetch all articles #22

vmeylan opened this issue Apr 7, 2024 · 1 comment

Comments

@vmeylan
Copy link
Collaborator

vmeylan commented Apr 7, 2024

TODO

  • Update the src/populate_csv_files/get_article_content/crawl_non_medium_websites.py to crawl all posts (URLs) from all websites in data.mev.fyi available at data/links/websites.csv. Visualize websites at data.mev.fyi on Websites tab.
  • Input: website URLs. Output: dict of website mapping to all the articles' URLs for that website (with pagination)
  • Approach: have a general script to which you pass config items for each website.
  • Work in progress:
    • Fix pagination
    • Make sure it works for all websites namely the config skeleton might need to be updated
    • If there are no articles in the website, first try go to the said website, find if there are other URLs available (in other indexes like /technology or /writing [...])
    • If there are new websites and the config already exists, add the empty config items to the existing config file
    • If there are new websites available e.g. a /technology while we added the /writing, then append this /techonology in to_parse.csv

Challenges:

  • Make sure it works for pagination
  • Make code general and robust. Abstract all the complexity into the config items. We can expect several containers, each with their own selectors, for each site

End goal:

  • get all the unique author blog posts. Then we crawl all their websites. Then once all unique articles' URLs are indexed, we scrap all articles and add them to the database.
  • Expected cost: 2-3 hours to reach >50% of websites covered. Challenges: possible numerous updates to the config format.

FAQ

Task: Obtain a list of all article urls for each website

  • classes are not important as long as the file works
  • input: called by cli with no arguments
  • output: dict of mapping of from website to lists to all article links
  • how does the code obtain the list of articles to crawl? -> implemented with the config file generated from websites.csv. Namely now all that is needed is to update the selectors for each website
  • the articles in medium should NOT be crawled because valmeylan is working on it
  • how do I know which websited should NOT be crawled because they only have one article?
  • modify an existing file created recently by valmeylan
  • Put logging for verify that pagination works.
  • Continue this chat https://chat.openai.com/share/0f46d34f-156f-417a-ab1d-6924ac6462a2
@girotomas
Copy link
Contributor

Hours worked: (detailed)

  • initial call explaining the task 1h

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants