Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for comments: Should we remove fetching HTTP functionality? #64

Open
Valian opened this issue Nov 12, 2024 · 4 comments
Open

Request for comments: Should we remove fetching HTTP functionality? #64

Valian opened this issue Nov 12, 2024 · 4 comments

Comments

@Valian
Copy link
Collaborator

Valian commented Nov 12, 2024

Right now there's Readability.summarize(url) function fetching the article and then parsing it.

I'm thinking about:

  • removing fetching functionality from Readability
  • removing httpoison from dependencies
  • relying on Readability.article(html) as an entrypoint to the library, with the expectation that user will get HTML on his own

Why?

  • there are various approaches to scraping. Some apps use Req, some HTTPoison, some Tesla. Using multiple clients in a single app doesn't really make sense.
  • People might need different settings - HTTP headers, proxy etc
  • There's some maintenance overhead of keeping it around - updating dependencies etc.

Thoughts? Maybe @vkryukov @philipbrown ?

@philipbrown
Copy link
Contributor

@Valian Yeah, that sounds good to me 👍

@vkryukov
Copy link
Contributor

I had the same exact idea! I'm using Req for my use case, and have essentially re-implemented Readability.summarize to work on raw html responses and URLs. +1

@vkryukov
Copy link
Contributor

While we are here (and since that might be a breaking change from the API perspective anyways), should we discuss renaming summarize? I don't think it's the best name as it does not technically summarize anything, just extracts different parts of the webpage.

@vkryukov
Copy link
Contributor

Some thoughts about simplifying the api:

  1. Readability.article(html), as proposed above, returns an %Article{} structure with all the fields populated.
  2. We don't have separate Readability.{title, published_at} etc. functions - they don't add much to the table (just parse the article and grab the fields you need).
  3. Potentially add some helpers, such as Readability.article_from_file(filename) and such.

Some downsides of this approach:

  1. Response headers can help determine the type of the file (e.g., we don't want to start parsing a PDF thinking that it's an HTML)
  2. URL also contains some useful information (e.g., newspaper3k extracts the date from it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants