Request for comments: Should we remove fetching HTTP functionality? #64

Valian · 2024-11-12T11:01:29Z

Right now there's Readability.summarize(url) function fetching the article and then parsing it.

I'm thinking about:

removing fetching functionality from Readability
removing httpoison from dependencies
relying on Readability.article(html) as an entrypoint to the library, with the expectation that user will get HTML on his own

Why?

there are various approaches to scraping. Some apps use Req, some HTTPoison, some Tesla. Using multiple clients in a single app doesn't really make sense.
People might need different settings - HTTP headers, proxy etc
There's some maintenance overhead of keeping it around - updating dependencies etc.

Thoughts? Maybe @vkryukov @philipbrown ?

The text was updated successfully, but these errors were encountered:

philipbrown · 2024-11-12T11:50:16Z

@Valian Yeah, that sounds good to me 👍

vkryukov · 2024-11-12T15:19:42Z

I had the same exact idea! I'm using Req for my use case, and have essentially re-implemented Readability.summarize to work on raw html responses and URLs. +1

vkryukov · 2024-11-12T15:21:45Z

While we are here (and since that might be a breaking change from the API perspective anyways), should we discuss renaming summarize? I don't think it's the best name as it does not technically summarize anything, just extracts different parts of the webpage.

vkryukov · 2024-11-12T15:51:46Z

Some thoughts about simplifying the api:

Readability.article(html), as proposed above, returns an %Article{} structure with all the fields populated.
We don't have separate Readability.{title, published_at} etc. functions - they don't add much to the table (just parse the article and grab the fields you need).
Potentially add some helpers, such as Readability.article_from_file(filename) and such.

Some downsides of this approach:

Response headers can help determine the type of the file (e.g., we don't want to start parsing a PDF thinking that it's an HTML)
URL also contains some useful information (e.g., newspaper3k extracts the date from it).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for comments: Should we remove fetching HTTP functionality? #64

Request for comments: Should we remove fetching HTTP functionality? #64

Valian commented Nov 12, 2024

philipbrown commented Nov 12, 2024

vkryukov commented Nov 12, 2024

vkryukov commented Nov 12, 2024

vkryukov commented Nov 12, 2024

Request for comments: Should we remove fetching HTTP functionality? #64

Request for comments: Should we remove fetching HTTP functionality? #64

Comments

Valian commented Nov 12, 2024

philipbrown commented Nov 12, 2024

vkryukov commented Nov 12, 2024

vkryukov commented Nov 12, 2024

vkryukov commented Nov 12, 2024