Save a copy of the linked content in case of link rot #207

ihavenogithub · 2014-10-19T16:52:17Z

From years of using del.icio.us then yahoo's delicious then self-hosted scuttle and semantic scuttle, I really miss the ability to save a local copy to still have a copy of the content when link rot eventually occurs.

nodiscc · 2014-10-20T12:13:23Z

@ihavenogithub You're right, this has been proposed a long time ago (#58). We have been triaging bugs and fixing some issues at https://github.com/shaarli/Shaarli/ , and concluded that Shaarli should not include complex features like web scraping (or keep them as plugins, but we don't have a plugin system yet).

I'm working on a python script that:

Downloads HTML exports from Shaarli
Saves the linked pages, with ability to filter tags, download audio/video media and more.

The script can be used fro a client machine (laptop, whatever) or can be placed on the server itself and run periodically (if the host supports python and cronjobs). At the moment, the script works perfectly for me, but needs some cleanup. Would this solve your problem?

ihavenogithub · 2014-10-20T16:44:40Z

Probably for a while but I'd rather have this process be done automatically.
Would you mind giving me a link to your script? I'd like to give it a try.

nodiscc · 2014-10-20T17:42:06Z

It will be automatic if you add it as a scheduled task (cron job). I'm now formatting the script so that it's usable/readable for everyone and will keep this updated.

nodiscc · 2014-11-05T14:05:48Z

hey @ihavenogithub I've started rewriting my script from scratch (it was too damn ugly), check https://github.com/nodiscc/shaarchiver

For now it only downloads html exports, and downloads audio/video media (with tag filtering), not pages. Rather limited but it's a clean start and more is planned (see the issues). Contributions welcome

Epy · 2015-05-26T12:52:23Z

Hi,
Your archiver script may use Wallabag scraper, you'll be able to scrape many websites without having to re-code the wheel
Wallabag does what you need but needs shaarli integration and automation I think

nodiscc · 2015-05-26T15:55:15Z

@Epy

This tool is written in Python
It's a command line tool
This tool is for local offline archiving, not on a remote server
This tool leverages youtube-dl for media downloads (supports more than 500 websites)
Once the page download features are in, it will download exact copies of pages, not "readable" versions (except ads removed).

So I don't think Wallabag could be useful for me.

However I agree that wallabag should be able to automatically archive pages from RSS feeds. Did you report a bug for this on the Wallabag issue tracker?

Epy · 2015-05-27T08:53:02Z

The wallabag remote server could be at your home ^_^ but I was only suggesting to use some components as a library, if possible.
Maybe re-use patterns only: https://github.com/wallabag/wallabag/tree/master/inc/3rdparty/site_config

It would be a great thing to have a standard library to download webpages, available for all open-source and free softwares

I understand that can't be done if you're developping in Python and wallabag is PHP

Thank you for your tool BTW :]

nodiscc · 2015-05-28T21:47:15Z

Thanks for the feedback @Epy I guess the script could also be run by your home server automatically if set up with a cron job

Maybe re-use patterns only: https://github.com/wallabag/wallabag/tree/master/inc/3rdparty/site_config

The patterns are very interesting, as they contain what to strip/extract to obtain "readable" versions of pages (example for bbc.co.uk). This feature could be added in the long run (in another script, or as a command-line option).

For now I want to concentrate on keeping exact copies of the pages, then removing just ads (don't know where I saved it but I have a draft for this, basically download ad blocking lists and fetch pages through a proxy that removes them).

I'm rewriting it (again...) as the script was getting overcomplicated. Next version should be able to download, video, audio (already implemented), webpages, and generate a markdown and html index of the archive. Next next version should make the html index filterable/searchable (text/tags). Next nex next version should support ad blocking.

Feel free to report a feature request so that I won't forget your ideas.

I also think wallabag should really support auto-downloading articles from RSS feeds...

Epy · 2015-05-29T10:25:12Z

Okay, I just made the feature request in your github repo :)

FreshRSS is a RSS feed reader and can export to wallabag (as it is able to do with Shaarli to export links only)
http://freshrss.org/
With 3 Self hostable open source (and KISS) apps connected we should be able to have a nice system, no ? :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save a copy of the linked content in case of link rot #207

Save a copy of the linked content in case of link rot #207

ihavenogithub commented Oct 19, 2014

nodiscc commented Oct 20, 2014

ihavenogithub commented Oct 20, 2014

nodiscc commented Oct 20, 2014

nodiscc commented Nov 5, 2014

Epy commented May 26, 2015

nodiscc commented May 26, 2015

Epy commented May 27, 2015

nodiscc commented May 28, 2015

Epy commented May 29, 2015

Save a copy of the linked content in case of link rot #207

Save a copy of the linked content in case of link rot #207

Comments

ihavenogithub commented Oct 19, 2014

nodiscc commented Oct 20, 2014

ihavenogithub commented Oct 20, 2014

nodiscc commented Oct 20, 2014

nodiscc commented Nov 5, 2014

Epy commented May 26, 2015

nodiscc commented May 26, 2015

Epy commented May 27, 2015

nodiscc commented May 28, 2015

Epy commented May 29, 2015