-
Notifications
You must be signed in to change notification settings - Fork 398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save a copy of the linked content in case of link rot #207
Comments
@ihavenogithub You're right, this has been proposed a long time ago (#58). We have been triaging bugs and fixing some issues at https://github.com/shaarli/Shaarli/ , and concluded that Shaarli should not include complex features like web scraping (or keep them as plugins, but we don't have a plugin system yet). I'm working on a python script that:
The script can be used fro a client machine (laptop, whatever) or can be placed on the server itself and run periodically (if the host supports python and cronjobs). At the moment, the script works perfectly for me, but needs some cleanup. Would this solve your problem? |
Probably for a while but I'd rather have this process be done automatically. |
It will be automatic if you add it as a scheduled task (cron job). I'm now formatting the script so that it's usable/readable for everyone and will keep this updated. |
hey @ihavenogithub I've started rewriting my script from scratch (it was too damn ugly), check https://github.com/nodiscc/shaarchiver For now it only downloads html exports, and downloads audio/video media (with tag filtering), not pages. Rather limited but it's a clean start and more is planned (see the issues). Contributions welcome |
Hi, |
So I don't think Wallabag could be useful for me. However I agree that wallabag should be able to automatically archive pages from RSS feeds. Did you report a bug for this on the Wallabag issue tracker? |
The wallabag remote server could be at your home ^_^ but I was only suggesting to use some components as a library, if possible. It would be a great thing to have a standard library to download webpages, available for all open-source and free softwares I understand that can't be done if you're developping in Python and wallabag is PHP Thank you for your tool BTW :] |
Thanks for the feedback @Epy I guess the script could also be run by your home server automatically if set up with a cron job
The patterns are very interesting, as they contain what to strip/extract to obtain "readable" versions of pages (example for bbc.co.uk). This feature could be added in the long run (in another script, or as a command-line option). For now I want to concentrate on keeping exact copies of the pages, then removing just ads (don't know where I saved it but I have a draft for this, basically download ad blocking lists and fetch pages through a proxy that removes them). I'm rewriting it (again...) as the script was getting overcomplicated. Next version should be able to download, video, audio (already implemented), webpages, and generate a markdown and html index of the archive. Next next version should make the html index filterable/searchable (text/tags). Next nex next version should support ad blocking. Feel free to report a feature request so that I won't forget your ideas. I also think wallabag should really support auto-downloading articles from RSS feeds... |
Okay, I just made the feature request in your github repo :) FreshRSS is a RSS feed reader and can export to wallabag (as it is able to do with Shaarli to export links only) |
From years of using del.icio.us then yahoo's delicious then self-hosted scuttle and semantic scuttle, I really miss the ability to save a local copy to still have a copy of the content when link rot eventually occurs.
The text was updated successfully, but these errors were encountered: