Content tagged with "Webarchive"

Really enjoyed this blog post. I’ve been using archivebox for a couple of years now and I tend to link to the original and then my archived ‘mirror’ when writing up my notes in my digital garden. I also use wallabag for preserving long form articles without the yucky html and js gunk which I can read later in my kindle.

I have recently been working in partnership with UK Web Archive in order to identify and parse large amounts of historic news data for an NLP task that I will blog about in the future. The NLP portion of this task will surely present its own challenges, but for now there is the small matter of identifying news data amongst the noise of 60TB of web archive dumps of the rest of the .UK top level domain.