Brainsteam

Published on November 3, 2022 by James Ravenscroft

Really enjoyed this blog post. I’ve been using archivebox for a couple of years now and I tend to link to the original and then my archived ‘mirror’ when writing up my notes in my digital garden. I also use wallabag for preserving long form articles without the yucky html and js gunk which I can read later in my kindle.

Exploring Web Archive Data – CDX Files

Published on June 5, 2017 by James Ravenscroft

#python #webarchive #PhD

I have recently been working in partnership with UK Web Archive in order to identify and parse large amounts of historic news data for an NLP task that I will blog about in the future. The NLP portion of this task will surely present its own challenges, but for now there is the small matter of identifying news data amongst the noise of 60TB of web archive dumps of the rest of the .UK top level domain.

Content tagged with "Webarchive"

Exploring Web Archive Data – CDX Files