I have recently been working in partnership with UK Web Archive in order to identify and parse large amounts of historic news data for an NLP task that I will blog about in the future. The NLP portion of this task will surely present its own challenges, but for now there is the small matter of identifying news data amongst the noise of 60TB of web archive dumps of the rest of the .UK top level domain.

WARC and CDX Files

The web archive project have produced standardized file formats for describing historic web resources in a compressed archive. The website is scraped and the content is stored chronologically in a WARC file. A CDX index file is also produced describing every URL scraped, the time it was retrieved at and which WARC file the content is in, along with some other metadata.

Our first task is to identify news content in order to narrow down our search to a subset of WARC files (in order not to fill 60TB of storage or have to traverse that amount of data). The CDX files allow us to do this. These files are available for free download from the Web Archive website. They are compressed using Gzip compression down to around 10-20GB per file. If you try to expand these files locally, you’re looking at 60-120GB of uncompressed data – a great way to fill up your hard drive.

Processing Huge Gzip Files

Ideally we want to explore these files without having to uncompress them explicitly. This is possible using Python 3’s gzip module but it took me a long time to find the right options.

Python file i/o typically allows you to read a file in line by line. If you have a text file, you can iterate over the lines using something like the following:

with open("my_text_file.txt", "r") as f:
    for line in f:
        print(line)

Now clearly trying this approach with a .gz file isn’t going to work. Using the gzip module we can open and uncompress gz as a stream – examining parts of the file in memory and discarding data that we’ve already seen. This is the most efficient way of dealing with a file of this magnitude that won’t fit into RAM on a modern machine and would will a hard drive uncompressed.

I tried a number of approaches using the gzip library, trying to run the gzip command line utility using subprocess and combinations of TextIOWrapper and BufferedReader but to no avail.

The Solution

The solution is actually incredibly simple in Python 3 and I wasn’t far off the money with TextIOWrapper. The gzip library offers a file read/write flag for accessing gzipped text in a buffered line-by-line fashion as above for the uncompressed text file. Simply passing in “rt” to the gzip.open() function will wrap the input stream from Gzip in a TextIOWrapper and allow you to read the file line by line.

import gzip

with gzip.open("2012.cdx.gz","rt") as gzipped:

    for i,line in enumerate(gzipped):
        print(line)
        # stop this thing running off and printing the whole file.
        if i == 10: 
            break

If you’re using an older version of Python (2.7 for example) or you would prefer to see what’s going on beneath the covers here explicitly, you can also use the following code:

import io
import gzip

with io.TextIOWrapper(gzip.open("2012.cdx.gz","r")) as gzipped:
    
    for i,line in enumerate(gzipped):
        print(line)
        # stop this thing running off and printing the whole file.
        if i == 10:
            break

And its as simple as that. You can now start to break down each line in the file using tools like urllib to identify content stored in the archive from domains of interest.

Solving a problem

We may want to understand how much content is available in the archive for a given Domain. To put this another way, which are the domains with the most pages that we have stored in the web archive. In order to answer this, we can run a simple script that parses all of the URLs, examines the domain name and counts instances of each.

import io
import gzip
from collections import Counter
from urllib.parse import urlparse

with gzip.open("2012.cdx.gz","rt") as gzipped:


    for i,line in enumerate(gzipped):
        
        parts = line.split(" ")
        
        urlbits = urlparse(parts[2])
        
        urlcounter[urlbits.netloc] += 1

#at the end we print out the top 10 URLs
print(urlcounter.most_common(10))

Just to quickly explain what is going on here:

  1. We load up the CDX file in compressed text mode as described above
  2. We split each line using space characters. This gives us an array of fields, the order and content of which are described by the WebArchive team here.
  3. We parse the URL (which is at index 2) using the urlparse function which will break the URL up into things like domain, protocol (HTTP/HTTPS), path, query, fragment.
  4. We increment the counter for the current domain (described in the ‘netloc’ field of the parsed url.
  5. After iterating we print out the domains with the most URLs in the CDX file.

This will take a long time to complete since we’re iterating over ~60TB of text. I intend to investigate parallel processing of these CDX files as a next step.

Conclusion

We’ve looked into how to dynamically unzip and examine a CDX file in order to understand which domains host the most content. The next step is to identify which WARC files are of interest and request access to them from the Web Archive.