Open machine learning research is undergoing something of a reproducibiltiy crisis. In fairness it’s not usually the authors’ fault – or at least not entirely. We’re a fickle industry and the tools and frameworks were ‘in vogue’ and state of the art a couple of years ago are now obsolete. Furthermore, academics and open source contributors are under no obligation to keep their code up to date. It is often left up to the reproducer to figure out how to breathe life back into older work.
This is a topic I’ll write about in more detail some day but for now I’ll focus on a case study around a fairly recent ML model that I wanted to run 2 years after it was written and the challenges I faced making that happen.
The model in question is a PyTorch implementation of Lei et al. 2016 paper on Rationalizing Neural Predictions. It’s a relatively simple (relatively doing the heavy lifting here) neural text model that tries to extract short summaries or ‘rationales’ from the text being classified that explain the class. Again a longer post about what exactly I’m doing with this will be coming soon.
Tao Lei did originally provide a Theano implementation of his model here but running Theano (RIP]) in 2021 poses much more of a challenge than even running 2 year old PyTorch code (I know Exactly how ridiculous that sounds and I completely agree – not ideal at all! I might have a go at resurrecting this one later).
First steps- taking stock
First we need to work out what is needed to run this model locally. Thankfully the author provides a relatively comprehensive README, and pip requirements files that outline which libraries are needed to run the model. A quick peek inside requirements3.txt is quite revealing – it shows a dependency on a binary PyTorch 0.3.0 pre-built for python 3.6. The versions of the remaining libraries are unconstrained which means pip can figure out which specific versions of those libs are compatible with PyTorch 0.3.0 and pull them for us.
We have enough information to have a first go at running this thing – we can do a quick git clone and then get started.
I’m a big fan of Miniconda for managing my Python virtual environments – not least because it gives you ‘free’ CUDA dependency management meaning I don’t have to faff about installing different nvidia drivers on my development system if I’m switching between projects that use different versions of TensorFlow or PyTorch built against different CUDA libraries.
I create a Python 3.6 environment and then try to install torch 0.3.0 via conda. No joy – this version of the library is compiled against CUDA 8.0 and the earliest version of cuda available in conda is 9.0.
So our options are now:
- Try running with a newer version of cuda and see what happens (there might be unknown side effects)
- Compile PyTorch 0.3.0 with a newer version of CUDA (Again possible unknown side effects/faff trying to compile things)
- Try to find and run install CUDA 8.0 libraries on our host system and then install PyTorch 0.3.0 (faff…)
I actually had a go at 1. but got an error message so we can probably make the assumption that there were several breaking API changes between PyTorch 0.3.0 running on CUDA V8 and PyTorch 1.7.0 running on CUDA V11.
Counter-intuitively, 3 is our next best choice.
The solution – containerise
If you’re not familiar with docker and containers and you’re in software eng/ML eng then I’d highly recommend looking into it. Containers allow us to borrow from the mid-late 90s Java “write once, run anywhere” paradigmn – they’re like super light-weight virtual machines that contain all of the dependencies you need to run your application. They are super useful when you want to experiment with libraries and dependencies in a controlled way without messing up your host operating system.
NVidia provide their own version of docker that supports GPU passthrough – i.e. our “tiny virtual machine” can have access to the GPU in our host computer.
Next we build a lightweight Dockerfile for building a container around the model implementation.
In summary, the above linked docker file:
- Starts from an Ubuntu base image with nvidia and cuda 8.0 installed already
- Installs miniconda
- Grabs the pytorch 0.3.0 library and other pythondependencies we need.
Since we the cuda libs are installed separately (part of the docker base image), the fact that conda tries to install the wrong cuda toolkit isn’t a problem inside our container.
After placing the docker file in the root of the git project, we can run
docker build -t yala_env . to build the container and then
docker run -it yala_env /bin/bash to start an interactive shell inside the virtualenv.
We run the train script and – as if by magic – it works! Hooray!
So the main learnings from this experience were:
- code maintainence and reproducibility in machine learning are hard problems to which the community is yet to find the solution.
- conda can sometimes help with managing nvidia/cuda mess so that we don’t have to uninstall and reinstall system libraries
- Docker and containers can provide a powerful and relatively simple way to isolate libraries and dependencies when even conda doesn’t help us.
I haven’t tried it yet but I’m fairly sure this method would work well with even older stuff like the original theano-based model I mentioned in this article – Docker seems like a really useful and helpful tool in the intrepid data scientist’s ability to re-run and reproduce old experiments.