Content tagged with "Python"

MLFlow is a powerful open source MLOps platform with built in framework for serving your trained ML models as REST APIs. The REST framework will load data provided in a JSON or CSV format compatible with pandas and pass this directly into your model. This can be handy when your model is expecting a tabular list of numerical and categorical features. However it is less clear how to serve with models and pipelines that are expecting unstructured text data as their primary input. In this post we will explore how to train and then serve an NLP model using MLFlow, scikit-learn and spacy.

Read more...

I recently ran into an issue where the newest version of Torch (as of writing 1.4.0) requires a newer version of CUDA/Nvidia Drivers than I have installed.

Last time I tried to upgrade my CUDA version it took me several hours/days so I didn’t really want to have to spend lots of time on that.

As it happens PyTorch has an archive of compiled python whl objects for different combinations of Python version (3.5, 3.6, 3.7, 3.8 – heck even 2.X which is no longer officially supported), CUDA Version (9.2, 10.0, 10.1) and Torch version (from 0.1 to 1.4). You can specify which you want to install if you know the right incantation. The full index is available here

Read more...

If you’re a frequent user of spacy and virtualenv you might well be all too familiar with the following:

python -m spacy download en_core_web_lg
Collecting en_core_web_lg==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.0.0/en_core_web_lg-2.0.0.tar.gz#egg=en_core_web_lg==2.0.0
Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.0.0/en_core_web_lg-2.0.0.tar.gz (852.3MB)
5% |█▉ | 49.8MB 11.5MB/s eta 0:01:10

If you’re lucky and you have a decent internet connection then great, if not it’s time to make a cup of tea.

Read more...

As part of my PhD I’m currently interested in topic models that can take into account the dialect of the writing. That is, how can we build a model that can compare topics discussed in different dialectical styles, such as scientific papers versus newspaper articles. If you’re new to the concept of topic modelling then this article can give you a quick primer.

Vanilla LDA

A diagram of how latent variables in LDA model are connected

Vanilla topic models such as Blei’s LDA are great but start to fall down when the wording around one particular concept varies too much. In a scientific paper you might expect to find words like “gastroenteritis”, “stomach” and “virus” whereas in newspapers discussing the same topic you might find “tummy”, “sick” and “bug”.  A vanilla LDA implementation might struggle to understand that these concepts are linked unless the contextual information around the words is similar (e.g. both articles have “uncooked meat” and “symptoms last 24 hours”).

Read more...

I have recently been working in partnership with UK Web Archive in order to identify and parse large amounts of historic news data for an NLP task that I will blog about in the future. The NLP portion of this task will surely present its own challenges, but for now there is the small matter of identifying news data amongst the noise of 60TB of web archive dumps of the rest of the .UK top level domain.

Read more...

I’ve written a small command line application for tracking my time on my PhD and other projects. We use Harvest at Filament which is great if you’ve got a huge team and want the complexity (and of course license charges) of an online cloud solution for time tracking.

If, like me, you’re just interested to see how much time you are spending on your different projects and you don’t have any requirement for fancy web interfaces or client billing, then timetrack might be for you. For me personally, I was wondering how much of my week is spent on my PhD as opposed to Filament client work. I know its a fair amount but I want some clear cut numbers.

Read more...

I have recently been playing with Elastic search a lot for my PhD and started trying to do some more complicated queries and pattern matching using the DSL syntax. I have an index on my local machine called impact_studies which contains all 6637 REF 2014 impact case studies in a JSON format. One of the fields is “UOA” which contains the title of the unit of impact that the case study belongs to. We recently identified the fact that we do not want to look at all units of impact (my PhD is around impact in science so domains such as Art History are largely irrelevent to me). Therefore I started trying to run queries like this:

Read more...

I’ve written a simple wrapper around the Brown University Citation parser FreeCite. I’m planning to use the service to pull out author names from references in REF impact studies and try to link them back to investigators listed on RCUK funding applications.

The code is here and is MIT licensed. It provides a simple method which takes a string representing a reference and returns a dict with each field separated. There is also a parse_many function which takes an array of reference strings and returns an array of dicts.

Read more...

I know I’m doing a lot of flip-flopping between SOLR and Elastic at the moment – I’m trying to figure out key similarities and differences between them and where one is more suitable than the other.

The following is an example of how to map a function _**f **_onto an entire set of indexed data in elastic using the scroll API.

If you use elastic, it is possible to do paging by adding a size and a from parameter. For example if you wanted to retrieve results in pages of 5 starting from the 3rd page (i.e. show results 11-15) you would do:

Read more...