#BlackgangPi – a Raspberry Pi Hack at Blackgang Chine

I was very excited to be invited along with some other IBMers to the Blackgang Pi event run by Dr Lucy Rogers on a semi regular basis at the Blackgang Chine theme park on the Isle of Wight.

Blackgang Chine is a theme park on the southern tip of the Isle of Wight and holds the title of oldest theme park in the United Kingdom. We were lucky enough to be invited along to help them modernise some of their animatronic exhibits, replacing some of the aging bespoke PCBs and controllers with Raspberry Pis running Node-RED and communicating using MQTT/Watson IOT.

Over the course of two days, my colleague James Sutton and I built a talking moose head using some of the IBM Watson Cognitive services.

We got it talking fairly quickly using IBM text to speech and had it listening for intents like “tell joke” or “check weather” via NLC.

I also built out a dialog that would monitor the state of the conversation and make the user comply with the knock knock joke format (i.e. if you say anything except “who’s there” it will moan and call you a spoil-sport).

Video we managed to capture before we had to pack up yesterday below

Cognitive Quality Assurance Pt 2: Performance Metrics

Last time we discussed some good practices for collecting data and then splitting it into test and train in order to create a ground truth for your machine learning system. We then talked about calculating accuracy using test and blind data sets.

In this post we will talk about some more metrics you can do on your machine learning system including Precision, Recall, F-measure and confusion matrices. These metrics give you a much deeper level of insight into how your system is performing and provide hints at how you could improve performance too!

A recap – Accuracy calculation

This is the most simple calculation but perhaps the least interesting. We are just looking at the percentage of times the classifier got it right versus the percentage of times it failed. Simply:

  1. sum up the number of results (count the rows),
  2. sum up the number of rows where the predicted label and the actual label match.
  3. Calculate percentage accuracy: correct / total * 100.

This tells you how good the classifier is in general across all classes. It does not help you in understanding how that result is made up.

Going above and beyond accuracy: why is it important?

target with arrow by AnonymousImagine that you are a hospital and it is critically important to be able to predict different types of cancer and how urgently they should be treated. Your classifier is 73% accurate overall but that does not tell you anything about it’s ability to predict any one type of cancer. What if the 27% of the answers it got wrong were the cancers that need urgent treatment? We wouldn’t know!

This is exactly why we need to use measurements like precision, recall and f-measure as well as confusion matrices in order to understand what is really going on inside the classifier and which particular classes (if any) it is really struggling with.

Precision, Recall and F-measure and confusion matrices (Grandma’s Memory Game)

Grandma's face by frankesPrecision, Recall and F-measure are incredibly useful for getting a deeper understanding of which classes the classifier is struggling with. They can be a little bit tricky to get your head around so lets use a metaphor about Grandma’s memory.

Imagine Grandma has 24 grandchildren. As you can understand it is particularly difficult to remember their names. Thankfully, her 6 children, the grandchildren’s parents all had 4 kids and named them after themselves. Her son Steve has 3 sons: Steve I, Steve II, Steve III and so on.

This makes things much easier for Grandma, she now only has to remember 6 names: Brian, Steve, Eliza, Diana, Nick and Reggie. The children do not like being called the wrong name so it is vitally important that she correctly classifies the child into the right name group when she sees them at the family reunion every Christmas.

I will now describe Precision, Recall, F-Measure and confusion matrices in terms of Grandma’s predicament.

Some Terminology

Before we get on to precision and recall, I need to introduce the concepts of true positive, false positive, true negative and false negative. Every time Grandma gets an answer wrong or right, we can talk about it in terms of these labels and this will also help us get to grips with precision and recall later.

These phrases are in terms of each class – you have TP, FP, FN, TN for each class. In this case we can have TP,FP,FN,TN with respect to Brian, with respect to Steve, with respect to Eliza and so on.

This table shows how these four labels apply to the class “Brian” – you can create a table will

Brian Not Brian
Grandma says “Brian” True Positive False Positive
Grandma says <not brian> False Negative True Negative
  • If Grandma calls a Brian, Brian then we have a true positive (with respect to the Brian class) – the answer is true in both senses- Brian’s name is indeed Brian AND Grandma said Brian – go Grandma!
  • If Grandma calls a Brian, Steve then we have a false negative (with respect to the Brian class). Brian’s name is Brian and Grandma said Steve. This is also a false positive with respect to the Steve Class.
  • If Grandma calls a Steve, Brian then we have a false positive (with respect to the Brian class). Steve’s name is Steve, Grandma wrongly said Brian (i.e. identified positively).
  • If Grandma calls an Eliza, Eliza, or Steve, or Diana, or Nick – the result is the same – we have a true negative (with respect to the Brian class). Eliza,Eliza would obviously be a true positive with respect to the Eliza class but because we are only interested in Brian and what is or isn’t Brian at this point, we are not measuring this.

When you are recording results, it is helpful to store them in terms of each of these labels where applicable. For example:

Steve,Steve (TP Steve, TN everything else)
Brian,Steve (FN Brian, FP Steve)

Precision and Recall

Grandma is in the kitchen, pouring herself a Christmas Sherry when three Brians and 2 Steves come in to top up their eggnogs.

Grandma correctly classifies 2 Brians but slips up and calls one of them Eliza. She only gets 1 of the Steve’ and calls the other Brian.

In terms of TP,FP,TN,FN we can say the following (true negative is the least interesting for us):

Brian 2 1 1
Eliza 0 1 0
Steve 1 0 1
  • She has correctly identified 2 people who are truly called Brian as Brian (TP)
  • She has falsely named someone Eliza when their name is not Eliza (FP)
  • She has falsely named someone whose name is truly Steve something else (FN)

True Positive, False Positive, True Negative and False negative are crucial to understand before you look at precision and recall so make sure you have fully understood this section before you move on.


Precision, like our TP/FP labels, is expressed in terms of each class or name. It is the proportion of true positive name guesses divided by true positive + false positive guesses.

Put another way, precision is how many times Grandma correctly guessed Brian versus how many times she called other people (like Steve) Brian.

For Grandma to be precise, she needs to be very good at correctly guessing Brians and also never call anyone else (Elizas and Steves) Brian.

Important: If Grandma came to the conclusion that 70% of her grandchildren were named Brian and decided to just randomly say “Brian” most of the time, she could still achieve a high overall accuracy. However, her Precision – with respect to Brian would be poor because of all the Steves and Elizas she was mis-labelling. This is why precision is important.

TP FP FN Precision
Brian 2 1 1 66%
Eliza 0 1 0 N/A
Steve 1 0 1 100%

The results from this case are displayed above. As you can see, Grandma uses Brian to incorrectly label Steve so precision is only 66%. Despite only getting one of the Steves correct, Grandma has 100% precision for Steve simply by never using the name incorrectly. We can’t calculate for Eliza because there were no true positive guesses for that name ( 0 / 1 is still zero ).

So what about false negatives? Surely it’s important to note how often Grandma is inaccurately calling  Brian by other names? We’ll look at that now…


Continuing the theme, Recall is also expressed in terms of each class. It is the proportion of true positive name guesses divided by true positive + false negative guesses.

Another way to look at it is given a population of Brians, how many does Grandma correctly identify and how many does she give another name (i.e. Eliza or Steve)?

This tells us how “confusing” Brian is as a class. If Recall is high then its likely that Brians all have a very distinctive feature that distinguishes them as Brians (maybe they all have the same nose). If Recall is low, maybe Brians are very varied in appearance and perhaps look a lot like Elizas or Steves (this presents a problem of its own, check out confusion matrices below for more on this).

TP FP FN Recall
Brian 2 1 1 66.6%
Eliza 0 1 0 N/A
Steve 1 0 1 50%

You can see that recall for Brian remains the same (of the 3 Brians Grandma named, she only guessed incorrectly for one). Recall for Steve is 50% because Grandma guessed correctly for 1 and incorrectly for the other Steve. Again Eliza can’t be calculated because we end up trying to divide zero by zero.


F-measure effectively a measurement of how accurate the classifier is per class once you factor in both precision and recall. This gives you a wholistic view of your classifier’s performance on a particular class.

In terms of Grandma, f-measure give us an aggregate metric of how good Grandma is at dealing with Brians in terms of both precision AND accuracy.

It is very simple to calculate if you already have precision and recall:

F_1 = 2 \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}}

Here are the F-Measure results for Brian, Steve and Eliza from above.

TP FP FN Precision Recall F-measure
Brian 2 1 1 66.6% 66.6% 66.6%
Eliza 0 1 0 N/A N/A N/A
Steve 1 0 1 1 0.5 0.6666666667

As you can see – the F-measure is the average (harmonic mean) of the two values – this can often give you a good overview of both precision and recall and is dramatically affected by one of the contributing measurements being poor.

Confusion Matrices

When a class has a particularly low Recall or Precision, the next question should be why? Often you can improve a classifier’s performance by modifying  the data or (if you have control of the classifier) which features you are training on.

For example, what if we find out that Brians look a lot like Elizas? We could add a new feature (Grandma could start using their voice pitch to determine their gender and their gender to inform her name choice) or we could update the data (maybe we could make all Brians wear a blue jumper and all Elizas wear a green jumper).

Before we go down that road, we need to understand where there is confusion between classes  and where Grandma is doing well. This is where a confusion matrix helps.

A Confusion Matrix allows us to see which classes are being correctly predicted and which classes Grandma is struggling to predict and getting most confused about. It also crucially gives us insight into which classes Grandma is confusing as above. Here is an example of a confusion Matrix for Grandma’s family.

Steve Brian Eliza Diana Nick Reggie


Steve 4 1 0 1 0 0
Brian 1 3 0 0 1 1
Eliza 0 0 5 1 0 0
Diana 0 0 5 1 0 0
Nick 1 0 0 0 5 0
Reggie 0 0 0 0 0 6

Ok so lets have a closer look at the above.

Reading across the rows left to right these are the actual examples of each class – in this case there are 6 children with each name so if you sum over the row you will find that they each add up to 6.

Reading down the columns top-to-bottom you will find the predictions – i.e. what Grandma thought each child’s name was.  You will find that these columns may add up to more than or less than 6 because Grandma may overfit for one particular name. In this case she seems to think that all her female Grandchildren are called Eliza (she predicted 5/6 Elizas are called Eliza and 5/6 Dianas are also called Eliza).

Reading diagonally where I’ve shaded things in bold gives you the number of correctly predicted examples. In this case Reggie was 100% accurately predicted with 6/6 children called “Reggie” actually being predicted “Reggie”. Diana is the poorest performer with only 1/6 children being correctly identified. This can be explained as above with Grandma over-generalising and calling all female relatives “Eliza”.

Steve sings for a Rush tribute band - his Geddy Lee is impeccable.
Steve sings for a Rush tribute band – his Geddy Lee is impeccable.

Grandma seems to have gender nailed except in the case of one of the Steves (who in fairness does have a Pony Tail and can sing very high).  She is best at predicting Reggies and struggles with Brians (perhaps Brians have the most diverse appearance and look a lot like their respective male cousins). She is also pretty good at Nicks and Steves.

Grandma is terrible at female grandchildrens’ names. If this was a machine learning problem we would need to find a way to make it easier to identify the difference between Dianas and Elizas through some kind of further feature extraction or weighting or through the gathering of additional training data.


Machine learning is definitely no walk in the park. There are a lot of intricacies involved in assessing the effectiveness of a classifier. Accuracy is a great start if until now you’ve been praying to the gods and carrying four-leaf-clovers around with you to improve your cognitive system performance.

However, Precision, Recall, F-Measure and Confusion Matrices really give you the insight you need into which classes your system is struggling with and which classes confuse it the most.

A Note for Document Retrieval (Watson Retrieve & Rank) Users

This example is probably directly relevant to those building classification systems (i.e. extracting intent from questions or revealing whether an image contains a particular company’s logo). However all of this stuff works directly for document retrieval use cases too. Consider true positive to be when the first document returned from the query is the correct answer and false negative is when the first document returned is the wrong answer.

There are also variants on this that consider the top 5 retrieved answer (Precision@N) that tell you whether your system can predict the correct answer in the top 1,3,5 or 10 answers by simply identifying “True Positive” as the document turning up in the top N answers returned by the query.


Overall I hope this tutorial has helped you to understand the ins and outs of machine learning evaluation.

Next time we look at cross-validation techniques and how to assess small corpii where carving out a 30% chunk of the documents would seriously impact the learning. Stay tuned for more!

IBM Watson – It’s for data scientists too!

Last week, my colleague Olly and I gave a talk at a data science meetup on how IBM Watson can be used for data science applications.

We had an amazing time and got some really great feedback from the event. We will definitely be doing more talks at events like these in the near future so keep an eye out for us!

I will also be writing a little bit more about the experiment I did around Core Scientific Concepts and Watson Natural Language Classifier in a future blog post.


Cognitive Quality Assurance – An Introduction

This article has a slant towards the IBM Watson Developer Cloud Services but the principles and rules of thumb expressed here are applicable to most cognitive/machine learning problems.


imagebot-com-2012042714194724316-800pxQuality assurance is arguably one of the most important parts of the software development lifecycle. In order to release a product that is production ready, it must be put under, and pass, a number of tests – these include unit testing, boundary testing, stress testing and other practices that many software testers are no doubt familiar with. The ways in which traditional software are relatively clear.In a normal system, developers write deterministic functions, that is – if you put an input parameter in, unless there is a bug, you will always get the same output back. This principal makes it.. well not easy… but less difficult to write good test scripts and know that there is a bug or regression in your system if these scripts get a different answer back than usual.

Cognitive systems are not deterministic in nature. This means that you can receive different results from the same input data when training a system. Such systems tend to be randomly initialised and learn in different, nuanced ways every time they are trained. This is similar to how identical twins who may be biologically identical still learn their own preferences, memories and  skillsets.

Thus, a traditional unit testing approach with tests that pass and fail depending on how the output of the system compares to an expected result is not helpful.

This article is the first in a series on Cognitive Quality Assurance. Or in other words, how to test and validate the performance of non-deterministic, machine learning systems. In today’s article we look at how to build a good quality ground truth and then carrying out train/test/blind data segmentation and how you can use your ground truth to verify that a cognitive system is doing its job.

Ground Truth

Let’s take a step back for a moment and make sure we’re ok with the concept of ground truth.

In machine learning/cognitive applications, the ground truth is the dataset which you use to train and test the system. You can think of it like a school textbook that the cognitive system treats as the absolute truth and first point of reference for learning the subject at hand. Its structure and layout can vary depending on the nature of the system you are trying to build but it will always abide by a number of rules. As I like to remember them: R-C-S!

Representative of the problem

Like Da Vinci when he drew the anatomically correct Vitruvian man, strive to represent the data as clearly and accurately as possible – errors make learning harder!
  • The ground truth must accurately reflect the problem you are trying to solve.
  • If you are building a question answering system, how sure are you that the questions in the ground truth are also the questions that end users will be asking?
  • If you are building an image classification system, are the images in your ground truth of a similar size and quality to the images that you will need to tag and classify in production? Do your positive and negative examples truly represent the problem (i.e. if you only have black and white images in positive but are learning to find cat, the machine might learn to assume that black and white implies cat).
  • The proportions of each type is an important factor. If you have 10 classes of image or text and one particular class occurs 35% of the time in the field, you should try and reflect this in your ground truth too.


  • The data in your ground truth must follow a logical set of rules – even if these are
    a bit “fuzzy” – after all if a human can’t decide on how to classify a set of data consistently, how can we expect a machine to do this?
  • Building a ground truth can often be a very large task requiring a team of people. When working in groups it may be useful to build a set of guidelines that detail which data belongs to which class and lists some examples. I will cover this in more detail on my article on working in groups.
  • Humans ourselves can be inconsistent in nature so if at all possible, try to automate some of the classification – using dictionaries or pattern matching rules.

: never use cognitive systems to generate ground truth or you run the risk of introducing compounding learn errors.

Statistically Significant

  • More data points means that the cognitive system has more to work with - don't skimp on ground truth - it will cost you your accuracy!
    More data points means that the cognitive system has more to work with – don’t skimp on ground truth – it will cost you your accuracy!

    The ground truth should be as large as is affordable. When you were a child and learned the concept of dog or cat, the chances are you learned that from seeing a large number of these animals and were able to draw up mental rules for what a dog entails (4 legs, furry, barks, wags tail) vs what cat entails (4 legs, sometimes furry, meows, retractable claws). The more  diverse examples of these animals you see, the better you are able to refine your mental model for what each animal entails. The same applies with machine learning and cognitive systems.

  • Some of the  Watson APIs list minimal ground truth quality requirements and these vary from service to service. You should always be aiming as high as possible but as an absolute minimum, for at least 25% more than the service requirement so that we have some data for our blind testing (all will be revealed)

There are some test techniques for dealing with testing smaller corpuses that I will cover in a follow up article.

Training and Testing – Concepts

Once we are happy with our ground truth, we need to decide how best to train and test the system. In a standard software environment, you would want to test every combination of every function and make sure that all combinations work. It may be tempting to jump to this conclusion with Cognitive systems too. However, this is not the answer.

Taking a step back again, let’s think remember when you were back at school. Over the course of a year you would learn about a topic and at the end there was an exam. We knew that the exam would test what we had learned during the year but we did not know:

  • The exact questions that we would be tested on – you have some idea of the sorts of questions you might be tested on but if you knew what the exact questions were you could go and find out what the answers are ahead of time
  • The exact exam answers that would get us the best results before we went into the exam room and took the test. That’d be cheating right?

With machine learning, this concept of learning and then blind testing is equally important. If we train the algorithm on all of the ground truth available to us and then test it, we are essentially asking it questions we already told it the answers to. We’re allowing it to cheat.

By splitting the ground truth into two datasets, training on one and then asking questions with the other – we are really demonstrating that the machine has learned the concepts we are trying to teach and not just memorised the answer sheet.

Training and Testing – Best Practices

SteveLambert-Dumbell-Lifter-800pxGenerally we split our data set into 80% training data and 20% testing data – this means that we are giving the cognitive system the larger chunk of information to learn from and testing it on a small subset of those concepts (in the same way that your professor gave you 12 weeks of lectures to lean from and then a 2 hour exam at the end of term).

It is important that the test questions are well represented in the train data (it would have been mean of your professors to ask you questions in the exam that were never taught in the lectures). Therefore, you should make sure to sample ground truth pairs from each class or concept that you are trying to teach.

You should not simply take the first 80% of the ground truth file and feed it into the algorithm and use the last 20% of the file to test the algorithm – this is making a huge assumption about how well each class is represented in the data. For example, you might find that all of the questions about car insurance come at the end of your banking FAQ ground truth resulting in:

  • The algorithm never seeing what a car insurance question looks like and not learning this concept.
  • The algorithm fails miserably at the test because most of the questions were on car insurance and it didn’t know much about that.
  • The algorithm has examples of mortgage and credit card questions but is never tested on these – we can’t make any assertions about how well it has learned to classify these concepts.
The best way to divide test and training data for the above NLC problem is as follows:
  1. Iterate over the ground truth – separating out each example into groups by class/concept
  2. Randomly select 80% of each of the groups to become the training data for that group/class
  3. Take the other 20% of each group and use this as the test data for that group/class
  4. Recombine the subgroups into two groups: test and train

With some of the other Watson cognitive APIs (I’m looking at you, Visual Recognition and Retrieve & Rank) you will need to alter this process a little bit. However the key here is making sure that the test data set is a fair representation (and a fair test) of the information in the train dataset.

Testing the model

Once you have your train set and test set, the next bit is easy. Train a classifier with the train set and then write a script that loads in your test set, asks the question (or shows the classifier the image) and then compare the answer that the classifier gives with the answer in the ground truth. If they match, increment a “correct” number. If they don’t match, too bad! You can then calculate the accuracy of your classifier – it is the percentage of the total number of answers that were marked as correct.

Blind Testing and Performance Reporting

Blindfolded-Darts-Player-800pxIn a typical work flow you may be training, testing, altering your ground truth to try and improve performance and re-training.  This is perfectly normal and it often takes some time to tune and tweak a model in order to get optimal performance.

However, in doing this, you may be inadvertently biasing your model towards the test data – which in itself may change how the model performs in the real world. When you are happy with your test performance, you may wish to benchmark against another third dataset – a blind test set that the machine has not been ‘tweaked’ in order to perform better against. This will give you the most accurate view, with respect to the data available, of how well your classifier is performing in the real world.

In the case of three data sets (test, train, blind) you should use a similar algorithm/work flow as describe in the above section.  The important thing is that the three sets must not overlap in any way and should all be representative of the problem you are trying to train on.

There are a lot of differing opinions on what proportions to separate out the data set into. Some folks advocate 50%, 25%, 25% for test, train, blind respectively, others 70, 20, 10. I personally start at the latter and change these around if they don’t work – your mileage may vary depending on the type of model you are trying to build and the sort of problem you are trying to model.

Warning-2400pxImportant: once you have done your blind test to get an accurate idea of how well your model performs in the real world, you must not do any more tuning on the model.
If you do, your metrics will be meaningless since you are now biasing the new model towards the blind data set. You can of course, start from scratch and randomly initialise a new set of test, train and blind data sets from your ground truth at any time.


Hopefully, this article has given you some ideas about how best to start assessing the quality of your cognitive application. In the next article, I cover some more in depth measurements that you can do on your model to find out where it is performing well and where it needs tuning beyond a simple accuracy rating. We will also discuss some other methods for segmenting test and train data for smaller corpuses in a future article.

ElasticSearch: Turning analysis off and why its useful

I have recently been playing with Elastic search a lot for my PhD and started trying to do some more complicated queries and pattern matching using the DSL syntax. I have an index on my local machine called impact_studies which contains all 6637 REF 2014 impact case studies in a JSON format. One of the fields is “UOA” which contains the title of the unit of impact that the case study belongs to. We recently identified the fact that we do not want to look at all units of impact (my PhD is around impact in science so domains such as Art History are largely irrelevent to me). Therefore I started trying to run queries like this:


               "UOA":"General Engineering"

For some reason this returns zero results. Now it took me ages to find this page in the elastic manual which talks about the exact phenomenon I’m running into above. It turns out that the default analyser is tokenizing every text field and so Elastic has no notion of UOA ever containing “General Engineering”. Instead it only knows of a UOA field that contains the word “general” and the word “engineering” independently of each other in the model somewhere (bag-of-words). To solve this you have to

  • Download the existing schema from elastic:
  • curl -XGET "http://localhost:9200/impact_studies/_mapping/study" master [4cb268b] untracked
  • Delete the schema (unfortunately you can’t make this change on the fly) and then turn off the analyser which tokenizes the values in the field:
$ curl -XDELETE "http://localhost:9200/impact_studies"
  • Then recreate the schema with “index”:”not_analyzed” on the field you are interested in:
curl -XPUT "http://localhost:9200/impact_studies/" -d '{"mappings":{"study":{"properties":{"CaseStudyId":{"type":"string"},"Continent":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"Country":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"Funders":{"type":"string"},"ImpactDetails":{"type":"string"},"ImpactSummary":{"type":"string"},"ImpactType":{"type":"string"},"Institution":{"type":"string"},"Institutions":{"properties":{"AlternativeName":{"type":"string"},"InstitutionName":{"type":"string"},"PeerGroup":{"type":"string"},"Region":{"type":"string"},"UKPRN":{"type":"long"}}},"Panel":{"type":"string"},"PlaceName":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"References":{"type":"string"},"ResearchSubjectAreas":{"properties":{"Level1":{"type":"string"},"Level2":{"type":"string"},"Subject":{"type":"string"}}},"Sources":{"type":"string"},"Title":{"type":"string"},"UKLocation":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"UKRegion":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"UOA":{"type":"string", "index" : "not_analyzed"},"UnderpinningResearch":{"type":"string"}}}}}'

Once you’ve done this you’re good to go reingesting your data and your filter queries should be much more fruitful.

Home automation with Raspberry Pi and Watson

I’ve recently been playing with trying to build a Watson powered home automation system using my Raspberry Pi and some other electronic bits that I have on hand.

There are already a lot of people doing work in this space. One of the most successful projects being JASPER which uses speech to text and an always on background listening microphone to talk to you and carry out actions when you ask it things in natural language like “What’s the weather going to be like tomorrow?” and “What is the meaning of life?” Jasper works using a library called Sphinx developed by Carnegie Mellon University to do speech recognition. However the models aren’t great – especially if you have a british accent.

Jasper also allows you to use other speech to text libraries and services too such as the Google Speech service and the AT&T speech service. However there is no currently available code for using the Watson speech to text API – until now.

The below code snippet can be added to your stt.py file in your jasper project.

Then you need to create a Watson speech-to-text instance in bluemix add the following to your JASPER configuration:

stt_engine: watson
stt_passive_engine: sphinx
 username: "<Text-to-speech-credentials-username>"
 password: "<Text-to-speech-credentials-password>"

This configuration will use the local Sphinx engine to listen out for “JASPER” or whatever you choose to call your companion (which it is actually pretty good at) and then send off 10-15s of audio to Watson STT to be analysed more accurately once the trigger word has been detected. Here’s a video of the system in action:

Freecite python wrapper

I’ve written a simple wrapper around the Brown University Citation parser FreeCite. I’m planning to use the service to pull out author names from references in REF impact studies and try to link them back to investigators listed on RCUK funding applications.

The code is here and is MIT licensed. It provides a simple method which takes a string representing a reference and returns a dict with each field separated. There is also a parse_many function which takes an array of reference strings and returns an array of dicts.