timetrack improvements

I’ve just added a couple of improvements to timetrack that allow you to append to existing time recordings (either with an amount like 15m or using live to time additional minutes spent and append them).

You can also remove entries using timetrack rm instead of remove – saving keystrokes is what programming is all about.

You can find the updated code over at github.

AI can’t solve all our problems, but that doesn’t mean it isn’t intelligent

Thomas Hobbes, perhaps most famous for his thinking on western politics, was also thinking about how the human mind “computes things” 500 years ago.

A recent opinion piece I read on Wired called for us to stop labelling our current specific machine learning models AI because they are not intelligent. I respectfully disagree.

AI is not a new concept. The idea that a computer could ‘think’ like a human and one day pass for a human has been around since Turing and even in some form long before him. The inner workings the human brain and how we carry out computational processes have even been discussed by great philosophers such as Thomas Hobbes who wrote in his book, De Corpore in 1655 that “by reasoning, I understand computation. And to compute is to collect the sum of many things added together at the same time, or to know the remainder when one thing has been taken from another. To reason therefore is the same as to add or to subtract.” Over the years, AI has continued to capture the hearts and minds of great thinkers, scientists and of course creatives and artists.

The Matrix: a modern day telling of Rene Descartes’ “Evil Demon” theorem

Visionary Science Fiction authors of the 20th century: Arthur C Clarke, Isaac Asimov and Philip K Dick have built worlds of fantasy inhabited by self-aware artificial intelligence systems and robots, some of whom could pass for humans unless subject to a very specific and complicated test.  Endless films have been released that “sex up” AI. The Terminator series, The Matrix, Ex Machina, the list goes on. However, like all good science fiction, these stories that paint marvellous and thrilling visions of futures that are still in the future even in 2016.

The science of AI is a hugely exciting place to be too (I would say that, wouldn’t I). In the 20th century we’ve mastered speech recognition, optical character recognition and machine translation good enough that I can visit Japan and communicate, via my mobile phone, with a local shop keeper without either party having to learn the language of their counterpart. We have arrived at a point where we can train machine learning models to do some specific tasks better than people (including drive cars and diagnostic oncology). We call these current generation AI models “weak AI”. Computers that can solve any problem we throw at them (in other words, ones that have generalised intelligence and known as “strong AI” systems) are a long way off. However, that shouldn’t detract from what we have solved already with weak AI.

One of the problems with living in a world of 24/7 new cycles and clickbait titles is that nothing is new or exciting any more. Every small incremental change in the world is reported straight away across the globe. Every new discovery, every fractional increase in performance from AI gets a blog post or a news article. It makes everything seem boring. Oh Tesla’s cars can drive themselves? So what? Google’s cracked Go? Whatever… 

If you lose 0.2Kg overnight, your spouse probably won’t notice. Lose 50 kg and I can guarantee they would

If you lose 50kgs in weight over 6 months, your spouse is only going to notice when you buy a new shirt that’s 2 sizes smaller or notice a change in your muscle when you get out of the shower. A friend you meet up with once a year is going to see a huge change because last time they saw you you were twice the size. In this day and age, technology moves on so quickly in tiny increments that we don’t notice the huge changes any more because we’re like the spouse – we constantly see the tiny changes.

What if we did see huge changes? What if we could cut ourselves off from the world for months at a time? If you went back in time to 1982 and told them that every day you talk to your phone using just your voice and it is able to tell you about your schedule and what restaurant to go to, would anyone question that what you describe is AI? If you told someone from 1995 that you can buy a self driving car via a small glass tablet you carry around in your pocket, are they not going to wonder at the world that we live in? We have come a long long way and we take it for granted. Most of us use AI on a day to day basis without even questioning it.

Another common criticism of current weak AI models is the exact lack of general reasoning skills that would make them strong AI.

DEEPMIND HAS SURPASSED the human mind on the Go board. Watson has crushed America’s trivia gods on Jeopardy. But ask DeepMind to play Monopoly or Watson to play Family Feud, and they won’t even know where to start.

That’s absolutely true. The AI/compsci definition of this constraint is the “no free lunch for optimisation” theorem. That is that you don’t get something for nothing when you train a machine learning model. In training a weak AI model for a specific task, you are necessarily hampering its ability to perform well at other tasks. I guess a human analogy would be the education system.

If you took away my laptop and told me to run cancer screening tests in a lab, I would look like this

Aged 14 in a high school in the UK, I was asked which 11 GCSEs I wanted to take. At 16 I had to reduce this scope to 5 A levels, aged 18 I was asked to specify a single degree and aged 21 I had to decide which tiny part of AI/Robotics (which I’d studied at degree level) I wanted to specialise in at PhD level. Now that I’m half way through a PhD in Natural Language Processing in my late 20s, would you suddenly turn around and say “actually you’re not intelligent because if I asked you to diagnose lung cancer in a child you wouldn’t be able to”? Does what I’ve achieved become irrelevant and pale against that which I cannot achieve? I do not believe that any reasonable person would make this argument.

The AI Singularity has not happened yet and it’s definitely a few years away. However, does that detract from what we have achieved so far? No. No it does not.


We need to talk about push notifications (and why I stopped wearing my smartwatch)

I own a Pebble Steel which I got for Christmas a couple of years ago. I’ve been very happy with it so far. I can control my music player from my wrist, get notifications and a summary of my calender. Recently, however I’ve stopped wearing it. The reason is that constant streams of notifications stress me out, interrupt my workflow and not wearing it makes me feel more calm and in control and allows me to be more productive.

As you can imagine, trying to do a PhD and be a CTO at the same time has its challenges. I struggle with the cognitive dissonance between walling off my research days to focus on my PhD and making sure that the developers at work are getting on ok and being productive without me. I have thus far tended to compromise by leaving slack running and fielding the odd question from colleagues even on my off days.

Conversely, when I’m working for Filament, I often get requests from University colleagues to produce reports and posters, share research notes and to resolve problems with SAPIENTA or Partridge infrastructure (or even run experiments on behalf of other academics). Both of these scenarios play havoc with my prioritisation of todos when I get notified about them.

Human Multitasking

Human Multitasking is something of a myth – as is the myth that women can multitask and men can’t. It turns out that we are all (except for a small group of people scientists call “supertaskers”) particularly rubbish at multi-tasking. I am no exception, however much I wish I was.

When we “multitask” we are actually context switching. Effectively, we’re switching between a number of different tasks very quickly, kind of like how a computer is able to run many applications on the same CPU core by executing different bits of each app – it might deal with an incoming email, then switch to rendering your netflix movie, then switch to continuing to download that email. It does this so quickly that it seems like both activities are happening at once. That’s obviously different for dual or quad core CPUs but that’s not really the point here since our brains are not “quad core”.

CPUs are really good at context switching very quickly. However, the human brain is really rubbish at this. Joel Spolsky has written a really cool computer analogy on why but if you don’t want to read a long article on it, lets just say that where a computer can context-switch in milliseconds, a human needs a few minutes.

It also logically follows that the more cognitively intensive a job is, the more time a brain needs to swap contexts. For example, you might be able to press the “next” button on your car stereo while driving at 70 MPH down the motorway, but (aside from the obvious practical implications) you wouldn’t be able to perform brain surgery and drive at the same time . If you consider studying for a PhD and writing machine learning software for a company to be roughly as complex as the above example, you can hopefully understand why I’d struggle.

Push Notifications

The problem I find with “push” notifications is that they force you to context switch. We, as a society, have spent the last couple of decades training ourselves to stop what we are doing and check our phones as soon as that little vibration or bling noise comes through. If you are a paramedic or surgeon with a pager, that’s the best possible use case for this tech, and I’m not saying we should stop push notifications for emergency situations like that. However, when the notification is “check out this dank meme dude” but we are still stimulated into action this can have a very harmful effect on our concentration and ability to focus on the task at hand.

Mobile phone notifications are bad enough but occasionally, if your phone buzzes in your pocket and you are engrossed in another task, you won’t notice and you’ll check your phone later. Smartwatch notifications seem to get my attention 9 times out of 10  – I guess that’s what they’re designed for. Having something strapped directly to the skin on my wrist is much more distracting than something buzzing through a couple of layers of clothing on my leg.

I started to find that push notifications forcibly jolt me out of whatever task I’m doing and I immediately feel anxious until I’ve handled the new input stimulus. This means that I will often prioritise unimportant stuff like responding to memes that my colleague has posted in slack over the research paper I’m reading. Maybe this means I miss something crucial, or maybe I just have to go back to the start of the page I’m looking at. Either way, time is a’wastin’.

The Solution

For me, it’s obvious. Push notifications need a huge re-think. I am currently reorganising the way I work, think and plan and ripping out as many push notification mechanisms as I can. I’ve also started keeping track of how I’m spending my time using a tool I wrote last week.

I can definitely see a use case for “machine learning” triage of notifications based on intent detection and personal priorities. If a relative is trying to get hold of me because there’s been an emergency, I wouldn’t mind being interrupted during a PhD reading session. If a notification asking for support on Sapienta or a work project comes through, that’s urgent but can probably wait half an hour until I finish my current reading session. If a colleague wants to send me a video of grumpy cat, that should wait in a list of things to check out after 5:30pm.

Until me, or someone with more time to do so builds a machine learning filter like this one, I’ve stopped wearing my smart watch and my phone is on silent. If you need me and I’m ignoring you, don’t take it personally. I’ll get back to you when I’m done with my current task. If it’s urgent,  you’ll just have to try phoning and hoping I notice the buzz in my pocket (until I find a more elegant way to screen urgent calls and messages).

timetrack – a simple time tracking application for developers

I’ve written a small command line application for tracking my time on my PhD and other projects. We use Harvest at Filament which is great if you’ve got a huge team and want the complexity (and of course license charges) of an online cloud solution for time tracking.

If, like me, you’re just interested to see how much time you are spending on your different projects and you don’t have any requirement for fancy web interfaces or client billing, then timetrack might be for you. For me personally, I was wondering how much of my week is spent on my PhD as opposed to Filament client work. I know its a fair amount but I want some clear cut numbers.

timetrack is a simple application that allows you to log what time you’ve spent and where from the command line with relative ease. It logs everything to a text file which is relatively easy to read by !machines. However it also provides filtering and reporting functions so that you can see how much time you spend on your projects, how much time you used today and how much of your working day is left.

It’s written in python with minimal dependencies on external libraries (save for progressbar2 which is used for the live tracker). The code is open source and available under an MIT license. Download it from GitHub

The builder, the salesman and the property tycoon

A testament to marketers around the world is the myth that their AI platform X, Y or Z can solve all your problems with no effort. Perhaps it is this, combined with developers and data scientists often being hidden out of sight and out of mind that leads people to think this way.

Unfortunately, the truth of the matter is that ML and AI involve blood sweat and tears – especially if you are building things from scratch rather than using APIs. If you are using third party APIs there are still challenges. The biggest players in the API space also have large pools of money. Pools of money that can be spent on marketing literature to convince you that their product will solve all your problems with no effort required. I think this is dishonest and is one of the reasons I have so many conversations like the one below.

The take home message is clear! We need to do way more to help clients to understand AI tech and what it can do in a more transparent way. Simply getting customers excited about buzzwords without explaining things in layman’s terms is a guaranteed way to lose trust and build a bad reputation.

At Filament, we pride ourselves on being honest and transparent about what AI can do for your business and are happy to take the time to explain concepts and buzzwords in laymans’ terms.

The following is an amusing anecdote about what happens when AI experts get their messaging wrong.

The builder, the salesman and the property tycoon

Imagine that a property tycoon is visiting an experienced builder for advice on construction of a new house. This is a hugely exaggerated example and all of the people in it are caricatures. No likeness or similarity intended. Our ‘master builders’ are patient, understanding and communicative and thankfully, have never met a ‘Mr Tycoon’ in real life.

Salesman(SM): Welcome Mr Tycoon, please allow me to introduce to you our master builder. She has over 25 years in the construction industry and qualifications in bricklaying, plumbing and electrics.

Master Builder (MB): Hi Mr Tycoon, great to meet you *handshake*

Tycoon(TC): Lovely to meet you both. I'm here today because I want some advice on my latest building project. I've been buying blocks of apartments and letting them out for years. My portfolio is worth £125 Million. However, I want to get into the construction game. 

MB: That's exciting. So how can we help?

TC: Ok I'm a direct kind of guy and I say things how I see them so I'll cut to the chase. I want to build a house. What tools shall I use?

MB: Good question... what kind of house are you looking to build?

TC: Well, whatever house I can build with tools.

MB: ok... well you know there are a lot of options and it depends on budget. I mean you must have something in mind? Bungalow? 2-Story family house? Manor house?

TC: Look, I don't see how this is relevant. I've been told that the tools can figure all this stuff out for me.

SM: Yes MB, we can provide tools that  will help TC decide what house to build right?

MB: That's not really how it works but ok... Let's say for the sake of argument that we're going to help you build a 2 bedroom townhouse.

TC: Fantastic! Yes, great initiative MB, a townhouse. Will tools help me build a townhouse?

MB: Yeah... tools will help you build a townhouse...

TC: That's great!

MB: TC, do you have any experience building houses? You said you mainly buy houses, not build them.

T: No not really. However, SM told me that with the right tools, I don't need any experience, the tools will do all the work for me.

MB: Right... ok... SM did you say that?

SM: Well, with recent advances in building techniques and our latest generation of tools, anything is possible!

MB: Yes... that's true tools do make things easier. However, you really do need to know how to use the tools. They're not 'magic' - you should understand which ones are useful in different situations

TC: Oh, that's not the kind of answer I was looking for. SM, you said this wouldn't be a problem.

SM: It's not going to be a problem is it MB? I mean we can help TC figure out which tools to use?

MB: I suppose so...

SM: That's the attitude MB... Tell TC about our services

MB: Sure, I have had many years of experience building townhouses, we have a great architect at our firm who can design the house for you. My team will take care of the damp coursing, wooden frame, brickwork and plastering and then I will personally oversee the installation of the electrics and pipework.

TC: Let's not get bogged down in the detail here MB, I just want a townhouse... Now I have a question. Have you heard of mechanical excavators - I think you brits call them "diggers".

MB: Yes... I have used diggers a number of times in the past.

TC: Oh that's great. MB, do you think diggers can help me build a house faster?

MB: Urm, well maybe. It depends on the state of the terrain that you want to build on.

TC: Oh that's great, some of our potential tenants have heard of diggers and if I tell them we used diggers to build the house they will be so excited.

MB: Wonderful...

TC: I've put an order in for 25 diggers - if we have more that means we can build the house faster right?

MB: Are you serious?

SM: Of course TC is serious, that's how it works right?

MB: Not exactly but ok, if you have already ordered them that's fine *tries to figure out what to do with 24 spare diggers*

TC: Great, it's settled then. One more thing, I don't know if I want to do a townhouse. Can you use diggers to build townhouses? I'm only interested in building things that diggers can build.

MB: Yes don't worry, you can build townhouses with diggers. I've done it personally a number of times

TC: I'm not so sure. I've heard of this new type of house called a Ford Mustang. Everyone in the industry is talking about how we can drive up ROI by building Ford Mustangs instead of Townhouses. What are your thoughts MB?

MB: That's not a... diggers... I... I'm really sorry TC, I've just received an urgent text message from one of our foremen at a building site, I have to go and resolve this. Thanks for your time, SM can you wrap up here? *calmly leaves room and breathes into a paper bag*

SM: Sorry about that TC, anyway yes I really love the Ford mustang idea, what's your budget?


This post is supposed to raise a chuckle and it’s not supposed to offend anyone in particular. However, on a more serious note, there is definitely a problem with buzzwords in machine learning and industry. Let’s try and fix it.

#BlackgangPi – a Raspberry Pi Hack at Blackgang Chine

I was very excited to be invited along with some other IBMers to the Blackgang Pi event run by Dr Lucy Rogers on a semi regular basis at the Blackgang Chine theme park on the Isle of Wight.

Blackgang Chine is a theme park on the southern tip of the Isle of Wight and holds the title of oldest theme park in the United Kingdom. We were lucky enough to be invited along to help them modernise some of their animatronic exhibits, replacing some of the aging bespoke PCBs and controllers with Raspberry Pis running Node-RED and communicating using MQTT/Watson IOT.

Over the course of two days, my colleague James Sutton and I built a talking moose head using some of the IBM Watson Cognitive services.

We got it talking fairly quickly using IBM text to speech and had it listening for intents like “tell joke” or “check weather” via NLC.

I also built out a dialog that would monitor the state of the conversation and make the user comply with the knock knock joke format (i.e. if you say anything except “who’s there” it will moan and call you a spoil-sport).

Video we managed to capture before we had to pack up yesterday below

Cognitive Quality Assurance Pt 2: Performance Metrics

Last time we discussed some good practices for collecting data and then splitting it into test and train in order to create a ground truth for your machine learning system. We then talked about calculating accuracy using test and blind data sets.

In this post we will talk about some more metrics you can do on your machine learning system including Precision, Recall, F-measure and confusion matrices. These metrics give you a much deeper level of insight into how your system is performing and provide hints at how you could improve performance too!

A recap – Accuracy calculation

This is the most simple calculation but perhaps the least interesting. We are just looking at the percentage of times the classifier got it right versus the percentage of times it failed. Simply:

  1. sum up the number of results (count the rows),
  2. sum up the number of rows where the predicted label and the actual label match.
  3. Calculate percentage accuracy: correct / total * 100.

This tells you how good the classifier is in general across all classes. It does not help you in understanding how that result is made up.

Going above and beyond accuracy: why is it important?

target with arrow by AnonymousImagine that you are a hospital and it is critically important to be able to predict different types of cancer and how urgently they should be treated. Your classifier is 73% accurate overall but that does not tell you anything about it’s ability to predict any one type of cancer. What if the 27% of the answers it got wrong were the cancers that need urgent treatment? We wouldn’t know!

This is exactly why we need to use measurements like precision, recall and f-measure as well as confusion matrices in order to understand what is really going on inside the classifier and which particular classes (if any) it is really struggling with.

Precision, Recall and F-measure and confusion matrices (Grandma’s Memory Game)

Grandma's face by frankesPrecision, Recall and F-measure are incredibly useful for getting a deeper understanding of which classes the classifier is struggling with. They can be a little bit tricky to get your head around so lets use a metaphor about Grandma’s memory.

Imagine Grandma has 24 grandchildren. As you can understand it is particularly difficult to remember their names. Thankfully, her 6 children, the grandchildren’s parents all had 4 kids and named them after themselves. Her son Steve has 3 sons: Steve I, Steve II, Steve III and so on.

This makes things much easier for Grandma, she now only has to remember 6 names: Brian, Steve, Eliza, Diana, Nick and Reggie. The children do not like being called the wrong name so it is vitally important that she correctly classifies the child into the right name group when she sees them at the family reunion every Christmas.

I will now describe Precision, Recall, F-Measure and confusion matrices in terms of Grandma’s predicament.

Some Terminology

Before we get on to precision and recall, I need to introduce the concepts of true positive, false positive, true negative and false negative. Every time Grandma gets an answer wrong or right, we can talk about it in terms of these labels and this will also help us get to grips with precision and recall later.

These phrases are in terms of each class – you have TP, FP, FN, TN for each class. In this case we can have TP,FP,FN,TN with respect to Brian, with respect to Steve, with respect to Eliza and so on.

This table shows how these four labels apply to the class “Brian” – you can create a table will

Brian Not Brian
Grandma says “Brian” True Positive False Positive
Grandma says <not brian> False Negative True Negative
  • If Grandma calls a Brian, Brian then we have a true positive (with respect to the Brian class) – the answer is true in both senses- Brian’s name is indeed Brian AND Grandma said Brian – go Grandma!
  • If Grandma calls a Brian, Steve then we have a false negative (with respect to the Brian class). Brian’s name is Brian and Grandma said Steve. This is also a false positive with respect to the Steve Class.
  • If Grandma calls a Steve, Brian then we have a false positive (with respect to the Brian class). Steve’s name is Steve, Grandma wrongly said Brian (i.e. identified positively).
  • If Grandma calls an Eliza, Eliza, or Steve, or Diana, or Nick – the result is the same – we have a true negative (with respect to the Brian class). Eliza,Eliza would obviously be a true positive with respect to the Eliza class but because we are only interested in Brian and what is or isn’t Brian at this point, we are not measuring this.

When you are recording results, it is helpful to store them in terms of each of these labels where applicable. For example:

Steve,Steve (TP Steve, TN everything else)
Brian,Steve (FN Brian, FP Steve)

Precision and Recall

Grandma is in the kitchen, pouring herself a Christmas Sherry when three Brians and 2 Steves come in to top up their eggnogs.

Grandma correctly classifies 2 Brians but slips up and calls one of them Eliza. She only gets 1 of the Steve’ and calls the other Brian.

In terms of TP,FP,TN,FN we can say the following (true negative is the least interesting for us):

Brian 2 1 1
Eliza 0 1 0
Steve 1 0 1
  • She has correctly identified 2 people who are truly called Brian as Brian (TP)
  • She has falsely named someone Eliza when their name is not Eliza (FP)
  • She has falsely named someone whose name is truly Steve something else (FN)

True Positive, False Positive, True Negative and False negative are crucial to understand before you look at precision and recall so make sure you have fully understood this section before you move on.


Precision, like our TP/FP labels, is expressed in terms of each class or name. It is the proportion of true positive name guesses divided by true positive + false positive guesses.

Put another way, precision is how many times Grandma correctly guessed Brian versus how many times she called other people (like Steve) Brian.

For Grandma to be precise, she needs to be very good at correctly guessing Brians and also never call anyone else (Elizas and Steves) Brian.

Important: If Grandma came to the conclusion that 70% of her grandchildren were named Brian and decided to just randomly say “Brian” most of the time, she could still achieve a high overall accuracy. However, her Precision – with respect to Brian would be poor because of all the Steves and Elizas she was mis-labelling. This is why precision is important.

TP FP FN Precision
Brian 2 1 1 66%
Eliza 0 1 0 N/A
Steve 1 0 1 100%

The results from this case are displayed above. As you can see, Grandma uses Brian to incorrectly label Steve so precision is only 66%. Despite only getting one of the Steves correct, Grandma has 100% precision for Steve simply by never using the name incorrectly. We can’t calculate for Eliza because there were no true positive guesses for that name ( 0 / 1 is still zero ).

So what about false negatives? Surely it’s important to note how often Grandma is inaccurately calling  Brian by other names? We’ll look at that now…


Continuing the theme, Recall is also expressed in terms of each class. It is the proportion of true positive name guesses divided by true positive + false negative guesses.

Another way to look at it is given a population of Brians, how many does Grandma correctly identify and how many does she give another name (i.e. Eliza or Steve)?

This tells us how “confusing” Brian is as a class. If Recall is high then its likely that Brians all have a very distinctive feature that distinguishes them as Brians (maybe they all have the same nose). If Recall is low, maybe Brians are very varied in appearance and perhaps look a lot like Elizas or Steves (this presents a problem of its own, check out confusion matrices below for more on this).

TP FP FN Recall
Brian 2 1 1 66.6%
Eliza 0 1 0 N/A
Steve 1 0 1 50%

You can see that recall for Brian remains the same (of the 3 Brians Grandma named, she only guessed incorrectly for one). Recall for Steve is 50% because Grandma guessed correctly for 1 and incorrectly for the other Steve. Again Eliza can’t be calculated because we end up trying to divide zero by zero.


F-measure effectively a measurement of how accurate the classifier is per class once you factor in both precision and recall. This gives you a wholistic view of your classifier’s performance on a particular class.

In terms of Grandma, f-measure give us an aggregate metric of how good Grandma is at dealing with Brians in terms of both precision AND accuracy.

It is very simple to calculate if you already have precision and recall:

F_1 = 2 \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}}

Here are the F-Measure results for Brian, Steve and Eliza from above.

TP FP FN Precision Recall F-measure
Brian 2 1 1 66.6% 66.6% 66.6%
Eliza 0 1 0 N/A N/A N/A
Steve 1 0 1 1 0.5 0.6666666667

As you can see – the F-measure is the average (harmonic mean) of the two values – this can often give you a good overview of both precision and recall and is dramatically affected by one of the contributing measurements being poor.

Confusion Matrices

When a class has a particularly low Recall or Precision, the next question should be why? Often you can improve a classifier’s performance by modifying  the data or (if you have control of the classifier) which features you are training on.

For example, what if we find out that Brians look a lot like Elizas? We could add a new feature (Grandma could start using their voice pitch to determine their gender and their gender to inform her name choice) or we could update the data (maybe we could make all Brians wear a blue jumper and all Elizas wear a green jumper).

Before we go down that road, we need to understand where there is confusion between classes  and where Grandma is doing well. This is where a confusion matrix helps.

A Confusion Matrix allows us to see which classes are being correctly predicted and which classes Grandma is struggling to predict and getting most confused about. It also crucially gives us insight into which classes Grandma is confusing as above. Here is an example of a confusion Matrix for Grandma’s family.

Steve Brian Eliza Diana Nick Reggie


Steve 4 1 0 1 0 0
Brian 1 3 0 0 1 1
Eliza 0 0 5 1 0 0
Diana 0 0 5 1 0 0
Nick 1 0 0 0 5 0
Reggie 0 0 0 0 0 6

Ok so lets have a closer look at the above.

Reading across the rows left to right these are the actual examples of each class – in this case there are 6 children with each name so if you sum over the row you will find that they each add up to 6.

Reading down the columns top-to-bottom you will find the predictions – i.e. what Grandma thought each child’s name was.  You will find that these columns may add up to more than or less than 6 because Grandma may overfit for one particular name. In this case she seems to think that all her female Grandchildren are called Eliza (she predicted 5/6 Elizas are called Eliza and 5/6 Dianas are also called Eliza).

Reading diagonally where I’ve shaded things in bold gives you the number of correctly predicted examples. In this case Reggie was 100% accurately predicted with 6/6 children called “Reggie” actually being predicted “Reggie”. Diana is the poorest performer with only 1/6 children being correctly identified. This can be explained as above with Grandma over-generalising and calling all female relatives “Eliza”.

Steve sings for a Rush tribute band - his Geddy Lee is impeccable.
Steve sings for a Rush tribute band – his Geddy Lee is impeccable.

Grandma seems to have gender nailed except in the case of one of the Steves (who in fairness does have a Pony Tail and can sing very high).  She is best at predicting Reggies and struggles with Brians (perhaps Brians have the most diverse appearance and look a lot like their respective male cousins). She is also pretty good at Nicks and Steves.

Grandma is terrible at female grandchildrens’ names. If this was a machine learning problem we would need to find a way to make it easier to identify the difference between Dianas and Elizas through some kind of further feature extraction or weighting or through the gathering of additional training data.


Machine learning is definitely no walk in the park. There are a lot of intricacies involved in assessing the effectiveness of a classifier. Accuracy is a great start if until now you’ve been praying to the gods and carrying four-leaf-clovers around with you to improve your cognitive system performance.

However, Precision, Recall, F-Measure and Confusion Matrices really give you the insight you need into which classes your system is struggling with and which classes confuse it the most.

A Note for Document Retrieval (Watson Retrieve & Rank) Users

This example is probably directly relevant to those building classification systems (i.e. extracting intent from questions or revealing whether an image contains a particular company’s logo). However all of this stuff works directly for document retrieval use cases too. Consider true positive to be when the first document returned from the query is the correct answer and false negative is when the first document returned is the wrong answer.

There are also variants on this that consider the top 5 retrieved answer (Precision@N) that tell you whether your system can predict the correct answer in the top 1,3,5 or 10 answers by simply identifying “True Positive” as the document turning up in the top N answers returned by the query.


Overall I hope this tutorial has helped you to understand the ins and outs of machine learning evaluation.

Next time we look at cross-validation techniques and how to assess small corpii where carving out a 30% chunk of the documents would seriously impact the learning. Stay tuned for more!