At the beginning of the month, I was lucky enough to spend a month embedded in the Watson Labs team in Austin, TX. These mysterious and enigmatic members of the Watson family have a super secret bat-cave known as “The Garage” located on the IBM Austin site – to which access is prohibited for normal IBMers unless accompanied by a labs team member.
During the week I was helping out with a couple of the internal projects but also got the chance to experiment with some of the new Watson Developer Cloud APIS to create some new tools for internal use. However, I can share with you a couple of the general techniques that I used since I think they might be useful for a number of applications
Technique number 1: query expansion using Part-of-speech tagging and the Concept Expansion API.
The idea here was to address the fact that a user might phrase their question using language synonymous in nature but different to the data being searched for or queried.
Our Retrieve And Rank service makes use of Apache SOLR which already offers synonym expansion within queries. However I found adding this further capability using the Concept Expansion (service that builds a thesaurus from large corpuses discussing related concepts) service came up with some synonyms that SOLR didn’t. This might be because the SOLR query expansion system uses MeSH which is a formal medical ontology and Concept Expansion (or at least the demo) uses a corpus of twitter data which offers a lot more informal word pairings and implicit links. For example, feeding “Michael Jackson” into Concept Expansion will give you outputs like “Stevie Nicks” and “Bruce Springsteen” who are both musicians who released music around the same sort of era of Michael Jackson. By contrast Michael Jackson is (perhaps unsurprisingly) not present in the MeSH ontology.
Although “Stevie Nicks” might not be directly relevent to those who are looking for “Michael Jackson” – and those of you who are music fans might know where I’m going next – the answer to the question “Who did Michael Jackson perform alongside with at Bill Clinton’s 1993 inaugural ball?” – is Fleetwood Mac – for whom Stevie Nicks sings (that said, my question is specific enough that the keywords “bill clinton, 1993, inaugural ball, michael jackson” get you the right answer in google – albeit at position 2 in the results). So there is definitely some value in using Concept Expansion for this purpose even if you have to be very clever and careful about matching up context around queries.
The first problem you face using this approach is in choosing which words to send off to concept expansion and which ones not to bother with. We’re not interested in stopwords or personal pronouns (putting “we” into concept expansion comes back with interesting results like “testinitialize” “usaian” and “linux preinstallation” because of the vast amount of noise around pronouns on twitter). We are more interested in nouns like “Chair”, entities and people like “Michael Jackson”, adjectives like “enigmatic” and verbs like “going”. All of these words and phrases are things that could be expanded upon in some way to make our information retrieval query more useful.
To get around this problem – I used the Stanford Part of Speech Tagger to annotate the queries and only sent words labelled as one of the above mentioned types to the service. Asking “how much does the CEO earn?” yields something like the output to the right.
Another problem I ran into very quickly was dealing with nouns consisting of multiple words. For example “Michael Jackson”. In my code, I assume that any words tagged Noun that reside next to each other are the same object and should be treated as such. This assumption seems to have worked so far for my limited set of test data
Alchemy API and Taxonomy Distance
Another small piece of work I carried out was around measuring how “similar” two documents are from a very high level based on their distance in the alchemy API taxonomy. If you didn’t know already, Alchemy has an API for classifying a document into a high level taxonomy. This can often give you a very early indication of how likely that document is to contain information relevent to your use case or requirements. For example a document tagged “automotive manufacturers” is unlikely to contain medical text or instructions on sewing and embroidery.
The taxonomy is a tree structure which contains a huge list of different categories and subcategories. The idea here was to walk the tree between the category “node” assigned to one document to the category assigned to the second document and count the steps – more steps means further away. So for each document I made an Alchemy API call to get its taxonomy class. Then I split on “/” characters and count how far away A is from B. It’s pretty straight forward. To the left you can see that a question about burgers and a question about salad dressings are roughly “2” categories away from each other – moving up to food from fast food counts as one jump and moving back down to condiments and dressing counts as another.
Interestingly the API did seem to struggle with some questions. I used “What was the market share of Ford in Australia?” for my first document and “What type of car should I buy?” as my second doc and got /automative and vehicle/vehicle brands/ford for my first classification and /finance/personal finance/insurance/car for my second. I have a suspicion that this API is not set up for dealing with short documents like questions and that confused it but I need to do some further testing.