Content tagged with "Python"

Introduction

Retrieve and Rank (R&R), if you hadn’t already heard about it, is IBM Watson’s new web service component for information retrieval and question answering. My colleague Chris Madison has summarised how it works in a high level way here.

R&R is based on the Apache SOLR search engine with a machine learning result ranking plugin that learns what answers are most relevant given an input query and presents them in the learnt “relevance” order.

Read more...

Introduction

As part of my continuing work on Partridge, I’ve been working on improving the sentence splitting capability of SSSplit – the component used to split academic papers from PLosOne and PubMedCentral into separate sentences.

Papers arrive in our system as big blocks of text with the occasional diagram, formula or diagram and in order to apply CoreSC annotations to the sentences we need to know where each sentence starts and ends. Of course that means we also have to take into account the other ‘stuff’ (listed above) floating around in the documents too. We can’t just ignore formulae and citations – they’re pretty important! That’s what SSSplit does. It carves up papers into sentence () elements whilst also leaving the XML structure of the rest of the document in tact.

Read more...

When I’m working on Partridge and SAPIENTA, I find myself dealing with a lot of badly formatted XML. I used to manually run xmllint –format against every file before opening it but that gets annoying very quickly (even if you have it saved in your bash history). So I decided to write a Nemo script that does it automatically for me.

#!/bin/sh

for xmlfile in $NEMO_SCRIPT_SELECTED_FILE_PATHS; do

    if [[ $xmlfile == *.xml ]]
     then
         xmllint --format $xmlfile > $xmlfile.tmp
        rm $xmlfile
        mv $xmlfile.tmp $xmlfile
    fi
done
Read more...