Content tagged with "Phd"

I know I’m doing a lot of flip-flopping between SOLR and Elastic at the moment – I’m trying to figure out key similarities and differences between them and where one is more suitable than the other.

The following is an example of how to map a function _**f **_onto an entire set of indexed data in elastic using the scroll API.

If you use elastic, it is possible to do paging by adding a size and a from parameter. For example if you wanted to retrieve results in pages of 5 starting from the 3rd page (i.e. show results 11-15) you would do:

Read more...

Here is a recording of my recent keynote talk on the power of Natural Language processing through Watson and my academic/PhD topic – Partridge – at York Doctoral Symposium.
  • 0-11 minutes – history of mankind, invention and the acceleration of scientific progress (warming people to the idea that farming out your scientific reading to a computer is a much better idea than trying to read every paper written)
  • 11-26 minutes – My personal academic work – scientific paper annotation and cognitive scientific research using NLP
  • 26- 44 minutes – Watson – Jeopardy, MSK and Ecosystem
  • 44 – 48 minutes Q&A on Watson and Partridge
  • Please don’t cringe too much at my technical explanation of Watson – especially those of you who know much more about WEA and the original DeepQA setup than I do! This was me after a few days of reading the original 2011 and 2012 papers and making copious notes!

    Read more...

    Hoorah! After a number of weeks I’ve finally managed to get SAPIENTA running inside docker containers on our EBI cloud instance. You can try it out at http://sapienta.papro.org.uk/.

    The project was previously running via a number of very precarious scripts that had a habit of stopping and not coming back up. Hopefully the new docker environment should be a lot more stable.

    Another improvement I’ve made is to create a websocket interface for calling the service and a Python-based commandline client. If you’re interested I’m using socket.io and the relevent python libraries (server and client). This means that anyone who needs to can now request annotations in large batches. I’m planning on using socket.io to interface Partridge with SAPIENTA since they are hosted on separate servers and this approach avoids any complicated firewall issues.

    Read more...

    [][1]
    Warwick CDT intake 2015: From left to right – at the front Jacques, Zakiyya, Corinne, Neha and myself. Rear: David, John, Stephen (CDT director), Mo, Vaggelis, Malkiat and Greg

    Hello again readers – those of you who follow me on other social media (twitter, instagram, facebook etc) probably know that I’ve just returned from a week in New York City as part of my PhD. My reason for visiting was a kind of ice-breaking activity called the CUSP (Centre for Urban Science + Progress) Challenge Week. This consisted of  working with my PhD cohort (photographed) as well as the 80-something NYU students starting their Urban Science masters courses at CUSP to tackle urban data problems.

    Read more...

    Introduction

    As part of my continuing work on Partridge, I’ve been working on improving the sentence splitting capability of SSSplit – the component used to split academic papers from PLosOne and PubMedCentral into separate sentences.

    Papers arrive in our system as big blocks of text with the occasional diagram, formula or diagram and in order to apply CoreSC annotations to the sentences we need to know where each sentence starts and ends. Of course that means we also have to take into account the other ‘stuff’ (listed above) floating around in the documents too. We can’t just ignore formulae and citations – they’re pretty important! That’s what SSSplit does. It carves up papers into sentence () elements whilst also leaving the XML structure of the rest of the document in tact.

    Read more...

    When I’m working on Partridge and SAPIENTA, I find myself dealing with a lot of badly formatted XML. I used to manually run xmllint –format against every file before opening it but that gets annoying very quickly (even if you have it saved in your bash history). So I decided to write a Nemo script that does it automatically for me.

    #!/bin/sh
    
    for xmlfile in $NEMO_SCRIPT_SELECTED_FILE_PATHS; do
    
        if [[ $xmlfile == *.xml ]]
         then
             xmllint --format $xmlfile > $xmlfile.tmp
            rm $xmlfile
            mv $xmlfile.tmp $xmlfile
        fi
    done
    Read more...