[][1]
Warwick CDT intake 2015: From left to right – at the front Jacques, Zakiyya, Corinne, Neha and myself. Rear: David, John, Stephen (CDT director), Mo, Vaggelis, Malkiat and Greg

Hello again readers – those of you who follow me on other social media (twitter, instagram, facebook etc) probably know that I’ve just returned from a week in New York City as part of my PhD. My reason for visiting was a kind of ice-breaking activity called the CUSP (Centre for Urban Science + Progress) Challenge Week. This consisted of  working with my PhD cohort (photographed) as well as the 80-something NYU students starting their Urban Science masters courses at CUSP to tackle urban data problems.

We were split into 20 random teams of 4 or 5 people and assigned an ‘Urban Science’ task. These tasks involved taking data sets – usually collected by CUSP staff members – and doing analysis on them. Our group had to investigate “Street Quality in New York City” which turned out to be analysing data on the city’s potholes. The problem may sound a little dull but once you get going it actually gets quite exciting! Potholes cost NYC millions of dollars per year in litigation and getting them fixed before someone falls into one or damages their car could save the city lots of money.

[An amusing image found by one of my pothole challenge teammates][2]
An amusing image found by one of my pothole challenge teammates

We were given a set of accelerometer readings tied to photographs captured by a device designed by our mentors Varun and Graham. The device is fitted to the dashboard of a car and takes 3D accelerometer readings every second as well as a photo as you drive along. The result is a dataset that roughly records where the driver encounters the most “bumpy” roads in the city. The dataset only covered one neighbourhood in Brooklyn known as Cobble Hill. However, it was still extensive enough to give us an idea of how you might identify poor quality roads in other areas of the city.

The other dataset we were given was the GPS coordinates of 311 complaints made about street quality during 2015. For those unfamiliar, 311 is a non-emergency hotline you can call in NYC to have a moan about some aspect of the city – noone’s been to pick up my rubbish, there’s a pothole in the road by my house, someone’s graffitied the bus stop – that sort of thing. I think the closest parallel we have to 311 in the UK is NHS direct – i.e. you call 111 if you have a cold and 999 if you’re having a heart attack. Thanks to New York’s open data initiative, we had geo-plots for all ‘street quality’ related queries right at our finger tips.

It was time to get stuck in. The team had a brainstorm about what questions we should try and answer based on the data available – we came up with around 20 but since we only had about 10 hours to investigate in total – decided to restrict ourselves to 3:

  1. Can we find a correlation between the accelerometer data and 311 complaints in order to show that 311 complaints could be used as a proxy for potholes?
  2. Do the number of 311 complaints correlate to population density or average salary of nearby residents – we hypothesised that more prosperous, well travelled areas of the city might be more eager to complain about the road quality?
  3. Can we train a machine learning system to recognise the presence or lack of potholes given input images and accelerometer data?

Linfeng – a teammate and budding applied mathematician – answered question three first. Using the accelerometer data and associated images he set about building a binary classifier – a system that could take an X,Y,Z reading from the accelerometer and spit out a “yes thats a pothole” or “no that’s not a pothole” reading. He did this by manually eye-balling all 843 images that the sensor snapped and putting each of the accelerometer readings into one or the other of the two categories.  After training using 5-folds cross validation – the final classifier worked with something like a 73% accuracy. We felt that this was pretty good for a first run and that there may be other features that could help this classification. However, our initial finding was – yes – it is definitely possible to train a machine learning classifier to detect potholes.

[Potholes and 311 complaints in cobble hills.][5]
Potholes(blue) and 311 complaints(orange) in cobble hills. Click to open interactive map.

The next task was to see if there was any correlation between 311 complaints and the pothole data. We used our classifier to only consider points determined to be potholes. The sensor data was recorded on the 3rd of May. Therefore, we also decided to filter 311 complaints by time of report. We thought it was reasonable to assume that potholes found in May would have been reported at earliest – April and fixed by June. Including 311 complaints from too far in the past or future would add noise to our investigation and slow things down.

We overlayed the 311 data onto the map of potholes to see how well they lined up. There was some loose correlation but the maps did not correlate brilliantly well. Upon reflection we realised that the GPS location associated with the 311 complaints represent where the call was made from rather than the location of the pothole the call was about. It is a fair assertion that most people would wait to make such a call from a safe location rather than stopping in the road as soon as they encounter a hole. We also realised that multiple calls could also be made regarding the same pothole but from different locations. These two assertions validate the need for a more granular data capture system like Graham and Varun’s.

[complaints and population density. Click for an interactive map][6]
complaints and population density. Click for an interactive map

Finally we looked into the issue of Population Denisy vs. Potholes. I struggled for a while to find a map of population density and ended up having to make my own. New York City has a map of what it calls Neighbourhood Tabulation Areas (NTA). These are small geographical areas used to tabulate statistics for census data i.e. each NTA has its own population density figure. I found a dataset for NTAs covering New York City and another dataset for population density by NTA. I was able to ‘wrangle’ the two datasets together and plot them on a map. I then did some geo-SQL – summing 311 pothole complaints for each NTA and storing it in a database table. This allowed me to plot a map showing both population density and 311 complaint ‘density’ for the whole of new york. Interestingly (but perhaps not surprisingly) the map shows strong positive correlation of population to 311 complaints. However, Statten Island as a borough serves as an outlier – having a lower population but a large number of complaints. I read that Statten Island has a larger vehicle ownership per capita and that this might explain the discrepancy. However, I did not have time to investigate this further. The population density and pothole complaint density correlation serves to further validate the need for more granular data collection. More people are complaining but are they all complaining about the same potholes or are they just better at finding potholes? Are there more potholes in the road or just more people to complain about them? These are questions that could be answered with more data.

I had a great week at CUSP and would like to thank all of their staff for hosting (and putting up with) us.