Data – it’s not just boring tables
Louisa Nolan from @DataSciCampus has blogged for us ahead of our data webinar with examples of the exciting possibilities out there coming from new types of data and new analytical tools. Join us on the 16th to find out more about Why using data effectively enables better decision making.
Data is exciting, and these days, we can extract interesting information not just from tables of survey results or management information (although these are still of course important) but also from large volumes of documents, or from images, or sensor readings. Data science gives us the tools to rapidly analyse these types of data, in ways that would not have been possible even just a few years ago. It is this combination of opportunities: new types of data + new tools to analyse them that is so exciting, because it opens a whole new world of insight!
As a lead data scientist at the Data Science Campus of the Office for National Statistics I get to think about data every day (this is a Good Thing!). We develop and deliver data science projects addressing difficult questions for our public sector customers, we offer advice and training on data science, and we run deep dives, hackathons and workshops with multi-disciplined teams to find solutions to data challenges.
In this blog, I’d like to share a few examples of how we have been using new types of data and applying data science techniques to tackle challenges that aren’t met by more traditional approaches. This is just a selection, so please visit our website, follow us on Twitter, or get in touch if you would like to discuss how we could support you to adopt or adapt these projects, or if you would like to discuss your own data science challenges.
What are people talking about when they talk about Wales on Twitter? We were posed this question by the National Assembly for Wales, who wanted to understand what people were interested in when they talk about Wales. Our data scientists built a tool for topic analysis of the text of tweets containing #Wales. Topic analysis is a technique which groups text – in this case Tweets – into related subjects. For the period we analysed, we found topics on tourism, sport – including rugby, of course, a business exposition in Cardiff, and, somewhat unexpectedly, a topic on Indian street children! This topic was related to the book ‘A Hundred Hands’, published that week by Diane Noble, a Welsh author. The tool can be easily adapted to analyse other hashtags of interest.
Mapping the urban forest at street level Using images sampled from Google StreetView, the team has developed an experimental method to map the density of trees and vegetation at 10 metre intervals in English and Welsh towns and cities – this is hyper-local mapping! The team have built a pipeline for processing and analysing the images, which could potentially be used for other types of analysis of StreetView images.
Analysing the text of patent applications to understand emerging technologies. In this project, large volumes US patent applications have been analysed, to explore whether emerging technology (aka ‘the Next Big Thing!) can be identified from the text. This is a great demonstration of how the power of data science can unlock data. Even 5 years ago, text documents like these patent applications would likely have had to be analysed laboriously by hand. Now, we can rapidly analyse large quantities of text to extract useful information to inform decision-making.
Turning free text lists into hierarchical groups. Sometimes, we have short, free text descriptions or lists – perhaps a list of products purchased or transported. To make use of these, we need to somehow group them into similar products, account for spelling mistakes, typos and different abbreviations. This is theoretically possible by hand, but usually prohibitively labour-intensive. This project automates the hierarchical classification. Because the approach is both syntactic (how the word is spelled) and semantic (what the word means), we can group, for example, whisky and vodka together, and correctly assign steel products, steel prod, and steel produtc to the same category. This tool could be adapted for various datasets of free text responses.
I hope that has given you a taster of some of the things we are working on in the Campus, and maybe some ideas or what you might be able to do with your own data. And I hope I have also convinced you that data doesn’t have to be boring!