It's often repeated among the very smart people I work with that 80% of their work is finding the data in the first place, then cleaning it and making it fit for purpose. Only then can models be created, which are refined continuously as more data becomes available. COVID-19 is touching every conceivable facet of life, reflected in the data being generated as a result.
Be it virus test results, traffic flows, shopper behaviour or what is spoken about on public social networks, the data flow can appear to be a torrent. Beneath all those public flows lies an iceberg of hidden data, a seemingly endless list of "known unknowns", and we can only wait for the "unknown unknowns" to come to light.
Another common idea among data scientists and machine learning and AI experts is that if we had all the data, no question would go answered. Indeed, good, trusted data is the lifeblood and foundation of machine learning and AI work we can place trust in. Every aspect of existence could potentially be modelled and forecast.
But the reality is we don't have all the data (at this moment). It may appear to an outside observer that there's a truly massive pool of data being generated on COVID-19 – and indeed, never before have I witnessed such an extensive public gathering and sharing of data on both a local and global stage – but much remains hidden.
To give but one example: One of the most important features in the study of virus spread is the study of human movement itself. We understand the role airlines played in the initial spread. We can trace back airline timetables and routes. It tells us with a great deal of certainty that shutting down air travel is one of the crucial elements in mitigating spread.
But the daunting task we face now is to restrict more earth-bound human movement – an even more devastating vector than air travel. A very large proportion of the population carries one of the most important sensors every developed: the mobile phone. And indeed, some regimes have made moves to encourage/enforce the installation of mobile apps to both track movement and keep citizens informed. No doubt that had a role to play in the apparent successes we have seen in some nations in combating the spread – and we're still in early days.
But the challenges of replicating the same model in other parts of the world are large indeed.
Several models have been developed around the projected spread of COVID-19, such as the Oxford and the Imperial College London models. Such models play a critical role in forecasting and informing decisions at the highest level. As more data becomes available, the models can incorporate that new data and their projections can be refined. This is the scientific process.
Ultimately, the more granular the data, ie the more localised and specific to a particular geography, the higher the confidence that can exist in the projection. There is a critical role to play for custodians of big data sets to make those data sets available to those who can apply them to the problem, while striking a balance and ensuring that civil liberties are preserved – lest we have a repeat of data being repurposed for an agenda other than that for which it was originally intended.
A small example: One little corner of the COVID-19 data pool is relationships and interactions on social media, both public and hidden. Studying it helps us identify mis-/disinformation, which has a very real impact outside of the virtual spaces. Posts on one popular social media platform are being generated at upwards of 20 per second globally, a testament to how deeply and extensively COVID-19 has touched every aspect of our lives.
Algorithms in this case can help us draw deeper insight, and paint a slightly more comprehensive picture, by identifying new themes and concerns as they develop; identifying the voices which make a positive or negative impact (please triple-check that message you’ve just received before forwarding it); and possibly playing a part in early detection of unforeseen disruptions caused by the virus.
These are just a few ways in which data, big and not so big, can help us all in this time.
CTO, VoxCroft Analytics