In this World of Smart Data blog, Rachel Franklin considers what’s needed for effective smart data research
Apologies to Gilbert and Sullivan, but data infrastructure isn’t a very exciting topic and I felt I could use all the entertainment support I could muster.
We talk a lot about smart data these days, especially how exciting it is and its potential to do all sorts of new things. We’re also thinking about the guardrails needed to ensure these data are used wisely, especially around ethics and governance, and how to maximise equitable access to data across the research landscape. Sometimes called “digital footprints” data smart spans a wide range of data produced by humans and the machines we use. Good examples are store loyalty cards, financial transaction data, internet history, wearables and apps, social media, or sensors and imagery. There tends to be a commercial aspect to this data, and data owners are often private companies.
Data infrastructure
If smart data is exciting, smart data infrastructure tends to be anything but. Like any other forms of infrastructure, for example, electricity or water, we are happiest when our data infrastructure hums along solidly in the background, making our lives a bit easier and better, without demanding a lot of our attention.
What do I mean by data infrastructure, exactly? It’s datasets, of course, but also all the connecting bits around access, storage, metadata, and integration (or connection). Data infrastructure ensures researchers know what data exists. It tells us what’s included in the data. How to access it. And what applications it’s suitable for.
If you are used to working with data but aren’t convinced that there’s any infrastructure involved, then yours is probably working as it should. Infrastructure is what allows lots of users to access and download data at the same time. Or only download selected observations or variables. Or to integrate datasets and protect identifiable information. In other words, data infrastructure is incredibly important for researchers, consumers of research, and subjects of research.
Smart data
Smart data is still in its infancy, compared to other types of data commonly relied upon by researchers across the social sciences or related domains like health or planning. This is no surprise, given the relative novelty of many forms of smart data but also the ownership of these datasets and the personal nature of much of the information.
Unlike censuses (government-produced) or ongoing surveys (usually research council-funded), smart data sources are generally in private hands. Censuses have had decades to develop protocols and data products that protect the privacy of individual respondents and respond to public, industry, and researchers needs. Surveys – which generate valuable individual, household, or firm-level data – have a different challenge. They must efficiently and effectively provide secure access to individual-level information. They too have had years to develop procedures for data security and user access.
In the case of smart data, however, there is no one owner responsible for facilitating data access. Indeed, there is no one type of smart data. Instead, there are diverse formats, types, and owners, each bringing different infrastructural requirements. Added to that, smart data owners often require additional assurances about data access and integration. Their first priorities are their customers and their bottom lines. Data owners need to be persuaded that providing researchers with access to their data is a good idea. They need to be assured this is something that won’t eventually negatively impact them.
Infrastructure can help!
Constructing smart data infrastructure
Individual researchers have, of course, long negotiated their own access to smart data, building one-to-one relationships with industry partners. However individual access is not infrastructure! Infrastructure provides common, uniform, democratic access to data.
In the UK, there’s been a lot of effort to develop smart data infrastructure, much of it funded by the Economic and Social Research Council. The Consumer Data Research Centre (CDRC) and the Urban Big Data Centre (UBDC) are two longstanding investments. Both have pioneered the creation of data infrastructure for UK researchers.
Smart data infrastructure is complicated. It includes the development of data interfaces and management systems so that researchers can explore, locate, and access new types of data. This involves providing timely updates, metadata, and other infrastructural supports. It also means ensuring the data is fit for purpose and underlying systems are reliable. Unlike more traditional forms of data, smart data infrastructure also involves long-term relationships with data owners. They control what information to share, under what access rules, and with whom.
Built for integration
Access to smart data is increasingly well handled. Visibility is growing and new types of data are becoming available every week. Infrastructure, however, has tended to focus on strengthening provision of one type of data at a time. There are exceptions—for example, in-house development of composite metrics that integrate a range of smart data, offering aggregate outputs to researchers. Researchers hoping to combine multiple smart datasets themselves, though, have been largely out of luck. There are valid reasons for this, of course. Once data are combined, especially when geographical location is involved, the risks of disclosure and individual identification mount substantially. On the other hand, many of the arenas where smart data could contribute new knowledge (mobility and transport, for example, or health) would benefit from integration across multiple datasets, both smart and traditional.
Towards a data ecosystem for the 21st century (where data linkage and integration rule)
Here is where things get really interesting. How do we start to think about the integration of data infrastructures—smart, traditional, census, administrative, and more? What we need is an infrastructure of infrastructures. The UK has a powerful breadth of social and economic data infrastructures, from longitudinal surveys like Understanding Society to the Census. Others include ADR UK (Administrative Data Research UK), a host of Office of National Statistics datasets, and the smart data offerings above.
These are exciting times for data.
To really make the most of all this data will require a lot of thought and a lot of work. Part of the challenge is discoverability: not only what exists but what can be combined with what (and for what useful purpose). Another aspect of the challenge is related to training, and ensuring we are producing researchers who know how to use all this data fruitfully and impactfully.
The really meaty challenge, though, is infrastructural. How do we work across data types, disciplines, and domains to build a cross-cutting data infrastructure that users can engage with? This is about computing architecture and systems, but also about access, storage, computing environments and, especially, interfaces that work for the researchers. 21st century data infrastructure needs to function at scale, serving up diverse forms of data in a timely fashion to diverse communities of users. And it must provide the sorts of data and functionality that researchers need—not that data providers think they should need! This will require lots of testing, community engagement, and—you knew this was coming—investment. I can hardly wait.