Big Data

Explainer

Post date

8th February 2018

What is Big data?

Big data is a term used to describe the application of analytical techniques to search, aggregate, and cross-reference large data sets in order to develop intelligence and insights. These large data sets can range from publicly available data sets to internal customer datasets held by a particular company. Increasingly, big data includes not only openly available information but extends to information collected by the private sector. This includes Twitter feeds, Google searches, and call detail records held by network providers.

Big data, now a ubiquitous, albeit vague, buzzword, has been heralded as being a boon for several fields including private industry, research, medicine, science, government, and humanitarian aid. Proponents of big data, in any given sector, often claim that applying sophisticated algorithms to huge volumes of data will reveal greater insight into any given topic. While access to such data is posited as opening opportunities for a variety of fields, it also has the potential to seriously threaten the right of individuals to keep their personal information private and have control over how their information is used.

'Big data' becomes a justification for amassing vast amounts of information, processing the information for multiple and often unforeseen reasons beyond what the individuals who may have shared that information intended it for, and using that information to glean intelligence about individuals, groups, and even whole societies.

Is 'big data' the same thing as 'data mining'?

Data mining (sometimes referred to as machine learning) is the process of extracting useful information from large amounts of data. What constitutes as 'useful information' is task dependent, hence the term 'data mining' is slightly ambiguous. It can be used to describe collecting aggregate data, finding correlations in data or to use to data in order to make predictions. Although these techniques have different consequences for privacy, a common theme in data mining is the collection of a mass amount of data, which raises its own privacy issues. Another common theme is that data mining is almost always a secondary use of the data. Data is very rarely generated for data mining, rather data that was intended for another purpose (e.g. content of communication) is mined in order to get useful information from it. This raises the issue of whether the person who created the data is aware that it is being mined.

Data mining is almost always a secondary use of data, using data that was not originally collected for that purpose. When a user agrees that his or her information may be used for mining, he or she has very little knowledge on exactly how it will be used. Once it is in a company's hands through a contractual agreement, they can do what they want with it, whether that it malicious or harmless. However many users have very little knowledge about the potential for abuse that may occur when agreeing to such a requirement. In a big data world, the individual may not be aware at all that his or her information is even being collected, and has practically no control over the purposes to which it is mined and analysed, and what the ethics and effects of that may be.

What are some examples of how big data works?

Let's say you shop at the same store often, and use a loyalty card. A company could take its customer database, which draws on this loyalty card information and the purchasing habits of customers, and cross-reference that data set with census data that contains aggregated income, property ownership, age and other information about particular areas. The company can then identify customers who would be most likely to respond to particular advertising.

In cross-referencing and combining data sets, big data analytics can put together pieces of data that in isolation contain seemingly innocuous information to produce a detailed picture of an individual's life. This generation of new personal information through the application of predictive algorithms raises serious privacy concerns: when joining a loyalty scheme, a customer may consent to sharing data, but may not appreciate - or consent to - the ultimate uses to which that data might be put.

One famous case of big data gone awry deals with US megastore Target, which used big data analysis of its own data sets to identify pregnant customers and sent "baby vouchers" to customers they believed to be pregnant. Understandably, receiving advertising that indicates a store knows you are pregnant - possibly before you've even told your family - was unnerving for many women. Target recognised this "creepy" factor and started adding in random advertisements alongside those for baby products.

As a Target executive explained "[W]e found out that as long as a pregnant woman thinks she hasn’t been spied on, she’ll use the coupons. She just assumes that everyone else on her block got the same mailer for diapers and cribs. As long as we don’t spook her, it works”. This example highlights the need for new mechanisms to oversee, inform, and when necessary, legislate, to establish good processes around big data.

Big data is also seen as a way to more efficiently fight crime and protect national security. While we're not quite at a "Minority Report" level of stopping "pre-crime", big data is a tool being trumpeted and deployed by law enforcement around the world to combat, for instance, gun crimes and sex crimes. Several cities across the US are using such tools to analyse crime statistics, map where crimes are more likely to occur, and allocate resources accordingly.

What are the problems with big data?

New technologies are enabling the creation of new forms and high quantities of data that can inform policy-making processes, increasing the potential effectiveness and efficiency of public policy and administration. However, inaccuracies can exist in the data used – either because data is not regularly updated, relates only to a sample of the population, or lacks contextual analysis.

Big data has the potential to discriminate in two ways. First, it can be used to identify aberrant data amongst larger sets, leading to the use of big data to discriminate against specific groups and activities. One example, quoted in the White House report on Big Data reported on research showing that web searches involving black-identifying names (e.g., “Jermaine”) were more likely to display ads with the word “arrest” in them than searches with white-identifying names (e.g., “Geoffrey”). The Wall Street Journal also found a number of cases of price discrimination.

Second, big data will be used to draw conclusions about large groups of people, and yet some will be excluded because their data is not included in the sets, or the quality of their data is poorer. For instance, there is a great level of interest in big data amongst developing countries and humanitarian organisations -- the very fields where the subjects of these analyses are least empowered, less likely to be included in systems, and when included their data is likely to be inaccurate.

It is important to remember that data does not equal truth. It only offers correlations -- for example, links between two different types of activities -- but does not provide a causal link. After much hype, Google Flu Trends, which relied on analysis of searches, social media and other sources, failed spectacularly and massively overestimated the expected incidence of flu.

Nonetheless, governments and companies are seeking to accumulate and analyse vast amounts of data in the hope of deriving accurate insights into human behaviour. Since the 1970s US industry in particular has been keen to accumulate large amounts of information on consumers and run algorithms against that data, but over the past twenty years this form of data mining and automated decision-making has been rapidly increasing. What began as an activity by credit record agencies has expanded to air travel (passenger profiling), anti-terrorist systems, and border management (automated-targeting system), and money-laundering (suspicious transaction reporting and analysis).

What is new is that there is now an industry around big data, selling solutions to governments and companies, while there are new opportunities for data collection -- whether it is through mass communications surveillance, the merging of data sets, and the deployment of new sensor technologies and the emerging 'internet of things'.

What about big data used for humanitarian or development purposes?

While big data may carry benefits for development initiatives, it also carries serious risks, which are often ignored. In pursuit of the promised social benefits that big data may bring, it is critical that fundamental human rights and ethical values are not cast aside.

One key advocate and user of big data is the UN Global Pulse, launched in 2009 in recognition of the need for more timely information to track and monitor the impacts of global and local socio-economic crises. This initiative explores how digital data sources and real-time analytics technologies can help policymakers understand human well-being and emerging vulnerabilities in real-time, in order to better protect populations from shocks.

UN Global Pulse clearly identified the privacy concerns linked to their use of big data and the impact of privacy in “Big Data for Development: Challenges & Opportunities” and have adopted Privacy and Data Protection Principles. While these are positive steps in the right direction, more needs to be done, given the increasingly complex web of actors concerned, the expanding scope of their work, the growing amount of data that can be collected on individuals, and the poor legal protections in place.

A recurring criticism of big data and its use to analyse socio-economic trends for the purpose of developing policies and programmes is the fact that the big data collected does not necessarily represent those towards whom these policies are targeted. The collection of data may itself be exclusionary when it only relates to users of a certain service (health care, social benefits), platforms (i.e. Facebook users, Twitter account holders, etc.) or other grouping (i.e. online shoppers, loyalty card members of airlines, supermarkets, etc.)

In the developing world, only 31 per cent households are online. More than 90 per cent of the 4 billion people that are not connected to the Internet are located in the developing world. Some countries have less than 10 per cent of their population active on the internet. This means whole populations can be excluded in data-based decision-making processes.

The short- and long-term consequences of collecting data in environments where appropriate legal and institutional safeguards are lacking have not been properly explored. Amassing and analysing data always has the potential to enable surveillance, regardless of the well-intentioned objectives that may underpin its collection. Development is not merely about economic prosperity, and social services. It is about providing individuals with a safe environment in which they can live in dignity.

What about anonymised data? Does there need to be consent for this data to be used?

Because big data is derived from aggregated data from various sources (which are not always identifiable), there is no process to request the consent of a person for the resulting data that emerges. In many cases, that data is more personal than the set of data the person consented to give.

In October 2012, MIT and the Université Catholique de Louvain, in Belgium, published research proving the uniqueness of human mobility traces and the implications this has on protecting privacy. The researchers analysed the anonymised data of 1.5 million mobile phone users in a small European country collected between April 2006 and June 2007, and found that just four points of reference, with fairly low spatial and temporal resolution, were sufficient to uniquely identify 95 per cent of them. This showed that even if anonymised datasets do not contain name, home address, phone number or other obvious identifier, the uniqueness of individuals’ patterns (i.e. top location of users) information could be linked back to them.

Advocates for big data for development argue that there is no need to request consent because they concern themselves with unidentifiable anonymised data. Yet, even if one actor in one context uses data anonymously, this does not mean that the same data set will not be de-anonymised by another actor. The UN Global Pulse can promise that they will not do anything that could potentially violate the right to privacy and permit re-identification, but can they guarantee others along the process ensure the same ethical safeguards apply?

Is big data here to stay?

While the term "big data" may disappear over time, it is safe to assume that these types of data analytics will be around for some time, in one shape or another. Efforts will continue to collect vast amounts of data, and reuse them for a secondary or even more purposes unforeseeable to the individual. Intelligence will be gleaned from this data, and the individual will not be involved in decisions about him or her, his or her data, and how it is put to use and the emerging analyses.

To begin with, moving forward there must be more accountability when using these datasets. "Data due process" has been advanced as one way to bring accountability to big data analytics; those who have had decisions made about them on the basis of big data analytics would have the right to know how that analytics was carried out. In their published paper, "Big data and Due Process: Towards A Framework to Redress Predictive Privacy Harms", Crawford and Schultz propose a new framework for a “right to procedural data due process,” arguing that “individuals who are privately and often secretly 'judged' by big data should have similar rights to those judged by the courts with respect to how their personal data has been used in such adjudications.

There is also must be greater recognition about the challenges of discrimination. Big data is not a perfect science that will inform and shape the perfectly modelled society. Big data can fuel discrimination. The lack of data will mean people will be excluded, and though this may become the motivation for stronger controls, worryingly it is also an argument for more data collection in developing world is because of this same point. The logic being: because people are not within data sets, we must work harder to collect more data on people to combat discrimination.

Finally, given the large amount of secrecy around big data, the laws that protect individuals' personal information must apply to big data systems too. As organisations accumulate more information they will be held to account for how they collected the information, how they put this information to use, and how individuals are affected by the use of this information, and whether individuals were granted the opportunity to engage with the system. At the moment, their plans are to disclose as little as possible, decide at their whim how information will be processed, and hold the intelligence to themselves.

Whether it is a company or the state, this is unacceptable in an open and democratic society.