THE PROMISE AND PERILS OF MASSIVE DATA
Experiments, observations, and numerical simulations in many areas of science and business are currently generating terabytes of data, and in some cases are on the verge of generating petabytes and beyond. Analyses of the information contained in these data sets have already led to major breakthroughs in fields ranging from genomics to astronomy and high-energy physics and to the development of new information-based industries. Traditional methods of analysis have been based largely on the assumption that analysts can work with data within the confines of their own computing environment, but the growth of “big data” is changing that paradigm, especially in cases in which massive amounts of data are distributed across locations.
While the scientific community and the defense enterprise have long been leaders in generating and using large data sets, the emergence of e-commerce and massive search engines has led other sectors to confront the challenges of massive data. For example, Google, Yahoo!, Microsoft, and other Internet-based companies have data that is measured in exabytes (1018 bytes). Social media (e.g., Facebook, YouTube, Twitter) have exploded beyond anyone’s wildest imagination, and today some of these companies have hundreds of millions of users. Data mining of these massive data sets is transforming the way we think about crisis response, marketing, entertainment, cybersecurity, and national intelligence. It is also transforming how we think about information storage and retrieval. Collections of documents, images, videos, and networks are being thought of not merely as bit
strings to be stored, indexed, and retrieved, but also as potential sources of discovery and knowledge, requiring sophisticated analysis techniques that go far beyond classical indexing and keyword counting, aiming to find relational and semantic interpretations of the phenomena underlying the data.
A number of challenges in both data management and data analysis require new approaches to support the big data era. These challenges span generation of the data, preparation for analysis, and policy-related challenges in its sharing and use, including the following:
To the extent that massive data can be exploited effectively, the hope is that science will extend its reach, and technology will become more adaptive, personalized, and robust. It is appealing to imagine, for example, a health-care system in which increasingly detailed data are maintained for each individual—including genomic, cellular, and environmental data—and in which such data can be combined with data from other individuals and with results from fundamental biological and medical research so that optimized treatments can be designed for each individual. One can also envision numerous business opportunities that combine knowledge of preferences and needs at the level of single individuals with fine-grained descriptions of goods, skills, and services to create new markets.
It is natural to be optimistic about the prospects. Several decades of research and development in databases and search engines have yielded a wealth of relevant experience in the design of scalable data-centric technology. In particular, these fields have fueled the advent of cloud computing and other parallel and distributed platforms that seem well suited to massive data analysis. Moreover, innovations in the fields of machine learning, data mining, statistics, and the theory of algorithms have yielded
data-analysis methods that can be applied to ever-larger data sets. However, such optimism must be tempered by an understanding of the major difficulties that arise in attempting to achieve the envisioned goals. In part, these difficulties are those familiar from implementations of large-scale databases—finding and mitigating bottlenecks, achieving simplicity and generality of the programming interface, propagating metadata, designing a system that is robust to hardware failure, and exploiting parallel and distributed hardware—all at an unprecedented scale. But the challenges for massive data go beyond the storage, indexing, and querying that have been the province of classical database systems (and classical search engines) and, instead, hinge on the ambitious goal of inference. Inference is the problem of turning data into knowledge, where knowledge often is expressed in terms of entities that are not present in the data per se but are present in models that one uses to interpret the data. Statistical rigor is necessary to justify the inferential leap from data to knowledge, and many difficulties arise in attempting to bring statistical principles to bear on massive data. Overlooking this foundation may yield results that are not useful at best, or harmful at worst. In any discussion of massive data and inference, it is essential to be aware that it is quite possible to turn data into something resembling knowledge when actually it is not. Moreover, it can be quite difficult to know that this has happened.
Indeed, many issues impinge on the quality of inference. A major one is that of “sampling bias.” Data may have been collected according to a certain criterion (for example, in a way that favors “larger” items over “smaller” items), but the inferences and decisions made may refer to a different sampling criterion. This issue seems likely to be particularly severe in many massive data sets, which often consist of many subcollections of data, each collected according to a particular choice of sampling criterion and with little control over the overall composition. Another major issue is “provenance.” Many systems involve layers of inference, where “data” are not the original observations but are the products of an inferential procedure of some kind. This often occurs, for example, when there are missing entries in the original data. In a large system involving interconnected inferences, it can be difficult to avoid circularity, which can introduce additional biases and can amplify noise. Finally, there is the major issue of controlling error rates when many hypotheses are being considered. Indeed, massive data sets generally involve growth not merely in the number of individuals represented (the “rows” of the database) but also in the number of descriptors of those individuals (the “columns” of the database). Moreover, we are often interested in the predictive ability associated with combinations of the descriptors; this can lead to exponential growth in the number of hypotheses considered, with severe consequences for error rates. That is, a naive appeal to a “law of large numbers” for massive data is unlikely to be
justified; if anything, the perils associated with statistical fluctuations may actually increase as data sets grow in size.
While the field of statistics has developed tools that can address such issues in principle, in the context of massive data care must be taken with all such tools for two main reasons: (1) all statistical tools are based on assumptions about characteristics of the data set and the way it was sampled, and those assumptions may be violated in the process of assembling massive data sets; and (2) tools for assessing errors of procedures, and for diagnostics, are themselves computational procedures that may be computationally infeasible as data sets move into the massive scale.
In spite of the cautions raised above, the Committee on the Analysis of Massive Data believes that many of the challenges involved in performing inference on massive data can be confronted usefully. These challenges must be addressed through a major, sustained research effort that is based solidly on both inferential and computational principles. This research effort must develop scalable computational infrastructures that embody inferential principles that themselves are based on considerations of scale. The research must take account of real-time decision cycles and the management of trade-offs between speed and accuracy. And new tools are needed to bring humans into the data-analysis loop at all stages, recognizing that knowledge is often subjective and context-dependent and that some aspects of human intelligence will not be replaced anytime soon by machines.
The current report is the result of a study that addressed the following charge:
Thus, this report examines the frontiers of research that is enabling the analysis of massive data. The major research areas covered are as follows:
CONCLUSIONS
The research and development necessary for the analysis of massive data goes well beyond the province of a single discipline, and one of the main conclusions of this report is the need for a thoroughgoing interdisciplinarity in approaching problems of massive data. Computer scientists involved in building big-data systems must develop a deeper awareness of inferential issues, while statisticians must concern themselves with scalability, algorithmic issues, and real-time decision-making. Mathematicians also have important roles to play, because areas such as applied linear algebra and optimization theory (already contributing to large-scale data analysis) are likely to continue to grow in importance. Also, as just mentioned, the role of human judgment in massive data analysis is essential, and contributions are needed from social scientists and psychologists as well as experts in visualization. Finally, domain scientists and users of technology have an essential role to play in the design of any system for data analysis, and particularly so in the realm of massive data, because of the explosion of design decisions and possible directions that analyses can follow.
The current report focuses on the technical issues—computational and inferential—that surround massive data, consciously setting aside major issues in areas such as public policy, law, and ethics that are beyond the current scope.
The committee reached the following conclusions:
to tackling the challenges of statistical inference, where the goal is to turn data into knowledge and to support effective decisionmaking. Assertions of knowledge require control over errors, and a major part of the challenge of massive data analysis is that of developing statistically well-founded procedures that provide control over errors in the setting of massive data, recognizing that these procedures are themselves computational procedures that consume resources.
and there may be no feasible solution to that broader problem; a suitable cross-disciplinary outlook can point researchers toward an essential refocusing. For example, absent appropriate insight, one might be led to analyzing worst-case algorithmic behavior, which can be very difficult or misleading, whereas a look at the totality of a problem could reveal that average-case algorithmic behavior is quite appropriate from a statistical perspective. Similarly, knowledge of typical query generation might allow one to confine an analysis to a relatively simple subset of all possible queries that would have to be considered in a more general case. And the difficulty of parallel programming in the most general settings may be sidestepped by focusing on useful classes of statistical algorithms that can be implemented with a simplified set of parallel programming motifs; moreover, these motifs may suggest natural patterns of storage and access of data on distributed hardware platforms.
time) to the selection of questions to pursue further. It may also be obtained from crowdsourcing, a potentially powerful source of inputs that must be used with care, given the many kinds of errors and biases that can arise. In either case, there are many challenges that need to be faced in the design of effective visualizations and interfaces and, more generally, in linking human judgment with data analysis algorithms.
As part of the study that led to this report, the Committee on the Analysis of Massive Data developed a taxonomy of some of the major algorithmic problems arising in massive data analysis. It is hoped that that this proposed taxonomy might help organize the research landscape and also provide a point of departure for the design of the middleware called for above. This taxonomy identifies major tasks that have proved useful in data analysis, grouping them roughly according to mathematical structure and computational strategy. Given the vast scope of the problem of data
analysis and the lack of existing general-purpose computational systems for massive data analysis from which to generalize, there may certainly be other ways to cluster these computational tasks, and the committee intends this list only to initiate a discussion. The committee identified the following seven major tasks:
For each of these computational classes, there are computational constraints that arise within any particular problem domain that help to determine the specialized algorithmic strategy to be employed. Most work in the past has focused on a setting that involves a single processor with the entire data set fitting in random access memory (RAM). Additional important settings for which algorithms are needed include the following:
Training students to work in massive data analysis will require experience with massive data and with computational infrastructure that permits the real problems associated with massive data to be revealed. The availability of benchmarks, repositories (of data and software), and computational infrastructure will be a necessity in training the next generation of “data scientists.” The same point, of course, can be made for academic research: significant new ideas will only emerge if academics are exposed to real-world massive data problems.
Finally, the committee emphasizes that massive data analysis is not one problem or one methodology. Data are often heterogeneous, and the best attack on a problem may involve finding sub-problems, where the best solution may be chosen for computational, inferential, or interpretational reasons. The discovery of such sub-problems might itself be an inferen-
tial problem. On the other hand, data often provide partial views onto a problem, and the solution may involve fusing multiple data sources. These perspectives of segmentation versus fusion will not be in conflict often, but substantial thought and domain knowledge may be required to reveal the appropriate combination.
One might hope that general, standardized procedures might emerge that can be used as a default for any massive data set, in much the way that the Fast Fourier Transform is a default procedure in classical signal processing. However, the committee is pessimistic that such procedures exist in general. That is not to say that useful general procedures and pipelines will not emerge; indeed, one of the goals of this report has been to suggest approaches for designing such procedures. But it is important to emphasize the need for flexibility and for tools that are sensitive to the overall goals of an analysis; massive data analysis cannot, in general, be reduced to turnkey procedures that consumers can use without thought. Rather, the design of a system for massive data analysis will require engineering skill and judgment, and deployment of such a system will require modeling decisions, skill with approximations, attention to diagnostics, and robustness. As much as the committee expects to see the emergence of new software and hardware platforms geared to massive data analysis, it also expects to see the emergence of a new class of engineers whose skill is the management of such platforms in the context of the solution of real-world problems.