This report aims to increase the level of awareness of the intellectual and technical issues surrounding the analysis of massive data. This is not the first report written on massive data, and it will not be the last, but given the major attention currently being paid to massive data in science, technology, and government, the committee believes that it is a particularly appropriate time to be considering these issues.
This final section begins by summarizing some of the key conclusions from the report. It then provides a few additional concluding remarks. The study that led to this report reached the following conclusions:
veloping statistically well-founded procedures that provide control over errors in the setting of massive data, recognizing that these procedures are themselves computational procedures that consume resources.
can be very difficult or misleading, whereas a look at the totality of a problem could reveal that average-case algorithmic behavior is quite appropriate from a statistical perspective. Similarly, knowledge of typical query generation might allow one to confine an analysis to a relatively simple subset of all possible queries that would have to be considered in a more general case. And the difficulty of parallel programming in the most general settings may be sidestepped by focusing on useful classes of statistical algorithms that can be implemented with a simplified set of parallel programming motifs; moreover, these motifs may suggest natural patterns of storage and access of data on distributed hardware platforms.
that need to be faced in the design of effective visualizations and interfaces and, more generally, in linking human judgment with data analysis algorithms.
The remainder of this chapter provides a few closing remarks on massive data analysis, focusing on issues that have not been highlighted earlier in the report.
The committee is agnostic as to whether a new name, such as “data science,” needs to be invoked in discussing research and development in massive data analysis. To the extent that such names invoke an interdisciplinary perspective, the committee feels that they are useful.
In particular, the committee recognizes that industry currently has major needs in the hiring of computer scientists with an appreciation of statistical ideas and statisticians with an appreciation of computational ideas. The use of terms such as “data science” indicates this interdisciplin-
ary hiring profile. Moreover, the existing needs of industry suggest that academia should begin to develop programs that train bachelors- and masters-level students in massive data analysis (in addition to programs at the Ph.D. level). Several such efforts are already under way, and many more are likely to emerge in the next few years. It is perhaps premature to suggest curricula for such programs, particularly given that much of the foundational research in massive data analysis remains to be done. Even if such programs minimally solve the difficult problem of finding room in already-full curricula in computer science and statistics, so that complementary ideas from the other field are taught, they will have made very significant progress.
A broader problem is that training in massive data analysis will require experience with massive data and with computational infrastructure that permits the real problems associated with massive data to be revealed. The availability of benchmarks, repositories (of data and software), and computational infrastructure will be a necessity in training the next generation of “data scientists.” The same point, of course, can be made for academic research: significant new ideas will only emerge if academics are exposed to real-world massive data problems.
The committee emphasizes that massive data analysis is not one problem or one methodology. Data are often heterogeneous, and the best attack on a problem may involve finding sub-problems, where “best” may be motivated by computational, inferential, or interpretational reasons. The discovery of such sub-problems might itself be an inferential problem. On the other hand, data often provide partial views of a problem, and the solution may involve fusing multiple data sources. These perspectives of segmentation versus fusion will often not be in conflict, but substantial thought and domain knowledge may be required to reveal the appropriate combination.
One might hope that general, standardized procedures might emerge that can be used as a default for any massive data set, in much the way that the Fast Fourier Transform is a default procedure in classical signal processing. The committee is pessimistic that such procedures exist in general. To take a somewhat fanciful example that makes the point, consider a proposal that all textual data sets should be subject to spelling correction as a preprocessing step. Now suppose that an educational researcher wishes to investigate whether certain changes in the curricula in elementary schools in some state lead to improvements in spelling. Short of designing a standardized test that may be difficult and costly to implement, the researcher might be able to use a data set such as the ensemble of queries to a search engine before and after the curriculum change was implemented. For such a researcher, it is exactly the pattern of misspellings that is the focus of inference, and a preprocessor that corrects spelling mistakes is an undesirable step that selectively removes the data of interest.
Nevertheless, some useful general procedures and pipelines will surely emerge; indeed, one of the goals of this report is to suggest approaches for designing such procedures. But the committee emphasizes the need for flexibility and for tools that are sensitive to the overall goals of an analysis. Massive data analysis cannot, in general, be reduced to turnkey procedures that consumers can use without thought. Rather, as with any engineering discipline, the design of a system for massive data analysis will require engineering skill and judgment. Moreover, deployment of such a system will require modeling decisions, skill with approximations, attention to diagnostics, and robustness. As much as the committee expects to see the emergence of new software and hardware platforms geared to massive data analysis, it also expects to see the emergence of a new class of engineers whose skill is the management of such platforms in the context of the solution of real-world problems.
Finally, it is noted that this report does not attempt to define “massive data.” This is, in part, because any definition is likely to be so context-dependent as to be of little general value. But the major reason for sidestepping an attempt at a definition is that the committee views the underlying intellectual issue to be that of finding general laws that are applicable at a variety of scales, or ideally, that are scale-free. Data sets will continue to grow in size over the coming decades, and computers will grow more powerful, but there should exist underlying principles that link measures of inferential accuracy with intrinsic characteristics of the data-generating process and with computational resources such as time, space, and energy. Perhaps these principles can be uncovered once and for all, such that each successive generation of researchers does not need to reconsider the massive data problem afresh.