Dropdown items
My Academies

Personal Library

Account settings

Opportunities from the Integration of Simulation Science and Data Science: Proceedings of a Workshop (2018)

Chapter: 3 Workshop Wrap-up Discussion

Visit NAP.edu/10766 to get more information about this book, to buy it in print, or to download it as a free PDF.

Previous chapter Next chapter
Page of 36
Search this publication

Previous Chapter: 2 Presentation Summaries

Page 18 Cite Bookmark

Suggested Citation: "3 Workshop Wrap-up Discussion." National Academies of Sciences, Engineering, and Medicine. 2018. Opportunities from the Integration of Simulation Science and Data Science: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25199.

3

Workshop Wrap-up Discussion

The workshop wrapped up with an open discussion of issues raised during the presentations and ideas for moving forward.

CONNECTING LARGE DATA SETS

Alexander Szalay (Johns Hopkins University) noted that one of the biggest differences between scientific computing and private sector systems, such as Google’s, is the ease with which data flows within the system. Elaborating on this point, David Konerding (Google, Inc.) gave the example of DREMEL, Google’s internal SQL-like query system. Because the company standardizes on a single serialization protocol format called protocol buffers, which DREMEL uses to convert SQL statements into query plans, DREMEL is able to join large pools of unrelated data from different teams. Furthermore, because every Google system feeds its data into Google’s current file system, Colossus, it is possible for Google’s 20,000 computer scientists to inspect their own monitoring data and that of others to identify problems. In these ways, Google’s internal system is an example of a working data commons at a global scale, Konerding said, adding that he would like to find a way to apply same model in Google’s cloud environment such that data scientists and domain researchers could benefit from it as well.

Robert Grossman (University of Chicago) offered an example in the academic research space. Using the data commons his team built for the National Cancer Institute, researchers can apply big query and analogous techniques with various clouds to perform gene by environment computations. It is a capability multiple research teams have found useful that was not difficult to build into the system, he said.

APPROACHES TO MOVING DATA

Michela Taufer (University of Tennessee, Knoxville) raised the related issue of data movement. The need for data movement comes up in many different situations, for example in the context of bringing sensor data into simulations. Taufer asserted that the community has not sufficiently defined how data movement should be dealt with from the cyberinfrastructure point of view. She noted that if data movement is not supported, then there is a greater need to invest in tools and software, which must ultimately be connected with the data.

Szalay suggested that one answer could be to create larger data transfer nodes (DTNs) that are capable of supporting data and data movements of 100 gigabytes, yet inexpensive enough for NSF to provide to a wide array

Page 19 Cite Bookmark

of institutions. He noted that he was working on a prototype for this sort of DTN with funding from the Schmidt Family Foundation.

QUANTIFYING USER NEEDS

Pete Beckman (Argonne National Laboratory) pointed out the importance of defining goals for improvement and articulating a metric that can be used to assess success. For example, if the goal is on-demand deployment of containers, there needs to be a way to measure that. William Gropp (University of Illinois at Urbana-Champaign) observed that if we want to reduce friction experienced by scientists using computing facilities, we must be able to quantify friction.

NSF’S TRADE-OFFS IN SUPPORTING SPECIALIZED, MIDSCALE SYSTEMS

Grossman asserted that, while there is a need for convergence, it is also important for NSF to get experience building specialized, midscale data systems. Over-relying on companies like Google and Amazon to fill this space risks leaving the academic research community with a dearth of data science expertise relevant to developing and using midscale systems, he suggested.

Building on this point, Dan Reed (University of Utah) posited a few scenarios under which NSF might or might not invest in midscale infrastructure. One option, he said, would be for NSF to use its limited resources to focus on big data and big computation and encourage academic institutions to support midrange systems, much as they do now in provisioning campus networks that connect to national network backbones.

One implication of this, noted Robert Harrison (Stony Brook University), is that it would sacrifice the cost efficiencies that are currently created when NSF and the universities harmonize their investments such that each gets a better return. In the case of systems that are housed at universities, Thomas Furlani (University at Buffalo) argued that NSF gets a particularly high return on investment because while NSF pays for the hardware it does not, generally speaking, have to pay the ongoing costs for support personnel, education, outreach, and training needed to run and utilize the system. By contrast, William Gropp observed, NSF covers support personnel as part of its funding for Tier 2 systems, thereby providing support for science users across the country. It may become increasingly difficult to depend on individual institutions for training and support at a time when many universities face their own resource constraints.

A second option Reed offered would be for NSF to seek substantial funding from Congress to support a 10-year effort to integrate midscale systems into a national infrastructure. This would require a Major Research Equipment and Facilities project supported by at least $100 million per year for a decade, he suggested.

A third option is to establish a new model for partnering with the private sector. Rather than a procurement model, where NSF and its grantees simply purchase products and services from companies, this new model would create partnerships in which the government and private sector work together to solve problems. One possibility for incentivizing companies to participate in such a partnership, Reed suggested, may be to offer companies the chance to monetize aspects of government data.

As an example of monetization of government data, Grossman pointed to the National Oceanic and Atmospheric Administration (NOAA), which has taken a radically different approach than NSF in terms of data sharing. Unlike NSF, NOAA gives large cloud providers the rights to use NOAA data with no restrictions. The cloud providers benefit when they find ways to monetize that data, while NOAA benefits because the arrangement vastly reduces the cost of storing and handling its data.

Based on his experience in the nonprofit arena, Grossman said one incentive sometimes used to attract private sector partners is to offer companies early access to data that will eventually be made public. This approach works well for certain kinds of data where there is value in having early access, though it does not make sense in the case of simulations, an arena in which public-private partnership has been more rare.

Page 20 Cite Bookmark

Alternative Ways to Allocate Investments and Track Costs

Participants discussed changes that could be made to the way NSF grants are allocated and tracked, which could potentially improve return on investment. For example, Beckman suggested that NSF grants could be used to procure services instead of hardware, which can sometimes solve the same problem at a lower cost. Gropp noted that that approach had been implemented at some institutions, though not uniformly nationwide. Expanding on this, Reed said a current disincentive for institutions to take that approach is that they feel financially constrained and are reluctant to waive overhead charges for public cloud services. On the other hand, Furlani suggested that some might justify charging overhead because, even if a service is “free” to an institution’s researchers, it still requires support personnel to help them take advantage of the service. Gropp noted that the discussion points to the broader challenges involved in determining which costs should be direct versus overhead.

OBSERVATIONS

During the presentations and subsequent discussions, planning committee members who participated in the workshop noted a number of recurring themes:

Opportunities for both data-intensive and simulation approaches to science abound, especially when the two are used in combination. Data-intensive science is adding to the demand, not replacing simulation, and growing opportunities at the intersection further increase demand for cyberinfrastructure.
Machine learning has rapidly emerged as a major driver of data-intensive science and has valuable application in simulation. Machine learning is also driving convergence at the hardware level as high-performance computing elements are being introduced into both simulation and data-centric systems to support it.
Realizing these expanding opportunities for science—and maintaining U.S. competitiveness in scientific research—will require increased investment from public sources, effective leverage of commercial services, and innovative approaches to public-private partnerships.
Scientific research benefits from a balanced ecosystem that includes national, institutional, and commercial cloud facilities and benefits from complementary architectures and services models they provide.
Use by NSF-supported researchers and other users of NSF computing facilities is broadly distributed across all computing scales, and the “long tail” of small-scale users has continued to grow.
Better integration of data-intensive and data-simulation approaches will depend on integration of different research cultures as well as technologies.
The traditional high-performance computing and cloud computing communities bring complementary insights, tools, and techniques that are beginning to be shared. Much more work will be needed, including (1) helping developers of scientific software make effective use of custom accelerators, (2) developing and deploying better performing data input/out capabilities and processor interconnects, (3) and better support for scientific productivity.
Careful attention to the costs and business models for data retention and access is needed, including how to prioritize what data should be kept and for how long. Just retaining data is more expensive than is widely understood.
There are a variety of models for public-private partnerships that can play a useful role in supporting retention of and access to scientific data. Flexibility on the part of research funders and performers will be needed to take full advantage of these opportunities.
The workforce needs for scientific computing are growing as the science opportunities and technology options expand. A particular challenge will be developing and retaining experts in applying emerging and rapidly changing technologies to scientific problems.

* * *

Page 21 Cite Bookmark

Members of the planning committee and workshop participants alike were struck both by how rapidly the scientific computing landscape is changing and how much there is a shared sense of mission around convergence. The changing landscape is marked by growing demand for both simulation and data-centric computing, the emergence of new ways of delivering computing and data services, new sources of data, new opportunities to partner with the private sector, and new strategic investments by a number of countries outside the United States. At the same time, there is a shared understanding of the challenges and opportunities in finding a way to support convergence in the broad sense of bringing data and computing together. The planning committee is pleased to see the many advances made since its 2016 report and believes that further discussions around convergence would foster additional progress.