Margaret Levenstein, University of Michigan, Moderator
Nuno Bandeira, University of California, San Diego
Jessie Tenenbaum, Duke University and the North Carolina
Department of Health and Human Services
Georgia (Gina) Tourassi, Oak Ridge National Laboratory
Robert Williams, University of Tennessee Health Science Center
Margaret Levenstein, University of Michigan, invited the research community representatives who shared their perspectives on the first day of the workshop (see Chapter 2) to participate in the final panel discussion of the workshop. She asked the researchers to reflect on the following questions, based on the information that was shared over the course of the workshop:
practices and considerations into data preservation, archiving, and accessing decisions?
Robert Williams, University of Tennessee Health Science Center, observed that the first day of the workshop was focused primarily on the preservation and curation of human data. He reiterated that there is important work in long-tail animal modeling and noted that the National Institute on Drug Abuse provides $250 million each year for rat research. However, almost none of those data are integrated in any kind of uniform database and thus are not linkable. He said that resources need to be built to allow investigators to link their data effectively. He suggested educating investigators early and giving them tools that will automatically connect data. During the past 20 years, Williams has been building families of genetically diverse animals that can be used to compute correlation coefficients. Such work relies on multiplicity—some data should be available forever, and thus “life cycle” is the wrong phrase to use to describe data. He reiterated a concern that surfaced multiple times throughout the workshop about how to determine which data are valuable. He suggested that data are valuable (and should be kept) if they are linkable, usable, and able to “breathe and breed.”
Georgia (Gina) Tourassi, Oak Ridge National Laboratory, emphasized that data, algorithms, and code will continue to be produced at a speed faster than that of policy and regulation. She said that it is difficult to forecast lifetime costs and risks because the definition of “valuable data sets” will change over time. Considering the differences across application domains, it is clear that a one-size-fits-all approach does not work, she asserted. Costs and risks will depend on storage, computations, and the number of users accessing the resources. Moving forward, she suggested a two-pronged approach: Academic researchers will always be limited by the lifetime of their grants and their funding, so it is unfair to ask them to make scientific advances and to deploy data sets, algorithms, and software in formats that are of operational value. Instead, she continued, the scientific community should develop policies for best practices. At the end of the funding cycle, when data have become a federal asset, they could move to an entity (e.g., a federal coordinated infrastructure) that would be responsible for the lifetime management of the data. She noted that
funding and well-defined metrics are needed to establish the value of different data sets, benchmark algorithms, and maintain transparencies and reproducibility. She suggested increased funding for algorithms as well as for techniques for data privacy and data curation, which could help change the culture of the scientific community. Statistical methods are also needed to determine whether a synthetic data set is reliable. Lastly, because data science is infused across all disciplines, she noted a need for more undergraduate and graduate training programs on best practices.
Jessie Tenenbaum, Duke University and the North Carolina Department of Health and Human Services, emphasized Butte’s and Tourassi’s assertions that requests for applications for data reuse and for curation tools and approaches would be very helpful. Because there are so many ways to integrate data, she noted that it could be interesting to write a review paper about the many different approaches that people use to integrate data. This could lead to a better understanding of the technical requirements for how data are shared. She championed the notion of improving education and changing the culture instead of forcing researchers with “carrots and sticks,” as well as involving all stakeholders from the start of the research process. She concluded by suggesting that researchers aim for conducting translucent research instead of transparent research, especially when working with clinical data.
Nuno Bandeira, University of California, San Diego, said that a discussion about data preservation should include the costs of data reutilization: If data are not going to be reused, why pay to store them? He added that data need to be interoperable—integrated with tools, workflows, compute resources, and community-scale tools for meta-analysis. He suggested evaluating the “data community cost” instead of the “data storage cost.” Although he applauded the postsecondary institutions that recognize the value of data and have allocated resources accordingly toward preservation, he worried that it will be difficult to create a community around data if standards for data preservation are not uniform across institutions and data types. He provided a cautionary tale about the first proteomics mass spectrometry repository effort, which failed because it was a federated system (i.e., the responsibility for storing data was distributed to various institutions). He emphasized the need for stewards in the data community (i.e., people who are responsible for determining community needs; building standards; communicating; and promoting data persistence, interoperability, and reusability). Those entities are currently called repositories, but Bandeira and Clifford Lynch, Coalition for Networked Information, proposed using the term “platforms” instead. Bandeira noted that the additional cost of such an entity needs to be considered in conversations about data preservation. He closed by emphasizing that even though it is important to organize data communities, their members should not have to provide for their own compute and storage capabilities.
Levenstein highlighted the panelists’ focus on “community” and the cost to create and maintain such a community around data, which is different from the cost to preserve data. She noted the panelists’ interest in creating a repository community, in particular. Repositories, like researchers, need to be trained to prepare and preserve data as well as to understand what standards exist across other repositories, she continued. These actions create “stewardship.” Although these changes may not reduce cost, she emphasized that these actions will increase the value of what is preserved.
Williams suggested developing a funding mechanism that would enable the interoperability of research efforts, and Levenstein mentioned an organization of repositories in the social sciences and statistical communities called Data-PASS.1 She added that the Research Data Alliance has also tried to create a community. Patricia Flatley Brennan, NLM, explained that NLM would like to increase the efficiency of spending and decrease waste rather than simply cut costs. She appreciated Tourassi’s statement that NLM has a federal asset, which society deserves to have fully utilized. Brennan said that NIH recognizes the need for enterprise-level solutions as well as institute-specific solutions, which complicates the “community approach”—many communities do not align directly with any single institute or center in NIH. She reiterated her request to the National Academies’ study committee to help NLM think about the preservation of existing data as well as preparation for the preservation of future data. She appreciated the participants’ comments about the importance of helping new investigators to understand, at the start of their training, what it means to create a data strategy that focuses on future interoperability. She hopes that this committee’s work might inspire the scientific communities to take on the difficult task of providing metrics for data value. Levenstein reiterated the suggestion for NIH to develop funding mechanisms for data preservation, data curation, and secondary use of data. She also reiterated the suggestion to require a section in proposals for prior data collection. Brennan mentioned an NLM initiative to fund computational approaches to curation. NIH plans on soon releasing a separate research-resource funding mechanism. Philip Bourne, University of Virginia, expressed his support for such a mechanism and noted that certain constraints related to data governance should appear in the requests for applications, which would allow greater integration across different resources as they evolve.
Lars Vilhuber, Cornell University, said that early career training for researchers (e.g., tools to think about data, methods to self-curate data,
___________________
1 For more information about Data-PASS, see http://data-pass.org, accessed September 25, 2019.
strategies to integrate platforms) is critical. The goal is not to transform researchers into data curators or programmers but rather to raise their awareness of possible solutions to problems. He mentioned the Registry of Research Data Repositories,2 which is a database of repositories, not a community of repositories. Although it has not been actively maintained, it has elements that could be leveraged to serve and build communities. Monica McCormick, University of Delaware Library, suggested that librarians and other partners in the research process should also be eligible for funded training. Warren Kibbe, Duke University, expressed his support for a separate research-resource funding mechanism but requested that it include awards for 7 years instead of for 5 years. Bandeira pointed out that some journals require a 10-year period for the persistence of the data, which extends beyond any current funding mechanism. Kibbe suggested that the process for building a community and engaging that community in the operation of a resource needs to be codified, which relates to the governance of each resource. He referenced a recent proposal to the National Cancer Institute to ensure that data management plans and data sharing plans are included in every submission. This will allow researchers to prepare to disseminate information, preserve data, and make data available for reuse in the future.
Several important themes and opportunities were raised during the workshop presentation and discussions, including the following:
___________________
2 For more information about the Registry of Research Data Repositories, see http://re3data.org, accessed September 25, 2019.
stakeholders to be able to estimate long-term data costs so they can plan accordingly (Brennan).
transition (subgroup led by Clifford Lynch, Coalition for Networked Information).
Bourne mentioned an issue that had not been discussed during the workshop: the value of data coordination centers and the role that they play in preservation. Maryann Martone, University of California, San Diego, agreed and noted that the data ecosystem (i.e., where data are, who is responsible for them, who has access to them) remains broad and includes many ongoing efforts. She championed the value of creating a PubMed-like infrastructure for data. She added that more data are needed to understand the number of institutional repositories that already exist. This broad and complex problem speaks to the data problem itself, she continued. The notion of a one-size-fits-all solution is intractable because data are generated in so many places and for so many different uses. She added that despite numerous efforts to establish catalogues over the past 10 years, many people remain unaware of their existence. Many members of the research community spend their time in the laboratory or the field and might not be aware of the resources available to them online. She also described the diverse skill sets in the research community that should be appreciated and utilized. She explained that the system needs to be managed in such a way that every researcher can reach his or her maximum value and then facilitate a future hand-off to the person with the right expertise for the next step in the process.
Martone commented that effective data management in the laboratory is essential for data sharing. The use of standards in the laboratory could facilitate data sharing and curation; however, data sharing could also facilitate the development of standards. She explained that barriers to entry will always exist; however, more needs to be understood about how standards and tools could lower costs and other barriers. She said that working with data is rarely simple or inexpensive, and, at the moment, many researchers do not value long-term preservation of data beyond the research life cycle. She appreciated Williams’ comments about animal research to highlight how different the data problems are in each domain. Large, rich, public data sets that enable discovery are important, and new methods can allow access to old data; however, long-term costs are unknown, she continued.
Martone said that incentives are not homogeneous. “Carrots and sticks” often work in tandem, and a mandate could be useful to initiate data sharing. However, to maintain data sharing, there needs to be value for the researcher beyond the mandate. She emphasized that early training is essential for researchers, as is institutional funding for repositories. Partnerships with libraries have been especially fruitful—guiding researchers to resources and providing expertise about data management and preservation.
Martone emphasized that efforts in data preservation and scientific discovery have to be synchronized. This workshop reiterated that this process is expensive and difficult, but it also highlighted the larger issue, which is that inefficiency exists throughout the system. Greater understanding is needed as to how individuals’ practices are impacted by infrastructure, she continued. For example, some researchers store copies of their data in addition to storing the data in a repository. Martone highlighted a previous point made by Cragin that although large grants are given for instruments, the data infrastructure that is required to handle data that emerge from these instruments is drastically underestimated. Martone also highlighted the absence of a good understanding of how much money from each grant is being allocated for data preparation and curation; likely, the costs are higher than realized. Liability costs are also of critical importance to avoid lawsuits.
In closing the workshop, Martone emphasized that communities are ready to use the wealth of existing tools and expertise available to think seriously about data management. However, funding mechanisms to create platforms to connect expertise and allow people to share experiences are still needed. McCray thanked participants for increasing the value of the workshop for the committee’s study and for the broader community.