During the afternoon of the summit’s first day, the participants separated into three breakout sessions where they identified various potential collaborations that could be used to advance different aspects of using research data. Each breakout group focused on one of the following areas: artificial intelligence policies, JEDI, integration and Indigenous data sovereignty, and data-driven decarbonization policies. The breakout groups were asked to identify potential collaborations that would require working across sectors and that could produce results within 3 years and then to prioritize those potential collaborations based on their value and their achievability. They then chose one to three collaborations to develop further and to describe to the entire summit after the breakout groups had reassembled. This exercise resulted in a total of seven suggested collaborations.
This chapter describes the potential collaborations identified in each of the three breakout sessions. These collaborations were then used in the summit’s second day, when the participants chose several of them that seemed most promising to serve as the basis for projects that they would commit themselves to carrying out over the next 3 years.
The breakout group on artificial intelligence (AI) began by talking about how to identify potential collaborations related to AI but quickly realized that the most pressing issue was not forming collaborations but
rather sharing information among different groups. The AI field is exploding, with people exploring different aspects of AI use, making it vital to find ways to communicate across the different areas that are being explored. In other words, a key challenge posed in the AI area is how to keep pace with a rapidly evolving technology.
Having identified communication as the major challenge to address in the AI field, the group next addressed which sorts of information it will be most useful to share. To answer this question, the group broke into five subgroups that identified various topics where information sharing would be important, and then the entire group voted on the most important. That led to the identification of three topic areas related to AI where sharing information would be most important: evaluations, incentives, and academic research and integrity. Specifically, the group decided on these three collaborations as having the greatest payoff:
Together, these three collaborations could go a long way toward smoothing the path for future collaborations among research groups interested in sharing datasets and models for use in artificial intelligence applications. The remainder of this section examines each of these suggested collaborations in greater detail.
AI requires vast amounts of data, and its users generally rely on large datasets that were not created explicitly for their purposes—that is, they are sharing and reusing data that were collected by someone else with different motives and requirements. New users need some way to evaluate the dataset to make sure that its data can be successfully reused. Currently, various metrics can be applied to assess the quality of data and to determine whether a dataset will be useful for a particular purpose; general agreement on which of these metrics are most appropriate for which types of datasets and purposes would be useful.
A retrospective approach is appropriate for already existing datasets but given growing recognition of the value of large datasets for AI purposes, the developers of such datasets will wish to proactively plan for their reuse. There is a need to develop widely accepted sets of quality standards for large databases of various types to ensure that the data in them are AI-ready and are suitable for particular purposes. Many of the data standards that exist were created before AI had become such an important user of large datasets, so there is an opportunity to update those technical standards. Similarly, people in the field will need to agree upon standards for attribution and assigning credit as well as on such things as approaches to licensing, consent, and determining the level of access to a database.
Concerning exactly what information needs to be shared across collaborations, the breakout group determined that evaluations of both data and models need to be shared as well as guidance on the datasets that would be most useful for particular models. The standards upon which the evaluations are made need to be updated to be more useful in determining whether datasets are AI-ready, and metrics need to be developed so that researchers can understand the quality of data they are considering. Because some datasets are regularly updated, version control will be important as well, and there will need to be information about how the dataset has been changed as well as why (was it, for instance, to redress a bias or performance issue?). Details about a dataset also need to include standardized information on such things as runtime environment, operating system, and software version.
In addition to these technical specifications, the evaluation information for datasets needs to include culturally relevant information as well. From what populations were the data collected? Are the data representative across various groups—ages, sex, ethnic groups, and so on? Was consent given for the collection of the data? Culturally relevant standards should be developed in parallel with the technical standards.
Finally, the breakout group suggested that both data creators and data users are essential key players for a successful collaboration because each will bring a different type of expertise to the work. Data curators also need to be involved, as they are familiar with the databases, and tool developers will be essential in judging the value of the various tools used on the data, the match between data and tool, and what tool standards are appropriate.
A second collaboration concerning AI-related data would focus on sharing information and ideas about the incentives for sharing data and for making data reusable for AI. Figuring out which incentives work best is crucial for opening up the field and making it possible for more people to develop and use AI technologies, but there is another important reason for learning about incentives. With the rapid rise in AI and growing concerns about its risks, it is vital to build trust in AI now, and the best way to do that is by finding effective ways to incentivize data sharing and make data reusable, making it possible to train AI models with the best possible datasets. In the same way, building trust in AI will enable efforts to innovate with AI applications that serve and benefit the individual, the collective, and society.
What sorts of things might act as incentives for data sharing and making data reusable? For one, data sharing helps all AI researchers by, for example, making it easier to validate models and to benchmark and by making scientific knowledge more usable. The sharing of data might be incentivized by providing attribution or authorship credit to those who provide the data. In some cases, the creators of the data might be compensated for sharing it and making it more usable. Incentivizing data holders in the private sector to share that data and make data more usable may look very different than it does for academic researchers and data creators; this is something that the collaboration could examine and think about.
As a recent article in Current Affairs said, “The truth is paywalled, but the lies are free” (Robinson, 2020). The lesson for AI is clear: Getting the best data to work with will require incentives.
Robert Chen (Columbia University) noted in the discussion following the presentation by members of the AI breakout discussion that disincentives for sharing data exist and also need to be taken into account. Indigenous communities or groups of other underrepresented minorities, for example, may have concerns about sharing data with researchers because of how some minority groups have been mistreated by researchers in the past. Or perhaps someone with a uniquely valuable dataset desires to profit from it and thus has incentives not to share it with potential competitors.
The modern breed of AI models, particularly generative AI, have broad and deep societal impacts. It is important to make sure that the benefits
of AI are clear and that the risks of AI are understood, which means that raising awareness has a major role to play—ensuring that all stakeholders are familiar with what AI is, what it does, how it works, and how it makes mistakes.
The envisioned collaboration among universities and potentially other research institutions would ensure that the outcomes of AI are as beneficial as possible and that they do not deepen the inequities that already exist across society. It would achieve this by promoting information on the benefits and risks of AI across its members so that everyone has as much accurate information as possible for making decisions on the technology. It would also share best practices for research policies, practices, and integrity; these policies would act to ensure transparency, accountability, integrity, and the sharing of information as well as the optimal outcomes for AI, including societal outcomes. The policies would also specify that people within the organization who are involved with AI in various ways receive appropriate training and develop an awareness of key AI-related issues and also that there is discussion about who has responsibility within the organization for those issues. As for the researchers who are directly building models and working with AI, it is important that they develop an additional level of understanding of the data that they are working with because the models they train on those data will reflect what is in those data.
In developing the collaboration, several questions must be answered. For instance, how will the information be communicated among the participants? Exactly what information will be shared? Whom is responsible for what?
At a university, it is most likely the office of the vice president for research will be responsible for instituting the organization’s research and academic policies for AI. One potential problem is that the people who have this responsibility may not have the knowledge and understanding needed to craft these policies, and this knowledge gap will need to be overcome. Part of the process will involve determining the impact of AI research on the research enterprise, administration, and students at the university. AI affects all aspects of the research environment and makes it possible to do things that would not otherwise be possible, but including AI in the research process fundamentally changes it, and researchers need to determine how to understand the AI models and the data that feed into them. On the flip side, determining the cost to the university of not focusing on AI research will likely be difficult to quantify.
A second breakout group discussed collaborations among research data institutions that would focus on the integration of JEDI (justice, equity, diversity, and inclusion) and Indigenous data sovereignty. After the individual members took a few minutes to write down specific ideas on achievable collaborations, the entire group voted for the top choices. After further discussion and refinement of the suggestions, the group ended up with three:
The three proposed collaborations are described in more detail in the following subsections.
The misuse of data from Indigenous and other minority groups has led to various harms. As databases get larger and more comprehensive and the applications and analyses that can be carried out with the data become more powerful, it is important to think now about the possible harms that might arise from the misuse of data from these groups. The first proposed collaboration from this breakout group was to establish a standing collaboration across agencies and sectors to embed considerations for the care and ethical and appropriate use of minority and Indigenous data. There needs to be a set of common principles that are widely accepted and applied.
As the breakout group noted, a set of principles already exists for dealing with Indigenous data—the CARE Principles for Indigenous Data Governance, which are Collective benefit, Authority to control, Responsibility, and Ethics. While CARE is an important framework and the community is developing metrics for measuring it, CARE was developed in a rights-based framework and some rights will be different across minority groups. It is important to develop a common approach to dealing ethically with data from all underrepresented minorities.
The breakout group outlined an approach for developing a set of principles that would govern the collection and use of data from historically disadvantaged minority groups. The four steps would be as follows:
All of these steps require funding. The first step is likely to result in a long list, as there are many different groups that should be included and, in many cases, multiple possible representatives of those groups. Given that there are 574 federally recognized tribes, for example, who, if any, should speak for all Indigenous peoples?
Once the representatives have been chosen, they would need to get together regularly in a neutral forum such as the Global Indigenous Data Alliance, or the National Academies of Sciences, Engineering, and Medicine. Various associations could serve as a model for this group. For instance, Earth Science Information Partners, or ESIP, is a nonprofit organization that was created to offer a safe space for Earth scientists, data professionals, and others to come together to collaborate to address various issues (ESIP, 2024), and various societies have created forums to examine fairness issues. Once the members get together, the idea is that they will develop some sort of statement of common principles. It may not be possible to develop a list of common principles that applies to all communities, but it should be possible to identify principles that apply to all communities combined with targeted principles that apply to some communities but not others. Then, the final step would be to begin identifying metrics that can be used to judge how well those principles are being achieved. Over time this group or others could monitor how well the standards and norms are being followed in various projects and uses of data.
One envisioned timeline for the project’s various milestones might be to identify the relevant participants within 6 months of securing funding, begin regular meetings by the end of year 1, produce a statement of common principles by the end of year 2, and identify common metrics by the end of year 3. Centering participation by Indigenous peoples would be crucial to the success of the collaboration.
The benefits of a successful collaboration would include reducing the risk to Indigenous peoples and other minority populations of sharing their data and increasing public trust in science.
The second proposed collaboration focuses on embedding standards and norms related to data sovereignty and JEDI issues into data management plans. These plans, which are increasingly required by research sponsors, have a variety of shortcomings. They are often only marginally useful, they are not always available, and common guidance on how to develop them has not emerged. A particular problem is that data management plans typically do not address JEDI issues and data sovereignty issues. This collaboration would work to solve this problem.
This collaboration could proceed in parallel with the first collaboration described above—on developing a set of common principles for working with data from Indigenous communities and other minority groups—and under the umbrella of the same organizing group. Those working to develop the overarching principles and those working to embed standards into data management plans could inform each other’s activities.
The essential parties needed for a successful collaboration would include organizations that need data management plans, policy experts, and librarians who would help implement the plans. Meeting would take place in a forum that follows the ESIP model. The forum would be created and participants assembled in the first year, while the set of standards for inclusion into data management plans would be developed by the end of year 3.
The benefit of a successful collaboration would be the creation of data management plans that anticipate and provide instruction on handling issues related to data sovereignty and JEDI concerns, thus reducing the risk of harm resulting from the data not being handled properly. One issue that needs attention in such data management plans is data provenance, so that it is clear which data in a dataset have been derived from Indigenous or other minority communities and any benefits derived from those data can be directed back to the communities that provided them.
The third proposed collaboration would focus on training, specifically providing data science training for underrepresented groups as well as JEDI-related training on the use of data for everybody. The core principle underlying such a collaboration would be “Nothing about us without us.” That is why it is important to include more people from underrepresented minorities in the data science workforce, given that data from these groups exist in databases used in various research projects and applications. It is also important to deepen the knowledge of the data science workforce about JEDI principles and issues. Furthermore, both approaches—bringing more individuals from underrepresented minorities into the data science workforce and training that workforce on JEDI principles—would increase the public’s trust in the field as well as increase the value and quality of the available data and broaden access to the use of and benefits from the data.
To provide more JEDI-related training to everyone in the data science field, the collaboration would assemble an inventory of existing data science training programs, including in the undergraduate general education curriculum, and then figure out how to build more JEDI content into them. Several different types of data science training programs exist today. For instance, the Committee on Data of the International Science Council, or CODATA, offers a summer school program. The Carpentries offers instructional researchers to researchers on software engineering and data science.1 Several universities have or are developing data science programs and there are employer-led initiatives on the subject as well. Some of these programs address JEDI concepts, and some do not, but all of them would go into the collaboration’s inventory—along with details on how much and what JEDI content they include—and they would be approached about bringing that content up to a useful level.
The specific tasks and milestones that would need to be achieved by this collaboration are the following: (1) developing an inventory of existing training programs, (2) developing an inventory of data science curricula, (3) developing an inventory of undergraduate general education curricula that involve data science training, (4) developing an inventory of employer data science training programs, and (5) building more JEDI content into these programs. A second track would focus on providing data science training to students from Indigenous groups and other underrepresented minorities.
___________________
1 See The Carpentries website, carpentries.org.
All three of the JEDI collaborations outlined by this group would work together in an interrelated, mutually reinforcing way. The first would bring together many different parties to establish guiding principles and standards for the use of data from Indigenous groups and other underrepresented minorities. The second would see that those standards are reflected in data management plans. The third would act to have anyone being trained to work with data be familiar with such principles and standards, so that treating data from these groups ethically becomes the norm in this industry.
The third breakout group was asked to focus on research-data collaborations that would aid decarbonization efforts. After discussions, the group emphasized the importance of adopting a data-driven approach to identifying where to invest resources to accelerate progress toward decarbonization and green-energy transition goals. They defined resources very broadly—not just funding and physical resources but human and organizational resources as well. They defined the green-energy transition goals to be environmental, social, economic, or other goals.
Their approach would involve bringing together interested policymakers and corporate decision-makers to determine the best ways to use available data resources to move toward decarbonization. Doing this would require identifying and organizing the data resources necessary to achieve these goals. The collaboration would also consider the downstream effects of these data and, in the longer term, monitor the impacts and decarbonization outcomes.
Such a data-driven approach would be particularly valuable now as the growing momentum toward decarbonization, coupled with the urgency to “do something” risks leading to costly mistakes. By identifying and taking advantage of the key datasets that can inform decisions in this space, policymakers may be more likely to allocate resources in a way that maximizes the positive benefits of the decarbonization and green-energy transitions.
The group identified a long list of stakeholders who might be involved in discussing the best ways to use the available data resources. The list included representatives from various offices of the Department of Energy (DOE), including its policy and applied and basic science components; the White House’s Office of Science and Technology Policy; the Interagency Working Group on Coal and Power Plant Communities and Economic Revitalization; the Environmental Protection Agency; the U.S. Geological
Survey and other agencies of the Department of the Interior; National Science Foundation; the National Institute of Standards and Technology (NIST) and other standards bodies; NASA; and the Department of Defense and the national security agencies. Stakeholders from the private sector could include utilities and other energy providers, oil and gas companies, mining companies, major agricultural and forestry companies, automobile manufacturers, photovoltaic and battery manufacturers, and high-tech industry. Trade organizations and cooperatives, such as the Electric Power Research Institute, could serve as bridge organizations between academics and the private sector. Energy investors would also be an important group to engage; there is a responsible investing sector that is heavily engaged in the decarbonization space. Academia and data repositories would be included, and it would also be important to involve representatives from tribal groups and economically disadvantaged groups.
Not everyone on this list would be brought in for a given project. Instead, there would be specific use cases with perhaps only two or three people from the broader stakeholder group who would be heavily engaged.
It will be important to determine the questions to ask potential partners to assess their interests. Stakeholder engagement could be carried out via a survey or focus groups and could be used to identify the relevant and available datasets. It should also be possible to use LinkedIn and other social media to crowdsource some of the information gathering on decarbonization data resources. From that information a living document could be compiled of known datasets and resources as well as the connections between those datasets. This could be used to identify gaps and to track who is using the datasets.
Ideally, several tasks would be completed first within a shorter time frame to demonstrate the concept. The group sketched out a list of possible tasks:
In the discussion that followed the presentations of the three breakout sessions, summit participants focused on topics that were broadly relevant to the proposed collaborations.
For instance, it was noted that there is a great deal of activity—research, projects, public-private collaborations, and so on—in the AI sphere, the decarbonization sphere, and to some extent the JEDI sphere. An important first step in many of these proposed collaborations would be to generate a landscape map of the particular sphere of interest. Such a landscape map would include major initiatives currently being undertaken in the area, the entities involved and their strengths and weaknesses, and available data and other resources. One valuable resource in constructing landscape maps could be the NIST research data framework, which is a list of more than 100 organizations that work in research data sharing and research data management. Once a collaboration puts together such a landscape map, analyzing it should help in determining which potential projects are most likely to succeed or to have the greatest value.
Another potential challenge for the various data-related consortiums will be gaining access to data held by private companies, given that datasets are often valuable assets that require investments of human and financial resources to create and maintain. Even when the companies are working with other organizations on an issue of common interest, they generally share only the data that bears directly on the problem they are trying to solve. A key issue is how to frame the value proposition when trying to get these companies to provide access to their data. They want to understand how they stand to benefit, and one part of it can be that these data projects can help mitigate risks to industry. In other words, the value proposition does not have to be just about benefits; it can also be about reducing risks.
Outside pressure could play a role here. In the decarbonization sphere, for instance, if society begins to demand some accountability and some real data from the companies making various decarbonization claims, it could push them to share more of their data than they have been willing to in the past.
One speaker noted the value of distinguishing between competitive and precompetitive spaces for companies. Every company has an idea of which
data are important in its competition with other companies and which are not, and that will vary from company to company. In working with multiple companies, one needs to determine which companies will be open to sharing which types of data on a project-by-project basis and then carve out safe spaces for data sharing. According to the same speaker, it is also useful to get companies used to the idea of data being pulled from multiple sources. To do that, one can start with projects in which the answers are already known; that is, start with projects where it is easier to get the companies to agree to sharing their data. This enables companies to see the benefits of integrating all the data, and such demonstrations of the value of the process will make companies more likely to agree in other cases.
Another speaker made a similar comment, saying that these collaborative efforts need to be iterative, growing larger with each iteration. There will always be early adopters who are willing to be part of a project, and they will help make the case to others that it is a worthwhile effort. Make it user-friendly, get the word out through media and other channels, and then gradually broaden the stakeholder base and increase the size of the model. An important goal of these projects should be to broaden engagement.
ESIP (Earth Science Information Partners). 2024. A home for Earth science data professionals. https://www.esipfed.org/ (accessed April 20, 2024).
Robinson, N. J. 2020. The truth is paywalled, but the lies are free. Current Affairs, August 2. https://www.currentaffairs.org/2020/08/the-truth-is-paywalled-but-the-lies-are-free/ (accessed August 18, 2024).
This page intentionally left blank.