On the summit’s second day, participants considered the seven opportunities that had been identified in the previous day’s breakout sessions (see Chapter 4) and then voted on what they believed to be the highest priorities. The seven opportunities from the first day were as follows:
The voting identified projects 4, 3, 6, and 2—in that order—as the summit participants’ four highest priorities. As planning committee co-chair Mary Lee Kennedy said when announcing those four priorities, the choice did not mean that there is not energy to address the other three opportunities at some point, but the summit had only a limited amount of time and participants and so had to prioritize what to discuss.
Next, the summit participants divided themselves into four groups to discuss how to proceed on each of the four opportunities. The goal was to identify potential champions for each project as well as organizations that would likely be interested in participating. In particular, the breakout sessions were asked to come up with a plan and a timeline for taking advantage of each of these opportunities.
Those plans, including champions and interested organizations, are described in the following subsections.
The first group to report was focused on the highest priority opportunity: establishing a standing collaboration across rights holders, stakeholders, and other interested parties for the care of Indigenous and other minority data. The group first looked at what it means to be a “standing collaboration” and concluded that they wanted to create not just a community of practice but an instantiated arrangement—a consortium or a collaborative of some form. The collaborative would be populated initially by groups that step forward and would be open to other groups that choose to join later.
The collaborative would work to advance both JEDI principles and the CARE (Collective benefit, Authority to control, Responsibility, Ethics) principles designed for Indigenous data governance. The group noted that broader community considerations are not fully encompassed by the JEDI and CARE principles, so part of the group’s work would involve expanding the conversation beyond JEDI and CARE. Another task would be to help groups operationalize and implement the principles.
The group saw its proposed collaborative as being linked to another group being developed—the one focused on training—since ensuring actual adoption of principles for the care of Indigenous and other minority data will be just as important as developing those principles. The collaborative would also help to develop tools and methods to be used in applying the principles they identify and also serve as a forum to guide funders and institutional review boards, with that guidance adapted over time.
The group identified about 25 different entities, groups, communities, and initiatives that should be part of this effort. These include federally recognized tribes, migrant communities, the Society for the Advancement of Chicanos/Hispanics and Native Americans in Science, Advancing Indigenous People in STEM (science, technology, engineering, and mathematics), the U.S. Indigenous Data Sovereignty Network, the Global Indigenous Data Alliance, tribal colleges and universities, Historically Black Colleges and Universities (HBCUs), Hispanic-serving institutions (HSIs), Data for Black Lives, Data for Good, Stop AAPI (Asian-Americans and Pacific Islanders) Hate, and key data alliances such as the Research Data Alliance—United States and the Earth Science Information Partnership (ESIP). It is likely that many more organizations would be interested in participating.
To develop the process for establishing such an effort, three members of the group—Joel Cutcher-Gershenfeld of Brandeis University, Jane Anderson of New York University and the Department of Energy’s National Energy Technology Laboratory, and Jennifer Hansen of Microsoft—agreed to do a first cut and produce a one- or two-page summary document, which would then be shared as a Google Doc with the others in the group. This would become a preliminary step toward a charter and be shared with representatives from these various groups. Once that is done, instead of bringing everyone together at once, the plan would be to assemble smaller groupings that have similar interests, after which all of the different groups—hopefully, 30 or 40 or even more entities—would meet on a Zoom conference call with the goal of instantiating an ongoing body that concerns itself with data from and about Indigenous and minority and underserved populations.
Cutcher-Gershenfeld commented that there was a great deal of energy and enthusiasm in the group to see this entity come into existence. He added that there are other ongoing efforts to foster similar conversations, and the goal would be to work with those other entities to highlight the specific role of data.
The next group was tasked with developing a plan to share information on the evaluation of datasets and models with a focus on data quality. The group started with the observation that evaluating the quality of research data continues to be of critical importance for mitigating bias and ensuring
research data is properly reused. The group would investigate a set of generalized data quality indicators that would allow researchers and scholars to quickly evaluate the trust and veracity of research data. The group saw its mission as finding ways to address the complex challenges surrounding data quality, identifying effective measures of quality, and disseminating the measures of data quality for those datasets that have been evaluated.
One group member explained the approach to data quality by using climate science as an example. In that area, measures of data quality have to account for anomalies, uncertainties, and other factors that affect accuracy. One challenge is to communicate conclusions about data quality in both human- and machine-readable formats, so that computer models and researchers can use the information about data quality.
As this example indicates, the group identified two steps as being important: first, developing a set of effective indicators of data quality and, second, communicating those indicators of data quality to others so that they can be used by individuals and groups that lack robust methods for assessing data quality.
Assessing data quality is complex, and many different types of data are collected in various fields and disciplines. Group members pointed out that many domain-specific data standards have been defined in individual fields and many groups discussing data quality. One of their first tasks will be to examine what best practices and lessons learned already exist, with the goal of creating common indicators of data quality that can be used across fields.
One topic discussed during the breakout session was the broad spectrum of methods for characterizing data. Some approaches use quantitative uncertainty models, while others are much more qualitative. One suggestion was that it might be better to focus on data characterization rather than data quality because “data characterization” has more of a value neutral connotation. It might be useful to provide data users with high-level guideposts of the factors to consider when they are contemplating using a dataset or combining two or more datasets.
The group outlined its plan to organize a Zoom call as a next step and exchange perspectives on best practices and data quality indicators from the various fields represented by group members. After getting some clarity on existing measures and efforts, the group will identify areas where it can add value, most likely related to developing common indicators.
One question that arose from the discussions was whether important perspectives are missing. Different criteria for determining data quality are used according to the domain (e.g., survey data, sociological data). One
can also look at data from the perspective of Indigenous populations or underserved communities. The group thought it was important to embrace all the different perspectives on data characterization, but members did not think that they adequately represented all the relevant perspectives. It would be important to seek out additional perspectives. In this way researchers and others could be encouraged to look at their data with new eyes and see things in new ways, similar to what has happened in medical research when an increasing number of women entered the field and pointed out the importance of adequately supporting research on female subjects.
Against that backdrop, the group sketched out the following possible step by step approach for developing a set of common indicators of data quality, namely:
Having developed a set of data quality indicators, the next step would be to share those indicators. The group thought that the most effective way to do this would be by describing them in various papers and articles published in journals and other platforms where researchers and other interested parties would find them.
Although the group has not yet developed its specific quality indicators, it did come up with a list of information that is likely to be important to anyone who is considering using a particular set of data. The list begins with such high-level information as the location of the data, the authors of the data and companies associated with it, the completeness level of the data, and qualitative assessment. Another important consideration is the ease with which datasets can be combined; this will depend upon such things as common identifiers, semantic identifiers, domain dependencies, provenance, bias, being able to evaluate whether the data have integrated correctly, and legal and licensing considerations for products of the combined
data. Potential users of the data will also want to know what format the data are in and whether they are machine-readable and will want indications of its quality. The granularity of the data is also important to know; in general, it will depend upon the discipline and the method used to collect the data.
Going forward, the two champions/points of contact for this group will be Robert Hanisch, director of the Office of Data and Informatics at the National Institute of Standards and Technology (NIST), and Giri Prakash, chief data and computing officer at the Department of Energy’s Atmospheric Radiation Measurement Data Center.
The third priority was to develop and provide data science training for underrepresented groups and ethical JEDI data use training for everyone. This group focused on two efforts: (1) working with underrepresented groups to ensure that they receive strong training in data science, and (2) ensuring that ethical and JEDI principles are understood by everyone who is doing research, data management, and data science.
For the first effort—developing and providing data-science training for underrepresented groups—the group decided to build on existing efforts, working to broaden their reach and make training available to larger numbers of individuals from underrepresented groups. For example, the National Student Data Corps, an initiative developed by the Northeast Big Data Innovation Hub, is a “community-developed initiative that provides resources and opportunities for students to learn data science, in a community of support, with a special focus on engaging underserved institutions, students, and communities” (Northeast Big Data Innovation Hub, 2024). One option would be to work with other interested institutions to set up similar programs in other parts of the country or to expand its offerings to more communities. The CODATA–Research Data Alliance (RDA) Schools of Research Data Science offers summer school classes to early-career researchers to teach them the foundational data-science skills they need to succeed in their careers (CODATA, n.d.). Although it is not specifically aimed at students from underrepresented communities, it is a resource these students could take advantage of. Similarly, many community colleges offer classes in data science and would be easier to access and more convenient for students from underrepresented groups than many universities. Another potential resource is the Minority Serving – Cyberinfrastructure Consortium, which helps develop sustainable campus-level information
technology capabilities for the support of data-intensive education and research programs (MS–CC, 2024). Its members include 55 HBCUs, 21 HSIs, 9 tribal colleges and universities, and 55 affiliate organizations, and it would be a natural partner to work with to provide data science training and resources on the campuses of minority-serving institutions (MSIs). Many organizations offer access to members of underrepresented communities, from HBCUs and other MSIs to tribal councils. Reaching out to these organizations, discussing the importance of data science training, and facilitating access to educational resources could make an important contribution.
The breakout group also spent time discussing ways to bring greater understanding of ethical and JEDI principles to all individuals involved in research, data management, and data science. One topic they considered is what career stage JEDI education might start, since it might take different forms according to education level. For undergraduate or graduate students, ethical and JEDI principles could be integrated into courses, for instance, while existing scholars might be given a continuing education unit on such principles. The group decided not to focus on K–12 at this point, although that cohort could be addressed in the future.
The group also listed some of the partners that could be valuable in such an effort. That list included the big data hubs in different geographic regions around the country (such as the Northeast Big Data Innovation Hub); The Carpentries, a nonprofit organization that offers instructional workshops to teach foundational coding and data science skills to researchers around the world (Carpentries, n.d.); ESIP, Federation of American Societies for Experimental Biology, and other professional societies from different domains; and CODATA-RDA. Funding could come from, among other possibilities, the National Science Foundation and the Institute of Museum and Library Services, both of which were represented at the summit.
Regarding the training in ethical and JEDI principles for data science, there is no need to reinvent the wheel, as a number of courses already exist with instruction on these principles. Indeed, the group noted that one of its tasks will be to inventory existing courses. For that purpose, ESIP’s Data Management Training Clearinghouse will be an excellent resource (ESIP, 2024). And CODATA-RDA offers an excellent example of training that is already available and in use.
The group also discussed where the initial focus of training efforts should be. Should they target R1 universities because they account for much more research than other schools, or should they start with less resourced institutions? Should they start with data scientists or domain
scientists or perhaps librarians or programmers? Several factors need to be taken into account, such as where the greatest need or demand lies, where the most potential benefit can be gained, and which institutions offer the best environment for early efforts.
The group planned a follow-up meeting to discuss the CODATA data training as a starting point, analyzing the pros and cons of that program. The group intended to sustain the effort as an ongoing initiative, building an infrastructure of training and development tools, methods, and approaches. Ultimately, the group expects to develop a training program based on a foundational curriculum with JEDI principles incorporated into it that would be openly available and reusable under a Creative Commons license.
The fourth and final priority was to establish community standards and norms for data integration, Indigenous data sovereignty, and data management and sharing (DMS) plans. The breakout group chose to focus on developing standards that would lead to the expansion of current and future DMS plans that include JEDI and Indigenous considerations. This would require a two-pronged approach. First it would be necessary to work with the agencies that already encourage or require DMS plans and have guidance for what those plans should include and encourage those organizations to reexamine their guidance through a JEDI lens. The second step would be to develop guidance for researchers who are formulating data management plans on how to incorporate JEDI principles.
Standards of this type would need to resonate with various stakeholders and rights holders—particularly those from whom the data are being collected, those for whom the data are being collected, and those who may eventually use the data. Broad discussions would be needed to develop such standards. There would need to be agreement from the groups from whom the data are being collected. Faith-based organizations, social scientists, representatives from minority groups and Indigenous groups, and government agencies, both federal and state, that have been involved in defining a common language, workflows, and processes would be included. One example of a federal agency that could be involved is the Department of Energy’s Office of Economic Impact and Diversity. Input could also be sought from standard-setting bodies that are already working on relevant efforts, such as the Indigenous data provenance standard developed by the
Institute of Electrical and Electronics Engineers (IEEE). Other contributors could include the International Organization for Standardization, ASTM International, and NIST. It would also be valuable to bring in international perspectives and discover relevant activity in other countries.
The standards would need to be readily understandable by all, including such individuals as local government leaders, leaders of faith-based organizations, representatives of Indigenous tribes, and so forth. And it would be useful to test proposed standards in different contexts—that is, in case studies—before they are officially adopted. It will be important to make sure that a standard resonates with and speaks to the needs and concerns of those who will be using the data or from whom data will be collected.
What would success look like? It would certainly be a success story if a standard developed by the group ended up being adopted by IEEE or other major standard-setting organizations. But it would also be a success if large numbers of existing DMS plans were expanded to include JEDI and Indigenous considerations.
This standards group might also work with the training group, whose presentation was described above. The standards being developed by this group could be used in training, and those being trained may play a role in developing standards.
John Havens of IEEE offered to serve as the point of contact for everyone in the group so that they could continue to communicate. At the time of the summit, the group had not set a date for a Zoom call, but they were planning on communicating and moving forward with the plan. It was suggested that IEEE may be a natural institutional home for the effort but that there may be other agencies—such as the National Science Foundation or the National Institutes of Health—that require DMS plans that could also participate.
A task that emerged during the final discussion was reaching out to the next generation of data professionals and data scientists that will be taking on leadership roles in the coming years. One option for reaching this cohort, several participants mentioned, would be to organize a follow-up summit that focuses on early-career professionals and their interests. The U.S. National Committee for CODATA might be the appropriate group to develop and possibly implement this idea.
Carpentries. n.d. The Carpentries. https://carpentries.org/index.html (accessed April 22, 2024).
CODATA (Committee on Data, International Science Council). n.d. 2021: CODATA-RDA School of Research Data Science. https://codata.org/initiatives/data-skills/research-data-science-summer-schools/2021-codata-rda-school-of-research-data-science/ (accessed April 22, 2024).
ESIP (Earth Sciences Information Partnership). 2024. Welcome to the Data Management Training Clearinghouse! https://dmtclearinghouse.esipfed.org/home (accessed April 22, 2024).
MS–CC (Minority Serving – Cyberinfrastructure Consortium). 2024. Minority Serving – Cyberinfrastructure Consortium: Accomplishing more together. https://ms-cc.org/ (accessed April 22, 2024).
Northeast Big Data Innovation Hub. 2024. National Student Data Corps. https://nebigdatahub.org/nsdc/ (accessed April 22, 2024).