Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop (2025)

Chapter: 4 Data Infrastructure, Interoperability, Classification, and Stewardship

Previous Chapter: 3 Applications in Early Warning and Preparedness
Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.

4

Data Infrastructure, Interoperability, Classification, and Stewardship

The third session of the workshop explored current data infrastructure and efforts to expand capacity through improved data collection and advances in data sharing. Ana Bento, assistant professor at Cornell University, moderated the session. Steve Sherry, acting director of the National Library of Medicine (NLM) at the National Institutes of Health (NIH), discussed federal pathogen data repositories, publicly available databases, and initiatives to improve interoperability. Melissa Haendel, director of Precision Health and Translational Informatics and Sarah Graham Kenan Distinguished Professor at the University of North Carolina at Chapel Hill, outlined the structure, function, and various research applications of the National COVID Cohort Collaborative (N3C). Alan Christoffels, director of the South African National Bioinformatics Institute, described the efforts of the Public Health Alliance for Genomic Epidemiology (PHA4GE) to improve public health bioinformatics and data-standards structures. Kristian Andersen, director of infectious disease genomics and professor at Scripps Research, discussed considerations for pathogen surveillance during various outbreak phases.

EXPLORING THE CURRENT STATE OF INFRASTRUCTURE AND INTEROPERABILITY OF PATHOGEN GENOMICS DATA IN THE UNITED STATES

Sherry discussed the role of public sequence data in biomedical surveillance and public health, focusing on the current state of infrastructure and interoperability of pathogen genomics data in the United States. He

Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.

explained that the aggregation, standardization, and distribution of public sequence data supports pathogen surveillance and outbreak detection by serving multiple goals. These include the rapid identification of outbreak sources (e.g., contaminated food, medical devices) to target interventions and mitigate threats. Additionally, researchers use sequence data in identifying common sources of infection to develop targeted initiatives aimed at reduction. The detection of emerging pathogenic threats includes the identification of novel species that cause emerging diseases and identification of antimicrobial resistance (AMR). Use of public sequence data also accelerates research across multiple disciplines, such as developing interventions or vaccines. Moreover, Sherry noted that sequence data facilitate monitoring ongoing pathogen evolution, including virulence, pathogenicity, evolving resistance to vaccines or therapeutics, and the host or environmental factors that contribute to disease spread.

Pathogen Genomics Data Infrastructure

Sherry described how NLM collaborates with organizations that interact with its data repository to define needs, request data access, or prioritize service for pathogens of interest. These organization include the World Health Organization, Food and Drug Administration (FDA), Centers for Disease Control (CDC), National Cancer Institute, National Institute of Allergy and Infectious Diseases, and U.S. Department of Agriculture (USDA). The pathogens each organization considers high priority may change from year to year, he said, and infrastructure and data management requirements may differ for various prioritized pathogens. Such differences pose challenges to a repository working to provide services with many different data characteristics and collection processes. For example, the NIH National Center for Biotechnology Information (NCBI) has one team that monitors microbial pathogens1 via ongoing food safety surveillance and another team that monitors for viral outbreaks.2 Although these teams are within the same agency, their needs vary due to the type of data they work with, said Sherry.

Sherry explained that the pathogen genomics data infrastructure comprises interrelationships among data repositories, data generators, and stakeholders. The repositories are maintained by different stakeholders and data generators and serve a wide range of use cases, he added. Sherry noted that NLM makes engineering decisions about how to deliver data to the community in collaboration with public health officials and other

___________________

1 For NCBI Pathogen Detection, see https://www.ncbi.nlm.nih.gov/pathogens/ (accessed February 5, 2025).

2 For NCBI Virus, see https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/ (accessed February 5, 2025).

Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.

groups involved in advancing and presenting data for research and action. He highlighted several repositories of note. GenBank is a publicly available collection of assembled, analyzed, and annotated DNA and RNA sequences. NLM ensures GenBank is available to the scientific community, enabling access to the most current and comprehensive sequence information. These data include nearly four billion records across all of life sciences, and data exchange with the International Nucleotide Sequence Database Collaboration (INSDC) occurs daily. The Sequence Read Archive (SRA) is a dataset of NGS reads. This vast, public database differs from GenBank in that it contains raw data from high-throughput DNA and RNA sequencing research, including public and controlled access sequences generated from isolated and environmental samples such as wastewater. Sherry noted that these public access data are also exchanged internationally. Two companion datasets, BioProject and BioSample, link data collected through funded projects and provide descriptions of samples for sequencing. Sherry also noted several widely used repositories and databases outside of NCBI. The Global Initiative on Sharing All Influenza Data is a global science initiative and source of primary data featuring a collaborative platform that emphasizes data security and privacy. The Chan Zuckerberg ID database is a cloud-based informatics tool platform that contains sample data, primarily metagenomic datasets. The U.S. Department of Energy Joint Genome Institute features an Integrated Microbial Genomes and Microbiomes database with tools for comparative analysis and annotation. Finally, the INSDC COMPARE data hubs facilitate submission, analysis, workflows, visualization services, and presentation services and tools.

NCBI’s Pathogen Detection Program

In 2013 NCBI launched its first Pathogen Detection program—a pilot project for Listeria monocytogenes—and developed the minimal metadata standard as part of the global microbial identifier, Sherry highlighted. Two years later, the White House established the Combating Antibiotic-Resistant Bacteria initiative. The Genomics for Food and Feed Safety (Gen-FS) Interagency Group formed in 2016, and in 2022 the Gen-FS metadata working group created an updated metadata standard and began updating previously submitted samples to meet this standard. Sherry noted that NCBI serves as the public repository for the Gen-FS sequence surveillance data and metadata, and Pathogen Detection provides rapid standardized genetic analyses and AMR and genetic characterization for pathogenic isolates in support of the Combating Antibiotic-Resistant Bacteria initiative’s goals. As of 2024, NCBI Pathogen Detection has analyzed almost two million isolates (NCBI, 2024) and FDA has taken more than 1,200 actions to protect consumers from foodborne pathogens, Sherry added. Working with partners

Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.

including CDC, FDA, USDA, and NIH, Gen-FS conducts collaborative analyses of sequence data submitted from food and patient samples to identify clonal isolates that could be potential sources of outbreak (Stevens et al., 2022). Matches between environmental food preparation samples and clinical isolates from patients are made in real time, enabling public health scientists to quickly identify and verify sources of foodborne pathogens behind clinical cases. Standardized sequence and cluster analyses available via NLM’s website allow rapid identification of outbreaks to target interventions for threat reduction and removal and determination of emerging threats such as AMR or new pathogens (NLM, 2024b).

Data Management and Reuse

Sherry highlighted that the Pathogen Detection program is working with the PHA4GE global coalition to advance the use of open data, empower laboratories to analyze and govern their own data, and encourage standardized structures and interchange formats. Additionally, PHA4GE published the INSDC-compliant Pathogen Data Object Model to structure genomic data for public health applications (Timme et al., 2023). This model establishes minimum sequence and contextual data needed for public health applications and allows pathogen submissions from a variety of agencies and institutions (see Figure 4-1). The data are then made available to the public to use in their organization’s analysis system, such as sequencing for their specific research needs. Sherry explained that data and data-storage properties affect the ability to reuse data. Variables such as volume, quality, and completeness influence scaling and ability to locate the data needed for a specific analysis. Even simple standardization in date and time formats can enable broad searches and data aggregation. Data-storage variables include locale, such as industry networks, academia, and local, state, or federal organizations. Furthermore, U.S. systems coordinate data from international sources. Noting that infrastructure should feature feedback loops to the various groups also storing the data, Sherry stated the need for collaborative workflow spaces and collaborative analysis of large datasets, and he underscored the value of species-agnostic annotation, including standardized data formats. The Pathogen Data Object Model features nested relationships among data objects, Sherry explained, and these collectively allow a project to describe millions of reads in assembled contigs grouped by meaningful organisms and samples.

The One Health Enteric Package,3 a next-generation metadata package created by the Gen-FS working group, aims to capture the One Health

___________________

3 See https://github.com/CFSAN-Biostatistics/One_Health_Enteric_Package (accessed February 6, 2025).

Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.
Figure depicts pathogen data transmission from institutions to the DOM, and then on to research and public health applications. Institutions like academia, industry, and government entities, share raw sequence and metadata with the Pathogen DOM. Data is then applied to functions like genotyping assays, phylogeny, and public health and clinical functions.
FIGURE 4-1 INSDC-compliant Pathogen Data Object Model to structure genomic data for public health applications.
NOTES: DOM = data object model; INSDC = International Nucleotide Sequence Database Collaboration; wgMLST = whole-genome multilocus sequence typing.
SOURCES: Sherry presentation, July 22, 2024; Timme et al., 2023. Courtesy of the National Library of Medicine. PD Mark 1.0 Universal.

sample space for enteric microbes, said Sherry. The package consists of core attributes of the collection process, such as context, geography, date, time, host, and additional environmental or facility details. He described how the package provides a holistic metadata standard that facilitates disease outbreak and pathogen traceback investigations, and that NLM now supports this effort. An initiative is under way to backfill millions of previously submitted sample records. Sherry remarked that although the feasibility of this has not yet been established, the effort suggests a systems-level awareness and an appreciation of the value of metadata and baseline context.

Publicly Available Pathogen Databases

Pathogen data for more than 80 organism groups are publicly available on an NLM website that is updated daily, Sherry highlighted.4 These data include all microbial isolates processed by the NCBI Pathogen Detection system as well as submissions from other organizations. With organisms from at least 94 countries represented, the isolates reflect a wide range of geographic locations, he noted. Sherry added that due to the voluntary nature of submissions, NLM does not have required metadata standards in place,

___________________

4 The National Library of Medicine Pathogen Detection organism groups and corresponding data are available at https://www.ncbi.nlm.nih.gov/pathogens/organisms/ (accessed September 30, 2024).

Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.

so submitters can choose to degrade or eliminate certain metadata fields in response to state, local, or programmatic directives. Data submissions have revealed geographical variation in AMR genes, including distribution and concentration of different genes that cause resistance in different countries (NLM, 2024a). The Accelerating COVID-19 Therapeutic Interventions and Vaccines (ACTIV) Tracking Resistance and Coronavirus Evolution (TRACE) initiative is a SARS-CoV-2 variant detection pipeline containing standardized calculations for defining lineage and experimental epitopes (Connor et al., 2024). Sherry underscored that funding for ACTIV TRACE has not been sustained, resulting in a lack of adequate expertise to continue the pipeline.

Sherry remarked that policies and governance structures should establish performance goals for data interoperability and incentivize increased public sharing by creating tools to measure contribution impact. Other opportunities include AI applications and wastewater and environmental biosurveillance for diseases and pathogens. He noted that NCBI is developing disease biosurveillance infrastructure and conducting systematic taxonomic analysis (Katz et al., 2021). Highlighting the importance of raw data availability and systems that make large-scale data useful for downstream analysis, Sherry described public repositories as central to data-sharing efforts and SRA as critical to validation and reanalysis.

INTEGRATION OF DATA STREAMS TO AUGMENT GENOMIC DATA

Haendel discussed the integration of data streams to augment genomic data, focusing on the efforts of N3C, the largest national, publicly available Health Insurance Portability and Accountability Act (HIPAA)–limited dataset in U.S. history, containing nearly 23 million unique patient records from hundreds of clinical organizations. This collaboration originated early in the COVID-19 pandemic, when profound fragmentation of clinical data revealed an urgent need for observational data at scale. For instance, little was known about associations between preexisting conditions and COVID-19 outcomes, or drugs that ameliorated or exacerbated the disease. Without public centralized health care, the United States lacked centralized clinical data to understand how to improve clinical outcomes for COVID-19. Furthermore, data from any given U.S. patient are often spread across multiple providers at different times and locations, limiting the ability to assess individual patients longitudinally, she added.

Federated research networks, in which a central management entity coordinates research and analysis among multiple independent networks, are used to answer focused research questions, said Haendel (Pfaff et al., 2022b). Researchers can analyze aggregated data submitted by N3C network partners to assess, for example, the effect of certain drugs on the aver-

Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.

age number of days that COVID-19 patients spent on a ventilator. Haendel emphasized the practical value of centralized approaches in conducting exploratory research, such as using machine learning to identify factors most predictive of severe COVID-19. Centralized analytics are performed on data secured in a central enclave and involve collaborative building, testing, and refining of algorithmic classifiers and identification of novel associations.

The National COVID Cohort Collaborative

Haendel explained that N3C was created as a public–private partnership by the NIH National Center for Advancing Translational Science, several commercial partners, and numerous institutions (Haendel et al., 2021). The N3C cohort is representative of U.S. population demographics, Haendel noted. N3C collects data from all 50 states, which are harmonized and made available to the public, and pulls data from four different data models used in research networks to create a robust pipeline and target data model to facilitate interoperability. Haendel highlighted that N3C has incentivized collaboration via attribution policies and structure design, resulting in the participation of more than 450 organizations and more than 4,000 users. Furthermore, N3C has transformed care guidelines, developed evidence-based disease definitions, and developed complex risk prediction models that strengthened public health initiatives, she noted.

Ensuring rigorous and scalable data harmonization requires continuous monitoring and data provenance, Haendel stated. N3C accomplishes this via complete transparency of data lineage for all data feeds from each site. Data are uploaded once per week, and the pipelines are versioned to enable fast, automated data quality checks, she said. New data uploads trigger the pipelines to refresh, enabling scalability. However, she noted that while each research entity carried out quality assurance processes, these occurred behind clinical firewalls and have resulted in the emergence of significant data quality issues during data centralization. For example, variability in records of patient body weight was tied to a lack of uniformity in measurement, with some organizations using grams and others using kilograms. Haendel noted that such inconsistencies required algorithmic repair (Bradwell et al., 2022).

N3C COVID-19 Findings

Haendel asserted that N3C provided the earliest and most representative COVID-19 data to predict risk and inform health policy. For example, research conflicted on whether vaccination reduced the risk of post-acute sequelae of SARS-CoV-2 infection, also known as long COVID. Exploration of this question revealed variance in state and county vaccine registry prac-

Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.

tices, resulting in difficulty in determining vaccination status. Reconciling vaccination data, N3C demonstrated via multiple methods that vaccination does appear to lower the risk of long COVID (Brannock et al., 2023). Additionally, N3C data revealed that fewer patients on Paxlovid treatment required medical care or hospitalization, a finding that led to federal implementation of a Paxlovid policy (Preiss et al., 2024). Furthermore, N3C data were used in generating the first evidence-based definition of long COVID via machine learning (Pfaff et al., 2022a; Tabak, 2022). Although many features of long COVID are observable in patients’ electronic health records (EHRs), the lack of an International Classification of Diseases (ICD) code for the condition early on made identifying patients with long COVID challenging. Researchers used machine learning to discern the patterns of long COVID clinical features from the records of patients seen at long COVID clinics or patients who had been assigned the ICD code. Researchers then employed machine learning to search EHRs for patients’ symptom patterns to identify previously unknown cases of long COVID. Haendel described how the incidence rates of various long COVID symptoms increased and decreased in similar patterns across epochs of different variants. Understanding these features proved useful in classifying patients as having long COVID.

N3C analysis indicated the highest rate of reinfections during the Omicron variant epoch (Hadley et al., 2024). Researchers determined that the severity of reinfection is associated with the severity of initial infection and that long COVID diagnoses occur more often following initial infection than reinfection within the same viral epoch. N3C also enabled study of geographic disparities in access to health care and specific therapeutics and associated COVID-19 outcomes. Researchers found that patients living in rural areas faced higher rates of hospitalization, inpatient death hazard, and adverse events in comparison with patients in urban areas (Anzalone et al., 2024). Additionally, the effectiveness of some therapeutics varied based on rurality. Haendel emphasized the relevance of disparities in therapeutic access and in receiving long COVID diagnosis. She stated that researchers should consider disparities and demographics when linking genomic variant resources with patient data.

Linking Clinical Outcomes Data

Haendel underscored that EHR data are necessary but serve as only a snapshot of clinical encounters and can be supplemented with other data sources. For example, insurance claims data can complement EHRs by indicating whether a patient prescribed a medication actually filled the prescription. Thus, combining claims data with EHRs provides better understanding of whether a certain drug negatively or positively affects outcomes. Other types of data that contribute to a more complete understanding of the context

Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.

of patient health include data on social determinants of health, viral variants, vaccinations, imaging records, and patient registries. She described how various records can be securely linked via patient privacy preserving record linkage (PPRL). Meeting the HIPAA definition of deidentified data, PPRL generates tokens to replace information that would identify a patient, then matches the tokens across data sources. She noted that researchers are working to incorporate genomic resources into PPRL but highlighted the need for a simpler process that generates greater quantities of available sequence data.

Several challenges persist in linking clinical outcomes, said Haendel. Data from rural and small clinic EHRs are difficult to obtain, resulting in a lack of representativeness. The absence of patient genomic data limits understanding of patient response to infection and modifier variants. Additionally, vaccination data are often incomplete, posing challenges to fully understanding clinical effects. Furthermore, Haendel noted that difficulty in linking viral variant data to patient records has led to use of temporality in lieu of actual known variants to infer correlations.

PHA4GE: BRIDGING THE GAP BETWEEN PUBLIC HEALTH AND BIOINFORMATICS

Christoffels discussed PHA4GE’s efforts to bridge the gap between public health and bioinformatics. PHA4GE originated in 2019 from discussions regarding an incentive gap between academic software development and public health applications, which highlighted the need for a modular and scalable open-source platform for public health bioinformatics. The desire for a community-driven model led to the development of PHA4GE, a global coalition of professionals invested in data generation to inform public health decision making. Much of the operationalization of this endeavor is carried out within working groups, including those focused on sustainability efforts, implementation of public health laboratory best practices, and developing human capital necessary for public health response. The data structures working group determines how to structure data received from public health laboratories, an initiative that pivoted with the advent of SARS-CoV-2 toward pandemic response in 2020. Christoffels outlined the activities of this working group to exemplify PHA4GE community principles and practices in supporting the growing need for data standards at appropriate scale.

Data Structures

Contextual data enables deeper understanding of genetic data, said Christoffels, and the utility, reproducibility, and interoperability of contextual data depend on formal description and data structure. He explained that data structure variability in local databases propagates into public

Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.

repositories, underscoring the need for transformation and accurate data transposition. Efficient, accurate sharing of data collected in the field, from case forms, and from other local resources is critical during public health emergencies, he stated. The PHA4GE data structures working group established a SARS-CoV-2 contextual data standard to facilitate collection of spatial and temporal data and information pertaining to laboratory specification and sequencing components (Griffiths et al., 2022). Laboratories can apply standardized tags to document the quality of underlying sequence data, which in turn supports efforts to optimize and validate protocols. Christoffels highlighted the value of mapping to standards in ensuring interoperability, noting that CDC is currently developing a cholera data standard and considering the process for mapping ontology registry data to localized data from several countries involved in a recent cholera outbreak.

Recognizing the need to communicate PHA4GE data standards to the scientific community—and specifically to public health laboratories—the working group considered interfaces for practical implementation, Christoffels noted. The group created a collection template for laboratory sites to capture data at the time of sample collection. By soliciting input from stakeholders, including public health laboratories in low- and middle-income countries (LMICs), the working group demarcated fields for variables as required, recommended, or optional. He remarked that this feature proved important in increasing use of the SARS-CoV-2 metadata standard. Additionally, the working group created guidance documents to assist users in collecting appropriate data and to decentralize the data standards. The data collection template enables users to identify the degree to which data can be shared, accounting for factors such as data sensitivity. For example, a provincial-level laboratory sharing data with a national public health laboratory could define data types as accessible to the general public or as requiring legal consideration before sharing. Currently, the data standards working group is exploring whether the Global Alliance for Genomics and Health dual-use ontology tag model could be applied to pathogen datasets, he noted. Christoffels highlighted a PHA4GE package that aids users in preparing data for upload to a range of global repositories,5 an effort aimed at making data standards more accessible, increasing ease in using data-standards tools, and reducing barriers for public health labs.

Data Pipelines and Visualization

The PHA4GE pipelines and visualization working group offers analytical and bioinformatic expertise, said Christoffels. In response to requests

___________________

5 The PHA4GE package is available at https://www.protocols.io/workspaces/pha4ge (accessed October 2, 2024).

Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.

from public health laboratories in resource-limited settings, the working group developed a publication on tools and approaches for establishing bioinformatics pipelines (Libuit et al., 2024). This document outlines best practices to assist labs in addressing software and infrastructure needs, with consideration given to public access repositories and workflow management systems. He described how a partnership with the Gates Foundation has enabled PHA4GE to collaborate with several countries in Africa and Southeast Asia to develop standards, implement tools, test systems in public health settings, and obtain feedback used in modifying, enhancing, or customizing those systems. Christoffels stated that these examples from two of PHA4GE’s seven technical working groups demonstrate the coalition’s efforts to collaborate at the interface of public health.

DATA PROCESSING STRUCTURES AND DETERMINING THE VALUE OF SEQUENCED DATA

Andersen discussed the role of pathogen surveillance in detecting, identifying, and understanding the dynamics of outbreaks. He emphasized the value of international collaboration throughout the phases of viral surveillance, emergence, detection, and interventions at the epidemic or pandemic stages until a virus becomes endemic. Pathogen surveillance can be active or passive, he noted. Active efforts involves researchers entering communities and engaging with individuals to identify illness. Passive efforts, which is more common, applies diagnostic tests to ill patients seeking medical services in health care settings. Andersen noted the potential for discovering expected, unexpected, and novel pathogens through passive and active approaches.

Wastewater Surveillance

Andersen stated that wastewater surveillance is likely to increase in the future due to its ability to detect pathogens from pooled community samples. He added that wastewater treatment infrastructure is not required for wastewater surveillance, as organisms that infect humans can often be found in environmental sources such as ponds and rivers. Wastewater surveillance was used in the detection of polio and became a primary tool for surveying SARS-CoV-2 during the COVID-19 pandemic (Karthikeyan et al., 2022; Levy et al., 2023). Classifying wastewater surveillance as both active and passive, Andersen pointed out that this method searches for pathogens without requiring direct interaction with people. Real-time wastewater monitoring of viral dynamics conducted by University of California, San Diego (UCSD) and Scripps Research for SARS-CoV-2 has also been applied to hepatitis A, mpox, and other diseases (Barnes et al., 2023; Karthikeyan et al., 2022; Levy et al., 2023; Yousif et al., 2023). The San

Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.

Diego Epidemiology and Research for COVID Health Alliance is a collaboration between Scripps Research, UCSD, public health authorities, and other institutions to perform sequencing on wastewater surveillance samples and track the evolution of SARS-CoV-2 over time. The alliance has applied this research framework in San Diego and in parts of Africa. Collaborations with South Africa revealed early detections of variants in that nation. Wastewater surveillance in South Africa is performed at national and regional levels, enabling variant breakdowns by province. Similarly, the CDC National Wastewater Surveillance System conducts U.S. national and regional COVID-19 monitoring. Andersen emphasized that wastewater will become a more important source of testing as fewer clinical samples will be available with the rise in at-home testing. In research exploring whether immunity profiling is possible via antibody detection in wastewater surveillance streams, early findings indicate differences in immunoglobulin A and G levels in SARS-CoV-2 and influenza A samples. Andersen noted the need for further study to determine the ability to use a multipathogen framework to detect both pathogens and immunity levels.

Closing Gaps in Pathogen Surveillance

Data gaps between pathogen emergence and detection are widespread, said Andersen. For example, the Zika virus was introduced into the Americas in 2013 and circulated for approximately one and a half years before detection (Grubaugh et al., 2019). He added that the unexpected nature of Zika emergence in Florida contributed to a gap in pathogen surveillance that enabled the virus to become entrenched in a community before initial detection (Grubaugh et al., 2019). In West Africa the 2014 Ebola outbreak was detected an estimated three months after the first human infection, and similar lag time occurred in detecting H5N1 highly pathogenic avian influenza A in the United States. He noted that the novel SARS-CoV-2 pathogen was present for approximately one month in China before detection at the end of 2019, signifying comparatively rapid detection. The timeframe for U.S. detection of SARS-CoV-2 was of similar length, despite the pathogen being expected by that point, Andersen said. Travel-associated pathogen surveillance that reveal the presence of pathogens can reduce detection gaps by indicating that an outbreak is occurring before large case clusters emerge at a new geographical location. For instance, travel-associated pathogen surveillance identified Zika in travelers and indicated a large epidemic in 2016 that spanned several countries and regions, as well as a sizable outbreak in 2017 for which the location was not immediately known (Grubaugh et al., 2019). The 2017 outbreak was eventually linked to Cuba, and researchers discovered that an aggressive vector-control campaign had delayed outbreak of the virus in that country by a year, Andersen noted.

Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.

SARS-CoV-2 Emergence, Epidemic, and Endemic Phases

Andersen reviewed data that indicate the Huanan Seafood Market in Wuhan, China, as the site of emergence of SARS-CoV-2 in humans, with two SARS-CoV-2 genome lineages both linked to the market (Andersen et al., 2020; Lytras et al., 2021; Pekar et al., 2022; Worobey et al., 2022). Furthermore, Andersen highlighted that early cases and excess deaths were clustered geographically around the market (Holmes et al., 2021; Li et al., 2021; Worobey et al., 2022). Within the market, viral particles were identified on equipment used for selling wildlife such as raccoon dogs and other species known to be susceptible to coronavirus infection, he added. Several variants emerged in the United States in 2020–2021, and models correctly predicted which would become dominant (Washington et al., 2021). As the COVID-19 pandemic shifted toward endemic status and public health measures loosened, genomic surveillance revealed dynamic shifts in the connections between COVID-19 epidemics and in viral strains in the wave patterns of case surges (Matteson et al., 2023). Andersen underscored the value of integrated software and various data sources for a robust understanding of epidemiology and of the functional evolution and dynamics of specific viruses.

DISCUSSION

Data-Sharing Challenges

Noting that countries and laboratories are often reluctant to share data, Bento asked about approaches to overcome barriers to data sharing. Andersen described the critical need for publicly available data, but he acknowledged negative repercussions associated with data sharing. For instance, South Africa has excelled at issuing early warnings about novel SARS-CoV-2 variants, only to have the United States respond with travel restrictions that carried adverse consequences for South Africa. Andersen asserted that the pattern of blaming countries or individuals for outbreaks discourages prompt, early reporting when outbreaks are detected. Moreover, lack of benefit sharing can erode trust between nations. He recounted a pattern in which the United States updated its COVID-19 vaccines using information from novel viral variants sequenced and reported by other countries, such as South Africa, but allocated these vaccines to Americans without ensuring access for the nations that initially reported and shared the pathogen sequences. Adding that the United States and many European countries refused to engage in the pandemic accord and in balanced international vaccine partnerships, Andersen maintained that the United States must share technologies and their benefits to achieve improved public global health.

Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.

Emphasizing the importance of early data sharing and global access, Haendel noted that the barriers to sharing clinical data within a country are multiplied when crossing national borders. Clinical informatics is competitive and siloed, given that these data reside in clinical organizations and are only shared under specific circumstances. Distributed research networks were established, in part, to overcome limited data sharing, but greater support is needed for the level of analytics required to address emerging pathogen threats, said Haendel. She outlined N3C’s attribution policies that require any investigator using N3C resources in their analyses to acknowledge the use of those sources—including the data itself—in any published papers or presentations. Haendel remarked that this attribution policy has yielded robust collaboration and technical provenance capabilities. Moreover, the rapid reuse and validation of existing work fosters corroboration and quality evolution in analytical resources at levels not seen in clinical informatics limited to one independent dataset, she maintained. Haendel underscored the need for incentives and mechanisms for sharing deidentified, high-level summary phenotypic information, biological samples, and structured clinical information.

Christoffels remarked that data-sharing challenges are not limited to LMICs: public laboratories within high-income countries are not always clear on data-sharing parameters. He differentiated between data sharing and data governance, stating that data governance entails the entire data lifecycle, of which data sharing is a component. Demonstrating the value of sharing data for the public good is insufficient given that not all countries have data-transfer agreements in place, he asserted. Building trust between countries entails strengthening ecosystems to accelerate their capacity to share data, said Christoffels. He added that data sharing relies on an environment of fairness and trust and on an ecosystem that facilitates data sharing throughout the application of various jurisdiction rules. Underscoring the importance of fairness and attribution, Sherry stated that attribution systems were designed to emphasize individual names, posing challenges to properly attributing work increasingly produced by teams. Noting lessons learned during the COVID-19 pandemic, he remarked on recent efforts to open the governance model of INSDC and its open data program, including inviting associate members and engaging leadership and representation from the Global South. Sherry stated that attribution policies and incentives work well with human data but are virtually impossible to enforce with microbial and viral data available to the public. Thus, data providers must create new methods for crediting submitters. For instance, NLM and NCBI are exploring how to reflect data submissions in tenure and promotion decisions. Acknowledging numerous complications in determining usage measures and attribution incentives, Sherry emphasized the need for a credible measure of value for submissions that can be tracked to individuals

Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.

rather than to the entire corpus. Furthermore, systems of governance should consider data repository interoperability, he noted, explaining that a divide between open access and controlled access SARS-CoV-2 databases resulted in inefficiencies and duplicate accessions for submitters. Sherry stated that a governance model that accepts multiple repositories would promote transparency and reduce redundancy in data circulation.

Data Standards

A participant stated that initial H5N1 sequences from retail milk were submitted without biological samples. She asked whether NCBI plans to implement a common pathogen data structure such as Data Object Model and how standards can be better communicated to submitters. Sherry replied that training for data requirements must reach upstream to the individuals collecting samples, and that samples as important as H5N1 will not be rejected due to imperfect or low-quality metadata. However, efforts to develop richer standards are ongoing and may be specific to a species or pathogen, depending on the outbreak or collection mechanism, said Sherry.

Data Corrections and Governance

An attendee highlighted the difficulty of working with flawed data submissions, both in terms of the time-consuming nature of identifying incorrect submissions and in the data systems requirement that corrections be made by the original submitters and asked how these barriers can be addressed. Sherry replied that GenBank and SRA were commissioned as research repositories to accompany papers, and they follow the long tradition that paper authors hold the right to stand by the quality or the imperfection of the record. However, the goal of preserving the scientific record poses time-sensitive challenges to public health surveillance for diseases, which must be operational in real time, he explained. Reusing data for public health surveillance requires prompt, accurate data to avoid the reintroduction of noise into sensitive analyses, so new governance models are needed, he acknowledged. He noted that SRA is incorporating third-party contextualization of datasets by creating cloud-based extensional frameworks. These frameworks enable third parties to use SRA data, add value to interpret data, amplify the metadata, and correct claims or allow crowdsourced corrections. Sherry highlighted that a next step could involve moving GenBank and associated sequences into the cloud with a similar framework. Emphasizing the need for governance, he underscored the potential for third parties to incorrectly comment on or adjust records, thereby generating additional noise. Sherry remarked that continuing to prohibit individuals from changing one another’s records while allowing the

Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.

addition of statements of adjustment is likely the most balanced approach. Haendel added that this issue also pertains to clinical data; she emphasized the need to preserve the full provenance and allow tracked corrections and improvements to the records. The governance process in adjudicating records and ensuring proper attribution is important to limit the potential of malicious actions, Haendel maintained.

Emergency Data-Use Agreements

Noting the utility of emergency use authorizations, a participant proposed establishing emergency data-use agreements that would go into effect should a specified set of conditions be met and asked for the panelists’ thoughts. Andersen remarked that the United States has not signed the Nagoya Protocol on Access and Benefit-Sharing, an international treaty adopted by the United Nations Convention on Biological Diversity in 2010 that addresses issues related to fairness and the benefits of sharing pathogen and human genetic data. Although the Nagoya Protocol is not a data-use agreement, it forms a higher framework in which countries agree to share data, collaborate, and receive equally distributed benefits from these efforts. Christoffels stated that the COVID-19 pandemic underscored the need for partnerships between academic and national public health laboratories and said rigidity within university systems posed barriers to implementing emergency use protocols. He highlighted an opportunity to sensitize government institutions to the need for emergency data-use agreements, noting the importance of clearly specifying authority among agencies. Addressing complexities at the national level would better enable the countries to consider global emergency use agreements on a global scale, said Christoffels.

Expansion of Data-Linking Approaches

An attendee asked about the potential for expanding data linking to all public health and clinical data, and whether this framework could be applied in other countries. Haendel replied that N3C was made possible by a public–private government partnership and governance structure in the United States, with NIH serving as a trustworthy host for data stored in a highly secure enclave environment. She said sites were given the latitude to determine the types of data they would submit and added that no issues with inappropriate access or use have arisen with these data. The N3C purview limits submissions to data related to COVID-19. However, the expansive dataset is beneficial to researchers approaching both COVID-19 and other areas, Haendel added. For example, in her work on rare disease patients, N3C data enabled Haendel to build computable phenotyping algorithms, perform risk prediction models, and complete other activities that

Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.

would have been impossible using a small dataset limited to the patients within a single institution’s database. Haendel said she obtains access to N3C data in this work by examining the risk of COVID-19 and its outcomes within the patient population she studies.

Haendel described efforts to expand this approach into other areas, including building enclaves for kidney disease research and for cancer research data. Individual institutions submitting data to N3C will have the authority to determine whether data can be used for other research areas. Currently, commercial parties are able to use deidentified data, but not HIPAA-limited data. Similarly, investigators in other nations can access deidentified data, with access to more sensitive data limited to institutional review board (IRB) oversight at U.S. institutions. She explained that linkage technology is straightforward and can be applied in any context for unconsented data, and consented data are more readily linked. Noting collaboration with the All of Us Research Program, which uses only consented data, Haendel stated that these data provide more robust information for individual patients. Haendel acknowledged that although such data are incredibly valuable and readily linkable, most researchers require additional training on data risk determination processes to reduce the risk of reidentification.

Barriers to Public Health Applications of Pathogen Genomics Data

Noting that systems for handling, sharing, and attributing human genomic datasets are often built for time scales and applications different from those in public health, a participant asked whether advances in other fields could yield technical solutions for pathogen genomics. Sherry remarked on a growing trend toward collaborative platforms featuring standardized execution workflows. These cloud-based collaborative platforms facilitate workforce development goals, standardize workflows, provide transparency in a framework that allows modification and elaboration, and can be used internationally, he explained. This technology runs continuously, making it appropriate for endemic surveillance, and can pivot to rapid response applications. Currently, NLM is working to find partners in this space, said Sherry. Acknowledging temporality issues, Haendel remarked that accessing U.S. clinical data is typically a lengthier process than performing PPRL. Underscoring the need for national governance to enact interoperability requirements, she pointed out that fragmentation in U.S. health care makes it difficult to leverage technologies at the point of care to influence the outcomes of care. Thus, the limited capacity of clinical organizations to exchange and integrate data poses greater barriers to public health applications than does PPRL, said Haendel.

Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.

This page intentionally left blank.

Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.
Page 33
Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.
Page 34
Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.
Page 35
Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.
Page 36
Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.
Page 37
Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.
Page 38
Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.
Page 39
Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.
Page 40
Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.
Page 41
Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.
Page 42
Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.
Page 43
Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.
Page 44
Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.
Page 45
Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.
Page 46
Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.
Page 47
Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.
Page 48
Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.
Page 49
Suggested Citation: "4 Data Infrastructure, Interoperability, Classification, and Stewardship." National Academies of Sciences, Engineering, and Medicine. 2025. Accelerating the Use of Pathogen Genomics and Metagenomics in Public Health: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/29103.
Page 50
Next Chapter: 5 Privacy, Ownership, and Accessibility Considerations in the United States
Subscribe to Email from the National Academies
Keep up with all of the activities, publications, and events by subscribing to free updates by email.