New approach methods (NAMs) include nonanimal and other approaches, such as nonmammalian whole animal models, that can inform decision-making on human health hazards. These tools and technologies are intended to provide more refined and timely toxicological insights to inform hazard and risk assessment. Importantly, they may also increase coverage of the many chemicals that lack existing laboratory mammalian toxicity studies. As such, development of NAMs provides opportunities to modernize and advance risk assessment approaches that rely on laboratory mammalian toxicity tests and to serve as inputs for generating risk assessments of chemicals where little to no laboratory mammalian toxicity data are available.
However, there are barriers to incorporating NAM data into human health risk assessments, with few example applications available to guide future progress. Scientific confidence in NAMs to support actions remains an important challenge, and currently regulatory toxicologists, risk assessors, and decision-makers continue to rely largely on evidence from laboratory mammalian toxicity studies as actionable per existing statutes and practices. Further, an overall strategy for determining the acceptability of NAMs for human health risk assessment remains to be developed.
In this context, the Environmental Protection Agency (EPA) sponsored a study to review the “variability and relevance of existing laboratory mammalian toxicity tests for human health risk assessment to inform the development of approaches for validation and establishing scientific confidence in using New Approach Methods (NAMs), and recommendations on expectations associated with NAMs when they cannot be compared with human studies.”1 The committee was charged to review relevant evidence on variability and concordance of mammalian toxicity tests by reviewing the literature and conducting information-gathering sessions, including two public workshops, as well as their implications related to performance of NAMs. Other aspects of the charge involved addressing related issues regarding validation of NAMs that are not intended as one-to-one replacements for in vivo toxicity tests and assessing concordance of data from human in vitro assays with toxicity data from animal models. Finally, the charge called for the committee to consider how this information may be incorporated into a validation paradigm or scientific confidence framework. An overview of the committee’s approach depicting the roadmap of information sources, approaches, and report chapters is shown in Figure S-1.
To address the issues of variability and concordance in its charge, the committee assembled and evaluated a variety of data sources, including a review of existing systematic reviews and authoritative reviews, supplemented by a review of peer-reviewed publications on variability and concordance presented to the committee during information-gathering sessions. To develop its recommendations, the committee relied on studies of higher methodological quality in terms of risk of bias. In addressing the aspects of its charge related to the implications of these data, the committee focused on the context of human health risk assessment (rather than, for instance, priority setting or tiered testing). Recognizing that risk assessment methods have evolved to be based on systematic evidence reviews and formal methods to integrate evidence from different data streams, the committee highlighted opportunities for advancing progress in applying NAMs to inform hazard and dose-response assessments.
___________________
1 See https://www.nationalacademies.org/our-work/variability-and-relevance-of-current-laboratory-mammalian-toxicity-tests-and-expectations-for-new-approach-methods--nams--for-use-in-human-health-risk-assessment.
This report aims to inform approaches to building confidence for applying NAMs to advance human health risk assessment and, consequently, risk management decisions. This summary provides an overview of the committee’s main findings and recommendations.
To inform its work, the committee reviewed background information regarding the use of existing laboratory mammalian toxicity studies in human health risk assessment, including relevant recommendations in several seminal reports from the National Academies of Sciences, Engineering, and Medicine (NASEM).
The goal of toxicity testing for human health risk assessment is to protect human health, including (and prioritizing) susceptible and vulnerable subpopulations. The 2007 report Toxicity Testing in the 21st Century: A Vision and a Strategy proposed that new in vitro tools and computational systems biology approaches could provide sufficient insight to bridge the gap between chemical effects measured in human cells and health effects in people. Subsequent reports, notably Science and Decisions: Advancing Risk Assessment (2009) and Using 21st Century Science to Improve Risk-Related Evaluations (2017), provided salient advice for toxicity testing under the specific context of use of human health risk assessment, particularly with respect to (1) expanding beyond the goal of one-to-one replacement of animal tests; (2) expanding whole animal testing
beyond “standard” guideline-like studies; (3) addressing human variability; (4) addressing challenges in assay performance and validation; and (5) implementing best practices for data integration and interpretation. In addition, these NASEM reports have recommended expanding the information used for risk assessment across the continuum of approaches and methods in toxicology and epidemiology (Figure S-2), moving away from apical endpoints to encompass “upstream” biomarkers, toxicity pathways, or key characteristics.
Regarding definition of terms, prior NASEM reports employed more precise and descriptive terms for specific types of assays or information sources rather than using the term “NAM.” The committee found the EPA’s definition of NAMs (see p. xxi) problematic for several reasons. In particular, the goal to “avoid the use of animal testing” is inappropriate given that NAMs encompass nonmammalian whole animal tests including zebrafish. In Addition, NAMs using primary cells are not animal studies but require animals as the source of cells. Further, NASEM has endorsed methods intended to complement animal data, including transgenic and population-based rodent models, nonrodent species, and targeted testing. Moreover, terms such as NAMs create a false dichotomy between data streams, all of which can be informative for human health risk assessment. Therefore, the committee did not restrict its consideration only to assays that might be considered NAMs under the EPA’s definition, adopting the NASEM (2017a) view that these issues extend to “development of the most salient and predictive assays for the endpoint or disease being considered.”
Recommendation 2.1: The EPA should continue to use previous NASEM reports, especially Toxicity Testing in the 21st Century: A Vision and a Strategy, Science and Decisions: Advancing Risk Assessment, and Using 21st Century Science to Improve Risk-Related Evaluations, for advice and recommendations as to how to improve toxicity testing and human health risk assessment.
Recommendation 2.2: The committee recommends that the EPA broaden the definition of NAM so that it encompasses the range of strategies and approaches as discussed in the 2017 report Using 21st Century Science to Improve Risk-Related Evaluations. The EPA should develop additional terms to refer to specific subsets of approaches, such as by technology type, toxicity indicator, or location on the exposure-to-outcome continuum, as shown in Figure S-2.
The committee considered both experimental as well as intrinsic biological variability in its deliberations (Figure S-3; see Box S-1 for definitions). In any biologically based assay system, there will always be intrinsic and irreducible biological variation that cannot be eliminated from the assay. Practically, it can be difficult to distinguish between intrinsic biological variability and experimental variability. However, variability is not fundamentally a negative attribute. Minimizing variability may limit the generalizability of the results by removing sources of biological variability that are informative as to the distribution of toxic response.
In its review of systematic reviews and authoritative reviews, the committee found moderate to high heterogeneity in the data analyzed across the limited number of higher quality systematic reviews and meta-analyses that reported variability data on laboratory mammalian toxicity studies intended to measure the same phenomenon. The heterogeneity was attributable, to some extent, to the different designs of the studies analyzed. Importantly, due to concerns about extending these findings on variability to other chemicals for the same endpoints, or the same chemical with different endpoints, broad and generalizable conclusions from the systematic reviews regarding the qualitative and quantitative variability of laboratory mammalian toxicity studies could not be drawn. As a class of studies, the studies cited in the workshops that were not systematic reviews lacked a prespecified protocol, comprehensive search strategy, risk of bias evaluation, and other methodological design elements for minimizing bias. Thus, any conclusions derived from this literature are to be interpreted with caution.
Recommendation 3.1: The EPA should refrain from trying to identify a threshold of acceptable variability derived from laboratory mammalian studies to apply across all NAMs and/or endpoints.
Recommendation 3.2: Overall, the EPA should aim to establish the performance of NAMs primarily based on their intrinsic performance characteristics (e.g., within and between laboratory repeatability, robustness, applicability domain), and value with respect to protecting human health effects (e.g., external validity, discussed subsequently), rather than benchmarking based on the variability of existing laboratory mammalian studies.
Additional recommendations relevant to variability include the following:
Recommendation 5.10: In its evaluation of test methods, the EPA should prioritize increasing external validity (discussed subsequently) through broader coverage of biological variability. One strategy that may be useful could be to use a battery of assays to encompass greater biological variability, while designing each assay so as to minimize experimental variability.
Recommendation 5.11: For any test method intended for use in risk assessment, whether in vivo, in vitro, in silico, or otherwise, particularly in a context where there are no other data (laboratory mammalian or human data), the EPA’s tolerance of variability should be driven by an analysis of the different levels and types of variability and of their impact on the test method’s internal and external validity (discussed subsequently). This analysis should also take into account the test method’s purpose and context of use.
The committee’s review considered cross-species concordance both in toxicokinetics and in outcomes. For toxicokinetics, it found differences in quantitative aspects but general similarities in qualitative aspects (e.g., the major elimination pathways, enzymes involved, metabolites formed, and disposition in the body). Less concordant aspects across animal species, and between animals and humans, may include oral bioavailability, skin absorption rates, and inhalation dosimetry.
Regarding concordance in outcomes across species, the higher quality systematic reviews and authoritative reviews showed laboratory mammalian toxicity tests (guideline and nonguideline) can generally identify human health hazards for a range of adverse health outcomes, but not necessarily the extent of response at a given dose. Laboratory mammalian toxicity testing can
generally identify dichotomous higher-level outcomes, such as cancer, but has not been as successful for other complex health endpoints, such as developmental neurotoxicity and mammary gland effects. This is due in part to the lack of alignment of methods and endpoints across experimental species and humans. Importantly, the committee’s literature review established that there are very few high-quality systematic reviews with evidence to evaluate concordance. As a class of studies, the studies cited in the workshops that were not systematic reviews lacked a prespecified protocol, comprehensive search strategy, risk of bias evaluation, and other methodological design elements for minimizing bias. Thus, any conclusions about concordance derived from this literature are to be interpreted with caution.
Recommendation 4.1: To better understand animal-to-human concordance in toxicokinetics, the EPA should systematically review existing data. One opportunity is to conduct a systematic review and/or meta-analysis on the existing primary in vitro hepatocyte metabolism and protein binding data, including any available in ToxCast (i.e., in rats and humans).
Recommendation 4.2: To evaluate concordance of outcomes:
Recommendation 4.3: In order to implement the recommendations from the report, the EPA, in collaboration with other agencies and the Organisation for Economic Co-operation and Development (OECD), should convene scientific advisory groups that include appropriate subject matter and community health expertise, including clinicians, to review and update mammalian toxicity testing study designs to make them more specific, sensitive, and better aligned with the 3 R goals.2 This process should provide opportunities to add endpoints with human relevance that are inadequately covered in laboratory mammalian toxicity tests such as developmental neurotoxicity and mammary gland effects.
Recommendation 4.4: Due to the known limitations in animal-to-human concordance, the EPA should not use laboratory mammalian toxicity tests as the sole factor in determining internal and external validity for acceptance of NAMs into regulatory practice.
___________________
2 Replace animal use, reduce the number of animals required for a test procedure, and where animals are still required, refine testing procedures to lessen or eliminate unrelieved pain and distress.
The overall purpose of toxicological test methods in the context of human health risk assessment is to inform (1) hazard identification, the level of evidence for a causal relationship between exposure to an agent and an effect and (2) dose-response assessment, the quantitative relationship between exposure and the incidence or severity of an effect, in humans, including susceptible and vulnerable subgroups. For both human studies and laboratory mammalian toxicity studies, there is a large literature on structured, systematic-review approaches for establishing scientific confidence for hazard and dose-response, including recommendations from numerous NASEM reports. However, these methods are typically applied retrospectively, to evaluate the existing body of evidence for hazard identification and dose-response assessment for a particular chemical. By contrast, most of the discussion around scientific confidence of NAMs has focused on evaluation of assay design and NAM-based testing strategies, where the goal is to determine whether a NAM will generate data acceptable for use in hazard identification and dose-response assessment.
Therefore, in developing its findings and recommendations on scientific confidence of NAMs, the committee aimed to integrate and bridge these different perspectives, so as to enable a seamless handoff between evaluation frameworks supporting NAM-based testing strategies and the incorporation of NAM data into modern, systematic-review-based risk assessment frameworks.
The committee identified a number of common themes based on consideration of recently published proposed scientific confidence frameworks to further the acceptance and use of NAMs. Overall, the different validation and scientific confidence framework proposals for NAMs can be consolidated into five key components, depicted in Box S-1.
Recommendation 5.1: Because the term “fit for purpose” is often vague and poorly defined, the EPA should use a different term such as “intended purpose and context of use,” which is recommended by the committee.
Recommendation 5.2: Any scientific confidence framework adopted by the EPA related to use of NAMs for hazard identification and dose-response assessment should include “provides information for the protection of public health, including vulnerable and susceptible subpopulations” as part of its “intended purpose and context of use.”
Recommendation 5.3: The EPA should avoid the term “validity” without modifiers because it can mean a wide range of different concepts. The committee recommends that the EPA use the terms “internal validity,” “external validity,” “experimental variability,” and “biological variability,” as defined by the committee.
Specification of the relevant population, exposure, comparator, and outcomes (PECO) statement is a cornerstone of systematic review methodology as this clarifies the question that a review intends to address. PECO statements are commonly used in methods to assemble laboratory mammalian toxicity tests and human epidemiologic studies and facilitate evidence synthesis and integration for human health hazard identification and/or dose-response assessment. For laboratory mammalian toxicity tests, there is a general understanding that the test methods are intended to be surrogates for a corresponding “target human” PECO for the same biological tissue or system, though this is not always explicitly stated. Because PECO statements are not currently routinely used for in silico, in vitro, and nonmammalian toxicity tests, it may be useful to define a target human PECO for a test method that provides information as to how it would inform human health hazard identification or dose-response. This “parallel PECO” approach also provides a common framework for grouping toxicity testing methods for comparison. In addition, the target human PECO identifies what human data would be needed for assessing concordance. Thus, different test methods with similar target human PECOs can be compared to each other in terms of consistency or corroboration.
Recommendation 5.4: The EPA should utilize parallel PECO statements as part of its specification of the intended purpose and context of use of all types of toxicity testing methods, including laboratory mammalian in vivo tests as well as in silico, in vitro, and nonmammalian test methods. This concept consists of an intended target human PECO and a “test method” PECO for any NAM intended for use in human health hazard identification or dose-response assessment.
The EPA should include or require test developers to document in meta-data the test method and target human PECOs, with sufficient information to adequately describe the assay’s domain of applicability (e.g., types of agents that should or should not be tested, the extent to which metabolic activation is included), and any limitations in terms of biological coverage (e.g., whether it has adequate coverage of the biological processes related to an endpoint). An additional benefit of specifying these parallel PECO statements is that they provide an explicit statement as to the hypothesis for concordance that can be helpful in evaluating external validity (discussed subsequently).
Together, this information will support downstream stakeholders and users in assessing the potential impact of assay results on human health risk assessment. It will also clarify the purpose
of a test method and its value for hazard identification or classification or for deriving a point of departure.
Numerous frameworks articulate approaches for evaluating risk of bias for randomized control trials, human epidemiologic studies, and laboratory mammalian toxicity studies. These frameworks have evolved over time and are developed from theoretical and empirical bases for different contributors to risk of bias. For nonmammalian, in vitro, and in silico test methods, there is less literature related to risk of bias, though some areas of bias have been documented. Some of the domains used for experimental animal studies may be transferable, while others may be specific to individual types of test methods.
Recommendation 5.5: The EPA, in collaboration with other entities such as the National Toxicology Program (NTP), should develop and utilize approaches for evaluating risk of bias for methods besides epidemiology and laboratory mammalian toxicity tests. Such approaches should be based on the four core principles articulated by Cochrane3. The EPA should be judicious in identifying existing risk-of-bias domains used for randomized control trials, human epidemiologic studies, and laboratory mammalian toxicity studies that may be transferable, and those risk-of-bias domains that may be unique to other types of toxicity testing methods.
Recommendation 5.6: The EPA, in coordination with other entities such as the NTP and OECD, should require assay developers, when designing or developing protocols for NAMs, to include recommended steps to minimize the risk of bias in the conduct of the study and subsequent data analyses. The elements to minimize risk of bias should be empirically and/or theoretically based. Note that reducing risk of bias (systematic error) is distinct from reducing experimental variability (random error), although some experimental strategies may reduce both.
Many existing frameworks for evaluating NAMs contain multiple elements of validation that can be grouped under the concept of external validity. As shown in Figure S-4, external validity can be conceived as the extent to which the test method informs the target human PECO. Many published scientific confidence frameworks discuss this issue in terms of concepts such as “biological relevance,” “predictive capacity,” “applicability domain,” and “concordance,” but most do not consolidate them together or provide a structured framework for addressing all of them together under a single unifying concept. Currently, NAMs are generally considered in the evidence integration step, typically to elevate confidence in the animal or human streams of evidence. When considered as part of evidence integration, limitations in external validity are an implicit part of how different types of evidence are grouped.
The committee identified the following common components of external validity, which can be defined within the “indirectness/directness” domain of evidence synthesis:
___________________
3 Higgins, J. P. T., J. Thomas, J. Chandler, M. Cumpston, T. Li, M. J. Page, and V. A. Welch. 2022. Cochrane Handbook for Systematic Reviews of Interventions. Chichester (UK) John Wiley & Sons.
Each of these may include both qualitative and quantitative considerations. As discussed previously, high-quality systematic reviews and authoritative reviews provide the highest confidence information for concordance, but even in the absence of such reviews, an evaluation of overall external validity can be conducted.
Recommendation 5.7: The EPA, in collaboration with other entities such as the NTP, should develop and utilize structured approaches for evaluating the external validity of test methods, including NAMs. This would entail developing guiding questions to facilitate a structured approach similar to the approaches used for evaluating risk of bias of NAMs. Similar to current best practices for evaluation of internal validity, the committee recommends against aggregating individual
domain ratings into a single quantitative score. Example domains and guiding questions are shown in Table 5-3, with examples of their application in Tables 5-4 to 5-8.
Recommendation 5.8: The EPA should not “double count” limitations in external validity in both evidence synthesis (e.g., downgrading evidence for (in)directness) and evidence integration (e.g., limiting how strongly nonhuman evidence can support a hazard conclusion).
Recommendation 5.9: Because concordance of a test method is defined with respect to the target human PECO, the EPA should broaden the considerations for evaluating concordance beyond comparisons to laboratory mammalian toxicity tests.
The evaluation of the scientific confidence of a NAM may be conducted in a general context focused on design, evaluation, and utilization of NAM-based testing strategies for the generation of data for assessing chemicals, including data-poor chemicals. Relevant scenarios include
The committee found that one-size-fits-all criteria for NAM-based testing strategies acceptability are inappropriate as the various elements of scientific confidence have different strengths and weaknesses. There is a need to consider what choice best promotes the overall goal of protection of public health, which may differ depending on the particular context of use (e.g., filling data gap, complementing existing data, or offering an alternative).
In addition, the committee found that although many NAMs have been developed and evolved, there is no authoritative source or compendium of available tests and technologies along with documentation of their reliability and description of their potential for applicability. Significant gaps in coverage also remain to be identified.
Recommendation 5.12: The EPA should establish the acceptability of NAM-based testing strategies based on each specific purpose and context of use. The EPA should be transparent as to the level of scientific confidence that results from examining the NAM’s internal validity, external validity, and variability.
Recommendation 5.13: For the regulated community, the EPA’s goal should be to provide lists of acceptable NAM-based testing strategies under different purposes and contexts of use in order to establish confidence that NAM-derived data submissions to the agency will be integrated into decision-making (discussed in the next section). This could be accomplished through the EPA working with partners in the U.S. government and appropriate international organizations to develop a harmonized registry of toxicity testing methods documenting their purpose and context of use (including parallel PECO statements), internal validity, external validity, and variability.
Numerous NASEM committees have recommended that government programs, including the NTP, EPA, and Department of Defense, apply rigorous systematic review-based approaches in the human health risk assessment of chemicals. Systematic review aims to minimize risk of systematic and random error and maximize transparency of decision-making. Thus, systematic review-based risk assessment frameworks provide a means of transparently and rigorously organizing, synthesizing, integrating, and evaluating relevant data.
Modern approaches of systematic review and evidence integration have largely been applied to human and laboratory animal evidence, but the general principles are applicable to other toxicity testing data used for hazard identification or dose-response. Figure S-5 illustrates a generic framework for human health hazard assessment, as well as how the concepts related to evaluating scientific confidence of NAM-based testing strategies can interface with a generic systematic review-based framework for human health hazard assessment.
In current human health risk assessment approaches, studies other than human and laboratory animal evidence are typically lumped into a category of “mechanistic” evidence. However, it can be challenging to integrate such mechanistic evidence because of the diversity of study types and endpoints they entail. However, the use of parallel PECO statements (Recommendation 5.4) would enable the creation of new NAM-based evidence streams that represent bodies of evidence from studies (other than epidemiologic and laboratory mammalian studies) that are to be used as a basis for hazard identification and dose-response assessment. This is particularly important because for the many chemicals presently in commerce and those that will be in the future, human and laboratory mammalian studies evidence is lacking, and approaches to identify human health hazards in the absence of such data are needed in order to inform risk management decisions to protect human health.
Indeed, evidence integration frameworks exist that include approaches for making hazard identification conclusions without human or experimental animal evidence. In addition, continued development of quantitative extrapolation approaches to derive health protective toxicity values using data from toxicity testing approaches other than human epidemiologic and laboratory mammalian toxicity studies will be needed. Data generated from evaluation of quantitative concordance with humans, as recommended previously, will be useful in developing such approaches. Moreover, NAMs provide an opportunity for the EPA to better evaluate the range of human variability and incorporate it into their dose-response for all endpoints, in line with previous NASEM recommendations.
Recommendation 5.14: The EPA should develop and utilize a framework for hazard identification and deriving toxicity values protective of public health that does not require human epidemiologic or laboratory mammalian toxicity data. This framework should also enable NAM-based data to be integrated with human epidemiologic and laboratory mammalian toxicity data. In so doing, the EPA should continue to follow previous NASEM recommendations related to systematic review and risk assessment. This will ensure a seamless handoff between evaluation of NAM-based testing strategies and evaluation of scientific confidence of NAM data for individual chemicals.