This chapter presents the results of the literature review, workshops, and committee deliberations with respect to how information from Chapters 3–4 on variability and concordance might be incorporated into a new or existing validation paradigm or scientific confidence framework. This information will help the U. S. Environmental Protection Agency (EPA) ensure that new approach methods (NAMs) are equivalent to or better than the animal tests replaced with respect to predicting and preventing potential adverse responses in humans.1 It also addresses the evaluation of the validity of assays that are not intended as one-to-one replacements for in vivo toxicity, an issue left unresolved in the National Academies of Sciences, Engineering, and Medicine (NASEM) report Using 21st Century Science to Improve Risk-Related Evaluations (NASEM, 2017a).2
As noted in NASEM (2017a), there is a long history of formal mechanisms for validation of so-called “alternative” methods, whereby the “reliability and relevance” of an approach is established “for a defined purpose.” Note that the phrase “alternative methods” shares some of the same problems as the term “NAM” in that methods that at one time are alternative and new can evolve into standard and routine. In 2005, the Organisation for Economic Co-operation and Development (OECD) published a Guidance Document on the Validation and International Acceptance of New or Updated Test Methods for Hazard Assessment, which outlines standards for establishing transparency, reliability, and relevance of methods used for regulatory decision-making (OECD, 2005). Referred to as “GD 34,” the approach was intended to be modular and sufficiently flexible to adapt to a variety of methods, with not all methods requiring fulfillment of all modules for implementation. In practice, the guidance has been rigidly interpreted, particularly for in vitro methods where the absence of a full suite of physiological interactions that occur within living organisms may call into question the relevance of responses to humans. The same principles described in GD 34 were extended to computational methods. The OECD Guidance Document on the Validation of (Quantitative) Structure Activity Relationship (QSAR) Models adapted the same principles developed for validating laboratory methods to computational approaches (OECD, 2007). Overall, it is recognized that many aspects of these guidance documents remain applicable to NAMs. However, the interpretation of “reliability” and “relevance” can be reconsidered, as they have traditionally been relatively narrowly defined, relating to reproducibility “within and between laboratories” and to evaluating if the “effect of interest … is meaningful and useful for a particular purpose.”
___________________
1 Charge question 5. Based on the conclusions from 1–4 above, how may the committee foresee this information being incorporated into a new or the existing validation paradigm or scientific confidence framework so that EPA can ensure that NAMs are equivalent to or better than the animal tests replaced?
2 Charge question 4a. Evaluation of the validity of assays that are not intended as one-to-one replacements for in vivo toxicity assays; and Charge question 4b. Assessment of the concordance of data from assays that use cells or proteins of human origin with toxicity data that are virtually all derived from animal models.
Therefore, NASEM (2017a) stated that before being used in regulatory-decision contexts, it is expected that new assays, models, or test methods would have their relevance, reliability, and “fitness for purpose” evaluated. However, NASEM (2017a) crucially noted that the rapid pace of assay development could not be matched by existing processes for validation. Thus, the report delineated a number of critical considerations in the validation process for new assays, models, or test methods:
Validation of methods therefore depends on the intended use. Indeed, researchers developing assays and methods may not always understand how they may be used in risk assessment applications. Though explicitly stated in the 2005 OECD guidance document, the fit-for-purpose, flexible approach has often been overlooked in favor of a “box checking” approach to validating regulatory methods. The suitability of NAMs must be considered in the context of their intended application, and thus the criteria for validity will necessarily vary.
Although recommending “fit-for-purpose validation” as the overall approach, NASEM (2017a) noted that one of the main challenges is finding appropriate comparators depending on the decision context. In particular, the report noted considerable disagreements in the quality, or even the existence, of “gold standards,” and suggested that ultimately expert judgment would need to be exercised on selection of comparators. Two key issues specifically highlighted by NASEM (2017a) are topics that this committee was tasked to comment on:
Thus, recognizing that “validation” has evolved beyond lab-to-lab reproducibility to include issues related to “fit for purpose” for a much wider range of decision contexts than previously considered, the language surrounding this process has also evolved. In particular, the term “scientific confidence framework” has been adopted by many in the field to encompass this broader scope and was adopted by the committee to denote this broader process.
Multiple organizations have recently published strategies and roadmaps, many with proposed scientific confidence frameworks, to further the acceptance and use of NAMs. Some
examples of such frameworks are described in Appendix D, though the committee did not perform a comprehensive search of all proposed strategies.
Overall, these and other proposed confidence frameworks for NAMs exhibit a number of common themes:
Finding: The committee found that the different validation and scientific confidence framework proposals for NAMs can be consolidated into five key components, as detailed in Box 5-1.
Finding: Although the overall mission of the EPA is to “protect human health and the environment,” protection of public health was not commonly identified as an element of scientific confidence.
Recommendation 5.1: Because the term “fit for purpose” is often vague and poorly defined, the EPA should use a different term such as “intended purpose and context of use,” which is recommended by the committee.
Recommendation 5.2: Any scientific confidence framework adopted by the EPA related to use of NAMs for hazard identification and dose-response assessment should include “provides information for the protection of public health, including vulnerable and susceptible subpopulations” as part of its “intended purpose and context of use.”
Recommendation 5.3: The EPA should avoid the term “validity” without modifiers because it can mean a wide range of different concepts. The committee recommends that the EPA use the terms “internal validity,” “external validity,” “experimental variability,” and “biological variability,” as defined by the committee.
The need to define the scope and purpose of a NAM was highlighted in the NASEM (2017a) report and is included in many proposed confidence frameworks for NAMs. It has also been recognized that structured frameworks are beneficial in clearly framing the “question” that is being asked in environmental health-related research, and the concept of using population, exposure, comparator, and outcomes (PECO) to define scope and purpose is increasingly accepted (Morgan et al., 2016, 2018). PECO statements for the test method also facilitate evidence synthesis and integration in hazard identification and human health risk assessment by providing explicit inclusion/exclusion criteria both within and across different types of evidence (NASEM, 2018).
Because PECO statements are not currently routinely used in silico, in vitro, and non-mammalian toxicity tests, it may be useful to define a “target human” PECO for a test method that provides information as to how it would inform human health hazard identification or dose-response. For laboratory mammalian toxicity tests, this has generally not been considered necessary because it is generally assumed (though not always explicitly stated) that the outcome observed in the test method corresponds to a similar “target human” outcome. In some instances, the test method and target human outcome are both highly specific, such as acetylcholinesterase inhibition. In other cases, the target human outcome is a general category, such as liver effects or developmental effects. For cancer, it is generally recognized that, in the absence of additional scientific information (e.g., as described in IARC, 2019), laboratory mammalian studies are indicators of potential carcinogenic hazard to humans (IARC, 2019). However, this does not imply concordance in tumor sites across species, which is not always evident (Baan et al., 2019). In addition, for some
in vitro and nonmammalian toxicity tests, such as assays utilizing primary or induced pluripotent stem cell (iPSC)-derived human cells and zebrafish, the correspondence to target human outcomes may also be generally understood, though not always explicitly stated. For other types of assay systems, however, the corresponding human outcome, if any, may be less clear.
The use of parallel test method and target human PECO statements can equally apply to situations where nonmammalian test methods are not one-to-one replacement for animal tests, such as in the case where no existing animal test exists or when tests are combined as a “battery” to cover a “biological domain” (NRC, 2007; Piersma et al., 2018). For instance, nonmammalian test methods can be combined (e.g., via a “defined approach” based on an adverse outcome pathway such as for skin sensitization) to predict a specific toxic effect or to cover organs or organ systems (e.g., to broadly cover developmental neurotoxicity) (OECD, 2021a). For example, the well-known limitations of guideline rodent assays for identifying human developmental neurotoxicants were discussed in Workshop 2. Rather than trying to redesign an improved mammalian in vivo test, there is an effort under way to develop batteries of in silico, in vitro, and nonmammalian whole animal assays designed to capture key biological processes in brain development (Bal-Price et al., 2018). The results of these test batteries could be considered as a single unit of evidence covering a broader target human PECO, though for such complex outcomes, multiple batteries may be needed. Using organizing frameworks such as key characteristics and adverse outcome pathways have been useful for identifying gaps in coverage of biological processes related to specific endpoints. Key characteristics are amenable to clear PECO statements and have been applied to the systematic assembly and review of a diversity of mechanistic evidence, including from in vitro and nonmammalian tests (Smith et al., 2016). Key characteristics can also be used to identify target health endpoints for assay development. For instance, for both carcinogenicity and cardiovascular toxicity, attempts have been made to identify available assays and biomarkers that measure each key characteristic (Lind et al., 2021; Smith et al., 2020). Similarly, adverse outcome pathways (AOPs) can be used to identify measurable key events at the biochemical, cellular, or tissue level that can serve as target endpoints for assay development (e.g., within an Integrated Approaches to Testing and Assessment framework [OECD, 2020]).
Finally, this parallel PECO approach provides a common framework for grouping toxicity testing methods for comparison. Specifically, the target human PECO identifies what human data would be needed for assessing concordance. Thus, different test methods with similar target human PECOs can be compared to each other in terms of consistency or corroboration. Examples of corresponding test method and target human PECOs are provided in Table 5-1.
Finding: The use of PECO statements for laboratory mammalian toxicity tests and human epidemiologic studies is a useful approach to specify their purpose and context of use for human health hazard identification and/or dose-response assessment and facilitates evidence synthesis and integration. Currently, PECO elements are attributed to studies after publication in the context of identifying evidence during a systematic review (e.g., by the risk assessor), as opposed to being a formal part of the meta-data associated with a study.
Finding: Though it is not always explicitly stated for laboratory mammalian toxicity tests, there is an implicit understanding that such test methods are intended to be surrogates for a corresponding target human PECO for the same biological tissue or system.
| Target Human PECO | Toxicity Testing Method | Test Method PECO |
|---|---|---|
| P: Human population E: Chronic oral exposure to chemical C: No/lower exposure O: Any cancer |
Two-year cancer rodent bioassay for chemical X in drinking water | P: Rodents E: Chemical in drinking water for 2 years C: Drinking water without X O: Any cancer |
| P: Human population E: Internal (serum) exposure to Chemical X via any route C: No/lower internal exposure O: Long QTc, positive or negative chronotropy, asystole |
High throughput screening for chemical X using iPSC-derived cardiomyocytes | P: iPSC-derived cardiomyocyte from single or multiple donors E: Chemical dissolved in media with dimethyl sulfoxide (DMSO) C: Negative controls: DMSO in media; Positive controls: known positive drugs for each outcome (e.g., sotalol, isoproterenol, propranolol) O: Delayed action potential, increased or decreased spontaneous beat rate, asystole |
| P: Human population E: Chemical X via any route C: No/lower exposure O: Adverse developmental outcomes |
Zebrafish-derived early life stage chemical screening | P: Diverse strains of early life stages zebrafish E: Chemical X dissolved in media C: Negative controls: DMSO in media; Positive controls: known positive developmentally active controls O: Lethality, developmental delay, altered morphology, altered motor responses |
| P: Human population E: Chemical X via dermal exposure C: No/lower exposure O: Skin allergy/sensitization |
Murine Local Lymph Node Assay (OECD TG 429/442A/442B) | P: Mice (adult female CBA/JNCrlj strain) E: Chemical dermally applied in vehicle (e.g., acetone: olive oil) C: Negative: Vehicle (acetone: olive oil) Positive: 25% hexyl cinnamic aldehyde in acetone: olive oil. O: Proliferation of lymphocytes in the lymph nodes draining the site of substance application |
| P: Human population E: Chemical X via dermal exposure C: No/lower exposure O: Skin allergy/sensitization |
OECD test battery based on addressing specific key events of the skin sensitization adverse outcome pathway. Considered positive based on 2/3 concordant tests (OECD, 2014). | OECD TG 442C In Chemico Skin Sensitization: Assays addressing the adverse outcome pathway key event on covalent binding to proteins P: Cysteine (Ac-RFAACAA-COOH) and lysine (Ac-RFAAKAACOOH) containing synthetic peptides of purity higher than 85% E: Chemical in an appropriate solvent (e.g. acetone) in a range of concentrations |
| C: Negative control (solvent) Positive control (Cinnamic aldehyde (CAS 104-55-2; 95% food-grade purity) O: Peptide depletion, which indicates reactivity and probable covalent binding |
||
| OECD TG 442D In Vitro Skin Sensitization ARE-Nrf2 Luciferase Test Method: In vitro skin sensitization assays addressing the adverse outcome pathway key event on keratinocyte activation P: The KeratinoSens™ transgenic cell line having a stable insertion of the luciferase reporter gene under the control of the antioxidant response element (ARE)-element should be used E: Chemical in DMSO in a range of concentrations C: Negative: Vehicle (DMSO) Positive: Five concentrations of cinnamic aldehyde O: Keratinocyte activation assessed by the Nrf2-mediated activation of ARE-dependent genes via a luciferase assay |
||
| OECD TG 442E In Vitro Skin Sensitization Assays Human Cell Line Activation Test (H-CLAT): In vitro skin sensitization assays addressing the key event on activation of dendritic cells on the adverse outcome pathway for skin sensitization P: THP-1 human monocytic leukemia cell line (TIB-202™) E: Chemical in saline or DMSO in a range of concentrations. C: Negative: Vehicle and lactic acid Positive: 2,4- dinitrochlorobenzene (DNCB) and nickel sulfate O: Changes in expression of CD86 and CD54 at the cell surface as a measure of immune activation |
||
| P: Human population E: Individual per- and polyfluoroalkyl substances (PFAS) (>5,000 unique chemicals) C: No/lower exposure O: Endocrine disruption by agonism/antagonism of the androgen receptor |
Artificial intelligence using SuperLearner analysis of Tox21 and DUD-E databases combined with molecular docking |
P: Structure-activity model of the human androgen receptor E: Different PFAS chemical structures in SMILES C: Not exposed O: Predicted agonism or antagonism of the androgen receptor (Singam et al. 2020) |
Recommendation 5.4: The EPA should utilize parallel PECO statements as part of its specification of the intended purpose and context of use of all types of toxicity testing methods, including laboratory mammalian in vivo tests as well as in silico, in vitro, and nonmammalian test methods. This concept consists of an intended target human PECO and a test method PECO for any NAM intended for use in human health hazard identification or dose-response assessment.
The EPA should include or require assay developers to document in meta-data the test method and target human PECOs, sufficient information to adequately describe the assay’s domain of applicability (e.g., types of agents that should/should not be tested, the extent to which metabolic activation is included), and any limitations in terms of biological coverage (e.g., whether it has adequate coverage of the biological processes related to an endpoint). An additional benefit of specifying these parallel PECO statements is that they provide an explicit statement as to the hypothesis for concordance that can be helpful in evaluating external validity (discussed subsequently).
Together, this information will support downstream stakeholders and users in assessing the potential impact of assay results on human health risk assessment. It will also clarify the purpose of a test method and its value for hazard identification or classification or for deriving a point of departure.
In the context of toxicity testing, the internal validity of a study (also denoted the “risk of bias”) reflects the extent to which aspects of the design of the study can provide confidence in (1) the qualitative issue of whether the result of a study reflects a causal relationship (or lack thereof) between exposure (E) and outcome (O) rather than being an artifact of study design, conduct, analysis or reporting; and (2) the quantitative issue of whether the magnitude of the effects observed are systematically different from their true value in that test method (rather than being random deviations from the true value). These potential biases are “internal” in that they deal only with the test method PECO. They do not consider the extent to which findings in this model system may be generalizable to the target human PECO, which is considered as the “external validity” (described subsequently).
NASEM reports over the past decade have consistently recommended the evaluation of risk of bias for individual studies when conducting human health risk assessments, although again the emphasis has been on epidemiologic and laboratory mammalian studies. It is also noteworthy that the field of systematic review has moved away from the concept of “study quality,” which often mixed different concepts such as risk of bias, imprecision, relevance, applicability, ethics, and completeness of reporting. Tools such as the Klimisch score are examples of approaches that are now considered inappropriate because of the mixing of concepts of bias and reporting, the use of scoring, and the difficulty of interpreting the overall summary score (Jüni et al., 2001). Because risks of bias are likely to vary in their impact and are subject to bias themselves, the calculation of a single risk-of-bias score is less useful than a more in-depth consideration of risks of bias.
State-of-the-art individual study evaluation tools and approaches are based on four key principles, originating in Cochrane (Higgins et al., 2022):
There is substantial literature and good theoretical and empirical evidence as to how certain features of the design, conduct, and analysis of studies, on average, lead to systematic error, or bias. Human clinical trials are the best studied in this area, but for both human observational (epidemiologic) studies and laboratory mammalian studies, the factors that influence risk of bias are reasonably well established (see Box 5-2; see also Table 5-1). Examples of individual risk of bias evaluation domains for the National Institute of Environmental Health Sciences (NIEHS) Division of Translational Toxicology (DTT) Integrative Health Assessments Branch (IHAB) method, the EPA Office of Research and Development (ORD) Staff Handbook for Developing IRIS Assessments (IRIS Handbook), and the Navigation Guide3 assessments for different study types are presented in Table 5-2 (Eick et al., 2020; EPA, 2022; Lam et al., 2014; Li et al., 2018; NTP, 2019).
For other types of studies, such as those based on in vitro and in silico methods, there is less literature related to risk of bias, though some areas of bias have been documented. For instance, for in vitro assays, it is known that edge-effects on assay plates can lead to bias (Niepel et al., 2019; Mansoury et al., 2021). The OECD published in 2018 a guidance document on “Good In Vitro Method Practices” that addresses some issues of bias as well, though it is not focused explicitly on internal validity (OECD, 2018). Roth et al. (2021) describe a SciRAP tool for “Evaluating the
___________________
3 The Navigation Guide systematic review method evaluates the association between environmental chemicals and adverse health outcomes using transparent and empirically based methods drawn from best practices in evidence-based medicine and environmental health. It was developed to inform health professionals, patients, communities, consumers, and policy and other decision-makers.
Reliability and Relevance of in vitro Toxicity Data,” but acknowledge that “does not integrate risk of bias considerations.” The EPA (2022) also recently proposed a set of study evaluation domains for in vitro studies that include several aspects specifically addressing risk of bias (Table 5-2: observation bias/blinding, variable control, selective reporting, chemical administration and characterization, endpoint measurement), though it also includes some aspects unrelated to internal validity. For in silico methods, OECD published guidance related to Q(SAR) models (OECD, 2007) as well as PBPK models (OECD, 2021b), but these do not address evaluation of risk of bias, per se. The GRADE working group (Brozek et al. 2021) has developed a general framework for evaluating “modeled outputs,” but note that specific risk of bias tools would need to be developed for different types of models.
| Study Evaluation Domains | NIEHS DTT IHAB Method | EPA IRIS Handbook | Navigation Guide |
|---|---|---|---|
| Epidemiology Studies | |||
| - Exposure measurement | X | X | X |
| - Outcome ascertainment | X | X | X |
| - Blinding | X | X | |
| - Participant selectiona | X | X | X a |
| - Confounding | X | X | X |
| - Analysis | X | ||
| - Selective reporting | X | X | X |
| - Incomplete outcome data | X | ||
| - Sensitivity | X | ||
| - Funding biasb | X | X | |
| - Other sources of bias | X | X | |
| Experimental Animal Studies | |||
| - Allocation | X | X | X |
| - Sequence generation | X | X | |
| - Observational bias/blinding | X | X | X |
| - Confounding | X | X | |
| - Attrition | X | ||
| - Chemical administration and characterization | X | ||
| - Endpoint measurement | X | X | |
| - Results interpretation | X | ||
| - Incomplete outcome data | X | ||
| - Selective reporting | X | X | X |
| - Sensitivity | X | ||
| - Funding bias b | X | X | |
| - Other sources of bias | X | X | |
| In vitro Studies (only as formalized by EPA/IRIS) | |||
| - Observation bias/blinding | X | ||
| - Variable control | X |
| Study Evaluation Domains | NIEHS DTT IHAB Method | EPA IRIS Handbook | Navigation Guide |
|---|---|---|---|
| - Selective reporting | X | ||
| - Chemical administration and characterization | X | ||
| - Endpoint measurement | X | ||
|
- Results presentation (reporting element, not shown to be related to risk of bias) |
X | ||
| - Sensitivity (domain not related to risk of bias) | X | ||
| - Funding biasb |
a The Navigation Guide domain is “Source population representation,” which addresses the question “Are the study groups at risk of not representing their source populations in a manner that might introduce selection bias?”.
b NASEM (NASEM, 2022; NRC, 2014) and EPA Science Advisory Committee on Chemicals (see https://www.regulations.gov/document/EPA-HQ-OPPT-2021-0414-0044) have recommended including consideration of financial conflict of interest in the risk-of-bias domains for all types of studies.
Finding: Numerous frameworks articulate approaches for evaluating risk of bias for randomized control trials, human epidemiologic studies, and laboratory mammalian studies. These frameworks have evolved over time and are developed from theoretical and empirical bases for different contributors to risk of bias. For nonmammalian and in vitro and in silico test methods, there is less literature related to risk of bias, though some areas of bias have been documented (Roth et al., 2021) and some of the domains used for experimental animal studies may be transferable. Others may be specific to individual types of test methods. For instance, for in vitro assays, edge-effects on assay plates can lead to bias (Niepel et al., 2019; Mansoury et al., 2021).
Recommendation 5.5: The EPA, in collaboration with other entities such as the National Toxicology Program (NTP), should develop and utilize approaches for evaluating risk of bias for methods besides epidemiology and laboratory mammalian toxicity tests. Such approaches should be based on the four core principles articulated by Cochrane (Higgins et al., 2022). The EPA should be judicious in identifying existing risk-of-bias domains used for randomized control trials, human epidemiologic studies, and laboratory mammalian toxicity studies that may be transferable, and those risk-of-bias domains that may be unique to other types of toxicity testing methods.
Recommendation 5.6: The EPA, in coordination with other entities such as the NTP and OECD, should require assay developers, when designing or developing protocols for NAMs, to include recommended steps to minimize the risk of bias in the conduct of the study and subsequent data analyses. The elements to minimize risk of bias should be empirically and/or theoretically based. Note that reducing risk of bias (systematic error) is distinct from reducing experimental variability (random error), although some experimental strategies may reduce both.
As defined in Box 5-1, external validity refers to “whether the study is asking the appropriate research question and the extent to which results from a study can be applied (generalized) to other situations, groups, or contexts.”
Many existing frameworks for evaluating NAMs contain multiple elements of validation that can be grouped under the concept of external validity. As shown in Figure 5-1, external validity can be conceived as the extent to which the test method informs the target human PECO. Many published scientific confidence frameworks discuss this issue in terms of concepts such as “biological relevance,” “predictive capacity,” “applicability domain,” and “concordance,” but most do not consolidate them together or provide a structured framework for addressing all of them together under a single unifying concept.
Within systematic review-based evidence evaluation frameworks, external validity is commonly addressed in either “evidence synthesis” or “evidence integration” (or both). Evidence synthesis involves the evaluation of a specific body of evidence, such as epidemiologic studies, defined by a PECO, leading to a conclusion about the level of confidence in the conclusion from that body of evidence. Concerns about external validity of the study design during evidence synthesis are commonly addressed explicitly under the rubric of “indirectness” or “directness.” For example, the EPA IRIS Handbook (EPA, 2022) states that,
“Judgments to decrease certainty based on indirectness should focus on findings for measures that have an unclear linkage to an apical or clinical (adverse) outcome” or if “the findings are determined to be nonspecific to the hazard under evaluation.”
Specifically, this “indirectness” domain usually concerns the external validity of the study’s population (e.g., children versus adults), exposure (e.g., acute versus chronic), and outcome measures (e.g., blood pressure versus stroke) on the human health hazard of interest, irrespective of the specific results of the study (Guyatt et al., 2011). For instance, the NIEHS DTT IHAB method describes “directness and applicability” for experimental animal studies as including
The EPA IRIS Handbook (2022) focuses on directness related to outcomes, specifically cases where there is “unclear linkage to an apical or clinical (adverse) outcome” or when “the findings are determined to be nonspecific to the hazard under evaluation.” In both of these cases, the use of “parallel PECO” statements can help to clarify this judgment. The same is true for in silico models, in the sense that directness of the training data or parameters used to build or calibrate the model, as well as the fidelity of the model structure to the target human PECO, would influence the directness of the model overall (Brozek et al., 2021).
Within evidence synthesis, there are also other evaluation domains that are used to evaluate the external validity of a body of evidence for a specific chemical or agent. For instance, GRADE4 (https://www.gradeworkinggroup.org/), the NIEHS DTT IHAB method, the EPA’s IRIS Handbook (2022), and the Navigation Guide all share, in addition to (in)directness, the considerations related to risk of bias (internal validity), (in)consistency, (im)precision, dose-response/magnitude of effect, and publication bias. Other considerations within some evidence synthesis frameworks include impact of residual confounding, cross-species/population consistency/coherence, or serious or rare endpoints, and financial conflict of interest.
Evidence integration involves combining evidence from different bodies of evidence, such as epidemiologic studies and in vivo experimental animal studies, to support a hazard conclusion. Concerns about external validity during evidence integration are usually implicit in how different types of evidence contribute to an overall hazard conclusion. For instance, in evaluations of carcinogenicity, in vivo experimental animal studies alone cannot be used to support a conclusion of “carcinogenic to humans,” whether those evaluations are performed by the EPA, the NTP, International Agency on Research on Cancer (IARC), or World Health Organization (WHO)/ International Programme on Chemical Safety (IPCS). However, epidemiologic data alone can support such a conclusion when these data provide sufficient evidence of carcinogenicity (i.e., chance, bias, and confounding can be ruled out with reasonable confidence). Thus, implicitly, in vivo experiment animal studies alone are considered to have less external validity than human epidemiologic studies alone. However, this implicit consideration of external validity during evidence integration means that there is potential for double counting if, for instance, evidence is already downgraded for indirectness during the previous step of evidence synthesis.
There are several evidence integration frameworks and methodologies that have been developed that have a provision for making use of mechanistic-type data, such as those deriving from NAMs. However, most existing frameworks are focused largely on human clinical/epidemiologic studies and laboratory mammalian toxicity studies, initially integrating only these two evidence
___________________
4 The GRADE (Grading of Recommendations: Assessment Development and Evaluation) approach was originally developed for application in healthcare and provides a framework for research users to consider the certainty of evidence in favor of a particular therapeutic intervention.
streams. Studies outside of these categories are generally lumped as mechanistic data and are mostly used to modify the initial integration based only on human epidemiology and laboratory mammalian toxicity studies. These include the Mode of Action (MoA) framework developed by WHO/IPSC and used extensively by the EPA for many years, and the Integrated Approaches to Testing and Assessment (IATA) framework developed by the OECD and applied for example by EFSA for developmental neurotoxicity (DNT) evaluation of pesticides. The latter (IATA), in particular, has been conceived to further the development of chemical safety assessment approaches that primarily rely on NAMs. In addition, the MoA and IATA frameworks address both hazard and risk assessment objectives. For example, under new OECD guidelines (Test No. 442C, D, E), in vitro studies alone are sufficient for categorization of hazard based on adverse outcome pathway considerations within an IATA framework. In addition, the recently updated IARC Monographs Preamble (IARC, 2019; Samet et al., 2020) (see Figure 5-2) allows for mechanistic evidence alone, based on key characteristics (Smith et al., 2016), to support a conclusion of “possibly” (2B) or “probably” (2A) carcinogenic to humans.
In line with the IARC approach, the NTP’s classification of “reasonably anticipated to be human carcinogen” (akin to IARC 2B and 2A) includes agents for which there is less than sufficient evidence of carcinogenicity in humans or laboratory animals but the agent, substance, or mixture belongs to a well- defined, structurally related class of substances whose members are either known or reasonably anticipated to be a human carcinogen, or there is convincing relevant information that the agent acts through mechanisms indicating it would likely cause cancer in humans. In another example, for pharmaceuticals, International Conference on Harmonization (ICH) guidelines (ICH, 2023) have recently been updated to include a weight of evidence approach, including mechanistic data or read-across, to determine that carcinogenicity is likely (or unlikely) in the absence of a rodent bioassay.
Likewise, per the California Code of Regulations, Title 22, 69402.2 et seq. (Kammerer, 2011), “suggestive” evidence of carcinogenicity for a given chemical substance can be provided by strong evidence for genotoxicity; mechanistic evidence from cell-based, tissue-based, or whole organism–based assays showing perturbations of known physiological, biochemical, or other pathways involved in carcinogenesis, such as described by the IARC Preamble; or strong indications of carcinogenicity from structure activity relationships. Similarly, mechanistic and other NAMs evidence alone can support suggestive evidence of reproductive toxicity, developmental toxicity, and other toxicological hazard traits.
Finding: Published confidence frameworks for NAMs include concepts that are related to external validity but most do not consolidate them together or provide a structured framework for addressing all of them together under a single unifying concept.
Finding: Frameworks for evaluating external validity as part of assessing a body of evidence have evolved over time and more recently have been defined as part of systematic reviews. Specifically, existing systematic review–based frameworks for evidence evaluation contain elements of external validity as part of a structured approach to evidence synthesis. Common domains of these frameworks include risk of bias, consistency, directness, precision, publication bias, effect size, and dose-response relationship (some include cross-species, population, or study consistency, and serious or rare endpoints). The consideration of (in)directness, which concerns the external validity of a study’s population, exposure, and outcome measures on the human health hazard of interest, covers aspects most relevant to assessing external validity of NAMs and is part of the process of evidence synthesis.
Finding: As shown in Table 5-3, evaluation of external validity has three common components: biological considerations, exposure considerations, and concordance, which can be defined within the “indirectness/directness” domain of evidence synthesis. Biological considerations may relate both to the external validity of the population as well as to that of the outcome. In addition, each of these may include both qualitative and quantitative considerations. For instance, qualitative concordance in the sense of identifying or categorizing a hazard may be evaluated separately from quantitative concordance in the sense of quantifying dose-response relationships or a point of departure (POD). Quantitative limitations may be addressable as part of dose-response assessment, as is commonly done for experimental animal studies through toxicokinetic modeling or application of uncertainty/variability factors.
Finding: Currently NAMs are generally considered in the evidence integration step, typically on a case-by-case basis through expert judgment and typically to elevate confidence in the animal or human streams of evidence. When considered as part of evidence integration, limitations in external validity are an implicit part of how different types of evidence are grouped.
Recommendation 5.7: The EPA, in collaboration with other entities such as the NTP, should develop and utilize structured approaches for evaluating the external validity of test methods, including NAMs. This would entail developing guiding questions to facilitate a structured approach similar to the approaches used for evaluating risk of bias of NAMs. Similar to current best practices for evaluation of internal validity, the committee recommends against aggregating individual domain ratings into a single quantitative score. Example domains and guiding questions are shown in Table 5-3, with examples of their application in Tables 5-4 to 5-8.
| Example Domains and Guiding Questions | Example Rubric with Ratings for Evaluating Each Domaina |
|---|---|
| Biological considerations: Population—How strong is the biological basis for the test method as a biologically relevant model for the human population? | High: Multiple aspects of structural/functional similarity; few/minor limitations; includes human population variability. Moderate: Some aspects of structural/functional similarity; some critical limitations; does not include human population variability. Low: Little or only superficial structural/functional similarity; many critical limitations. Inadequate: Insufficient data to determine. |
| Biological considerations: Outcome—How strong is the biological basis for the test method outcome as a model for human outcomes measured? | High: Clear correspondence between outcome measured and relevant human outcomes; few/minor limitations. Moderate: Some correspondence between outcomes measured and relevant human outcomes, perhaps requiring analogy; a few critical limitations. Low: Little or only superficial correspondence between outcome measured and relevant human outcomes; many critical limitations. Inadequate: Insufficient data to determine. |
| Exposure considerations: How accurately does exposure in the test method model human exposures? | High: Exposure in the test method clearly corresponds to relevant human exposures, including consideration of toxicokinetics and duration. Moderate: Some uncertainties as to how exposure in the test method corresponds to relevant human exposure, including consideration of toxicokinetics and duration. Low: Great uncertainty as to how exposure in the test method corresponds to relevant human exposure, including consideration of toxicokinetics and duration. Inadequate: Insufficient data to determine. |
| Concordance:b How accurately does the test method predict human outcomes to exposure? | High: Substantial support from comparisons of test method results and human data (e.g., based on systematic reviews with large number of positive and negative reference compounds) showing concordance. Moderate: Some support from comparisons of test method results and human data showing concordance, but with some limitations (e.g., case studies of high methodological quality but limited number/diversity of compounds; systematic reviews with moderate risk of bias, low precision, etc.). Low: Comparisons of test method results and human data evaluating concordance exhibit many limitations. Inadequate: Insufficient data to determine (e.g., absence of human data to evaluate concordance). |
a Potentially separate for qualitative (e.g., for hazard identification) and quantitative (e.g., for dose-response assessment) considerations. For instance, qualitative concordance in the sense of identifying or categorizing hazard may be evaluated separately from quantitative concordance in the sense of quantifying dose-response relationships or a POD. Quantitative limitations may be addressable as part of dose-response assessment, as is commonly done for experimental animal studies through toxicokinetic modeling or application of uncertainty/variability factors.
b See Chapter 4 for additional discussion and recommendations for evaluating concordance.
| External Validity Domain | Qualitative Considerations | Quantitative Considerations |
|---|---|---|
| Biological considerations: Population—How strong is the biological basis for the test method as a biologically relevant model for the human population? | High: Rodents are biologically and physiologically similar to humans, with largely the same organs and tissues. | Moderate: Bioassays use homogeneous populations with different background exposure and stress milieu (diet, lightening, concomitant exposures, etc.) compared to the human population, so do not address factors such as sensitive or vulnerable subpopulations. |
| Biological considerations: Outcome—How strong is the biological basis for the test method outcome as a model for human outcomes measured? | High: Rodent tumors are pathologically similar to human tumors. | Moderate: Background rates of rodent and human tumors are different overall and at specific sites, so there are uncertainties in quantitative extrapolation. |
| Exposure considerations: How accurately does exposure in the test method model human exposures? | High for inhalation, drinking water, feed; Moderate for gavage: Rodents have the same toxicokinetic processes as humans regarding absorption, distribution, metabolism, and excretion. Most rodent metabolic enzymes have direct analogues in humans. However, some routes of exposure (e.g., gavage) may not have direct analogues to humans. | Moderate: Physiological differences related to body size, relative organ weight, and metabolic activity may lead to quantitative differences in internal exposure. These limitations can be addressed through quantitative approaches, such as allometric scaling or physiological based pharmacokinetic modeling (PBPK) modeling. |
| Concordance: How accurately does the test method predict human outcomes to exposure? | High (with exceptions): All IARC Group I carcinogens are positive in at least one rodent bioassay. Site concordance is not required. However, certain outcomes (e.g., male rat kidney tumors due to alpha2u-globulin accumulation) may not be relevant to humans. | Low: There are limited high-quality data on the quantitative concordance between rodents and humans in carcinogenic potency. Early quantitative studies suggested fairly good concordance of quantitative measures of potency, corroborated by individual more recent case studies (e.g., tricholorethylene) (Allen 1988; Crouch, 1983; Crouch and Wilson 1979; EPA, 2011). However, there is no recent systematic evaluation. |
| External Validity Domain | Qualitative Considerations | Quantitative Considerations |
|---|---|---|
| Biological considerations: Population—How strong is the biological basis for the test method as a biologically relevant model for the human population? | Moderate: Human iPSC-cardiomyocytes (hiPSC-CM) spontaneously beat and have morphology and gene expression similar to human left ventricular cardiomyocytes. However, they express a more fetal phenotype and are not paced. | Moderate: If hiPSC-CM from only a single donor is used, do not address factors such as sensitive or vulnerable subpopulations. |
| Biological considerations: Outcome—How strong is the biological basis for the test method outcome as a model for human outcomes measured? | High: In vivo transporters and receptors involved in cardiomyocyte function (e.g., hERG channel, beta1 and beta2adrenergic receptors) are expressed in hiPSC-CM. Beating parameters measured in vitro are similar to those measured in vivo using an electrocardiogram. | Moderate: Spontaneous beating is at a slower rate than in vivo paced beating, so outcome measurements may need to be corrected for these differences. |
| Exposure considerations: How accurately does exposure in the test method model human exposures? | Moderate: Chemical must be direct acting due to lack of metabolic capacity in hiPSC-CM, and soluble in DMSO. | Moderate: Quantitative uncertainties regarding differences in protein binding between media and serum, binding to testing materials (e.g., plastic), and partitioning. No consideration of background exposures. |
| Concordance: How accurately does the test method predict human outcomes to exposure? | Moderate: Accurate bioactivity predictions for a large number of positive and negative reference drugs for each outcome, especially QTc prolongation, but not environmental chemicals. Patient-specific iPSC-CM accurately predict susceptibility to doxorubicin-induced cardiotoxicity. However, these were not the result of systematic reviews. | Moderate: For QTc prolongation, the in vitro free concentration EC01 predicts within 3-fold the in vivo free blood concentration EC01 reported in clinical trials for 10 positive reference drugs. However, these were not the result of a systematic review and only include drugs (Blanchette et al., 2019). |
| External Validity Domain | Qualitative Considerations | Quantitative Considerations |
|---|---|---|
| Biological considerations: Population—How strong is the biological basis for the test method as a biologically relevant model for the human population? | Moderate: High level of conservation in embryonic development between zebrafish and humans. However, zebrafish are nonmammalian. | Moderate: Can include diverse genetic backgrounds, as well as some nonchemical stressors, to address sensitive or vulnerable subpopulations. |
| Biological considerations: Outcome—How strong is the biological basis for the test method outcome as a model for human outcomes measured? | Moderate: Measured developmental outcomes, though not identical human outcomes, have analogies to human outcomes. Developmental milestones and biological processes are similar with similar molecular pathways. | Moderate: Outcomes are measured following direct exposure to the developing embryo or larvae so does not include maternal-fetal interactions. |
| Exposure considerations: How accurately does exposure in the test method model human exposures? | Moderate: Applicable to chemicals soluble in water, DMSO, and alcohols. Zebrafish are metabolically competent, so exposures to metabolically activated chemicals can be tested. The degree of overall metabolic similarity to humans remains somewhat uncertain. | Moderate: Corrections may be needed because of uncertainties in dosimetry and routes of exposure (e.g., embryo/larval chemical uptake across assay period, nonspecific binding to exposure plates, differences in chemical metabolism). |
| Concordance: How accurately does the test method predict human outcomes to exposure? | Moderate: Accurate bioactivity predictions for a large number of positive and negative reference compounds for teratogenicity and developmental toxicity. However, these were not the result of systematic reviews. | Inadequate: Measured developmental outcomes are often binary (yes and no) rather than quantitative. |
| External Validity Domain | Qualitative Considerations | Quantitative Considerations |
|---|---|---|
| Biological considerations: Population—How strong is the biological basis for the test method as a biologically relevant model for the human population? | Moderate: The battery uses peptides that contain the same amino acids present in human proteins, genetically engineered human derived keratinocytes, and a human cancer cell line. | Moderate: May not address sensitive subpopulations. |
| Biological considerations: Outcome—How strong is the biological basis for the test method outcome as a model for human outcomes measured? | High: The battery of assay is based on a well-defined adverse outcome pathway and targets specific key events. By using two out of three assays it helps increase the predictive value. May underpredict test chemicals exclusively reactive toward lysine residues. | Moderate: Limited metabolic capability, but majority of pre-haptens and pro-haptens are sufficiently well identified. Fluorescent substances may interfere with the flow cytometric detection; sensitive to interference with luciferase enzyme. |
| Exposure considerations: How accurately does exposure in the test method model human exposures? | Moderate: Limited to soluble chemicals or chemicals that form stable dispersion either in water or DMSO. Alkylating agents may be underpredicted. 442C not applicable to metals. | Inadequate: Exposure is in solution, unclear how the effect levels in chemico and in vitro are related to in vivo potency although models are being developed. |
| Concordance: How accurately does the test method predict human outcomes to exposure? | Moderate: The battery is reported to have a better predictive value than local lymph noted assay (LLNA). | Inadequate: There are limited data on the quantitative concordance in potency. |
| External Validity Domain | Qualitative Considerations | Quantitative Considerations |
|---|---|---|
| Biological considerations: Population—How good is the biologically relevant model of the androgen receptor for the human population at large? | High: The structure of the human androgen protein is well characterized. | Moderate: May not address sensitive subpopulations who have mutations in the protein. |
| Biological considerations: Outcome—How predictive is the AI/docking method as a model for human endocrine disruption? | Low: The method is meant as a screen to identify chemicals for further testing. | Low: Docking scores may not be directly comparable quantitatively. Can be used to group, but not rank chemicals in a specific order. |
| Exposure considerations: Does the level of exposure alter the predictive value of the test method? | High: Chemicals with well-defined structures should perform well in the method. | Low: Requires predictions of the chemical at the target site. If the chemical does not reach the receptor there will be no effect despite being positive by the test method. |
| Concordance: How accurately does the test method predict human outcomes in relation to exposure? | Moderate: The test method provides candidates for further screening and is fit for this purpose. | Low: Absorption, Distribution, Metabolism and Excretion (ADME) considerations may affect the predictive value of the test method that models the isolated protein. |
Recommendation 5.8: The EPA should not “double count” limitations in external validity in both evidence synthesis (e.g., downgrading evidence for (in)directness) and evidence integration (e.g., limiting how strongly nonhuman evidence can support a hazard conclusion).
Recommendation 5.9: Because concordance of a test method is defined with respect to the target human PECO, the EPA should broaden the considerations for evaluating concordance beyond comparisons to laboratory mammalian toxicity tests.
This recommendation, along with those in Chapter 4, addresses the committee’s charge questions regarding one of the related issues left unresolved in Using 21st Century Science to Improve Risk-Related Evaluations: assessment of the concordance of data from assays that use cells or proteins of human origin with toxicity data that are virtually all derived from animal models.
Summary of Relevant Findings from Chapter 3: As discussed in detail in Chapter 3, experimental and biological variability are ubiquitous within experimental biological science and are not unique to laboratory mammalian toxicity studies. Moreover, intrinsic biological variability and experimental variability can be difficult to distinguish. Experimental variability in laboratory mammalian studies is minimized through study design elements and standardization of animal maintenance and environmental control (Jacobs and Hatfield, 2013; Haseman et al., 1989). Although minimizing experimental variability is often regarded as something highly valuable, variability is not fundamentally a negative attribute because some aspects of variability might be important for characterizing the biological variability associated with the distribution of responses to exposure to toxicants in the heterogenous human population. Thus, for NAMs, overemphasis on reducing experimental variability through too much standardization may actually reduce the external validity of the results by testing only in an overly narrow set of relevant conditions (Miller, 2014; Voelkl et al., 2020). Even though a limited number of high-quality systematic reviews on the topic were identified, substantial heterogeneity in outcomes of laboratory mammalian studies was generally reported. However, a broad and generalizable threshold of acceptable or tolerable variability could not be identified.
Recommendation 5.10: In its evaluation of test methods, thenEPA should prioritize increasing external validity through broader coverage of biological variability. One strategy that may be useful could be to use a battery of assays to encompass greater biological variability, while designing each assay so as to minimize experimental variability.
Recommendation 5.11: For any test method intended for use in risk assessment, whether in vivo, in vitro, in silico, or otherwise, particularly in a context where there are no other data (laboratory mammalian or human data), EPA’s tolerance of variability should be driven by an analysis of the different levels and types of variability and of their impact on the test method’s internal and external validity. This analysis should also take into account the test method’s purpose and context of use.
The evaluation of the scientific confidence of a NAM may be conducted in a general context-focused design, evaluation, and utilization of NAM-based testing strategies for the generation of data for assessing chemicals, including data-poor chemicals. Within this context, there are three related scenarios: (1) filling data gaps, (2) complementing existing data, and (3) offering an alternative or replacement to another test method.
Filling a data gap. When there are no existing data for a chemical, no existing toxicity testing method, and/or when there is only a default approach that does not rely on data, a NAM can be evaluated independently according to its own strengths and limitations. Thus, in this case, a NAM is evaluated alone as to its purpose and context of use, internal validity, external validity, and variability. There are no set criteria for minimal level of confidence suitable for human health risk assessment, as it depends on the decision-making context, tolerance for uncertainty, and other factors.
Complementing existing data. In addition, in some cases NAMs are developed to add to or complement existing tests in a larger battery of tests or body of evidence that may increase certainty in the evaluation of an outcome. In this case, the NAM confidence evaluation would be carried forward to the specific context of a particular chemical(s), and evidence synthesized or integrated with other available evidence in making a human health hazard or risk determination (discussed subsequently).
Offering an alternative. When there is another test method, such as a mammalian toxicity test method, that has the same or a similar target human PECO (a comparator test method), a NAM (including NAMs consisting of batteries of assays) can be evaluated relative to the strengths and limitations of the comparator. This scenario includes but is not limited to the one-to-one replacement situation. The choice of comparator toxicity testing method depends on the goal of the comparison and could involve any type of test method from a laboratory mammalian study to another NAM—the only requirement is that they have the same or similar target human PECOs.
Once the comparator is identified, each of the two methods (NAM and comparator) would be evaluated on the same dimensions of purpose and context of use, internal validity, external validity, and variability using the same framework so that the two could be compared as to their scientific confidence. To the extent that there are human data on reference compounds for evaluating external validity, the same human data should be used, where possible, when evaluating concordance of the NAM relative to that of its comparator. The relative scientific confidence between the NAM and the comparator test method may well differ for each of the different components or subcomponents. For example, a NAM may have greater internal validity but less external validity relative to the comparator, or the NAM is superior for some domains of external validity and the comparator test is superior for others. As with internal validity, numerical individual domain scores
and overall summary scores are not recommended because different aspects of scientific confidence may have different impacts in different situations.
For either filling data gaps, complementing existing data, or offering an alternative, the evaluation of confidence alone does not determine whether a NAM is preferred or equivalent because other considerations, such as cost to the public, timeliness, or throughput may also be relevant. For instance, even if there is greater quantitative uncertainty in a NAM relative to an existing test, it still may be preferable from a value of information perspective because it provides more data more quickly and/or at lower cost, and thus possibly result in greater overall public health benefit.
Finding: “One-size-fits-all” criteria for acceptability of NAM-based testing strategies are inappropriate because the various elements of scientific confidence have different strengths and weaknesses, such as assessing or limiting biological variability. There is a need to consider what choice best promotes the overall goal of protection of public health.
Recommendation 5.12: The EPA should establish the acceptability of NAM-based testing strategies based on each specific purpose and context of use. The EPA should be transparent as to the level of scientific confidence that results from examining the NAM’s internal validity, external validity, and variability.
This recommendation, along with the previous recommendations for scientific confidence, addresses the charge questions related to “evaluation of the validity of assays that are not intended as one-to-one replacements for in vivo toxicity assays” and “how may the Committee foresee this information being incorporated into a new or the existing validation paradigm or scientific confidence framework so that EPA can ensure that NAMs are equivalent to or better than the animal tests replaced.”
Finding: Although many NAMs have been developed and results reported in the literature, there is no authoritative source or compendium of available tests and technologies along with documentation of their reliability and description of their potential applicability. Significant gaps in biological coverage also remain to be systematically identified. In other contexts, having registries of acceptable NAMs for a specific context of use has proven to be helpful such as with the U.D. Food and Drug Administration’s (FDA’s) Medical Device Development Tool (MDDT) and Innovative Science and Technology Approaches for New Drugs (ISTAND). In addition, registries developed for other purposes, such as for clinical trials (clinicaltrials.gov) and systematic reviews (PROSPERO), have been helpful to researchers, developers, and regulators.
Recommendation 5.13: For the regulated community, the EPA’s goal should be to provide lists of acceptable NAM-based testing strategies under different purposes and contexts of use in order to establish confidence that NAM-derived data submissions to the agency will be integrated into decision-making (discussed in next section). This could be accomplished through the EPA working with partners in the U.S. government and appropriate international organizations to develop a harmonized registry of toxicity testing methods documenting their purpose and context of use (including parallel PECO statements), internal validity, external validity, and variability.
At this point, it is useful to step back to recall the overall purpose of toxicological test methods in the context of human health risk assessment.5 Most fundamentally, toxicity testing aims to inform (1) hazard identification as to the level of evidence for a causal relationship between exposure to an agent and an effect and (2) dose-response assessment as to the quantitative relationship between exposure and the incidence or severity of an effect, in humans, including susceptible and vulnerable subgroups. For both human studies and laboratory mammalian studies, there is a large literature on structured approaches for establishing scientific confidence for hazard and dose-response, including recommendations from numerous NASEM reports (NASEM, 2009, 2011, 2014, 2017a, 2018, 2022).
Additionally, numerous NASEM committees have recommended government programs, including NTP, EPA, and the Department of Defense (DOD), apply rigorous systematic review-based approaches in the human health risk assessment of chemicals (NASEM, 2018, 2022; NRC, 2011, 2014;). Systematic review has been defined as “the application of methods designed to minimize risk of systematic and random error, and maximize transparency of decision-making, when using existing evidence to answer specific research questions” (Whaley et al., 2022). While originating in the area of clinical medicine, in the last decade, it has been increasingly applied to environmental health and toxicology questions.
The workflow in Figure 5-3 illustrates a generic framework for human health hazard assessment, adapted from NASEM (2021) (see definitions in Box 5-1 and 5-2), as well as how the concepts related to evaluating scientific confidence in NAM-based testing strategies can interface with this framework. Additional details, derived from the committee’s review of previous NASEM reports (NASEM, 2022; NRC, 2011, 2014) and related documents (e.g., the EPA IRIS Handbook [EPA, 2022] and NIEHS DTT IHAB method [NTP, 2019]), are shown in Table 5-9.
Finding: Modern approaches of systematic review, which include evidence synthesis and integration, involved in hazard and risk assessment have largely been applied to human and laboratory animal evidence, but the general principles for evidence evaluation can be applied to data from other types of toxicity testing methods that are used for hazard identification or dose-response. In the context of initial registration (e.g., new pesticide active ingredient), a structured and transparent approach to evidence synthesis consistent with state-of-the-art systematic review-based frameworks is also possible. If all existing data are already available to the EPA, such as during the registration process, a literature search may be unnecessary.
Finding: Numerous previous NASEM reports have recommended systematic review-based processes (e.g., based on NIEHS DTT IHAB, Navigation Guide, or IRIS) for organizing, synthesizing, integrating, and evaluating laboratory mammalian toxicology and epidemiologic studies for hazard and dose-response, and these recommendations are also applicable for NAMs.
___________________
5 As specified in the Statement of Task, the committee focused on human health risk assessment applications of animal studies and NAMs, and thus did not address other contexts of use, such as prioritization or tiered testing decisions.
Finding: In current human health risk assessment approaches, studies other than human and laboratory animal evidence are typically lumped into a category of mechanistic evidence. However, it can be challenging to integrate such mechanistic evidence because of the diversity of study types and endpoints they entail. One existing approach to organizing and analyzing mechanistic evidence using specific PECO questions is the concept of key characteristics. For NAMs based on AOPs, a battery of tests can form the basis for the PECO questions, as is the case for OECD skin sensitization.
Finding: The use of parallel PECO statements (Recommendation 5.4) would enable the creation of new NAM evidence streams that represent bodies of evidence from studies (other than epidemiologic and laboratory mammalian studies) that are to be used as a basis for hazard identification and dose-response assessment.
Finding: For the many chemicals presently in commerce and that will be in the future, human and experimental animal evidence are lacking. Approaches to identify human health hazards in the absence of such data are needed in order to inform risk management decisions to protect human health.
Finding: Most existing approaches to integration of evidence for making hazard identification conclusions are largely based on integration of human epidemiologic and experimental animal evidence, with other types of evidence (labeled mechanistic) used to modify this initial judgment. However, as shown by IARC (2019), it is possible to develop an evidence integration framework in which
| Example Steps in Current Systematic Review-Based Human Health Assessment | Relevant Committee Recommendations |
|---|---|
Step 1: Scoping and Problem Formulation
Step 2: Identify evidence through a comprehensive and systematic literature search, using the PECO to specify inclusion/exclusion criteria. This search aims to identify relevant human epidemiologic and experimental animal studies; “mechanistic evidence” may not be subjected to the same systematic review approach. |
The use of “parallel PECO” statements as part of the purpose and context of use of a NAM provides a way to directly incorporate a NAM during the Scoping and Problem Formulation step. Specifically, the “target human” PECO facilitates considering NAMs as an “evidence stream” that can undergo systematic review. See Recommendations 5.4, 5.14. |
|
Step 3: Evaluate evidence at the level of individual studies through evaluation for risk of bias, or internal validity, using structured criteria. Common domains include selection bias, performance bias, attrition bias, ascertainment bias, and reporting bias. The IRIS Handbook (EPA 2022) specifies evaluation domains separately for epidemiologic studies, animal studies, and in vitro studies. |
The identification of key aspects of test method design and conduct that increase internal validity directly informs the evaluation of risk of bias in the evaluate evidence step. See Recommendations 5.5–5.6 (internal validity). |
|
Step 4: Synthesize evidence involves drawing conclusions from the body of evidence corresponding to a particular health effect for a particular evidence stream. This process is also structured with specific criteria for considering factors that increase or decrease confidence. For example, in the IRIS Handbook (EPA, 2022), key considerations that inform evidence synthesis are for human epidemiologic data and animal evidence are (1) risk of bias and sensitivity, (2) consistency, (3) effect magnitude and imprecision, (4) dose-response, (5) directness of outcome/endpoint measures, (6) coherence, and (7) other factors (e.g., publication bias). The IRIS Handbook also notes that “When mechanistic information is included in a predefined unit of analysis … the same factors … are considered during mechanistic evidence synthesis.” |
Components of scientific confidence for NAMs map directly to several considerations in evidence synthesis. Specifically, using examples from the IRIS Handbook (EPA 2022):
See Recommendations 5.5–5.6 (internal validity); 5.10–5.11 and Chapter 3 (variability); 5.7–5.9 and Chapter 4 (external validity and concordance). |
|
Step 5: Integrate evidence involves bringing together different evidence streams for a particular health effect and a conclusion is drawn from them as a whole. The IRIS Handbook (EPA, 2022) and IARC (2019) are examples of a guided expert judgment approach. The IRIS Handbook lists the following considerations: (1) human relevance of findings, (2) cross-stream coherence, (3) susceptible populations and lifestages, (4) biological plausibility and MoA considerations, and (5) other critical inferences (e.g., toxicokinetics). |
Components of scientific confidence for NAMs map directly to considerations in evidence integration. Specifically, using the examples from the IRIS Handbook (EPA, 2022):
See Recommendations 5.7–5.9 and Chapter 4 (External Validity and Concordance); 5.10–5.11 and Chapter 3 (Variability); 5.14 (Framework). |
| Example Steps in Current Systematic Review-Based Human Health Assessment | Relevant Committee Recommendations |
|---|---|
|
Step 6: Dose-response assessment is performed to derive toxicity values that can be used to compare to human exposure levels and thereby inform risk management decision-making. This process usually involves two steps: (1) quantitative analysis of dose-response data from one or more studies and (2) extrapolation of the results of the dose-response analysis to the human population while addressing uncertainties and human variability. It is well recognized that available toxicity data may be systematically biased in their quantitative representation of the human population dose-response relationship. Procedures that have been developed to address these uncertainties include:
|
The quantitative components of external validity are key considerations in the dose-response assessment step, particularly as regards the adjustments needed in deriving toxicity values from NAMs. The evaluation of experimental and biological variability is also important for this step, specifically as to the extent to which uncertainty and variability in susceptibility in the population are addressed when deriving toxicity values from NAMs. Examples developed in the context of screening and prioritization could be further evaluated for use in human health risk assessment (Abdo et al., 2015; Blanchette et al., 2019). See Recommendations 5.7–5.9 and Chapter 4 (External Validity and Concordance); 5.10–5.11 and Chapter 3 (Variability); 5.14 (Framework). |
Hence, evidence integration frameworks exist which include approaches for making hazard identification conclusions without human or experimental animal evidence. Skin sensitization (Casati 2018) is another example of such an approach.
Finding: Processes to derive quantitative toxicity values exist for human epidemiologic and laboratory mammalian studies. Some such approaches are available for other types of toxicity testing methods (e.g., reverse-toxicokinetics-based in vitro to in vivo extrapolation, QSAR models), particularly those that might serve as the foundation for rapid risk assessments of chemicals with a paucity of other toxicity data. Continued development of quantitative extrapolation approaches to derive health protective toxicity values using data from toxicity testing approaches other than human epidemiologic and laboratory mammalian toxicity studies will be needed. Collaboration among federal and state regulatory agencies and with research entities in the federal government also involved with risk estimation is also possible (e.g., National Institutes of Health (NIH)/ NIEHS, Centers for Disease Control and Prevention (CDC)/National Institute for Occupational Safety and Health (NIOSH), CDC/Agency for Toxic Substances and Disease Registry (ATSDR). Data generated from evaluation of quantitative concordance with humans, as recommended previously, will be useful in developing such approaches.
Finding: Prior NASEM committees have recommended that the EPA better evaluate the range of human variability and incorporate it into their dose-response for all endpoints, and NAMs provide an opportunity to address this issue.
Finding: In the absence of toxicity values derived from data in humans, toxicity values derived from laboratory mammalian toxicity studies can be used as a benchmark. When making comparison between toxicity values, it is important to
Recommendation 5.14: The EPA should develop and utilize a framework for hazard identification and deriving toxicity values protective of public health that does not require human epidemiologic or laboratory mammalian toxicity data. This framework should also enable NAM-based data to be integrated with human epidemiologic and laboratory mammalian toxicity data. In so doing, the EPA should continue to follow previous NASEM recommendations related to systematic review and risk assessment. This will ensure a seamless handoff between evaluation of NAM-based testing strategies and evaluation of scientific confidence of NAM data for individual chemicals.
Abdo, N., M. Xia, C. C. Brown, O. Kosyk, R. Huang, S. Sakamuru, Y.H. Zhou, J. R. Jack, P. Gallins, K. Xia, Y. Li, W. A. Chiu, A. A. Mostinger-Reif, C. P Austin, R. R. Tice, I. Rusyn and F. A. Wright, 2015. “Population-Based In Vitro Hazard and Concentration-Response Assessment of Chemicals: The 1000 Genomes High-Throughput Screening Study.” Environmental Health Perspectives 123(5):458-66.
Allen, B. C., K. S. Crump, and A. M. Shipp, 1988. Correlation between carcinogenic potency of chemicals in animals and humans. Risk Analysis 8(4): 531–544.
Baan, R. A., B. W. Stewart, and K. Straif. 2019. Tumour Site Concordance and Mechanisms of Carcinogenesis, IARC Scientific Publications, No. 165. Lyon: IARC. https://publications.iarc.fr/578.
Bal-Price, A., F. Pistollato, M. Sachana, S. K. Bopp, S. Munn, and A. Worth. 2018. “Strategies to Improve the Regulatory Assessment of Developmental Neurotoxicity (DNT) Using in Vitro Methods.” Toxicology and Applied Pharmacology 354 (September): 7–18.
Blanchette, A. D., F. A. Grimm, C. Dalaijamts, N.-H. Hsieh, K. Ferguson, Y.-S. Luo, I. Rusyn, and W. A. Chiu. 2019. “Thorough QT/QTc in a Dish: An In Vitro Human Model That Accurately Predicts Clinical Concentration-QTc Relationships.” Clinical Pharmacology and Therapeutics 105(5): 1175–1186.
Brozek, J. L., C. Canelo-Aybar, E. A. Akl, J. M. Bowen, J. Bucher, W. A. Chiu, M. Cronin, B. Djulbegovic, M. Falavigna, G. H. Guyatt, and A. A. Gordon. 2021. GRADE Guidelines 30: the GRADE approach to assessing the certainty of modeled evidence—An overview in the context of health decision-making. Journal of Clinical Epidemiology 129: 138–150.
Casati, S. 2018. “Integrated Approaches to Testing and Assessment.” Basic and Clinical Pharmacology and Toxicology 123: 51-55. https://doi.org/10.1111/bcpt.13018.
Crouch, E. 1983. “Interspecies Extrapolation of Cancer Potency.” Environmental Health Perspectives. https://ehp.niehs.nih.gov/doi/epdf/10.1289/ehp.8350321.
Crouch, E., and R. Wilson. 1979. “Interspecies Comparison of Carcinogenic Potency,” Journal of Toxicology and Environmental Health, 5(6): 1095–1118. https://doi.org/10.1080/15287397909529817.
Eick, S. M., D. E. Goin, N. Chartres, J. Lam, and T. J. Woodruff. 2020. “Assessing Risk of Bias in Human Environmental Epidemiology Studies Using Three Tools: Different Conclusions from Different Tools.” Systematic Reviews 9(1): 249.
EPA (U.S. Environmental Protection Agency). 2011. “Toxicological Review of Trichloroethylene.” EPA/635/R-09/011F.
EPA. 2018. “Strategic Plan to Promote the Development and Implementation of Alternative Test Methods Within the TSCA Program.” EPA-740-R1-8004. https://www.epa.gov/sites/default/files/2018-06/documents/epa_alt_strat_plan_6-20-18_clean_final.pdf.
EPA. 2022. ORD Staff Handbook for Developing IRIS Assessments. EPA/600/R-22/268. Washington, DC: EPA Office of Research and Development.
Guyatt, G. H., A. D. Oxman, R. Kunz, J. Woodcock, J. Brozek, M. Helfand, P. Alonso-Coello, et al. 2011. “GRADE Guidelines: 8. Rating the Quality of Evidence--Indirectness.” Journal of Clinical Epidemiology 64(12): 1303–1310.
Haseman, J. K., J. E. Huff, G. N. Rao, and S. L. Eustis. 1989. “Sources of Variability in Rodent Carcinogenicity Studies.” Fundamental and Applied Toxicology: Official Journal of the Society of Toxicology 12(4): 793–804.
Higgins, J. P. T., J. Thomas, J. Chandler, M. Cumpston, T. Li, M. J. Page, and V. A. Welch. 2022. Cochrane Handbook for Systematic Reviews of Interventions. Chichester (UK) John Wiley & Sons.
ICH (International Conference on Harmonization). 2023. Testing for Carcinogenicity of Pharmaceuticals—Step 5. EMA/774371. Committee for Medicinal Products for Human Use.
IARC (International Agency for Research on Cancer). 2019. “Preamble to the IARC Monographs (amended January 2019).” https://monographs.iarc.who.int/iarc-monographs-preamble-preamble-to-the-iarc-monographs/.
Jacobs, A. C., and K. P. Hatfield. 2013. “History of Chronic Toxicity and Animal Carcinogenicity Studies for Pharmaceuticals.” Veterinary Pathology 50(2): 324–333.
Jüni, P., D. G. Altman, and M. Egger. 2001. “Systematic Reviews in Health Care: Assessing the Quality of Controlled Clinical Trials.” BMJ 323(7303): 42–46.
Kammerer, F. 2011. Sacramento, California 95812 RE: Modified Text of Proposed Regulations, Division 4.5, Title 22, California Code of Regulations Chapter 54. Green Chemistry Hazard Traits.
Lam, J., E. Koustas, P. Sutton, P. I. Johnson, D. S. Atchley, S. Sen, K. A. Robinson, D. A. Axelrad, and T. J. Woodruff. 2014. “The Navigation Guide—Evidence-Based Medicine Meets Environmental Health: Integration of Animal and Human Evidence for PFOA Effects on Fetal Growth.” Environmental Health Perspectives 122(10): 1040–1051.
Li, J., C. Brisson, E. Clays, M. M. Ferrario, I. D. Ivanov, P. Landsbergis, N. Leppink, et al. 2018. “WHO/ILO Work-Related Burden of Disease and Injury: Protocol for Systematic Reviews of Exposure to Long Working Hours and of the Effect of Exposure to Long Working Hours on Ischaemic Heart Disease.” Environment International 119 (October): 558–569.
Lind, L., J. A. Araujo, A. Barchowsky, S. Belcher, B. R. Berridge, N. Chiamvimonvat, W. A. Chiu, et al. 2021. “Key Characteristics of Cardiovascular Toxicants.” Environmental Health Perspectives 129(9): 95001.
Mansoury, M., M. Hamed, R. Karmustaji, F. Al Hannan, and S. T. Safrany. 2021. “The Edge Effect: A Global Problem. The Trouble with Culturing Cells in 96-Well Plates.” Biochemistry and Biophysics Reports 26 (July): 100987.
Miller, G. W. 2014. “Improving Reproducibility in Toxicology.” Toxicological Sciences: An Official Journal of the Society of Toxicology 139(1): 1–3.
Morgan, R. L., K. A. Thayer, L. Bero, N. Bruce, Y. Falck-Ytter, D. Ghersi, G. Guyatt, et al. 2016. “GRADE: Assessing the Quality of Evidence in Environmental and Occupational Health.” Environment International 92–93 (January): 611–616.
Morgan, R. L., P. Whaley, K. A. Thayer, and H. J. Schünemann. 2018. “Identifying the PECO: A Framework for Formulating Good Questions to Explore the Association of Environmental and Other Exposures with Health Outcomes.” Environment International 121 (Pt 1): 1027–1031.
NASEM (National Academies of Sciences, Engineering, and Medicine). 2017a. Using 21st Century Science to Improve Risk-Related Evaluations. Washington, DC: The National Academies Press. https://doi.org/10.17226/24635.
NASEM. 2017b. Guiding Principles for Developing Dietary Reference Intakes Based on Chronic Disease. Washington, DC: The National Academies Press. https://doi.org/10.17226/24828.
NASEM. 2018. Progress Toward Transforming the Integrated Risk Information System (IRIS) Program: A 2018 Evaluation. Washington, DC: The National Academies Press. https://doi.org/10.17226/25086.
NASEM. 2021. The Use of Systematic Review in EPA’s Toxic Substances Control Act Risk Evaluations. Washington, DC: The National Academies Press. https://doi.org/10.17226/25952.
NASEM. 2022. “Review of U.S. EPA’s ORD Staff Handbook for Developing IRIS Assessments.” https://doi.org/10.17226/26289.
NRC (National Research Council). 1994. Science and Judgment in Risk Assessment. Washington, DC: The National Academies Press. https://doi.org/10.17226/2125.
NRC. 2009. Science and Decisions: Advancing Risk Assessment. Washington, DC: The National Academies Press. https://pubmed.ncbi.nlm.nih.gov/25009905/.
NRC. 2011. Review of the Environmental Protection Agency’s Draft IRIS Assessment of Formaldehyde. Washington, DC: The National Academies Press. https://doi.org/10.17226/13142.
NRC. 2014. Review of EPA’s Integrated Risk Information System (IRIS) Process. Washington, DC: The National Academies Press. https://doi.org/10.17226/18764.
Niepel, M., M. Hafner, C. E. Mills, K. Subramanian, E. H. Williams, M. Chung, B. Gaudio, et al. 2019. “A Multi-Center Study on the Reproducibility of Drug-Response Assays in Mammalian Cell Lines.” Cell Systems 9(1): 35–48.e5.
NTP (National Toxicology Program). 2019. Handbook for Conducting a Literature-Based Health Assessment Using OHAT Approach for Systematic Review and Evidence Integration. Washington, DC: U.S. Department of Health and Human Services. https://ntp.niehs.nih.gov/ntp/ohat/pubs/handbookmarch2019_508.pdf.
OECD (Organisation for Economic Co-operation and Development). 2005. “Guidance Document on the Validation and International Acceptance of New or Updated Test Methods for Hazard Assessment,” OECD Series on Testing and Assessment, Guidance Document 34. Paris: OECD Publishing. https://one.oecd.org/document/ENV/JM/MONO(2005)14/en/pdf.
OECD. 2007. Guidance Document on the Validation of (Quantitative) Structure-Activity Relationships [(Q)SAR] Models. Paris: OECD Publishing. https://one.oecd.org/document/env/jm/mono(2007)2/en/pdf.
OECD. 2014. The Adverse Outcome Pathway for Skin Sensitisation Initiated by Covalent Binding to Proteins. Paris: OECD Publishing.
OECD. 2018. “Guidance Document on Good In Vitro Method Practices (GIVIMP),” OECD Series on Testing and Assessment, Guidance Document 286. Paris: OECD Publishing. https://doi.org/10.1787/9789264304796-en.
OECD. 2020. “Overview of Concepts and Available Guidance Related to Integrated Approaches to Testing and Assessment (IATA),” OECD Series on Testing and Assessment, Guidance Document 329. Paris: OECD Publishing.
OECD. 2021a. Guideline No. 497: Defined Approaches on Skin Sensitisation. Paris: OECD Publishing. https://doi.org/10.1787/b92879a4-en.
OECD. 2021b. “Guidance document on the characterisation, validation and reporting of Physiologically Based Kinetic (PBK) models for regulatory purposes,” OECD Series on Testing and Assessment, Guidance Document 331. Paris: OECD Publishing. https://www.oecd.org/chemicalsafety/risk-assessment/guidance-document-on-the-characterisation-validation-and-reporting-of-physiologically-based-kinetic-models-for-regulatory-purposes.pdf.
Piersma, A. H., J. van Benthem, J. Ezendam, and A. S. Kienhuis. 2018. “Validation Redefined.” Toxicology in Vitro: An International Journal Published in Association with BIBRA 46 (February): 163–165.
Roth, N., J. Zilliacus, and A. Beronius. 2021. “Development of the SciRAP Approach for Evaluating the Reliability and Relevance of in Vitro Toxicity Data.” Frontiers in Toxicology (2021): 42.
Samet, J. M., W. A. Chiu, V. Cogliano, J. Jinot, D. Kriebel, R. M. Lunn, F. A. Beland, et al. 2020. “The IARC Monographs: Updated Procedures for Modern and Transparent Evidence Synthesis in Cancer Hazard Identification.” Journal of the National Cancer Institute 112 (1): 30–37.
Singam, A., E. R., P. Tachachartvanich, D. Fourches, Soshilov, J. C. Y. Hsieh, M. A. La Merrill, M. T. Smith, and K. A. Durkin. 2020. “Structure-Based Virtual Screening of Perfluoroalkyl and Polyfluoroalkyl Substances (PFASS) as Endocrine Disruptors of Androgen Receptor
Activity Using Molecular Docking and Machine Learning.” Environmental Research 190 (March):109920.
Smith, M. T., K. Z. Guyton, C. F. Gibbons, J. M. Fritz, C. J. Portier, I. Rusyn, D. M. DeMarini, et al. 2016. “Key Characteristics of Carcinogens as a Basis for Organizing Data on Mechanisms of Carcinogenesis.” Environmental Health Perspectives 124(6): 713–721.
Smith, M. T., K. Z. Guyton, N. Kleinstreuer, A. Borrel, A. Cardenas, W. A. Chiu, D. W. Felsher, et al. 2020. “The Key Characteristics of Carcinogens: Relationship to the Hallmarks of Cancer, Relevant Biomarkers, and Assays to Measure Them.” Cancer Epidemiology, Biomarkers & Prevention: A Publication of the American Association for Cancer Research, Cosponsored by the American Society of Preventive Oncology 29 (10): 1887–1903.
Voelkl, B., N. S. Altman, A. Forsman, W. Forstmeier, J. Gurevitch, I. Jaric, N. A. Karp, et al. 2020. “Reproducibility of Animal Research in Light of Biological Variation.” Nature Reviews. Neuroscience 21 (7): 384–393.
Whaley, P., T. Piggott, R. L. Morgan, S. Hoffmann, K. Tsaioun, L. Schwingshackl, M. T. Ansari, K. A. Thayer, and H. J. Schünemann. 2022. “Biological Plausibility in Environmental Health Systematic Reviews: A GRADE Concept Paper.” Environment International 162 (April): 107109.
This page intentionally left blank.