This chapter presents the results of the literature review, the two virtual workshops, the literature presented to the committee in information gathering sessions, and committee deliberations with respect to qualitative and quantitative variability of laboratory mammalian toxicity studies for human health risk assessment. The implications of these findings when using them to establish the performance of new approach methods (NAMs) are also presented.
Specifically, the charge question to the committee that the chapter addresses is as follows:
Given the results of the literature review and workshops, what are the implications of the qualitative and quantitative variability of laboratory mammalian toxicity studies when using them to establish the performance of NAMs?
In addressing the charge question, the committee considered the multiple possible sources of variability, including experimental as well as biological variability. The following sections describe these different sources of variability and the methods used to assess variability, summarize the insights from the virtual workshops and the literature reviews, and provide the committee’s findings and recommendations concerning variability.
As mentioned in the previous chapters and discussed extensively in the 2007 report Toxicity Testing in the 21st Century: A Vision and a Strategy, animal toxicity studies have a long history of use. Early animal toxicity studies were performed using a variety of different animal species, housing conditions, and study designs and methods (Jacobs and Hatfield, 2013). Standardized testing protocols were progressively developed which stressed the importance of the purity and stability of the test chemical, selection of an appropriate test species, standardization of animal maintenance and environmental control (including temperature, humidity, hours of light, bedding, airflow, water, and diet), dose selection, and issues related to various routes of administration (Haseman et al., 1989; Jacobs and Hatfield, 2013). From the standardized protocols, study guidelines regarding data management, reporting, and experimental design including sample sizes for dose groups emerged (e.g., good laboratory practices [GLP], Organisation for Economic Co-operation and Development [OECD] test guidelines, U.S. Environmental Protection Agency [EPA] test guidelines for health effects, Food and Drug Administration [FDA] Redbook), which attempted to harmonize key criteria of a given toxicity study while still allowing flexibility for practical applications. Standardized protocols and study guidelines were designed to reduce experimental variability within and between studies by restricting known sources of experimental variability.
Despite best efforts to control for experimental variability, there will always be factors that differ from study to study (e.g., animal handler, supply chain change in feed, microbiome, vivarium air and water filtration) or even within a study. Thus, two studies are never identical to each other. The presence of experimental variability is ubiquitous within experimental science (including in vitro studies) and is not unique to laboratory mammalian toxicity studies; importantly, experimental variability can occur not only between laboratories but also within the same laboratory over time. Ideally, experiments should be designed so that they are resilient to variations that occur
inside (e.g., new members joined, animals ordered during a different season) and outside (e.g., finals week, day after a major sporting event) the laboratory, so that unexpected variations are detected through control measurements and experimental results can be replicated by others. Minimizing variability is often regarded as highly valuable and provides greater statistical power to detect important biological differences at a specific sample size. However, it may also limit the generalizability and the reproducibility of the results (Miller, 2014), since with every additional variable that is standardized, one risks that the inference space of a study decreases (Voelkl et al., 2020). In addition, when comparing reproducibility between studies, it is important to match the studies as best as possible, since differences among methods can increase variability. When an outcome is consistently observed despite the use of different methods, it increases the confidence in both causality of the relationship between exposure and outcome and its relevance to humans when it is observed across species (which represents biological variability), and it provides evidence for the external validity of a test method.
One key approach to understanding the experimental variability of a method is to compare test results within and among laboratories. Comparisons among laboratories can be performed by prospective interlaboratory studies wherein each laboratory tests the same substances multiple times using the same method. For an ideal method, within-laboratory variability would be similar to between-laboratory variability, indicating that there are not laboratory-specific sources of variability that differ significantly among the laboratories. When this does not occur, the distribution of responses among control measurements may shed insights into outlier results and differences among laboratories. Since in vivo guideline methods often did not have interlaboratory validation conducted before their adoption, the most common approach for evaluating them is to perform post hoc comparisons among results from different laboratories that have tested the same substance. However, examples of limitations of this approach include (1) there may be different impurities present in the test substances, which could increase the variability in the results and (2) each laboratory typically only tests the substance a single time; and thus, within laboratory variability is not evaluated.
As defined in Chapter 2, biological variability is the true differences in attributes due to heterogeneity or diversity. Complex biological systems are intrinsically variable in their response to both endogenous and exogenous stressors, such as chemicals, due to the stochastic nature of biological systems on a chemical, molecular, cellular, tissue, organ, and organismal level. Intrinsic variability arises from the stochastic interactions between finite numbers of spatially and temporally distributed macromolecules within and between cells as well as dynamic multifactorial biological systems. Thus, in any biologically based assay system, there will always be intrinsic and irreducible biological variation that cannot be eliminated from the assay (i.e., the intrinsic biological variability). This intrinsic variability will impact the precision and false positive/false negative rates of a given assay system, and the extent of the variability will be specific to each assay. It is essential to characterize the variability and to account for it in analysis, interpretation, and utilization of the assay results. However, in practice, it can be difficult to distinguish between intrinsic biological variability and experimental variability, and often these two sources of variability are considered together in evaluating inter- and intra-laboratory variability.
Biological variability represents the reality of complex biological systems. For humans this would correspond to differences in outcomes across the population due to differences in intrinsic factors (e.g., life stage, reproductive status, age, gender, genetic traits) and acquired factors that
modify susceptibility (e.g., previous or ongoing exposure to multiple chemicals, pre-existing disease, geography, socioeconomic status, racism/discrimination, cultural, workplace). With respect to laboratory mammalian studies, this would include differences in responses within animals of a given species and between animals of different species, strain, age, sex, body weight, history, and concurrent or previous exposures. Multiple sources of potential biological variability have been identified in laboratory mammalian toxicity studies, including the genetic and epigenetic differences between species and between individuals within a species, sex, and developmental stage of the test organism or system, previous or concurrent exposures, or health status. For multiple reasons, biological variability (e.g., biological diversity) has historically been minimized in laboratory mammalian toxicity studies by controlling the experimental design (discussed previously). Nevertheless, an intrinsic biological variability will remain within any laboratory mammalian toxicity studies. Further, while limiting variability aids with test sensitivity to detect adverse responses in a test system, it may come at the possible cost of generalizability to conditions beyond those being studied.
Figure 3-1 depicts the multiple sources of variability considered by the committee. Box 3-1 addresses characterization of variability.
The first virtual workshop was held on December 9, 2021, with the aim of providing information to assist the committee in considering the potential utility and expectations for the use of NAMs in risk assessment. Presentations and round table discussions involving experts from
academia, industry, government, and other organizations addressed current scientific knowledge with regard to how laboratory mammalian toxicity studies are used to inform chemical safety decisions, some examples of variability and concordance of laboratory mammalian toxicity studies, and a consideration of the expectations of different stakeholders. A summary of the workshop proceedings was published (NASEM, 2022a). Points relevant to variability in laboratory mammalian toxicity studies are discussed subsequently.
Participants elaborated on the goal of toxicity testing and collectively asked: “Is the goal to prevent catastrophic effects (i.e., overt effects) or to predict effects that are of public health concern?” Several participants expressed that laboratory toxicity studies were designed to identify catastrophic effects rather than more subtle effects that are of public health interest, including obesity and metabolic diseases, thyroid disease, hypertension, allergy and asthma, and autoimmune studies. NAMs may provide an opportunity to inform on the effect of chemicals on these health states.
While laboratory mammalian toxicity studies have been the backbone of current risk assessment, participants presented examples of qualitative and quantitative variability within laboratory mammalian tests. Even when studies follow the same guideline, there will still be variability due to diet and water source, sex-specific differences, route of administration, and timing of administration. Acknowledging some of the shortcomings of the existing toxicity tests can help identify potential areas of opportunity where NAMs can fill data gaps. For example, laboratory mammalian toxicity studies usually use a small number of young, healthy animals. NAMs may provide an opportunity to include more biological diversity.
The second virtual workshop was held on May 12, 2022 (NASEM, 2022b). This workshop addressed elements of a scientific confidence framework for NAMs pertinent to risk assessment via case studies related to mixtures, developmental neurotoxicity (DNT), and estrogenicity. The goal of the case studies was to illustrate the strengths and weaknesses of traditional and nontraditional approaches and methods. Relevant to variability in laboratory mammalian toxicity studies were the case studies on mixtures and DNT, as discussed subsequently.
This case study suggested that mixture studies provide an opportunity to assess the variability in laboratory mammalian toxicity studies because variability was represented in analyses where multiple chemicals were evaluated to establish toxicity equivalency factors (TEFs), a concept important in the field of mixtures. Existing data sets that may be helpful to this question include tumor bioassay studies on polycyclic aromatic hydrocarbons (PAHs) (EPA, 2010) or dioxin-like chemicals (Haws et al., 2006; Scott et al., 2006), noncancer effects from organophosphate or pyrethroid pesticides, and chemicals with a common adverse outcome, such as antiandrogenic chemicals (Conley et al., 2018; Howdeshell et al., 2017). Meta-analytic or meta-regression techniques may be a useful approach to analyze such data, where contributions from interstudy and interchemical variability to overall heterogeneity may possibly be disentangled.
Regarding variability in DNT studies, several potential sources were reported by the participants. In particular, the timing of exposure was highlighted as an important determinant of study outcomes and therefore a significant source of variability across studies. Biological sex was also identified as a source of variability in DNT study outcomes. Further, participants indicated that DNT guideline studies are often variable depending on the laboratory conducting the study, and an example was given on the importance of including adequate positive controls to aid in the data interpretation as it might be difficult to determine whether an effect, or lack thereof, is truly observed if the appropriate controls are not included. There was also discussion regarding the fact that behavioral endpoints such as those assessed in DNT studies might inherently be more variable than anatomical endpoints. Finally, because DNT studies are seldom performed, this has hampered the ability to characterize and understand the variability of this test across chemical classes. Overall, there were concerns with the confidence in these tests due to the perceived variability and lack of sensitivity.
Participants indicated considerable variability in the data from in vivo tests such as the uterotrophic assay. There was discussion about the importance of having adequate positive controls, and the participants noted that even the quantitative results from positive controls could vary from laboratory to laboratory, or even within a laboratory at different times, highlighting the importance of defining performance limits and expected levels of accuracy for any given assay. There was also discussion regarding the different aspects of variability: biological and technical (i.e., experimental). For example, adrenal weights, which seems like a simple outcome, can in fact be different from one experimenter to another depending on how they clean the tissue. There was also mention that understanding variability is important, and efforts to eliminate variability, especially biological
variability, is unwarranted. Variability is ideally understood and embraced as we might gain additional information by studying it. For example, participants highlighted the need to understand how different individuals respond to a given exposure and when possible, attempt to replicate the variability observed in the human population, including susceptible populations. There was also discussion regarding the irreducible amount of variability that will be present in any type of study, beyond those concerned with estrogenicity (e.g., subchronic, chronic, and developmental studies) that will impact the point of departure1 from these studies. According to another panelist, variability in outcomes might have different consequences according to the type. For example, according to a workshop participant, misclassifying a compound as nonestrogenic when it is in fact estrogenic (a false negative) might have considerably more deleterious implications than a small difference in the quantitative assessment of a compound’s estrogenic potency. In other words, the public health impacts of a misclassification could be more severe than those associated with a marginal difference (e.g., < 2 fold) in the potency estimate.
To support the committee’s effort, the committee reviewed existing literature that evaluated evidence on variability and concordance of laboratory mammalian toxicity tests. The primary literature reporting on variability in relevant laboratory mammalian toxicity tests and amenable to de novo review and analysis was voluminous, and a formal systematic review of this literature was not considered within the scope of the committee’s effort. However, the committee considered literature consisting of reviews, wherein information from multiple relevant studies, experiments, or databases was compiled and analyzed. Recognizing that systematic reviews provide a transparent, comprehensive, and consistent evaluation that are designed to have less bias than other types of reviews and analyses, the committee conducted an overview review (Pollock et al., 2019) to identify and evaluate systematic reviews of the scientific evidence of the highest methodological quality relevant to the committee’s charge. The goals of the approach were to identify relevant systematic reviews and authoritative reviews, evaluate their methodological quality, and illustrate the study strengths and weaknesses as well as the populations, interventions or exposures, and outcomes covered and where significant gaps may remain. Systematic reviews and authoritative reviews judged to be of critically low methodological quality were not considered further by the committee, whereas those of higher quality formed the evidentiary basis analyzed by the committee in reaching findings and recommendations that addressed the charge questions.
The approach entailed development of a prespecified method as further described in Appendix C. This method detailed the key terms and their definitions, the objectives of the review, the scoping questions and associated population, exposures, comparators, and outcomes (PECO) statements, the inclusion and exclusion criteria, the literature search strategy, the process to assess methodological quality, and the analysis plan. The following goal was used to guide the literature review on variability: To summarize the systematic reviews and authoritative reviews (see Appendix C for definition) that assess and evaluate variability of laboratory mammalian toxicity studies.
In brief, this goal was addressed through a comprehensive literature search of multiple databases conducted using relevant terms. The results were reviewed for relevance by two independent screeners and included studies evaluated for methodological quality using AMSTAR 2 (A MeaSurement Tool to Assess systematic Reviews) (Shea et al., 2017) in line with assessments of environmental health information by prior National Academies of Sciences, Engineering, and
___________________
1 The point of departure (POD) is the lowest dose or concentration of a substance that produces a measurable effect or adverse response in a biological system, which is utilized to establish safe exposure limits for humans or other organisms in risk assessment.
Medicine (NASEM) committees (NASEM 2019b, 2021, 2022c). An evidence map was generated to illustrate the extent of coverage with respect to the PECO statements as well as the quality of existing systematic reviews.2
The overview review identified 71 systematic reviews and authoritative reviews of the literature that potentially addressed variability of laboratory mammalian toxicity studies. Systematic reviews judged to be of critically low quality were not considered further by the committee. Data on variability of laboratory mammalian toxicity studies were sought from the remaining 24 studies, as summarized in Table 3-1.
Of these, five systematic reviews were most informative to the committee’s task because they presented data on the variability of laboratory mammalian studies and are briefly summarized subsequently. The remaining studies are summarized in Appendix C and either were focused on endpoints that are not standard in toxicological risk assessment, had methodological flaws that confounded the outcomes, or failed to report or attribute the sources of data variability.
Variability data were identified in two separate systematic reviews of two sets of endpoints: the impact of phthalate exposure on male reproductive tract development (and on testosterone levels) and of polybrominated diphenyl ether (PBDE) exposure on neurobehavioral function, in a consensus study published by NASEM in 2017 (NASEM, 2017). For example, the decreased anogenital distance (AGD), an important endpoint for male reproductive development, was evaluated in a meta-regression for rats exposed to diethylhexyl phthalate (DEHP) and yielded a linear trend with respect to dose (–1.55 [95% CI: –1.86, –1.24]). The variability in results among studies was also evaluated for rats and mice exposed to other phthalates and for other male reproductive endpoints.
As an illustration of the range of variability that can be seen within systematic reviews, Table 3-2 shows the distribution of reported values for tau (τ) and for I2 from the meta-analyses of effects of six different phthalates on AGD and for blood testosterone levels in the rat, alongside the number of experiments (each dose level is considered an experiment) contributing to each analyses. For most compounds and for both outcomes, the variation between studies is substantial (Median I2 of 94.8), indicating a heterogeneity of underlying effects. It is likely that a substantial proportion of this reflects true, biologically important differences in responses between experiments rather than imprecise estimation of a consistent effect. Specifically, different strains of rat, different doses, timing of exposure, outcomes measurements, and so forth were employed, and therefore, heterogeneity is expected. Additional studies considered relevant to the committee’s charge are briefly summarized as follows.
Perel et al. (2007) reported systematic review and meta-analyses of the animal data for six interventions where the effect on human health is known (corticosteroids in traumatic head injury, antifibrinolytics in hemorrhage, thrombolysis in acute ischemic stroke, tirilazad in acute ischemic stroke, antenatal corticosteroids to prevent neonatal respiratory distress syndrome, and bisphosphonates to prevent and treat osteoporosis). They found heterogeneity (I2) of 0% for effect of corticosteroids on grip test strength following traumatic brain injury; 78% and 75% and 0% for the effects of tissue plasminogen activator on infarct volume, neurobehavior, and mortality, respectively, following focal cerebral ischaemia; 73% and 58% for the effects of tirilazad on infarct volume and neurobehavior, respectively, following focal cerebral ischaemia; and 73% for the effects of corticosteroids on mortality in the neonatal respiratory distress syndrome.
___________________
2 See https://public.tableau.com/app/profile/leslie.beauchamp/viz/NAMsEvidenceMapDashboard/EvidenceMap?publish=yes.
| Reference | Population(s) | Exposures(s) | Adverse Outcome(s) | Committee Comments (relevance and informativeness to charge) |
|---|---|---|---|---|
| Andrade et al. (2019) | Rats and mice | Resveratrol | Alveolar bone loss and an expression of cytokines | Conducted a systematic review and meta-analysis of seven mammalian preclinical studies for the effects of resveratrol on induced periodontal disease. Reported a high level of heterogeneity between the studies (I2 = 95%; p < 0.01). |
| Leffa et al. (2019) | Rats | Spontaneously hypertensive rat model of attention-deficit/ hyperactivity disorder (ADHD) | Locomotion (hyperactivity), attention, impulsivity or memory | Identified 36 studies that met the inclusion criteria. The authors noted significant heterogeneity in hyperactivity outcome measures with a I2 = 70% and a Chi2 = 151.56 (df = 45, p < 0.001) and similarly significant heterogeneity in attention analysis with an I2 = 68% and a Chi2 = 72.68 (df = 23, p < 0.001). The impulsivity analysis showed low heterogeneity with I2 = 9% and a Chi2 = 8.8 (df = 8, p = 0.36) while the memory analysis showed moderate heterogeneity with an I2 = 43% and a Chi2 = 22.97. Discusses differences in study design but does not comment on the sources of outcome heterogeneity. |
| Morahan et al. (2020) | Rats and mice | Maternal nonnutritive sweeteners diet during pregestation and/or gestation and/or lactation | Impact on body weight | Reviewed the effects of non-nutritive sweeteners on dams and pups; they found low variability for effects on maternal weight (I2 = 12%) and litter size (0%), but not for offspring weight at weaning (80%) or in adulthood (92%). |
| NASEM (2017) | Humans, rats, mice, guinea pigs | Developmental exposure to polybrominated diphenyl ethers (PBDEs) | Humans: quantitative measures of intelligence; ADHD and attention-related behavioral conditions; Animals: Measures of learning, memory, attention, or response inhibition | Overall, for latency in the last trial of the Morris Water Maze, latency was 24%; for individual brominated diphenyl ethers (BDEs); this ranged from 0% (BDE-209) to 44% (BDE-47) (discussed further in the text). |
| NASEM (2017) | Humans, rats, mice, guinea pigs | In utero exposure to phthalates (see Box 3-1 of the report for details) | Male reproductive toxicity (anogenital distance [AGD], hypospadias, fetal testosterone) | Reported values for τ and for I2 for the effect of six phthalates on two endpoints, AGD distance and testosterone levels (presented in the text). τ is a combination of intra- and interstudy variability for each compound. Different doses used in each study were considered as independent experiments for the same study endpoint and chemical. |
| Perel et al. (2007) | Humans, rats, mice, primates, other mammals | Six interventions for which there was evidence of a treatment effect (benefit or harm) in systematic reviews of clinical trials | Tirilazad was associated with worse outcome in patients; other outcomes studied were beneficial in nature | Heterogeneity (I2) was 0% for the effect of corticosteroids on grip test strength following traumatic brain injury; 78% and 75% and 0% for the effects of tissue plasminogen activator on infarct volume, neurobehavior, and mortality, respectively, following focal cerebral ischemia; 73% and 58% for the effects of tirilazad on infarct volume and neurobehavior, respectively, following focal cerebral ischemia; and 73% for the effects of corticosteroids on mortality in the neonatal respiratory distress syndrome (discussed further in the text). |
| Ramsteijn et al. (2020) | Rats, mice, guinea pigs | Selective serotonin reuptake inhibitors (SSRIs) during pregnancy | Behavioral outcomes | Reported heterogeneity as 49% for activity and exploration, 51% for anxiety, 48% for stress coping, 65% for social behavior, 49% for learning and memory, 69% for ingestive- and reward behavior, 49% for motoric behavior, 68% for sensory processing, and 77% for reflex and pain sensitivity. Heterogeneity in outcomes of activity, exploration, learning, and memory were slightly reduced by subgroup analyses based on sex. The specific period the animal was exposed to an SSRI (prenatal, postnatal, or both) explained the most heterogeneity in the data out of the three subgroup analyses we performed. Overall, the data used in these meta-analyses demonstrated high levels of heterogeneity (discussed further in the text). |
| Shojaei-Zarghani et al. (2020) | Humans, rats, mice | Caffeinated beverages (tea, coffee, soda) and chocolate in humans; caffeine in animals | Risk of colon cancer in human studies; in animal studies, cancers (adenocarcinomas) and precancerous lesions; survival; key characteristics of carcinogens (5, 6, 10) | Of five studies of incidence, two showed an increase, two showed a decrease, and one showed no change. In three studies of the effects on existing tumors, one showed increased tumor burden, one showed a reduction, and one showed no change. Of two studies of effects on mortality, one showed an increase, and one showed a decrease. Of note, two studies did not report group size, and median group size for those which did was 9. |
| Reference | Population(s) | Exposures(s) | Adverse Outcome(s) | Committee Comments (relevance and informativeness to charge) |
|---|---|---|---|---|
| Soliman et al. (2021) | Injury-related or persistent pain model in rats and mice | Any cannabinoid, cannabis-based medicine, or endocannabinoid system modulator administered to assess antinociceptive effect | Pain-associated behavioral outcome measures | The meta-analysis of 374 studies revealed moderate overall heterogeneity (I2 = 61.58%). Subgroup analyses demonstrated that a significant proportion of the heterogeneity could be attributed to the species, therefore rat and mouse studies were analyzed separately. When only rat studies (n = 276) were considered, I2 was 57.8%; when only mouse studies (n = 153) were considered, I2 was 66.7%. Within each species, the drug and drug class, the specific pain model, and the outcome measures accounted for a large proportion of the heterogeneity observed. Also reported that within species, strain and sex significantly contributed to heterogeneity. In their discussion of the external validity of these studies, authors noted that the models used in preclinical pain research are not representative of the clinical population (discussed further in the text). |
| Sophocleous et al. (2022) | Humans, mice, rats, rabbits | Cannabinoid receptor ligands (synthetic and natural) | Bone cell activity and bone volume in rodents; bone mineral density in humans | Conclusions were limited because the studies were few in number and heterogeneous. |
| Wikoff et al. (2021) | Humans and rats | Exposure to dioxin-like compounds | Reduced sperm count | Conducted a meta-analysis of 29 studies investigating the effects of 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD) on sperm count. They reported a high level of heterogeneity (I2 > 84%) across models and approaches and noted that qualitative exploratory sensitivity analyses indicated that there were no obvious patterns in dose-response relationship based on strain or postnatal age at evaluation, suggesting the underlying data is itself inconsistent. |
| Zhang et al. (2022) | Rats and mice | Bisphenol A exposure | Oxidative damage | Identified 20 publications with a median sample size of 7. The majority were of unclear risk of bias across most of the 10 Syrcle risk of bias indicators. Across 7 indicators of oxidative damage, I2 was 57% for glutathione reductase and greater than 90% for the remaining indicators. Some of this heterogeneity could be explained by aspects of study design including dose, duration of exposure, and the tissue in which antioxidant effects were sampled. |
| Andersen et al. (2020) | Humans and rats | Exposure to exposed to methadone or buprenorphine in utero | Cognitive, psychomotor, motor, behavioral, attentional, executive, or visual outcomes | Does not provide an analysis of variability (qualitative or quantitative) across multiple laboratory mammalian toxicity studies. Discusses concordance between human outcomes and observations in experimental animals (see Chapter 4). |
| Bestry et al. (2022) | Humans, primates, rats, mice | Prenatal alcohol exposure | DNA methylation | Does not provide an analysis of variability (qualitative or quantitative) across multiple laboratory mammalian toxicity studies. |
| Bezemer et al. (2021) | Humans, mice | Allylamines | Safety and efficacy for treatment of cutaneous and mucocutaneous leishmaniasis—under “adverse events” there were none reported | Does not provide an analysis of variability (qualitative or quantitative) across multiple laboratory mammalian toxicity studies. A meta-analysis could not be performed because the identified studies were low in number, heterogeneous, and of low quality. |
| Bodewein et al. (2019) | Humans, rats, mice, guinea pigs, dogs | Man-made electric fields, magnetic fields, electromagnetic fields in the intermediate frequency range (300 Hz to 1 MHz) | Effects on any biological function | Does not provide an analysis of variability (qualitative or quantitative) across multiple laboratory mammalian toxicity studies. The authors noted that there was large heterogeneity in study designs, investigated systems, and endpoints. |
| da L. D. Barros et al. (2018) | Rats and mice | Fluoxetine (the effects of pharmacological neonatal inhibition of serotonin reuptake by fluoxetine) | Influence on neonatal feeding behavior and energy balance | Does not provide an analysis of variability (qualitative or quantitative) across multiple laboratory mammalian toxicity studies. |
| European Commission (2018) | Humans, rabbits, rats, mice | Endocrine disrupting chemicals [World Health Organization (WHO), International Programme on Chemical Safety (IPCS) definition]: bisphenol A, di(2ethylhexyl)phthalate (DEHP), vinclozolin, trenbolone, per-fluorooctanesulfonic acid, perfluorooctanoic acid, BDE, perchlorate, prochloraz | Endocrine-disrupting activity or effect, including effects manifested at later life stages | Does not provide an analysis of variability (qualitative or quantitative) across multiple laboratory mammalian toxicity studies. |
| Reference | Population(s) | Exposures(s) | Adverse Outcome(s) | Committee Comments (relevance and informativeness to charge) |
|---|---|---|---|---|
| Hooijmans et al. (2016) | Animals with experimental cancer—mice (60%) or rats (40%) | Treatment with analgesic and anesthetic drugs | Metastasis | Does not provide an analysis of variability (qualitative or quantitative) across multiple laboratory mammalian toxicity studies. Anesthetic drugs are grouped as either volatile or general, and data are not discussed on the effects of multiple individual anesthetic drugs. |
| Jukema et al. (2021) | Humans, mice, rats, guinea pigs, rabbits, other mammals | Antileukotrienes to prevent or treat chronic lung disease | All-cause mortality and any harm, and, for the clinical studies, incidence of chronic lung disease | Does not provide an analysis of variability (qualitative or quantitative) across multiple laboratory mammalian toxicity studies. Overall, there was heterogeneity in study designs, methods, and a high level of bias in the included studies. |
| Kinkade et al. (2021) | Rats and mice | Fusarium-derived mycoestrogens | Reproductive hormone levels, ovary and uterine weight, morphological and pathological changes in ovary or uterus, oocyte maturation rate, duration of estrus cycle, placental changes, implantation rate, pregnancy rate, gestational weight gain, resorbed/dead fetuses, live birth rate, fetal growth. | Does not provide an analysis of variability (qualitative or quantitative) across multiple laboratory mammalian toxicity studies. |
| Leenaars et al. (2019) | Humans, rats, rabbits, mice | Various | Translational success (and failure) rates | Does not provide an analysis of variability (qualitative or quantitative) across multiple laboratory mammalian toxicity studies. Paper focuses on the concordance between animals and humans. All but one of the included studies were of very low quality. |
| Michelogiannakis et al. (2018) | Rats | Nicotine | Tooth movement | Does not provide an analysis of variability (qualitative or quantitative) across multiple laboratory mammalian toxicity studies. |
| Rogers et al. (2016) | Humans, primates, rabbits, rats | Low-energy sweeteners | Energy intake and body weight | Does not provide a quantitative analysis of variability across multiple laboratory mammalian toxicity studies. Of 49 experiments reporting the effects of forced low energy sweeteners on body weight, 5 reported a gain, 21 reported a loss, and 23 reported no effect. Median group size was 12, 20, and 18, respectively (mean group size 10, 171, 54), suggesting those studies reporting weight gain may have been underpowered. |
| Zhao et al. (2021) | Mice | Grifola frondosa (edible medicinal mushroom) | Tumor proliferation | Does not provide an analysis of variability (qualitative or quantitative) across multiple laboratory mammalian toxicity studies. |
| Compound | Anogenital Distance | Testosterone Level | ||||
|---|---|---|---|---|---|---|
| Number of experiments | τ | I2 | Number of experiments | τ | I2 | |
| DEHP | 251 | 3.5 | 45.7 | 70 | 77 | 98.5 |
| Benzylbutyl phthalate | 46 | 5.8 | 94.9 | 14 | 77.7 | 98.2 |
| Di-n-butyl phthalate | 198 | 7.8 | 89.1 | 111 | 59.2 | 94.8 |
| Dipentyl phthalate | 5 | - | - | 45 | 76.2 | 98.1 |
| Diisobutyl phthalate | 5 | - | - | 10 | 71.8 | 97.0 |
| Diisononyl phthalate | 48 | 0.75 | 4.6 | 34 | 31.2 | 83.3 |
τ is a combination of intra- and inter-study variability for each compound. Different doses used in a given study were considered as independent experiments for the same study endpoint and chemical. Data obtained from Application of Systematic Review Methods in an Overall Strategy for Evaluating Low-Dose Toxicity from Endocrine Active Chemicals (NASEM, 2017). For explanation of τ and I2, see Box 3-1.
Ramsteijn et al. (2020) evaluated the effects of perinatal selective serotonin reuptake inhibitor (SSRI) exposure on behavioral outcomes in systematic review and meta-analyses of animal studies. In this study, they performed nine meta-analyses and two qualitative syntheses corresponding to different behavioral categories, aggregating data from thousands of animals and found evidence for reduced activity and exploration behavior (standardized mean difference (SMD) −0.28 [−0.38, −0.18]), more passive stress coping (SMD −0.37 [−0.52, −0.23]), and less efficient sensory processing (SMD −0.37 [−0.69, −0.06]) in SSRI- versus vehicle-exposed animals. No standardized mean differences were found for anxiety (p = 0.06), social behavior, learning and memory, ingestive and reward behavior, motoric behavior, or reflex and pain sensitivity. Exposure in the period equivalent to the human third trimester was associated with the strongest effects. Variability in outcomes of activity, exploration, learning, and memory were slightly reduced by subgroup analyses based on sex. The specific period the animal was exposed to an SSRI (i.e., prenatal, postnatal, or both) explained the most variability in the data out of the three subgroup analyses performed. Overall, the data used in these meta-analyses demonstrated high levels of heterogeneity.
Soliman et al. (2021) conducted a systematic review and meta-analysis of studies that assessed the antinociceptive efficacy of cannabinoids, cannabis-based medicines, and endocannabinoid system modulators on pain-associated behavioral outcomes in animal models. They meta-analyzed 374 studies and the overall heterogeneity was moderate (I2 = 61.58%). Subgroup analyses demonstrated that a significant proportion of the heterogeneity could be attributed to the species; therefore, rat and mouse studies were analyzed separately. When only rat studies (n = 276) were considered, I2 was 57.8%; when only mouse studies (n = 153) were considered, I2 was 66.7%. Within each species, the drug and drug class, the specific pain model, and the outcome measures accounted for a large proportion of the heterogeneity observed. The authors also reported that within species, strain and sex significantly contributed to heterogeneity. In their discussion of the external validity of these studies, they noted that the models used in preclinical pain research are not representative of the clinical population.
Overall, these systematic reviews and meta-analyses report moderate to high variability in the data analyzed. The authors were able to attribute part of the variability to study design elements that were not consistent across the studies analyzed such as the exact type of drug or chemical, sex,
strain, species, route of administration and dose, which therefore reflects experimental variability. Lower variability could be observed if studies of identical design and conduct were compared.
The committee also analyzed literature presented to them in information gathering sessions. This included literature presented during the virtual workshops as well as that referenced by the sponsor during their presentations to the committee. Importantly, these publications represent those selected by speakers at the committee’s information gathering sessions rather than through a more systematic process that would minimize bias. The committee’s approach to reviewing this literature entailed development of a prespecified method as further described in Appendix C. After removal of duplicates, 128 papers were screened using similar inclusion/exclusion criteria as for the systematic review of reviews. From these 128 papers, 19 studies were included; several had already been identified in the committee’s literature review, including the systematic reviews published by NASEM (2017).
Most of the papers identified during the virtual workshops and other information gathering sessions were not systematic reviews. However, many leveraged data sets culled from the literature that could in part be identified via a systematic review, though additional studies could also be identified. All the papers cited in the workshops that were not previously identified as part of the committee’s “overview review” were regarded as of critically low methodological quality because they were not systematic reviews. Specifically, none of the reviews explicitly stated that the review methods were established prior to the conduct of the review, nor did they report and justify any significant deviations from their analysis plan or protocol. These reviews also did not use or document a comprehensive search strategy or an approach for evaluating the quality of the included studies, including an acceptable risk of bias tool. This highlights opportunities for improving the reviews presented to the committee to decrease bias in the overall evidence evaluation. For example, future studies in this area could employ a comprehensive search strategy and more carefully detail the rationale for why studies were excluded during the title and abstract or full text review stages, and protocols for evaluating the studies could be preregistered for transparency. In addition, a risk of bias evaluation could be conducted on the studies identified using tools previously recommended by NASEM.
It should be noted that the evaluations of mammalian toxicity variability that used data from specific databases are not systematic reviews, and as such, there may be other existing data that might be pertinent to the research question and that might not have been included in the data analysis. In other words, these studies used convenience samples and they were not evaluated for risk of bias. Therefore, the results are not generalizable. The findings from these studies are summarized in Table 3-3, but any conclusions about the qualitative and quantitative variability of laboratory mammalian toxicity studies derived from these reviews are to be interpreted with caution.
Overall, these peer-reviewed papers report low to moderate qualitative variability based on the data they analyzed and generally less than a log difference in quantitative variability. The attribution of the sources of variability was inconsistent across studies. Furthermore, no observed adverse effect levels (NOAELs) and lowest observed adverse effect levels (LOAELs), or qualitative descriptors (i.e., active/inactive, or potency categories), are highly dependent upon the doses and threshold selected, respectively, and therefore the variability may reflect, at least in part, experimental design choices.
| Reference | Population(s) | Exposure(s) | Adverse Outcome(s) | Reported Results |
|---|---|---|---|---|
| Barton et al. (2005) | Rats, mice | Early life exposure to more than 50 chemicals and radiation | Tumor incidence | Variability between life stages was quantified, calculated as the ratio of the estimated cancer potency from early-life exposure compared with the estimated cancer potency from adult exposure. |
| Crofton et al. (2004) | Rats | 27 positive control chemicals for developmental neuro tox (listed in Table 4) | Functional observational battery, motor activity, acoustic startle response, learning and memory, neuropathology | A survey of the positive control reports submitted to support DNT guideline studies identified three problem categories, faulty design, inadequate results, and problems with the manner in which the data were reported. |
| Dumont et al. (2016) | Mice | Approximately 100 chemicals and 500 local lymph node assays (LLNA) from the National Toxicology Program (NTP) Interagency Center for the Evaluation of Toxicological Methods (NICEATM) LLNA database | Skin sensitization | Studies with irreproducible classifications mainly classified chemicals in an adjacent class, but the authors were not able to draw general conclusions on whether such studies lead to a more severe or less severe classification. |
| Gottmann et al. (2001) | Rats, mice | 121 chemicals tested in two or more carcinogenicity studies | Carcinogenicity, mutagenicity, carcinogenic potency (tumorigenic dose rate 50, TD50) | 69 (57%) chemicals had reproducible results; the authors reported a difference between the reproducibility of experiments with mice (49% reproducible; 70 studies) and rats (62% reproducible; 71 studies) but saw no difference between sex. |
| Helma et al. (2018) | Rats | Chemicals with multiple rat chronic rat toxicity studies as collated in two databases, the Nestle database and the Swiss Food Safety and Veterinary Office (FVSO) Database | Chronic oral rat LOAEL | In the Nestle database, there were 93 compounds with multiple studies and reported mean standard deviation (SD) of 0.32 using log10 transformed values. In the FVSO database, 91 compounds had multiple values and mean SD was 0.29. |
| Hill et al. (2017) | Rats, mice | 332 pesticides plus bisphenol A, DEHP, per-fluorooctanesulfonic acid, and perfluorooctanoic acid | Carcinogenicity | Reproducibility for any tumor outcome (+/- in either sex) was 69% (11/16 chemicals) for rat and 63% (10/16 chemicals) for mouse |
| Haseman. (2000) | Rats, mice | 471 chemicals (NTP toxicological reports that assessed carcinogenicity) | Cancer | Focus is on species variability between rats and mice. Of 385 studies with adequate results (adequate not defined), concordant results for mice and rats were found for 283 substances (74%); 47 substances (12%) were only positive for rats, 55 substances (14%) were only carcinogenic for mice. |
| Karmaus et al. (2022) | Rats | 2441 chemicals from public databases (ChemProp, Hazardous Substances Data Bank, ChemIDplus, AcutoxBase, eChemPortal) | Acute oral toxicity, 50% Lethal Dose (LD50) values (approximately 6,000) | Conditional probability analysis to assess qualitative variability revealed that the acute oral toxicity in rats is on average only 50%–60% reproducible (i.e., when results for a compound indicated one category, how often did a repeated test yield results in that same category) for most hazard categories in the EPA and Global Harmonized System (GHS) categorization schema. Quantitative variability was computed as the median absolute deviation (MAD) across log10 of the point estimate LD50 values, and the distribution of MAD revealed that most MAD values are below 0.5 (log10 [mg/kg]). |
| Kleinstreuer et al. (2016) | Rats, mice | 70 chemicals with at least two guideline uterotrophic studies curated from the published literature | Uterotrophic assay | Irreproducible outcomes were present for 26% of the data set (18 chemicals), resulting in a chemical being classified as both “active” and “inactive” for uterotrophic bioactivity. |
| Luechtefeld et al. (2016) | Rabbits | Approximately 500 chemicals tested in at least two Draize tests | Eye irritation | The authors concluded that the most reproducible outcomes from the Draize test were for chemicals that had results of negative (94% reproducible) or severe eye irritant (73% reproducible). The study also concluded that the Draize test cannot reliably distinguish between mild irritants (Class 2B) and non-irritants (i.e., qualitative variability). |
| Luechtefeld et al. (2016) | Mice, Guinea pigs | Chemicals evaluated in three laboratory mammalian sensitivity tests (Buehler, Guinea pig maximization test, LLNA) in the REACH database from 2008 to 2014 | Skin sensitization | The reproducibility for a given skin sensitization classification was 95% for Buehler (344 chemicals), 93% for Guinea pig maximization test (624 chemicals), and 86% for LLNA (296 chemicals). |
| Mansouri et al. (2021) | Rats | Acute oral exposure to 15,688 substances | LD50 (survival) | Binary and multiclass models achieved higher scores (median S scores ranging from 0.74 to 0.82) than the discrete LD50 prediction models (median SLD50 0.66). After quantifying the inherent variability based on the bootstrap analysis, the resulting margin of ±0:3 log10 (mg/kg) was considered the 95% CI for acute oral LD50 values. |
| Reference | Population(s) | Exposure(s) | Adverse Outcome(s) | Reported Results |
|---|---|---|---|---|
| Mazzatorta et al. (2008) | Rats | 94 chemicals with repeat dose systemic toxicity studies | Repeat dose systemic toxicity (LOAEL values) | The interlab reproducibility of LOAEL values can be estimated as twice the mean SD, i.e., 0.64 mg/kg/day (in logarithmic units) (quantitative variability). |
| Pham et al. (2020) | Rats, mice, rabbits, dogs | Chemicals tested in subacute, subchronic, chronic, multigeneration reproductive, and developmental toxicity studies from the EPA Toxicity Reference Database (ToxRefDB) | Specific outcomes were not provided but “potency values for effects in adult or parental animals” were analyzed | Total variance in systemic lower exposure limit (LEL) and LOAEL values (in log10-mg/kg/day units) ranged from 0.74 to 0.92 and were similar across the different study types (i.e., subacute, subchronic, chronic, developmental). |
| Pradeep et al. (2020) | Rats, mice, rabbits | 3,592 chemicals compiled from EPA’s ToxValDB database | Specific outcomes were not provided but “effect level” values were extracted (e.g., LOAEL, NOAEL, benchmark dose, etc.) | The SD per chemical ranges between 0.34 and 0.58 for all data set combinations (as shown in Figure 3a). For example, in rats in the chronic study design, the mean SD per chemical in the chronic rat combination is 0.51 log10mg/kg/day, and so one would expect the models to have errors close to 0.51 log10-mg/kg/day, solely due to the variability in the data set being modeled. |
| Rooney et al. (2021) | Rabbits | Approximately 1,000 chemicals and 3,000 rabbit skin irritation studies from the ECHA database | Dermal irritation | Conditional probability analysis revealed that test results are relatively reproducible at the extremes (i.e., corrosive [Class I or Cat 1] or nonirritant substances). The most reproducible outcomes were for chemicals that had results of nonirritating (84% reproducible) and corrosive (76% reproducible). Reproducibility decreased for intermediate categories (Class II and III or Cat 2 and 3). |
| Wang and Gray (2015) | Rats, mice | 37 different chemicals (39 NTP studies with both short- and long-term studies in mice and rats) | Nonneoplastic lesions | Comparing lesions across rats and mice, concordance among short-term studies ranged 57%–89%, average 75%; concordance among long-term studies ranged 65%–89%, average 80%. Kappa values to measure concordance by organ ranged –0.04 (poor agreement) to 0.47 (moderate agreement) for short-term studies, and –0.14 (poor agreement) to 0.71 (substantial agreement) for long-term |
| studies. Comparing lesions between short-term and long-term: concordance ranged 70%–100%; in mice Kappa values ranged –0.06 (poor agreement) to 0.62 (substantial agreement), average 0.32 (fair agreement); in rats Kappa values ranged 0.21 (fair agreement) to 0.75 (substantial agreement), average 0.48 (moderate agreement). |
Findings related to sources of variability:
Recommendation 3.1: The EPA should refrain from trying to identify a threshold of acceptable variability derived from laboratory mammalian studies to apply across all NAMs and/or endpoints.
Recommendation 3.2: Overall, the EPA should aim to establish the performance of NAMs primarily based on their intrinsic performance characteristics (e.g., within and between laboratory repeatability, robustness, applicability domain), and value with respect to protecting human health effects (e.g., external validity [see Chapter 5]), rather than benchmarking based on the variability of existing mammalian studies.
Recommendation 3.3: For any experimental assay:
Recommendation 3.4: The EPA should require that assays document appropriate methodological attributes that can be used to assess study quality and contributions to variability, such that they are included and accounted for in the analysis, interpretation, and utilization of the data for the purpose of risk assessment.
Recommendation 3.5: The EPA should carefully revisit the appropriateness of the thresholds used to set the categories, such as for a weak, moderate, strong, or extreme skin sensitizer, or should use continuous measures instead of categories when possible. Some in vivo methods include more categories than warranted based on the variability from repeated testing of the same compounds. It is recommended to establish the categories based on the assay performance and to revise the protocols accordingly.
Findings related to the literature review and literature presented to the committee:
Recommendation 3.6: If the EPA were to continue to pursue variability information from laboratory mammalian toxicity tests for benchmarking NAMs or batteries of NAMs intended as a direct replacement for an in vivo mammalian toxicity test, it should only use data from high-quality systematic reviews, meta-analyses, or authoritative reviews, or from interlaboratory studies using predetermined protocols and methods.
Andersen, J. M., G. Høiseth, and E. Nygaard. 2020. “Prenatal Exposure to Methadone or Buprenorphine and Long-Term Outcomes: A Meta-Analysis.” Early Human Development 143 (April): 104997.
Andrade, E. F., D. R. Orlando, A. M. S. Araújo, J. B. de Andrade, D. V. Azzi, R. R. de Lima, A. R. Lobo-Júnior, and L. J. Pereira. 2019. “Can Resveratrol Treatment Control the Progression of Induced Periodontal Disease? A Systematic Review and Meta-Analysis of Preclinical Studies.” Nutrients 11(5). https://doi.org/10.3390/nu11050953.
Barton, H. A., V. J. Cogliano, L. Flowers, L. Valcovic, R. W. Setzer, and T. J. Woodruff. 2005. “Assessing Susceptibility from Early-Life Exposure to Carcinogens.” Environmental Health Perspectives 11(9): 1125–1133.
Bestry, M., M. Symons, A. Larcombe, E. Muggli, J. M. Craig, D. Hutchinson, J. Halliday, and D. Martino. 2022. “Association of Prenatal Alcohol Exposure with Offspring DNA Methylation in Mammals: A Systematic Review of the Evidence.” Clinical Epigenetics 14(1): 12.
Bezemer, J. M., J. van der Ende, J. Limpens, H. J. C. de Vries, and H. D. F. H. Schallig. 2021. “Safety and Efficacy of Allylamines in the Treatment of Cutaneous and Mucocutaneous Leishmaniasis: A Systematic Review.” PloS ONE [Electronic Resource] 16(4). https://doi.org/10.1371/journal.pone.0249628.
Bodewein, L., K. Schmiedchen, D. Dechent, D. Stunder, D. Graefrath, L. Winter, T. Kraus, and S. Driessen. 2019. “Systematic Review on the Biological Effects of Electric, Magnetic and Electromagnetic Fields in the Intermediate Frequency Range (300 Hz to 1 MHz).” Environmental Research 171 (April): 247–259.
Conley, J. M., C. S. Lambright, N. Evans, M. Cardon, J. Furr, V. S. Wilson, and L. E. Gray. 2018. “Mixed ‘Antiandrogenic’ Chemicals at Low Individual Doses Produce Reproductive Tract Malformations in the Male Rat.” Toxicological Sciences: An Official Journal of the Society of Toxicology 164(1): 166–178.
Crofton, K. M., S. L. Makris, W. F. Sette, E. Mendez, and K. C. Raffaele. 2004. “A Qualitative Retrospective Analysis of Positive Control Data in Developmental Neurotoxicity Studies.” Neurotoxicology and Teratology 26(3): 345–352.
Da, L. D. Barros M., R. Manhaes-de-Castro, D. T. Alves, O. G. Quevedo, A. E. Toscano, A. Bonnin, and L. Galindo. 2018. “Long Term Effects of Neonatal Exposure to Fluoxetine on Energy Balance: A Systematic Review of Experimental Studies.” European Journal of Pharmacology 833 (August): 298–306.
Dumont, C., J. Barroso, I. Matys, A. Worth, and S. Casati. 2016. “Analysis of the Local Lymph Node Assay (LLNA) Variability for Assessing the Prediction of Skin Sensitisation Potential and Potency of Chemicals with Non-Animal Approaches.” Toxicology in Vitro: An International Journal Published in Association with BIBRA 34 (August): 220–228.
EPA (Environmental Protection Agency). 2010. “Development of a Relative Potency Factor (rpf) Approach for Polycyclic Aromatic Hydrocarbon (PAH) Mixtures (external Review Draft, Suspended).” http://cfpub.epa.gov/ncea/iris_drafts/recordisplay.cfm?deid=194584.
European Commission. 2018. “Temporal aspects in the testing of chemicals for endocrine disrupting effects (in relation to human health and the environment): Final Report.” Publications Office, 2018, https://data.europa.eu/doi/10.2779/789059.
Gottmann, E., S. Kramer, B. Pfahringer, and C. Helma. 2001. “Data Quality in Predictive Toxicology: Reproducibility of Rodent Carcinogenicity Experiments.” Environmental Health Perspectives 109(5): 509–514.
Haseman, J. K. 2000. “Using the NTP Database to Assess the Value of Rodent Carcinogenicity Studies for Determining Human Cancer Risk.” Drug Metabolism Reviews 32(2): 169–186.
Haseman, J. K., J. E. Huff, G. N. Rao, and S. L. Eustis. 1989. “Sources of Variability in Rodent Carcinogenicity Studies.” Fundamental and Applied Toxicology: Official Journal of the Society of Toxicology 12(4): 793–804.
Helma, C., D. Vorgrimmler, D. Gebele, M. Gütlein, B. Engeli, J. Zarn, B. Schilter, and E. Lo Piparo. 2018. “Modeling Chronic Toxicity: A Comparison of Experimental Variability with (Q)SAR/Read-Across Predictions.” Frontiers in Pharmacology 9. https://doi.org/10.3389/fphar.2018.00413.
Haws L.C., Su S.H., Harris M., Devito M.J., Walker N.J., Farland W.H., Finley B., Birnbaum L.S. 2006. “Development of a refined database of mammalian relative potency estimates for dioxin-like compounds.” Toxicological Sciences. Jan;89(1):4-30. doi: 10.1093/toxsci/kfi294.
Higgins, J. P. T., S. G. Thompson, J. J. Deeks, and D. G. Altman. 2003. “Measuring Inconsistency in Meta-Analyses.” BMJ 327(7414): 557–560.
Hill C., Sapouckey S.A., Surorov A., Vandenberg L.N. 2017. “Developmental exposures to bisphenol S, a BPA replacement, alter estrogen-responsiveness of the female reproductive tract: A pilot study, Cogent Medicine” 4:1 (1317690). Cogent Medicine. https://doi.org/10.1080/2331205X.2017.1317690.
Hooijmans, C. R., F. J. Geessink, M. Ritskes-Hoitinga, and G. J. Scheffer. 2016. “A Systematic Review of the Modifying Effect of Anaesthetic Drugs on Metastasis in Animal Models for Cancer.” PloS ONE [Electronic Resource] 11(5). https://doi.org/10.1371/journal.pone.0156152.
Howdeshell, K. L., A. K. Hotchkiss, and L. E. Gray Jr. 2017. “Cumulative Effects of Antiandrogenic Chemical Mixtures and Their Relevance to Human Health Risk Assessment.” International Journal of Hygiene and Environmental Health 220(2 Pt A): 179–188.
Jacobs, A. C., and K. P. Hatfield. 2013. “History of Chronic Toxicity and Animal Carcinogenicity Studies for Pharmaceuticals.” Veterinary Pathology 50(2): 324–333.
Jukema, M., F. Borys, G. Sibrecht, K. J. Jørgensen, and M. Bruschettini. 2021. “Antileukotrienes for the Prevention and Treatment of Chronic Lung Disease in Very Preterm Newborns: A Systematic Review.” Respiratory Research 22(1): 208.
Karmaus, A. L., K. Mansouri, K. T. To, B. Blake, J. Fitzpatrick, J. Strickland, G. Patlewicz, D. Allen, W. Casey, and N. Kleinstreuer. 2022. “Evaluation of Variability Across Rat Acute Oral Systemic Toxicity Studies.” Toxicological Sciences: An Official Journal of the Society of Toxicology 188(1): 34–47.
Kinkade, C. W., Z. Rivera-Núñez, L. Gorcyzca, L. M. Aleksunes, and E. S. Barrett. 2021. “Impact of Fusarium-Derived Mycoestrogens on Female Reproduction: A Systematic Review.” Toxins 13(6). https://doi.org/10.3390/toxins13060373.
Kleinstreuer N.C., Ceger P.C., Allen D.G., Strickland J., Chang X., Hamm J.T., Casey W.M. “A Curated Database of Rodent Uterotrophic Bioactivity.” Environmental Health Perspectives. 2016 May;124(5):556-62. https://doi: 10.1289/ehp.1510183.
Leenaars, C. H. C., C. Kouwenaar, F. R. Stafleu, A. Bleich, M. Ritskes-Hoitinga, R. B. M. De Vries, and F. L. B. Meijboom. 2019. “Animal to Human Translation: A Systematic Scoping Review of Reported Concordance Rates.” Journal of Translational Medicine 17(1): 223.
Leffa, D. T., A. C. Panzenhagen, A. A. Salvi, C. H. D. Bau, G. N. Pires, I. L. S. Torres, L. A. Rohde, D. L. Rovaris, and E. H. Grevet. 2019. “Systematic Review and Meta-Analysis of the Behavioral Effects of Methylphenidate in the Spontaneously Hypertensive Rat Model of Attention-Deficit/hyperactivity Disorder.” Neuroscience and Biobehavioral Reviews 100 (May): 166–179.
Luechtefeld, T., A. Maertens, D. P. Russo, C. Rovida, H. Zhu, and T. Hartung. 2016. “Analysis of Draize Eye Irritation Testing and Its Prediction by Mining Publicly Available 2008-2014 REACH Data.” ALTEX 33(2): 123–134.
Mansouri, K., A. L. Karmaus, J. Fitzpatrick, G. Patlewicz, P. Pradeep, D. Alberga, N. Alepee, et al. 2021. “CATMoS: Collaborative Acute Toxicity Modeling Suite.” Environmental Health Perspectives 129(4): 47013.
Mazzatorta, P., M. D. Estevez, M. Coulet, and B. Schilter. 2008. “Modeling Oral Rat Chronic Toxicity.” Journal of Chemical Information and Modeling 48(10): 1949–1954.
Michelogiannakis, D., P. E. Rossouw, D. Al-Shammery, Z. Akram, J. Khan, G. E. Romanos, and F. Javed. 2018. “Influence of Nicotine on Orthodontic Tooth Movement: A Systematic Review of Experimental Studies in Rats.” Archives of Oral Biology 93 (September): 66–73.
Miller, G. W. 2014. “Improving Reproducibility in Toxicology.” Toxicological Sciences 139: 1 https://doi.org/10.1093/toxsci/kfu050.
Morahan, H. L., C. H. C. Leenaars, R. A. Boakes, and K. B. Rooney. 2020. “Metabolic and Behavioural Effects of Prenatal Exposure to Non-Nutritive Sweeteners: A Systematic Review and Meta-Analysis of Rodent Models.” Physiology & Behavior 213 (January): 112696.
NASEM (National Academies of Sciences, Engineering, and Medicine). 2017. Application of Systematic Review Methods in an Overall Strategy for Evaluating Low-Dose Toxicity from Endocrine Active Chemicals. Washington, DC; The National Academies Press. https://doi.org/10.17226/24758.
NASEM. 2019a. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. https://doi.org/10.17226/25303.
NASEM. 2019b. Review of DOD’s Approach to Deriving an Occupational Exposure Level for Trichloroethylene. Washington, DD: The National Academies Press. https://doi.org/10.17226/25610.
NASEM. 2021. The Use of Systematic Review in EPA’s Toxic Substances Control Act Risk Evaluations. Washington, DC: The National Academies Press. https://doi.org/10.17226/25952.
NASEM. 2022a “New Approach Methods (NAMs) for Human Health Risk Assessment: Proceedings of a Workshop—in Brief.” Washington, DC: The National Academies Press. https://nap.nationalacademies.org/catalog/26496/new-approach-methods-nams-for-humanhealth-risk-assessment-proceedings.
NASEM. 2022b. New Approach Methods (NAMs) for Human Health Risk Assessment: Workshop 2. Washington, DC: The National Academies Press. https://www.nationalacademies.org/event/05-12-2022/new-approach-methods-nams-for-human-health-risk-assessment-workshop-2.
NASEM. 2022c. Guidance on PFAS Exposure, Testing, and Clinical Follow-Up. Washington, DC: The National Academies Press. https://doi.org/10.17226/26156.
Perel, P., I. Roberts, E. Sena, P. Wheble, C. Briscoe, P. Sandercock, M. Macleod, L. E. Mignini, P. Jayaram, and K. S. Khan. 2007. “Comparison of Treatment Effects between Animal Experiments and Clinical Trials: Systematic Review.” BMJ 334(7): 197.
Pham, L., S. Watford, P. Pradeep, M. T. Martin, R. Thomas, R. Judson, R. W. Setzer, and K. P. Friedman. 2020. “Variability in in Vivo Studies: Defining the Upper Limit of Performance for Predictions of Systemic Effect Levels.” Computational Toxicology (Amsterdam, Netherlands) 15 (August 2020): 1–100126.
Pollock, M., R. M. Fernandes, L. A. Becker, and D. Pieper. 2019. “V: Overviews of Reviews.” In J. P. T. Higgins, J. Thomas, J. Chandler, M. Cumpston, T. Li, M. J. Page, & V. A. Welch, (Eds.), Cochrane Handbook for Systematic Reviews of Interventions.” Cochrane, 2022. http://training.cochrane.org/handbook.
Pradeep, P., K. P. Friedman, and R. Judson. 2020. “Structure-Based QSAR Models to Predict Repeat Dose Toxicity Points of Departure.” Computational Toxicology (Amsterdam, Netherlands) 16(2020). https://doi.org/10.1016/j.comtox.2020.100139.
Ramsteijn, A. S., L. Van de Wijer, J. Rando, J. van Luijk, J. R. Homberg, and J. D. A. Olivier. 2020. “Perinatal Selective Serotonin Reuptake Inhibitor Exposure and Behavioral Outcomes: A Systematic Review and Meta-Analyses of Animal Studies.” Neuroscience and Biobehavioral Reviews 114 (July): 53–69.
Rogers, P. J., P. S. Hogenkamp, C. de Graaf, S. Higgs, A. Lluch, A. R. Ness, C. Penfold, et al. 2016. “Does Low-Energy Sweetener Consumption Affect Energy Intake and Body Weight? A Systematic Review, Including Meta-Analyses, of the Evidence from Human and Animal Studies.” International Journal of Obesity 40(3): 381–394.
Rooney, J. P., N. Y. Choksi, P. Ceger, A. B. Daniel, J. Truax, D. Allen, and N. Kleinstreuer. 2021. “Analysis of Variability in the Rabbit Skin Irritation Assay.” Regulatory Toxicology and Pharmacology: RTP 122 (June): 104920.
Scott, P. K., L. C. Haws, D. F. Staskal, L. S. Birnbaum, N. J. Walker, M. J. De Vito, M. A. Harris, W. H. Farland, B. L. Finley, and K. M. Unice. 2006. An Alternative Method for Establishing TEFs For Dioxin-Like Compounds. Part 1. Evaluation of Decision Analysis Methods for Use in Weighting Relative Potency Data. Environmental Protection Agency.
Sedgwick, P. 2015. “Meta-Analyses: What Is Heterogeneity?” BMJ 350 (March): h1435.
Shea, B. J., B. C. Reeves, G. Wells, M. Thuku, C. Hamel, J. Moran, D. Moher, P. Tugwell, V. Welch, E. Kristjansson, and D. A. Henry. 2017. “AMSTAR 2: A Critical Appraisal Tool for Systematic Reviews That Include Randomised or Non-Randomised Studies of Healthcare Interventions, or Both.” BMJ 358 (September): j4008.
Shojaei-Zarghani, S., A. Yari Khosroushahi, M. Rafraf, M. Asghari-Jafarabadi, and S. Azami-Aghdash. 2020. “Dietary Natural Methylxanthines and Colorectal Cancer: A Systematic Review and Meta-Analysis.” Food & Function 11(1): 10290–10305.
Soliman, N., S. Haroutounian, A. G. Hohmann, E. Krane, J. Liao, M. Macleod, D. Segelcke, et al. 2021. “Systematic Review and Meta-Analysis of Cannabinoids, Cannabis-Based Medicines, and Endocannabinoid System Modulators Tested for Antinociceptive Effects in Animal Models of Injury-Related or Pathological Persistent Pain.” Pain 162(S). https://doi.org/10.1097/j.pain.0000000000002269.
Sophocleous, A., M. Yiallourides, F. Zeng, P. Pantelas, E. Stylianou, B. Li, G. Carrasco, and A. I. Idris. 2022. “Association of Cannabinoid Receptor Modulation with Normal and Abnormal Skeletal Remodelling: A Systematic Review and Meta-Analysis of in Vitro, in Vivo and Human Studies.” Pharmacological Research: The Official Journal of the Italian Pharmacological Society 175 (January): 105928.
Voelkl, B., Altman, N. S., Forsman, A., Forstmeier, W., Gurevitch, J., Jaric, I. et al. 2020. Reproducibility of animal research in light of biological variation. Nature Reviews Neuroscience 21(7): 384-393.
Wang, B., and G. Gray. 2015. “Concordance of Noncarcinogenic Endpoints in Rodent Chemical Bioassays.” Risk Analysis: An Official Publication of the Society for Risk Analysis 35(6): 1154–1166.
Wikoff, D. S., J. D. Urban, C. Ring, J. Britt, S. Fitch, R. Budinsky, and L. C. Haws. 2021. “Development of a Range of Plausible Noncancer Toxicity Values for 2,3,7,8-Tetrachlorodibenzo-P-Dioxin Based on Effects on Sperm Count: Application of Systematic Review Methods and Quantitative Integration of Dose Response Using Meta-Regression.” Toxicological Sciences: An Official Journal of the Society of Toxicology 179(2): 162–182.
Zhang, H., R. Yang, W. Shi, X. Zhou, and S. Sun, S. 2022. “The Association Between Bisphenol A Exposure and Oxidative Damage in Rats/Mice: A Systematic Review and Meta-Analysis.” Environmental Pollution 292: 118444.
Zhao, F., Z. Guo, Z. R. Ma, L. L. Ma, and J. Zhao. 2021. “Antitumor Activities of Grifola Frondosa (Maitake) Polysaccharide: A Meta-Analysis Based on Preclinical Evidence and Quality Assessment.” Journal of Ethnopharmacology 280 (November): 114395.