Evaluating the quality of evidence for key questions on the benefits and harms of preventive services forms a cornerstone of the U.S. Preventive Services Task Force (USPSTF) and the Community Preventive Services Task Force processes. Researchers who produce this evidence often face tradeoffs among an array of possible study designs that can be used for research on whether an action (e.g., screening, behavioral counseling, or preventive medication) causes an outcome. Randomized controlled trials (RCTs) are widely considered to produce the highest-quality evidence because they are less prone to confounding than observational study designs. However, evidence from RCTs often has key limitations in generalizability to real-life populations and settings.
New designs for RCTs and innovative methods for observational studies have gained increasing traction in health care delivery research in the past two decades. Modern study designs could enable researchers to address evidence gaps in the USPSTF analytic framework where traditional individual-level RCTs are not feasible for a variety of reasons. In addition, innovative designs can be useful for studies to address gaps in the research foundation, and in real-world dissemination and implementation of recommended preventive services. In this chapter, the committee discusses the types of studies needed to fill different types of evidence gaps, considering both existing USPSTF methods and innovative modern methods. Our purpose is to highlight specific newer study designs particularly germane to clinical prevention research, and to encourage
researchers, sponsors, and guideline committees to increase their use and acceptance where appropriate.
To identify the types of studies needed to fill evidence gaps for recent USPSTF-reviewed topics, the committee evaluated all “Research Needs and Gaps” sections from all 14 USPSTF recommendations that contained any “insufficient evidence” (I) statement published over 3 years, from July 2018 to June 2021. Five of these 14 statements contained I statements along with letter grade recommendations (A, B, C, or D) for part of the population or intervention, with the I statement applying to a subgroup or to a specific type of intervention. The committee also reviewed all nine statements containing only A, B, C, or D grades published from January 2020 to June 2021.
Among the 14 USPSTF recommendations reviewed containing an I statement, four specifically called for RCTs (see Table B-1). Others called for evidence about the effectiveness of screening or intervention without specifying a required study design. Ten of the 14 called for studies of risk prediction approaches (i.e., ways to identify high-risk groups). Of these, several statements requested research on how high-risk subgroups could be identified using clinical information such as data from questionnaires. Others requested research to establish the best specifications for a physical test, such as determining the preferred laboratory test for vitamin D deficiency, or developing consistent definitions of hearing loss to improve certainty about the accuracy of screening tests. Five of the 14 recommendations described evidence gaps that would require longitudinal followup of patients, either in RCTs or cohort studies, such as understanding the long-term history of hypertension in children and how often it resolves spontaneously. Half of the statements called for more research on population subgroups defined by age, sex, race or ethnicity, or other factors. Three requested research on real-world dissemination or implementation. One statement, on abdominal aortic aneurysm screening, suggested that high-quality modeling studies for women could be informative if new trials were not available.
Of the nine USPSTF recommendations reviewed containing letter grades and no I statements, six specifically requested additional RCTs. Seven called for studies of risk prediction approaches, and six requested studies to refine the specifications of the intervention or to test related new interventions. Three described research needs that could best be met by longitudinal cohort studies or trials with long-term follow-up. Two mentioned research needs related to dissemination and implementation.
TABLE B-1 Types of Studies Needed to Address Research Needs and Gaps Described in Recent U.S. Preventive Services Task Force Statements
| Study Type Specified or Implied in the Research Needs Section of the USPSTF Statement | ||||||||
|---|---|---|---|---|---|---|---|---|
| Topic | Randomized controlled trial (RCT)* | Intervention effectiveness study, unspecified* | Risk prediction study, including test performance | Longitudinal follow-up (in RCT or cohort study) | Cross-sectional study | Research in subgroups | Dissemination or Implementation study | Notes |
| Part 1. Recommendations containing an “insufficient evidence” (I) statement, July 2018–June 2021 | ||||||||
| Vitamin D Deficiency in Adults: Screening | 1 | 1 | ||||||
| Tobacco Smoking Cessation in Adults, Including Pregnant Persons: Interventions | 1 | 1 | 1 | 1 | Calls for studies of additional outcomes | |||
| Hearing Loss in Older Adults: Screening | 1 | 1 | 1 | |||||
| Study Type Specified or Implied in the Research Needs Section of the USPSTF Statement | ||||||||
|---|---|---|---|---|---|---|---|---|
| Topic | Randomized controlled trial (RCT)* | Intervention effectiveness study, unspecified* | Risk prediction study, including test performance | Longitudinal follow-up (in RCT or cohort study) | Cross-sectional study | Research in subgroups | Dissemination or Implementation study | Notes |
| High Blood Pressure in Children and Adolescents: Screening | 1 | 1 | 1 | 1 | RCTs could be difficult because this screening is routine in usual care | |||
| Prevention and Cessation of Tobacco Use in Children and Adolescents: Primary Care Interventions | 1 | 1 | ||||||
| Unhealthy Drug Use: Screening | 1 | 1 | 1 | |||||
| Illicit Drug Use in Children, Adolescents, and Young Adults: Primary Care–Based Interventions | 1 | 1 | 1 | |||||
| Bacterial Vaginosis in Pregnant Persons to Prevent Preterm Delivery: Screening | 1 | 1 | ||||||
| Cognitive Impairment in Older Adults: Screening | 1 | 1 | Calls for more consistent definitions of outcomes | |||||
| Abdominal Aortic Aneurysm: Screening | 1 | 1 | 1 | 1 | 1 | 1 | Suggested high-quality modeling studies could be useful | |
| Elevated Blood Lead Levels in Children and Pregnant Women: Screening | 1 | 1 | 1 | RCTs could be difficult because this screening is routine in usual care | ||||
| Atrial Fibrillation: Screening with Electrocardiography | 1 | 1 | Mentions ongoing RCTs | |||||
| Peripheral Artery Disease and Cardiovascular Disease: Screening and Risk Assessment with the Ankle-Brachial Index | 1 | 1 | 1 | Mentions ongoing RCTs | ||||
| Cardiovascular Disease: Risk Assessment with Nontraditional Risk Factors | 1 | 1 | 1 | 1 | Requests studies of incremental benefit in real-world practice | |||
| Total | 4 | 11 | 10 | 5 | 2 | 7 | 3 |
| Study Type Specified or Implied in the Research Needs Section of the USPSTF Statement | ||||||||
|---|---|---|---|---|---|---|---|---|
| Topic | Randomized controlled trial (RCT)* | Intervention effectiveness study, unspecified* | Risk prediction study, including test performance | Longitudinal follow-up (in RCT or cohort study) | Cross-sectional study | Research in subgroups | Dissemination or Implementation study | Notes |
| Part 2. Recommendations containing only A, B, C, or D grades | ||||||||
| Healthy Weight and Weight Gain in Pregnancy: Behavioral Counseling Interventions | 1 | 1 | 1 | |||||
| Colorectal Cancer: Screening | 1 | 1 | 1 | 1 | 1 | |||
| Hypertension in Adults: Screening | 1 | 1 | ||||||
| Lung Cancer: Screening | 1 | 1 | Requests studies of risk prediction models to select patients to screen | |||||
| Asymptomatic Carotid Artery Stenosis: Screening | 1 | 1 | ||||||
| Hepatitis B Virus Infection in Adolescents and Adults: Screening | 1 | 1 | 1 | 1 | 1 | Requests studies of decision support tools | ||
| Healthy Diet and Physical Activity for Cardiovascular Disease Prevention in Adults with Cardiovascular Risk Factors: Behavioral Counseling | 1 | 1 | ||||||
| Sexually Transmitted Infections: Behavioral Counseling | 1 | 1 | 1 | 1 | ||||
| Hepatitis C Virus Infection in Adolescents and Adults: Screening | 1 | 1 | 1 | 1 | ||||
| Total | 6 | 6 | 7 | 3 | 0 | 3 | 2 |
* The Randomized Controlled Trial column was marked only when a Research Needs statement specified a randomized controlled trial. Other statements calling for research on the benefits of a screening, treatment, or other intervention were classified as “Intervention Effectiveness Study, unspecified.”
The USPSTF rates the body of evidence for each question in the analytic framework for a given topic as convincing, adequate, or inadequate based on several factor (USPSTF, 2021). First among these is, “Do the studies have the appropriate research design to answer the key question(s)?” Other key factors considered in evaluating the adequacy of evidence include the internal validity and external generalizability to the U.S. primary care population, aggregated across all studies across each of the key questions.
The Grading of Recommendations Assessment, Development, and Evaluation (GRADE) system, introduced in 2011 based on the work of an international consensus group, is a widely used body of standards for rating the quality of a body of evidence in systematic reviews and guidelines for clinical, public health, and health systems questions. In the GRADE approach, RCTs are initially classified as high-quality evidence and observational studies as low-quality evidence for estimates of intervention effects (Guyatt et al., 2011). From these starting points, a body of evidence about whether an action (e.g., screening or treatment) causes an outcome may be downgraded or upgraded. Reasons for downgrading include bias, inconsistency, indirectness, imprecision, and publication bias. Reasons for upgrading include evidence of a large effect, a dose-response effect, and all plausible confounding would bias in the direction opposite of the observed effect (or lack of effect).
In recent years, GRADE guidance has been updated to suggest that the evaluation of non-randomized studies of interventions could begin with an assumption of high certainty with the use of a new tool called the Risk of Bias in Non-Randomized Studies of Interventions (ROBINS-I) (Schünemann et al., 2019; Sterne et al., 2016). The ROBINS-I tool enables reviewers to evaluate the risk of bias in estimates of comparative effectiveness (harm or benefit) in non-randomized studies of interventions and to assess the magnitude of bias in different domains, including confounding, selection, misclassification of intervention status, deviations from intended interventions, missing data, outcomes measurement, and reporting.
For many of the USPSTF recommendations reviewed in Table 5-1, it would be possible to address a research need using either an RCT or an observational study design. Many factors influence the selection of a study design by a sponsor or a researcher. These include the anticipated internal and external validity of the study design, and extrinsic factors including time urgency and logistical barriers (Armstrong, 2012). In addi-
tion, studies designed for diverse populations may also need to consider the possibility that the intervention may have heterogeneous effects on different subpopulations. Study designs that have high per-patient costs may lead to limited sample sizes and inadequate power to identify heterogeneity of treatment effect, making study designs with lower per-patient costs more attractive if they are of adequate quality.
RCTs are not practically feasible or are not the optimal approach in some situations. Examples include preventive services where:
Individual-level RCTs have other important limitations. The results are often not externally generalizable because inclusion and exclusion criteria tend to select patients who do not resemble patients in actual practice based on age, comorbidities, sex, race/ethnicity, or socioeconomic status. In addition, because sample sizes are usually limited by cost, individual-level trials can rarely address questions about how intervention effects may vary in different subgroups (Armstrong, 2012). Failure to identify this effect, termed treatment heterogeneity, may mean that high-risk subgroups miss the potential benefits of a preventive service if it is found ineffective in the general population. These limitations are not inherent to RCTs; they may also hinder non-randomized studies of interventions. Conversely, many pragmatic or effectiveness trials do enroll patients representative of those in practice.
Newer designs for RCTs may be useful to overcome specific limitations of individual-level trials. Many excellent books on study design and reviews of modern study methods are available and provide more comprehensive descriptions of the options. Here, the committee discusses selected designs that may be useful in clinical prevention. Examples of modern trial designs include
Cluster randomization may be especially useful when it would be difficult to provide different interventions to individuals within the same clinical setting. For example, a trial of mailed and telephone intervention to reduce postpartum weight retention among women with gestational diabetes mellitus used a cluster randomized design in order to leverage a centralized system for case management across the 44 medical facilities participating, and to enable consistent workflow within each medical facility.
Stepped wedge cluster randomized trials are increasingly used to evaluate interventions that involve service delivery in discrete units such as geographical regions, hospitals or clinics. In this design, each unit crosses over from control to intervention in a randomized and sequential fashion (Hemming et al., 2015). This design enables researchers to conduct rigorous evaluation within the constraints sometimes imposed by policy makers when they believe it important that every unit in a study should eventually receive the intervention.
Observational study designs for comparative effectiveness compare a group that has been exposed to a condition or intervention (e.g., screening or treatment) with a comparison group that has not. Historically, observational studies have been plagued with biases, also referred to as confounding. These include selection bias (when the intervention and comparison groups differ in characteristics associated with the outcome of interest) and performance bias (when delivery of an intervention is associated with generally higher levels of quality of care by the health care unit) (Armstrong, 2012).
Cohort study designs are prone to problems that threaten their internal validity, including secular trend and regression to the mean. Multivariable analysis is perhaps the oldest and best-established method of adjusting for confounders that have been measured. However, specific well-known cases have underscored the limitations of cohort study designs, even when they include untreated comparison groups and multivariable analysis is applied. For example, multiple observational studies prior to the late 1990s indicated that the use of postmenopausal hormone replacement therapy was associated with substantial reduction in coronary heart disease risk (Humphrey et al., 2002). Subsequently, a large RCT of postmenopausal hormone replacement therapy (HRT) found no benefit in cardiovascular heart disease (CHD) risk (Rossouw et al., 2002). This apparent contradiction stemmed in part from many of the observational studies not having collected data on socioeconomic status, which was a key confounder of the relationship between HRT use and CHD risk.
It should be noted that in research intended to reduce health disparities for underrepresented individuals, the control group (both in RCTs and observational designs) should be selected to be as similar as possible to the intervention group. Recent articles recommend standards for reporting on race and ethnicity (Flanagin et al., 2021) and for research and publication on racial health inequities (Boyd et al., 2020).
In recent years, newer study designs and analytic methods have offered more robust approaches for causal inference, the process of inferring that an action or event affects an outcome of interest (Rothman and Greenland, 2005). Modern analytic methods include
For example, if a behavioral intervention to prevent illicit drug use in adolescents were made available only to patients making clinic visits on Mondays, Wednesdays, and Fridays, the day of the week each clinic visit was made might be usable as an instrumental variable in an observational study, provided it were not associated with the outcomes of interest.
Quasi-experimental designs can be used to enhance the robustness of non-randomized studies of interventions (Harris et al., 2006). Examples include
This design can be useful when it is impossible to assign multiple groups at random to intervention or control. For example, if one health care system initiated a program that focused on enhancing colorectal cancer screening through annual stool testing, its outcomes before and after the program began could be compared with a similar system without such a program.
Most of the recent USPSTF statements the committee reviewed called for predictive analytics—risk prediction studies using tests, questionnaires, or clinical characteristics to identify patients at high risk for a condition. Research needs involving predictive analytics can be categorized as studying physical screening tests (e.g., laboratory, imaging, colonoscopy), clinical examination findings (e.g., visual examination for skin cancer), patient-reported factors on questionnaires, other clinical characteristics, or combinations of these.
A rich body of literature exists on methods for evaluating test performance and setting test cutoffs. Some common methods and issues to consider include the need for prediction models to have adequate discrimination and calibration (Alba et al., 2017), and the use of receiver operating characteristic curves to identify potential test cutoff points (Poldrack et al., 2020).
The importance of predictive analytics in clinical prevention research is likely to increase in the foreseeable future due to scientific advances that create the opportunity for precision medicine and precision public health. Genomics, proteomics, and metabolomics are among the many biological fields that could provide new tools for clinical prevention by developing tests to identify patient groups at high risk of future disease. Advances in artificial intelligence have made it possible to analyze images and text and to generate computer-based predictive algorithms using myriad variables available in electronic health records. Patient monitoring technology that gathers highly detailed data, such as wearable patches for heart rhythm monitoring in atrial fibrillation, is also increasing. These trends are likely to increase the needs for predictive analytics, and for modeling studies that combine predictive analytic methods with other methods to give a comprehensive overview of the projected benefits and harms of a preventive service.
One issue that seems likely to warrant increasing attention by researchers and guideline committees is the need to conduct studies that evaluate the net incremental value of a test or prediction model (i.e., its additional value relative to existing methods of identifying patients as high risk). For example, the USPSTF statement on risk assessment of cardiovascular disease using nontraditional risk factors (including the ankle-brachial index, high-sensitivity C-reactive protein, and coronary artery calcification score) noted that studies assessing these factors in isolation are of limited value. Instead, the statement called for studies comparing traditional risk assessment with traditional risk assessment plus one or more of the newer factors, so that the incremental benefits and harms of the newer factors can be more clearly delineated. Likewise, in studies of computer-based risk prediction models, Shah et al. (2019) have called for considering the net incremental value of taking plausible actions when selecting the best model, rather than simply relying on statistical measures.
Modeling refers to the use of tables or computer-based simulations to make projections about outcomes under varying scenarios. Models range from simple calculations presented in outcomes tables to more formal decision models. The USPSTF uses modeling to inform recommendations
when there is direct evidence of the benefit of a preventive service on health outcomes, or when there is evidence for each of the linkages in an analytic framework. The USPSTF Procedure Manual notes that candidates for modeling are generally A, B, and some C recommendations, and notes that decision modeling is primarily warranted when there are outstanding clinical questions about how best to target a clinical preventive service at the individual and program level, and it is unlikely that the systematic review can confidently determine the magnitude of net benefit, particularly for subpopulations of interest (USPSTF, 2021). Thus, the USPSTF currently uses modeling in a highly focused way, to further inform the applications of letter grade recommendations to subpopulations. It would not use modeling to overturn an “insufficient evidence” statement.
Modeling has potentially important uses in research on preventive services beyond its current role in the USPSTF process. For example, a research sponsor with finite funding may wish to compare the value of investing in research on one preventive services topic compared with another, or within a given topic, compare the value of investing in studies to address one key question compared with another. Decision models could be used to project the potential findings of the proposed studies, their effects on preventive services recommendations, and the downstream effects on health outcomes. Multi-criteria decision analysis, a type of decision modeling that takes into account multiple factors of interest, could be used along with the prioritization criteria in the taxonomy presented in this report. Such models could help sponsors compare the projected costs, benefits, and perceived value of alternative investments in research.
Modeling could also be used to inform some of the challenging decisions among competing priorities in preventive care that clinicians and policy makers make in real-world practice. For example, the number of preventive services recommended during an average visit may exceed the amount of time a clinician has available to address them. In addition, many patients have comorbidities that set up competing priorities with the impetus to deliver preventive care. To present realistic projections of the impact of preventive services in actual practice, it would be helpful for models to not assume 100 percent adherence to an intervention and to account for the fact that harms and benefits data from randomized trials may need to be adjusted to reflect real-world outcomes.
Decision models are inherently imperfect and usually rely on base case assumptions about which uncertainty exists. Sensitivity analysis is important in all modeling, to understand how results may change when assumptions are varied over plausible ranges. With the appropriate caveats in mind, modeling can be helpful to elucidate the tradeoffs among costs, benefits, and harms of alternative options, to support decision mak-
ing by policy makers, operational leaders of programs that deliver preventive care, and individual clinicians and patients.
The USPSTF is one of many groups that develop recommendations on preventive services, and its methods are highly evidence-based and structured compared with most others. Even with its highly evolved approaches, the task force likely still has room for additional refinement in the processes used to arrive at decisions as a group. These processes could be applied, for example, to help the task force determine priorities among evidence gaps identified via the use of the taxonomy.
One example of structured group communication and decision-making processes is the Delphi method, which uses iterative questionnaires and statistical feedback to the group to develop a consensus. Other quantitative methods of group decision making include voting, weighting, ranking, scoring, and grading (Madhavan et al., 2017). Some methods, including multi-criteria decision analysis and analytic hierarchy process, enable the user to assign weights to different criteria that describe the options; this can be used for individual or group decision-making (Phelps and Madhavan, 2017). These methods are often poorly understood and deserve to be subjected to more practical research.
The USPSTF uses several types of processes where methods for group decision-making could be further tested and refined, including the selection of topics to review, the decision about a recommendation on a given topic once the evidence is reviewed, and in the future, the prioritization among evidence gaps that should be filled for a given topic. Learning more about how to best apply such methods in developing recommendations on preventive services and associated research would enhance the robustness of decisions and optimize their downstream impact.
Alba, A. C., T. Agoritsas, M. Walsh, S. Hanna, A. Iorio, P. J. Devereaux, T. McGinn, and G. Guyatt. 2017. Discrimination and calibration of clinical prediction models: Users’ guides to the medical literature. JAMA 318(14):1377–1384.
Ali, S., G. Hopkin, N. Poonai, L. Richer, M. Yaskina, A. Heath, T. P. Klassen, C. McCabe, KidsCAN PERC Innovative Pediatric Clinical Trials No OUCH Study Group, and KidsCAN PERC Innovative Pediatric Clinical Trials Methods Core. 2021. Correction to: A novel preference-informed complementary trial (PICT) design for clinical trial research influenced by strong patient preferences. Trials 22(1):353.
Alsan, M. and A. N. Finkelstein. 2021. Beyond causality: Additional benefits of randomized controlled trials for improving health care delivery. The Milbank Quarterly.
Armstrong, K. 2012. Methods in comparative effectiveness research. Journal of Clinical Oncology 30(34):4208–4214.
Berry, S. M., J. T. Connor, and R. J. Lewis. 2015. The platform trial: An efficient strategy for evaluating multiple treatments. JAMA 313(16):1619–1620.
Boyd, R. W., E. G. Lindo, L. D. Weeks, and M. R. McLemore. 2020. On racism: A new standard for publishing on racial health inequities. Health Affairs Blog, July 2, 2020. https://www.healthaffairs.org/do/10.1377/hblog20200630.939347/full (accessed November 1, 2021).
Flanagin, A., T. Frey, S. L. Christiansen, and AMA Manual of Style Committee. 2021. Updated guidance on the reporting of race and ethnicity in medical and science journals. JAMA 326(7):621–627.
Glynn, R. J., S. Schneeweiss, and T. Stürmer. 2006. Indications for propensity scores and review of their use in pharmacoepidemiology. Basic & Clinical Pharmacology & Toxicology 98(3):253–259.
Guyatt, G., A. D. Oxman, E. A. Akl, R. Kunz, G. Vist, J. Brozek, S. Norris, Y. Falck-Ytter, P. Glasziou, H. DeBeer, R. Jaeschke, D. Rind, J. Meerpohl, P. Dahm, and H. J. Schünemann. 2011. GRADE guidelines: 1. Introduction—GRADE evidence profiles and summary of findings tables. Journal of Clinical Epidemiology 64(4):383–394.
Harris, A. D., J. C. McGregor, E. N. Perencevich, J. P. Furuno, J. Zhu, D. E. Peterson, and J. Finkelstein. 2006. The use and interpretation of quasi-experimental studies in medical informatics. Journal of the American Medical Informatics Association 13(1):16–23.
Hemming, K., T. P. Haines, P. J. Chilton, A. J. Girling, and R. J. Lilford. 2015. The stepped wedge cluster randomised trial: Rationale, design, analysis, and reporting. BMJ 350:h391.
Hillier, T. A., K. L. Pedula, K. K. Ogasawara, K. K. Vesco, C. E. S. Oshiro, S. L. Lubarsky, and J. Van Marter. 2021. A pragmatic, randomized clinical trial of gestational diabetes screening. New England Journal of Medicine 384(10):895–904.
Humphrey, L. L., B. K. S. Chan, and H. C. Sox, Jr. 2002. Postmenopausal hormone replacement therapy and the primary prevention of cardiovascular disease. Annals of Internal Medicine 137(4):273–284.
Landes, S. J., S. A. McBain, and G. M. Curran. 2020. Reprint of: An introduction to effectiveness-implementation hybrid designs. Psychiatry Research 283(16):112630.
Luo, Z., J. C. Gardiner, and C. J. Bradley. 2010. Applying propensity score methods in medical research: Pitfalls and prospects. Medical Care Research and Review 67(5):528–554.
Madhavan, G., C. Phelps, and R. Rappuoli. 2017. Compare voting systems to improve them. Nature 541(7636):151–153.
Martens, E. P., W. R. Pestman, A. d. Boer, S. V. Belitser, and O. H. Klungel. 2006. Instrumental variables: Application and limitations. Epidemiology 17(3):260–267.
Phelps, C. E., and G. Madhavan. 2017. Using multicriteria approaches to assess the value of health care. Value Health 20(2):251–255.
Poldrack, R. A., G. Huckins, and G. Varoquaux. 2020. Establishment of best practices for evidence for prediction: A review. JAMA Psychiatry 77(5):534–540.
Rossouw, J. E., G. L. Anderson, R. L. Prentice, A. Z. LaCroix, C. Kooperberg, M. L. Stefanick, R. D. Jackson, S. A. A. Beresford, B. V. Howard, K. C. Johnson, J. M. Kotchen, J. Ockene, and Writing Group for the Women’s Health Initiative Investigators. 2002. Risks and benefits of estrogen plus progestin in healthy postmenopausal women: Principal results from the women’s health initiative randomized controlled trial. JAMA 288(3):321–333.
Rothman, K. J., and S. Greenland. 2005. Causation and causal inference in epidemiology. American Journal of Public Health 95:S144–S150.
Schünemann, H. J., C. Cuello, E. A. Akl, R. A. Mustafa, J. J. Meerpohl, K. Thayer, R. L. Morgan, G. Gartlehner, R. Kunz, S. V. Katikireddi, J. Sterne, J. P. Higgins, G. Guyatt, and G. W. Group. 2019. Grade guidelines: 18. How ROBINS-I and other tools to assess risk of bias in nonrandomized studies should be used to rate the certainty of a body of evidence. Journal of Clinical Epidemiology 111:105–114.
Shah, N. H., A. Milstein, and S. C. Bagley. 2019. Making machine learning models clinically useful. JAMA 322(14):1351–1352.
Sterne, J. A., M. A. Hernán, B. C. Reeves, J. Savović, N. D. Berkman, M. Viswanathan, D. Henry, D. G. Altman, M. T. Ansari, I. B. I., J. R. Carpenter, A. W. Chan, R. Churchill, J. J. Deeks, A. Hróbjartsson, J. Kirkham, P. J. P., Y. K. Loke, T. D. Pigott, C. R. Ramsay, D. Regidor, H. R. Rothstein, L. Sandhu, P. L. Santaguida, H. J. Schünemann, B. Shea, I. Shrier, P. Tugwell, L. Turner, J. C. Valentine, H. Waddington, E. Waters, G. A. Wells, P. F. Whiting, and J. P. Higgins. 2016. ROBINS-I: A tool for assessing risk of bias in non-randomised studies of interventions. BMJ 355:i4919.
Stürmer, T., R. Wyss, R. J. Glynn, and M. A. Brookhart. 2014. Propensity scores for confounder adjustment when assessing the effects of medical interventions using nonexperimental study designs. Journal of Internal Medicine 275(6):570–580.
USPSTF (U.S. Preventive Services Task Force). 2021. U.S. Preventive Services Task Force procedure manual. https://www.uspreventiveservicestaskforce.org/uspstf/about-uspstf/methods-and-processes/procedure-manual (accessed August 30, 2021).