Decision makers are rarely interested in evidence that applies only to the specific samples, conditions, and outcomes included in research studies. Rather, they are interested in applying study results to predict—or to generalize—findings to their own populations, settings, and outcomes of interest. Generalizing findings requires (1) study results that are internally valid and replicable; (2) curricula, study samples, contexts, and outcomes that are representative of conditions that are to be generalized over; and (3) knowledge about the extent to which findings may vary and why these findings vary (Cronbach, 1982; Shadish et al., 2002; Stuart et al., 2011; Tipton, 2012; Tipton & Olsen, 2018).
As described in Chapter 2, the evidence base on preschool curricula includes findings from many rigorous, internally valid evaluations, some of which have been replicated over multiple studies. Yet gaps remain in the types of curricula, study samples, contexts, and outcomes that are represented in the existing literature. These include questions about the effects of culturally responsive curricula; the effects of curricula in different settings including family child care or for-profit contexts; the effects on less commonly studied child outcomes, such as problem solving and curiosity, child positive racial identity, and multilingual learners’ growth in home language; and the effects on other widely adopted instructional approaches to preschool, such as Montessori.
Emerging research on preschool curriculum also suggests variation in curriculum effectiveness. Given the diversity of early childhood conditions under which curricula are delivered, research about curriculum effectiveness should describe the specific contexts, settings, and populations under which the curriculum effect was observed. This is because study findings are
often dependent on contexts and unstated theoretical assumptions under which the study was designed and carried out. For example, a curriculum may have a small average effect in improving students’ early literacy skills, but the magnitude of the effect may be much larger (or smaller) in different preschool settings (e.g., Head Start versus child care center), with more experienced educators delivering the curriculum, and for students with different home language experiences. Effects may also vary by the type of study design (e.g., experimental versus quasi-experimental).
When curriculum effects vary substantially because of differences in student characteristics, in how the curriculum was delivered, in settings under which the curriculum is delivered, and in outcomes examined, findings observed under one specific set of circumstances will often not be observed under a different set of circumstances. This implies that different curriculum approaches may be more (or less) appropriate for addressing students’ developmental and learning goals. In these cases, the program administrator’s goal is not to select a preschool curriculum that is the most effective “on average,” but to identify the curriculum that is most effective for their specific students and context. For example, program administrators in tribal communities may seek curriculum approaches that teach children academic and social-emotional skills in ways that are culturally responsive and well-aligned to the norms, mores, and goals of their community and families; educators who have children with disabilities in their classrooms may need curricula that address the unique learning needs of their specific students.
The evidence base for a new vision of preschool curriculum, therefore, must address the multiple and complex ways in which children, educators, and their environments interact with curriculum and its delivery. And yet, generating evidence to understand variation in curriculum effects remains elusive and challenging. This chapter focuses on research and methodological issues that arise in designing and implementing studies for evaluating variation in curriculum effectiveness. The committee first presents a framework for understanding why curriculum effects may vary across different studies and contexts (Steiner et al., 2019). The framework describes both programmatic and study-related factors that may be critical for determining the size of curriculum effects. Next, the chapter discusses challenges with identifying sources of effect variation within studies and across multiple studies. Finally, we present conclusions drawn from a review of the literature on curriculum effectiveness.
The overarching mission of many researchers of early childhood education is to identify curriculum, practices, and policies that improve student outcomes. When curriculum effects are replicated over multiple studies
with diverse settings, populations, intervention deliveries, and contexts, researchers and decision makers have increased confidence that findings can likely be generalized to their specific population or context of interest. However, when study findings are fragile, hard to replicate, and not robust for different populations of interest, the external validity of these findings is questioned.
All scientific conclusions about curriculum effectiveness are based on study findings drawn from samples of participants with specific settings, treatment protocols, and materials. The challenge arises when, in interpreting these findings, researchers fail to specify for whom and what the curriculum effects are intended to represent, as well as any potential constraints on generality that may either amplify or dampen the size of curriculum effects. In the absence of such information, program administrators may assume that study conclusions apply broadly to any sample or context. But when study findings fail to replicate, the trustworthiness of the finding is questioned, and their value for evidence-based decision making is doubted.
Given the diversity in contexts and settings for how preschool curricula are implemented and delivered—and the potential variation in students’ responses to different curricular approaches—a research agenda that supports a new vision of preschool curriculum must seek to understand the extent to which curriculum effects are robust across contexts, settings, and populations. And, in cases where effect variation is observed, the evidence base must identify why effects varied to understand the populations and conditions under which findings do—and do not—generalize over.
There are multiple reasons why curriculum effects may differ across studies and study conditions. Steiner et al. (2019) proposed a framework for describing key sources of effect variation across different populations, contexts, and studies. The framework demonstrates that the size of an intervention effect depends on programmatic considerations as well as study design characteristics. “Programmatic considerations” include variations in the curriculum or intervention, the condition against which it is being compared, and the outcomes for which curriculum effectiveness is being measured. It also includes differences in student and setting characteristics that may interact with the size of the curriculum effect. When curriculum effects fail to replicate because of programmatic differences across studies, researchers conclude that curriculum effects will likely not generalize to other student populations, contexts, settings, outcomes, and treatment deliveries (Rubin, 1981).
“Study design characteristics” include methodology decisions that may affect the size and precision of an effect. When results are compared from studies with different design characteristics, they may differ because researchers’ choices in methodology yielded different conclusions about
curriculum effects. For example, one study effect may be obtained from a randomized controlled trial, while another study effect is obtained from an observational study. Curriculum effects across the two studies may differ because the latter suffers from selection bias, raising concerns about the validity of the study findings. Study findings may also fail to replicate because one or both studies lack statistical power for producing precise estimates of the curriculum effect. Decision makers are often not interested in study findings that fail to replicate because of methodology choices. However, in cases where both programmatic and study design characteristics vary simultaneously, it may be impossible to disentangle why study findings failed to replicate. Do curriculum effects differ across studies and study conditions because effects are not generalizable or because of study design choices? It may be impossible to know. A researcher or decision maker may mistakenly interpret incongruent study findings to mean that a curriculum effect fails to generalize, when the true reason may be because one of the studies lacked statistical power to detect curriculum effects.
A central goal for a research agenda supporting a new vision for preschool curriculum is examining the extent to which curriculum effects are generalizable—and when they are not—to variations in student populations, contexts, settings, and outcomes. This report describes reasons related to both programmatic and study design to explain why curriculum effects may vary. Chapter 2 summarizes the empirical literature examining programmatic reasons for why curriculum effects vary—including differences in curriculum type, in the outcomes used for assessing effectiveness, in the students—and their backgrounds, knowledge, and experiences—participating in the curriculum, in the characteristics of teachers using the curriculum, in the preschool setting under which the curriculum is delivered, and in macro conditions that may interact with how the curriculum is delivered (e.g., funding for preschool, state licensure requirements for preschool teachers). These characteristics may amplify or dampen the size of the curriculum effect and may interact in complex ways that affect the effectiveness of a curriculum (Tefera et al., 2018). For example, a curriculum may be especially effective for students with disabilities in public preschool settings but less so for children without disabilities in private child care centers. Understanding the extent to which curriculum effects vary by programmatic features is critical for determining the generalizability of study findings.
The following section describes study design–related reasons that curriculum findings may differ across studies and study conditions. They include differences in the treatment–control contrast for evaluating the curriculum, the research design used for identifying and estimating the curriculum effect, and the size of the sample used to evaluate the effect. While these characteristics are usually investigated to assess the generalizability of study findings,
they are related to the feasibility, logistics, and ethics of conducting a study. Therefore, they are of special consideration for researchers and funders of studies on preschool curriculum.
Curriculum effects are determined by comparing the average effects for students who participated in the curriculum with effects for students who did not. As such, what activities children engage with in both the curriculum and the control condition can have substantial impact on the size and direction of the effect. In preschool curriculum studies, children in the control condition may participate in a wide variety of activities. For example, they may be learning from a curriculum that teachers used prior to the introduction of the new curriculum, they may be engaged in an online activity on a computer or tablet, or they may not be enrolled in a preschool program at all. Usually, the control condition includes the learning activities, experiences, and instruction that the child would have received had they not participated in the curriculum under investigation. Given that these circumstances can vary widely across preschool settings—and that curriculum effects are determined by comparing outcomes for students who participated and did not participate in the curriculum—understanding what activities occurred in the control condition is critical for interpreting curriculum effects.
In general, studies with strong treatment contrasts—with more distinct intervention and control group differences—will produce effects that are larger than those of studies with weak treatment contrasts. For example, Duncan & Magnuson (2013) noted that programs evaluated before 1980 produced substantially larger effects than those evaluated later. They argue that one explanation for the decline in effects is that the “counterfactual conditions for children in the control group studies have improved substantially” (Duncan & Magnuson, 2013, p. 114). In more recent samples, children in the control group were much more likely to attend center-based care programs and were likely to experience higher-quality home environments with more educated mothers (Duncan & Magnuson, 2013).
Reanalysis of data from the Head Start Impact Study also concluded that the overall average effect masked substantial variations in Head Start effects that were related to differences in the control conditions (Morris et al., 2018)—here, the authors found evidence of sustained impacts for Head Start when the control consisted of children who stayed home and did not attend center-based care. Finally, in the early 2000s, the Preschool Curriculum Evaluation Research (PCER; 2008) program supported a series of experimental evaluations examining the relative performance of curricular approaches. A recent meta-analysis of the PCER data examined the
performance of different curricular approaches against alternative counterfactuals (Jenkins et al., 2018). The authors looked at the performance of content-specific curricula in reading and math versus what they described as whole-child-focused curricular approaches, such as HighScope and The Creative Curriculum, and locally developed curricula.1 Overall, the authors concluded that content-specific curricula produced larger effects on targeted outcomes than did whole-child approaches, and that whole-child approaches did not yield student-level effects that were reliably different from locally developed curricula. However, the original PCER evaluation studies were conducted 20 years ago, and the curricula represented in the control condition have been revised since then.
“Research design” describes the methodological approach used for determining curriculum effectiveness. Most research designs involve the comparison of outcomes for one curriculum condition versus those obtained for an alternative condition. Students, teachers, or centers may be randomly assigned to different curriculum conditions, or they may be asked to select their own conditions. When participants select their own curriculum conditions, researchers may use statistical adjustment procedures to compare outcomes for students and classrooms that are observationally similar. The goal here is to ensure that differences observed in outcomes are the result of exposure to different curriculum approaches and not because of other differences between groups.
Research design features are important if the choice in methodology contributes to the size of the curriculum effect. For example, in a preschool evaluation where the curriculum is not randomly assigned but is selected by center directors, researchers may be concerned that center directors will be more likely to choose the curriculum being assessed because children in their centers are at risk for low academic achievement. By comparing outcomes for students in centers that selected the curriculum with those in centers that did not, the curriculum may appear ineffective—or even have negative effects—because students enrolled in the intervention centers were at greater risk for low achievement than students in the comparison centers. Although the researcher may use statistical procedures to ensure that both groups of children appear observationally similar, children across the two groups may also differ in ways that are unobserved by the researcher. In these cases, it can be difficult to differentiate why children in the curriculum condition exhibited lower outcome scores than those in the control condition—was it because the curriculum was ineffective or because there were other unobserved differences between children in the two groups?
___________________
1 Some of the curricula evaluated as part of the 2008 PCER studies have undergone revisions in the time since these evaluations were conducted. As such, the versions currently used in classrooms may differ from those evaluated as part of these studies.
The modern program evaluation literature prioritizes clear interpretations of intervention effects—or the internal validity of a study—as the sine qua non for high-quality rigorous evaluations of program or policy effects (Campbell & Stanley, 1963). This is in part because empirical evaluations of methods have shown that, compared with experimental approaches, nonexperimental methods can yield badly biased—or incorrect—results about intervention effectiveness (Fraker & Maynard, 1987; Lalonde, 1986; Wong et al., 2018). One benefit of experimental approaches, therefore, is that they yield causally interpretable effects when assumptions for the research design are met. Moreover, when deviations from the planned research design do occur—that may introduce bias in the intervention effect—it can often be detected by the researcher. Deviations from the planned research design may include differential attrition across intervention conditions, the inclusion of additional interventions that occur simultaneously with introduction of the curriculum, or failure to comply with randomly assigned intervention conditions. For these reasons, most evidence registries (e.g., What Works Clearinghouse,2 Blueprints for Healthy Development3) have minimum requirements for inclusion that are related to the quality and implementation of the research design.
To date, the committee is unaware of any studies that have compared the magnitude of effects for curriculum effectiveness by research design. In a broader meta-analysis of 84 studies of early care and education curriculum effects, the effect size differences between evaluations in which interventions were randomly and nonrandomly assigned were not statistically different (0.25 standard deviations for randomized controlled trials versus 0.19 standard deviations for nonexperiments; Duncan & Magnusson, 2013). However, to be included in the meta-analysis required both experimental and quasi-experimental designs with more than 10 participants in each condition and less than 50% attrition. Quasi-experimental effects were limited only to those that included repeated measures approaches (e.g., change models, difference-in-difference models), regression discontinuity, propensity score matching, and instrumental variable approaches.
Finally, studies with small samples produce less precise effect estimates. In these cases, it can be difficult to detect effect variation at all, much less identify sources of the effect variation. In addition to producing imprecise effect estimates, small sample studies may include participants and conditions that are not representative of populations ultimately intended to receive the intervention or curriculum. For example, intervention conditions
___________________
may be administered by the researcher or developer and delivered under controlled settings, making the study more akin to a laboratory trial than field research. Participants, aware of their involvement in a small novel intervention, may also respond differently than they would have had they been involved in a scaled-up version of the intervention with many participants. In Duncan & Magnuson’s (2013) review of program impacts in early care and education, small sample studies tended to have larger impacts, but these studies were also more likely to have researcher-developed programs and to have been conducted prior to 1980. For all of these reasons, studies with small samples may be most informative when they can be synthesized with effect estimates from other study efforts.
In the research literature, the programmatic and study design features are sometimes described as “moderators” of intervention effects. Moderators may be examined by comparing curriculum effects for different subgroups of participants within the same study (within-study approaches) or by comparing effects across multiple studies with different participants, settings, and sometimes research methodologies (between-study approaches).
The within-study approach offers the benefit of comparing effects for different subgroups of participants within the same study, who usually have observations on the same measures and have likely undergone similar study procedures (Bloom & Michalopoulos, 2013). Thus, if differential effects are observed between groups of participants, the researcher may have more confidence that effect heterogeneity is due to differences between subgroups of students and not because of other extraneous, study-related characteristics. However, within-study comparisons of effects are limited because studies often do not have sufficient sample sizes for detecting differential effects for subgroups of participants and settings (Sabol et al., 2022; Spybrook et al., 2016; Tipton, 2021). And, in the absence of strong theory guiding moderator analyses, researchers may be prone to conducting multiple moderator tests and reporting only statistically significant results. The challenge here is that these effects may be significant by chance, resulting in misleading conclusions about moderator effects.
The between-study approach compares results across different studies with variations in populations, settings, intervention conditions, and outcomes. The studies included in the review often have been screened to meet criteria for yielding interpretable results, including a valid and well-implemented research design. For example, the What Works Clearinghouse applies methodological requirements to education evaluation studies for
inclusion in its evidence registry. It prioritizes study results with strong internal validity, such as those evaluated by experimental or well-implemented quasi-experimental designs. When results from multiple studies with similar interventions and outcomes are available, the What Works Clearinghouse uses meta-analytic approaches for examining the overall average effect of the intervention, as well as evidence for effect heterogeneity.
Prioritizing strong internal validity in evaluation studies, however, can introduce biases into the evidence base for summarizing intervention effectiveness. Although experimental designs are viewed as the gold standard approach for yielding unbiased intervention effects, these approaches require intervention conditions that can be manipulated or randomly assigned to participants (Shadish et al., 2002). Promising curricular approaches that are not easily evaluated by random assignment (or quasi-experimental approaches) may be omitted from the evidence base. For example, a systems-based policy reform or a curriculum designed for a specific tribal community and that cannot be randomly assigned may be excluded from the evidence registry. Criteria for study findings to be included in an evidence base require consideration of both internal and external validity of studies—in terms of the representativeness of interventions, contexts, and populations of findings included (Imai et al., 2008).
The focus on internal validity may also obviate other concerns with study quality, including the construct validity of the intervention and conditions being compared. For example, the researcher’s interpretation and understanding of intervention components may not be well aligned with the participants’ experience and understanding of the curriculum or program, or the contexts for how the curriculum was delivered. Also, intervention effects can be determined only for constructs and outcome domains that can be reliably and validly measured, which may be challenging in preschool studies that often require direct assessments of young children. Outcome measures may not adequately represent all the domain areas that are critical for healthy development; they may also fail to fully capture the learning and growth of children from marginalized communities, especially those who have different home languages from those represented in the assessment.
Finally, intervention effects can be obtained only for samples and settings that are accessible to researchers. Study samples are recruited for various reasons (Tipton & Olsen, 2022; Tipton et al., 2020). They may be locationally convenient, logistically feasible, and/or financially reasonable, but they are rarely obtained using random—or even purposive—sampling from a well-defined population of units, treatments, outcomes, settings, and times. Children from marginalized communities, or belonging to low-incidence disability groups, may be underrepresented in study samples, potentially limiting the generalizability of study results. If different types of curricula are effective for underrepresented children, the results will not be reflected in the evidence base.
Study effects may be averaged in meta-analysis without clarifying what, whom, where, or when study effects represent (Schauer & Hedges, 2021).
In meta-analysis, with enough study effects, the researcher may examine whether variations in curricular approaches, participant characteristics, and settings—as well as study design characteristics—are related to the size and direction of intervention effects. These relationships can be modeled as factors related to scientific and study design conditions (and their interactions) in a series of meta-regressions of effects. The approach allows researchers to observe and test the robustness of effects across different programmatic and study design features, as well as to begin to formulate hypotheses about the conditions under which effects may or may not vary.
However, even in meta-analysis, it is often unclear how the researcher can best interpret these associations, and whether these interactions causally moderate or dampen the size of the intervention effect. Moreover, these approaches do not allow for all sources of variation to be tested simultaneously.
To consider challenges with identifying sources of variation across multiple but coordinated studies, the committee examined experimental results produced by the PCER program, which was funded by the Institute of Education Sciences (IES). The goal of this initiative was to provide rigorous evidence about the efficacy of available preschool curricula. The initiative funded 12 research teams from across the country to experimentally evaluate 14 preschool curricula using a common set of measures (one curriculum—The Creative Curriculum—was evaluated twice by two different research teams, such that there were 15 evaluation studies of curriculum). Starting in fall 2003, the study’s sample included predominantly low-income children enrolled in Head Start programs, state prekindergarten programs, or private child care centers. Outcomes for students’ skills (reading, phonological awareness, language development, mathematics knowledge, and behavior) were examined at the end of the preschool and kindergarten years. Researchers also examined classroom-level outcomes, including measures for classroom quality, teacher–child interaction, and instructional quality. Results were analyzed and reported separately for each outcome and study because each team had its own sampling plan and randomization schemes. The final PCER (2008) report concluded mixed results for both the student- and classroom-level outcomes. While 8 of the curricula had statistically significant impacts on classroom-level measures, 7 did not. And 2 curricula showed significant impacts on at least some of the student-level measures at the end of the preschool year, while 13 did not have any statistically significant effects. By the end of the kindergarten year, 3 curricula demonstrated effects on at least some positive student-level outcomes, while 11 had no impacts and 1 had negative impacts.
PCER was funded with the goal of providing decision makers with definitive evidence for choosing preschool curriculum. The initiative required that curricula be evaluated using random assignment and include samples
of children and programs that were of interest for decision makers. The programs included Head Start, state prekindergarten, and private child care centers in urban, rural, and suburban locations. The initiative also included standardized measures for assessing outcomes at the student and classroom levels, as well as for reporting curriculum fidelity and contamination of intervention conditions, and for assessing participant response rates and attrition. Finally, the effort included independent evaluations of curriculum conducted by 12 research teams, with technical support for conducting studies from two contract research firms. Given the well-defined target population, standardized method for data collection, and experimental design, the PCER initiative represents the acme of field evaluation methods for informing evidence-based decision making. So why did the PCER effort not yield more conclusive evidence for guiding curriculum choice?
One issue was the lack of statistical power for individual studies to detect significant effects. Random assignment occurred at the classroom or program level—with group sample sizes ranging between 11 and 40 clusters per evaluation study (the median group-level sample size was 18 classrooms or programs). Research teams reported minimum detectable effect sizes that ranged from 0.34 to 0.69 across composite student outcome measures, suggesting that individual studies were mostly underpowered for detecting statistically significant effects unless the magnitude of the effects was at least larger than a third of a standard deviation (PCER, 2008). The lack of statistically significant findings was perhaps not surprising.
Variations in study design characteristics also challenged the interpretation of results. Across the 15 evaluation studies (of 14 curricula), there were substantial differences in the comparison conditions for assessing curriculum effects, preschool settings in which the evaluations occurred, location of sites, and training of teachers on curriculum materials. For example, the evaluation of Project Construct found no statistically significant effects on student-level outcomes. The evaluation of DLM Early Childhood Express with Open Court Reading Pre-K, however, concluded statistically significant effects for student-level outcomes in reading, phonological awareness, and language. Explaining why there are different effects across the two curricula is more challenging. One reason may be that DLM Early Childhood Express with Open Court Reading is a more effective curriculum than Project Construct. Another reason could be that the teacher-developed materials in the control condition for Project Construct were more effective than the materials used by teachers in the control condition for the DLM Early Childhood Express study. Curriculum effects may also vary by preschool setting—the DLM evaluation took place in public prekindergarten classrooms in Florida, and the Project Construct evaluation took place in private child care centers in Missouri.
To address some of the challenges with interpreting results from the PCER initiative, Jenkins et al. (2018) reanalyzed results from the 2008
PCER study through a meta-analysis. By combining effect estimates across the 15 curriculum studies, the research team was able to address some of the ambiguity in conclusions due to weak statistical power for the individual studies; they also explored one hypothesis about why effects may have varied across studies. To conduct their analyses, Jenkins et al. (2018) compared curriculum effects according to four different treatment–control contrasts in the PCER initiative: (1) literacy-focused curriculum versus HighScope and The Creative Curriculum, (2) literacy-focused versus locally (or teacher-) developed curriculum, (3) mathematics-focused versus HighScope and The Creative Curriculum, and (4) The Creative Curriculum versus locally developed curriculum. Overall, the authors concluded that, compared with The Creative Curriculum and HighScope, the literacy- and mathematics-focused curricula had stronger evidence of improving student-level outcomes; they also concluded that there was not much evidence that The Creative Curriculum and HighScope improved students’ school-readiness skills more than teacher- or locally developed curriculum approaches.
However, as discussed previously, the curriculum studies varied in multiple ways besides the treatment contrast investigated by Jenkins et al. (2018). If the type of treatment–control contrasts covaried with other setting characteristics (including the type of preschool setting, the fidelity of curriculum implementation), it may be difficult to make definitive conclusions about why these effects differed. As such, while post hoc approaches such as meta-analysis allow the researcher to explore and disentangle various predictors of effect variation, these analyses cannot allow us to definitively point to the “cause” of why curriculum effects varied, nor is it possible to separate whether multiple sources of variation produce differential effects simultaneously. For example, it may be possible that the effectiveness of different types of curriculum approaches vary by the type of preschool program that it is delivered under and for different types of children enrolled in the program. To make such a conclusion would require prospective research designs that intentionally vary multiple systematic sources of effect variation.
The goal to identify what works under what conditions and for whom is not a new initiative in education research or in the evaluation of pre-K curriculum. Given the diversity in settings, populations, and conditions under which pre-K curriculum can be delivered, there is an intense desire to understand the extent to which and why curriculum effects vary. To address these concerns, IES introduced the Standards for Excellence in Education Research (SEER) in 2019, encouraging researchers to begin identifying the conditions under which intervention effects are generated. SEER asked grant recipients
to specify intervention components, document the treatment implementation and contrast, and take steps to facilitate the generalization of study findings. The reasoning here was that it is difficult to identify sources of effect heterogeneity—even as correlational relationships—when it is unclear what the effects themselves represent. In its 2022 review of IES’s work, the National Academies of Sciences, Engineering, and Medicine recommended that the agency prioritize the funding of studies to understand the extent to which intervention effects vary and that it begin to identify sources of effect variation. The preschool evaluation literature also calls for researchers to characterize and understand the extent to which intervention effects vary (National Academies for Sciences, Engineering, and Medicine, 2022).
Evidence about curriculum effectiveness is a central issue for the consideration of quality. Despite broad-based agreement by researchers and funders that understanding sources of effect heterogeneity is important for evidence-based decision making, the evidence on curriculum effectiveness often falls short of achieving these goals. The prior section described challenges that researchers face in understanding sources of effect variation. Results from individual studies—even large-scale, multisite trials—are often underpowered for detecting and testing for treatment effect variation (Sabol et al., 2022). In cases where results from multiple studies are combined, such as in a meta-analysis, it may be difficult to interpret the synthesized findings because individual study results may represent different populations, contexts, settings, and outcomes that are not well understood by the meta-analyst and reader. Even when multiple curriculum evaluation studies are planned and conducted in coordinated ways—such as in the PCER study—it may be difficult for researchers to understand and disentangle why effects differed across studies, given the multiple sources of effect variation that occurred simultaneously.
Data and quantitative and qualitative methods are needed to describe the rich contextual experiences for how preschool curricula are implemented and delivered, as are new analytic methods for examining and describing variations in effects. Ideally, evidence generated using these methods would
Because effectiveness is determined by comparing outcomes from children participating in the curriculum with outcomes obtained from an alternative condition, it is crucial that comparisons represent conditions that program administrators, educators, and parents are also likely to face. Moreover, high-quality teaching requires that educators are responsive to the dynamic and individual needs of children in their classrooms, so adaptations from curriculum materials are likely to occur. Study findings that are informed by an understanding of how the curriculum was delivered in real-world settings, and the extent to which deviations occurred from the intended protocols for the intervention and comparison conditions can provide valuable insights on effectiveness. The issue then is how researchers should carry out a research agenda that addresses the evolving needs of a diverse early childhood education landscape. The future research agenda described in Chapter 10 of this report highlights three areas of work needed to support such a research endeavor.
Bloom, H. S., & Michalopoulos, C. (2013). When is the story in the subgroups? Strategies for interpreting and reporting intervention effects for subgroups. Prevention Science: The Official Journal of the Society for Prevention Research, 14(2), 179–188. https://doi.org/10.1007/s11121-010-0198-x
Campbell D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for research on teaching. In N. L. Gage (Ed.), Handbook of research on teaching (pp. 171–246). Rand McNally.
Cronbach, L. J. (1982). In praise of uncertainty. New Directions for Program Evaluation, 1982(15), 49–58. https://doi.org/10.1002/ev.1310
Duncan, G. J., & Magnuson, K. (2013). Investing in preschool programs. Journal of Economic Perspectives, 27(2), 109–132. https://doi.org/10.1257/jep.27.2.109
Fraker, T., & Maynard, R. (1987). The adequacy of comparison group designs for evaluations of employment-related programs. Journal of Human Resources, 22(2), 194–227. https://doi.org/10.2307/145902
Imai, K., King, G., & Stuart, E. A. (2008). Misunderstandings between experimentalists and observationalists about causal inference. Journal of the Royal Statistical Society Series A: Statistics in Society, 171(2), 481–502. https://doi.org/10.1111/j.1467-985x.2007.00527.x
Jenkins, J. M., Duncan, G. J., Auger, A., Bitler, M., Domina, T., & Burchinal, M. (2018). Boosting school readiness: Should preschool teachers target skills or the whole child? Economics of Education Review, 65, 107–125. https://doi.org/10.1016/j.econedurev.2018.05.001
LaLonde, R. (1986). Evaluating the econometric evaluations of training with experimental data. American Economic Review, 76(4), 604–620. http://www.jstor.org/stable/1806062
Morris, P. A., Connors, M., Friedman-Krauss, A., McCoy, D. C., Weiland, C., Feller, A., Page, L., Bloom, H., & Yoshikawa, H. (2018). New findings on impact variation from the Head Start Impact Study: Informing the scale-up of early childhood programs. AERA Open, 4(2). https://doi.org/10.1177/2332858418769287
National Academies of Sciences, Engineering, and Medicine. (2022). The future of education research at IES: Advancing an equity-oriented science. The National Academies Press. https://doi.org/10.17226/26428
Preschool Curriculum Evaluation Research Consortium (PCER). (2008). Effects of preschool curriculum programs on school readiness (NCER No. 2008-2009). National Center for Education Research, Institute of Education Sciences, U.S. Department of Education. https://ies.ed.gov/ncer/pubs/20082009/pdf/20082009_1.pdf
Rubin, D. B. (1981). Estimation in parallel randomized experiments. Journal of Educational Statistics, 6(4), 377–401. https://doi.org/10.3102/10769986006004377
Sabol, T. J., McCoy, D., Gonzalez, K., Miratrix, L., Hedges, L., Spybrook, J. K., & Weiland, C. (2022). Exploring treatment impact heterogeneity across sites: Challenges and opportunities for early childhood researchers. Early Childhood Research Quarterly, 58, 14–26. https://doi.org/10.1016/j.ecresq.2021.07.005
Schauer, J. M., & Hedges, L. V. (2021). Reconsidering statistical methods for assessing replication. Psychological Methods, 26(1), 127–139. https://doi.org/10.1037/met0000302
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Houghton, Mifflin and Company.
Spybrook, J., Kelcey, B., & Dong, N. (2016). Power for detecting treatment by moderator effects in two- and three-Level cluster randomized trials. Journal of Educational and Behavioral Statistics, 41(6), 605–627. https://doi.org/10.3102/1076998616655442
Steiner, P. M., Wong, V. C., & Anglin, K. (2019). A causal replication framework for designing and assessing replication efforts. Zeitschrift für Psychologie, 227(4), 280–292. https://doi.org/10.1027/2151-2604/a000385
Stuart, E. A., Cole, S. R., Bradshaw, C. P., & Leaf, P. J. (2011). The use of propensity scores to assess the generalizability of results from randomized trials. Journal of the Royal Statistical Society Series A: Statistics in Society, 174(2), 369–386. https://doi.org/10.1111/j.1467-985X.2010.00673.x
Tefera, A. A., Powers, J. M., & Fischman, G. E. (2018). Intersectionality in education: A conceptual aspiration and research imperative. Review of Research in Education, 42(1), vii–xvii. https://doi.org/10.3102/0091732X18768504
Tipton, E. (2012). Improving generalizations from experiments using propensity score subclassification. Journal of Educational and Behavioral Statistics, 38(3), 239–266. https://doi.org/10.3102/1076998612441947
Tipton, E. (2021). Beyond generalization of the ATE: Designing randomized trials to understand treatment effect heterogeneity. Journal of the Royal Statistical Society Series A: Statistics in Society, 184(2), 504–521. https://doi.org/10.1111/rssa.12629
Tipton, E., & Olsen, R. B. (2018). A review of statistical methods for generalizing from evaluations of educational interventions. Educational Researcher, 47(8), 516–524. https://doi.org/10.3102/0013189X18781522
Tipton, E., & Olsen, R. B. (2022). Enhancing the generalizability of impact studies in education (NCEE No. 2022-003). U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance. https://files.eric.ed.gov/fulltext/ED617445.pdf
Tipton, E., Spybrook, J., Fitzgerald, K. G., Wang, Q., & Davidson, C. (2020). Toward a system of evidence for all: Current practices and future opportunities in 37 randomized trials. Educational Researcher, 50(3), 145–156. https://doi.org/10.3102/0013189x20960686
Wong, V. C., Steiner, P. M., & Anglin, K. L. (2018). What can be learned from empirical evaluations of nonexperimental methods? Evaluation Review, 42(2), 147–175. https://doi.org/10.1177/0193841X18776870
This page intentionally left blank.