This chapter describes the presentations and discussions that took place during the second workshop, titled Use of Meta-Analyses in Nutrition Research and Policy: Exploring Best Practices of Conducting Meta-Analysis, which took place on September 25, 2023. The objectives of the workshop were to
The workshop was moderated by planning committee member Janet A. Tooze from Wake Forest University and included presentations by Emma Boyland of the University of Liverpool, Andrew Jones of Liverpool John Moores University, and George A. Wells of the University of Ottawa. A panel discussion with the three presenters and additional discussants Elie A. Akl of the American University of Beirut, Joseph Beyene of McMaster University, and M. Hassan Murad of the Mayo Clinic followed the presentations. The workshop concluded with remarks from planning committee member Russell Jude de Souza of McMaster University. Workshop speakers
addressed the following questions, which were provided by the workshop sponsor in advance of the workshop:
Boyland’s, Jones’s, and Wells’s presentations on best practices of MA in nutrition research covered four topic areas:
Boyland began by focusing on screening data for potential extraction errors. She disclosed that she receives no industry funding and approached the presentation through the lens of an academic researcher, leaning on her experience as a lead reviewer for the World Health Organization (WHO). Her presentation drew content from reviews that she cocreated with other experts. Finally, Boyland noted that her presentation included information from the Cochrane Handbook, which she highlighted as a key resource for anyone conducting an MA.
Boyland stated that the best practice should be to prevent data errors from occurring during data extraction and highlighted the importance of training. For example, Boyland emphasized the importance of training the entire research team on all aspects of the process, including consistent methods of data extraction and she endorsed the careful examination of retraction statements to ensure that none of the data being extracted for inclusion has been retracted. She echoed the sentiments of Hooper, recommending that data extraction occur independently and in duplicate. She noted that this duplication is particularly important for any data elements that require subjective judgments and for data that are integral to the outcomes of the MA being performed.
Boyland suggested cross-checking all presentations of the data, including tables, text, and figures, for accuracy and homogeneity. She also noted
the benefits of a standardized data extraction template table, which can improve data organization and prevent errors in extraction and presentation of the data. An example of a data extraction template table is provided in Figure 3-1. Boyland described the “balancing act” between extracting too much and too little data. She recommended that teams consider the data and figures they intend to present in their review when determining the data to extract.
Boyland discussed the potential complication of duplicate data, or “linked data or studies,” which occurs when multiple papers are based on the same data set. She emphasized the importance of screening for linked data to avoid including duplicate data within an MA, noting that it may not be obvious that data are linked. She highlighted some factors to screen for, including identical locations, sample sizes, study dates, and study durations. If it is not clear whether two studies shared the same data, Boyland suggested reaching out to the study authors to verify. She referenced her own experience with this situation, discussing a time in which her team was performing an MA and encountered two articles that did not mention being connected but had similar characteristics. One was titled “Food advertising, children’s food choices and obesity: interplay of cognitive defenses and product evaluation: an experimental study” (Tarabashkina et al., 2016). The second article was called “When persuasive intent and product’s healthiness make a difference for young consumers” (Tarabashkina et al., 2018). The authors were similar and the data in general appeared similar, although not identical. Boyland’s team contacted the study authors, who confirmed that the data were linked. At this point, the team relied on pre-specified decision rules to determine which study to include in the MA, to avoid including duplicate data.
In cases where two or more papers with linked data are discovered, the team must decide which paper to include in their MA. Boyland suggested having a decision-making protocol for this specific issue included in the initial protocol for the MA so that the decision-making process is formal and systematic. For example, she said that teams might agree in advance to always use the more recently published article. She emphasized that it is critical for the review authors to choose which paper to include and to provide justification for this decision.
Other situations that can cause confusion in the data extraction process are instances when data from two studies are reported within the same article. Boyland gave some examples of this problem, such as two similar studies being reported in one paper, two randomized controlled trials (RCTs) with the same study design being carried out separately with male and female participants, or two RCTs performed identically but in different countries. She explained that it is not necessary to exclude these types of papers from the MA, but it is important that the data extraction and
analysis clearly note that the two data sets came from the same article. For example, she offered an option of listing the data by study author, then year, and then noting either “study 1” or “study 2.”
Another place where errors can arise is when multiple outcomes are reported within the same study. Boyland said that this may occur when there are different data points within the same article that seem relevant to the data being extracted. She gave the example of an MA on media usage. One article may report on both time spent viewing television and time spent online, and the team needs to decide which data to extract. Boyland suggested referring to the pre-established protocol in the decision-making process, asking which data are most relevant to the MA, and which data sets use the most valid tools. She illustrated this point with another example from the food marketing literature, noting that when examining the effects of exposure to food marketing on participant food intake, there are many different but potentially relevant outcomes. The team involved in this example decided to consider the most relevant outcomes to the purpose of their specific review, which, in this case, was the reduction of unhealthy food intake. In the absence of this outcome, they had a secondary outcome, which was total energy-dense snack intake. Having these criteria and outcomes established in advance enabled the team to make consistent, uniform decisions throughout the review process.
In the second portion of her presentation, Boyland discussed the evaluation of risk of bias in study design. This type of bias, she explained, refers to “systematic errors” or deviation from the truth in results. This form of error is distinct from random errors or imprecision, and it may be introduced into the data at many points throughout the research process, including from the original article authors through research design constraints or by systematic review (SR) authors. She detailed some of the ways that systematic errors may occur. For example, study authors may have bias that is reflected in the selective reporting of results. Review authors can additionally add their own bias in how they select studies and report data. Research constraints can impact the quality of results by causing researchers to rely on less precise data—for example, some studies rely on self-reported body weight as the only measurement of weight. Importantly, Boyland noted that it is important to consciously minimize bias wherever possible. As stated previously, this bias can come from the primary studies, research constraints, or the SR authors, and she highlighted that bias can either inflate or underestimate an effect.
Boyland described specific tools that can be used to analyze the risk of bias in a study when selecting studies for a review. She explained that the
tools should be selected based on their suitability for the specific study design being assessed. For example, as mentioned in Chapter 2, the Cochrane Risk of Bias 2 tool is a highly effective tool when analyzing RCTs for bias.1 It highlights ways that bias can arise, including during the randomization process, when deviating from the intended intervention, when outcome data are missing, from the way an outcome is measured, and in selecting the data to be reported. She added that each of these domains has an algorithm underlying the tool and explained how studies are rated within this tool. Boyland said that studies can be rated low risk, some concern, or high risk. The tool generates a table depicting the ratings for each domain as well as an overall rating for each article included. Boyland made specific note that “low risk” in one domain does not mean that an entire study is at a low risk for bias and concern in any domain suggests concern with the entire article.
Boyland gave some examples of assessing risk of bias in nutrition research specifically. She highlighted themes from Chapter 2, noting that many of the tools used for analyzing risk of bias are only applicable to RCTs. It is important to consider that many nutrition studies are not RCTs but rather observational studies, necessitating other types of bias assessment tools. As she closed her presentation, Boyland listed some of the most common quality and bias issues that can arise within nutrition research. She noted that nutrition studies, when experimental, often involve small populations measured over short periods of time. She suggested that nutrition research can have a higher overall risk of bias profile than other types of research due to lack of disclosures or bias not being adequately reported in the initial article. She highlighted the common issue of research constraints, pointing out that it is nearly impossible to have a “blinded” experimental group in a nutrition trial. Boyland reiterated that many of the existing tools for examining risk of bias do not apply to the type of studies generally conducted in the nutrition field, and while new tools may need to be created to solve this issue, the creation of new tools can itself lead to errors, if not fully validated.
Jones’s presentation focused on best practices for avoiding and addressing publication bias in SRs and MAs. Jones disclosed that he would reference work created alongside other researchers as well as materials from the Cochrane Handbook. He defined publication bias as the “failure to publish the results of a study on the basis of the direction or strength of the study findings,” a problem not limited to the field of nutrition (DeVito and
___________________
1 https://methods.cochrane.org/bias/resources/rob-2-revised-cochrane-risk-bias-tool-randomized-trials (accessed January 10, 2024).
Goldacre, 2019). As he explained, journals are more likely to publish studies that report significant results. He referred to this issue as “file drawer bias” because many studies that do not produce the intended results remain unpublished, or in the “file drawer.” He noted that this problem not only occurs when studies have nonsignificant results but also can occur when studies that produce negative results are suppressed by the study sponsor. He expanded on this point, saying that negative findings are generally published more slowly than positive findings, and publication bias is not limited to the field of nutrition. In general, there is a “constrained evidence base” of positive findings. He referenced a study from Polanin et al. (2016), which showed that published studies yield larger effect sizes than unpublished studies.
Jones noted that the prevalence of assessing publication bias is also high. A study titled “Publication Bias in Psychological Science: Prevalence, Methods for Identifying and Controlling, and Implications for the Use of Meta-Analyses” (Ferguson and Brannick, 2012) showed that across 91 MAs, 70 percent demonstrated some effort to evaluate publication bias, and 41 percent reported some evidence of publication bias. He said that it was important to identify unpublished studies, and he described several resources, including preprint servers such as “Nutri-Xiv” for nutrition research. He noted some issues, however, that may arise when using unpublished research. First, the studies are often not peer reviewed. Unpublished studies are more likely to include grammatical and numerical errors as well as errors in their figures and are also less likely than their peer-reviewed counterparts to report conflicts of interest or funding sources, both of which are critical for evaluating overall bias. Despite these challenges, Jones explained that the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) still recommends the use of preprints in MAs to avoid publication bias.
Jones described several statistical methods to assess publication bias. He suggested not using the fail-safe N (FSN) to avoid publication bias. FSN determines the number of studies with a nonsignificant effect that would need to be added to a sample of studies for the meta-analytic effect to be nonsignificant. The focus of this method rests on effect significance rather than effect size. This method is discouraged in nutrition research MAs because it does not actually determine whether there is bias, and the existence of greater bias can lead to the creation of a larger FSN value. For these reasons, Jones noted, the method is not recommended by Cochrane.
Funnel plots, however, can be a useful way to visually examine and identify publication bias. Jones displayed a funnel plot diagram, as shown in Figure 3-2, and described how it can be used to identify publication bias. As he explained, high-powered studies appear at the top of the funnel and lower-powered studies at the bottom. The funnel plot shows each study’s
effect size, or intervention effect, against a measure of the study’s precision, such as sample size or standard error. Symmetry in the funnel plot suggests a lower risk of publication bias. However, Jones noted, funnel plots are subjective measures, and he referenced a study by Terrin et al. (2005) in which researchers were unable to accurately identify publication bias using visual analysis of a funnel plot. He stated that funnel plots can be difficult to visually interpret when the sample size is small. Jones explained that the improper or misleading use of funnel plots can be harmful and should be used in conjunction with another statistical method.
Jones spoke about methods of assessing funnel plots using statistical analyses. Egger’s test aims to identify whether there may be “small study bias” and regresses the scaled effect size against the precision of the included studies. If there is asymmetry this indicates that the smaller studies are systematically different from the larger studies. Another set of tests, the Precision Effect Test (PET) or the Precision Effect Estimate with Standard Error (PEESE) test, fits a linear regression line and then extrapolates to estimate the effect size of a hypothetical study with a standard error of zero. Jones cautioned that this method can be particularly ineffective under conditions of high heterogeneity and when analyzing smaller studies with a large effect size.
Another approach for analyzing a funnel plot that Jones described is the “Trim and Fill” method. In this process, studies with the most extreme effect sizes are “trimmed” from the plot to help researchers understand how many hypothetical studies are missing from the analysis in order to achieve symmetry on the plot. However, he noted some drawbacks to this approach, including that removing outliers will have a large impact and the approach does not work in MAs with high heterogeneity.
Jones also spoke about a method called the p-curve analysis. Jones explained that p-curve analysis is a newer method and for this reason is not frequently used. He described the p-curve analysis as an MA of the p-values in the included studies, and he explained that this is a method of considering the evidential value of the studies being analyzed. Jones said that it is a useful way of measuring publication bias or “selective reporting.” He explained that scientists often use the p-value as a measure of importance, and a p-value of less than .05 is used as a threshold for statistical significance in a study but also represents a threshold at which studies are more likely to be published. P-values closer to .01 suggest high-quality evidence with greater power and stronger evidence of effect. When using p-curve analysis, selective reporting may be suspected if p-values cluster around .04 or .05, suggesting that researchers may have engaged in selective reporting of evidence to support their preformed conclusions. He and Boyland used this technique in their MA “Association of Food and Nonalcoholic Beverage Marketing with Children and Adolescents’ Eating Behaviors and Health”
(Boyland et al., 2022). In this study, the researchers found strong evidence for the value of the data with more p-values around .01 than .05.
Another method of analysis that Jones touched on is the Graphical Display of Study Heterogeneity (GOSH). This tool can be used to identify the impact of influential studies within an MA (Olkin et al., 2012). It takes the effect size from each study included in the MA and calculates an estimated overall effect for every potential combination of effect size up to a specified number of potential combinations. For an MA of k ≥ 2 studies, where k is the number of studies included in the MA, there are 2k –1 potential subsets of studies. For an MA with 20 studies, he noted that there are more than 1 million possible combinations. The purpose of this method of analysis is to gain an understanding of the normal distribution of possible effects from the studies included in the MA. As shown in Figure 3-3, Jones noted that smaller or more heterogeneous MAs will not show a normal distribution curve, and one study with a large effect can bias all the models in which it is included.
In closing, Jones reiterated the importance of checking for unpublished research on the topic being reviewed and using statistical tests to check for publication bias. Examination of the inclusion of influential studies can improve understanding of how they may impact measures of heterogeneity and skew the meta-analytic effect. While each statistical test that he mentioned has both benefits and weaknesses, when more tests are applied, research teams will be better able to demonstrate that their data are sensitive to the potential for publication bias.
Wells presented on the topics of interpreting the results of MA and how to conceptualize and address heterogeneity. He disclosed that he has no industry funding and stated that he is affiliated with Cochrane, having been a part of the working group since its inaugural meeting. Wells’s presentation was informed by his experience with both Cochrane and Grading of Recommendations Assessment, Development and Evaluation (GRADE). The presentation focused on the final three steps in performing an MA:
Wells described several statistical methods used in these processes. Forest plots are a graphical display of all the individual estimated results
from scientific studies addressing the same question, along with the overall results. Wells discussed how to use a forest plot to analyze the effect estimates or, in the specific example he reviewed, the “mean difference” data. As Wells described, Figure 3-4 displays a forest plot for the SR on the use of low-sodium salt substitutes (LSSS) compared with regular salt or no active intervention in adults and change in blood pressure. Wells described how to interpret the mean difference information. The horizontal line represents the confidence interval (CI), the vertical line indicates the line of no effect, and if the 95 percent confidence interval crosses the vertical line, it indicates that the results of the study were not statistically significant. Wells also described how the studies were weighted to assess their impact within the MA.
Wells also described two statistical models of analysis that are used in MAs, the equations for which are shown in Figure 3-5: the fixed effects model and the random effects model. The fixed effects model, Wells said, may be unrealistic to use in nutrition MAs because it ignores heterogeneity. The random effects model allows for incorporation of between studies heterogeneity but still requires an understanding of the estimate of the distribution of studies. This estimated distribution may not be accurate if there are few studies or events.
Wells described the “mean difference,” an effect measure for comparing the effect means of the intervention groups to those of the control groups. He spoke about how confidence intervals relate to the mean difference, pooling estimates, and standard errors. He said that there should be a lower confidence interval and an upper confidence interval; the more precise the numerical estimates are, the narrower the confidence interval will be. Wells said that the pooled effect estimate is represented both graphically and numerically. Graphically, it is represented on the forest plot by a diamond shaped mark, which highlights its position relative to the mean difference. The numeric value of the pooled estimate effect shows its meta-analytic value, while the confidence interval, and where it falls relative to the individual studies, shows its statistical significance.
Wells discussed how to identify heterogeneity on a forest plot visually, using a chi-square (or Q) test, or by using an I-squared statistical test. Confidence intervals that show very little overlap on the forest plot provide an early indicator of a heterogeneity problem. Even confidence intervals that fall on the same side of the line can be too scattered to overlap and can cause an issue with heterogeneity, he explained. This visual representation is the most basic way to identify excessive heterogeneity. When using the Q-test to assess heterogeneity, a small p-value means that the presence of homogeneity has been rejected and the studies are too different to combine. However, the Q-test as a statistical method can be unreliable with a small body of studies and can also be unreliable with too large of a body of studies due to its high sensitivity. Another issue with the Q-test is that it
lacks nuance and can only produce a “yes or no” answer about the presence of heterogeneity. Another statistical method for assessing the presence of heterogeneity that Wells detailed is the I-squared test, which attempts to identify and quantify heterogeneity. Values range from zero to 100 percent, with higher values indicating higher heterogeneity. While the I-squared test provides more detailed information about the presence of heterogeneity, there is no universally accepted cut-off point for interpretation. For example, below 30–40 percent might represent low or unimportant heterogeneity, 30–60 percent might represent moderate heterogeneity, 50–90 percent might represent substantial heterogeneity, and 75–100 percent might represent high heterogeneity, but these are not firm or specific metrics.
Wells said that once heterogeneity is identified, it should be explained; if it cannot be explained, the data should be interpreted to accommodate the heterogeneity. Wells defined the types of heterogeneity that may be present in an SR or MA. Clinical heterogeneity refers to differences in studies among participant and intervention characteristics. Participant variations may include differences in their condition, demographics, or location. Intervention variations may include differences in implementation, the experience of practitioners involved, and the type of control used, such as placebo, standard care, or no control. Outcomes variations may include measurement methods, event definition, cut points, and duration of follow-up. Methodological heterogeneity, as Wells described, can arise when studies vary in how they are designed and conducted or when there are publication limitations or constraints. Statistical heterogeneity occurs when the results observed across studies are more disparate than would be expected by chance.
Wells described ways to explore heterogeneity. Subgroup analysis and meta-regression can be used to assess the factors that appear to modify the effect. Wells explained that specific factors should be considered during subgroup analyses, such as whether the heterogeneity found within subgroups differs from the overall heterogeneity. He also suggested conducting statistical tests for subgroup differences to ensure that true differences exist between subgroups. Wells noted that confidence in the results of the MA should increase if the effect that is seen is also thought to be clinically plausible and supported by evidence outside of the review.
He suggested exploring heterogeneity through use of sensitivity analysis, which provides information on the robustness of the results. This analysis is done by repeating the MA using alternative options to assess the consistency or robustness of the results. Using the study on LSSS mentioned previously as an example, Wells described the subgroup analyses performed and their results. As he explained, subgroup analysis was conducted by study duration. The study team further meta-analyzed the subgroups and found that there was a moderate I-squared or a moderate risk of heterogeneity.
He described the concept of considering heterogeneity with subgroups and noted the importance of asking whether the subgroups are truly different.
Wells detailed the system of color-coding for evaluating risk of bias within individual studies. Figure 3-6 provides a chart displaying how studies can be rated on a variety of domains, with green signifying low risk of bias, yellow signifying uncertain risk, and red signifying high risk.
While examining the forest plot for the LSSS studies, Wells referred to Jones’s description of a GOSH analysis. Wells suggested that it could be useful to remove each study, one at a time, and rerun a GOSH analysis to determine the impact of each individual study on the overall effect.
Wells provided a detailed example of analyzing risk of bias and heterogeneity in an MA featuring studies with binary or discrete outcomes such as studies of cardiovascular events. He described a forest plot that included five RCTs in an MA. He reviewed each component of the forest plot, again describing all the measurements that are graphically displayed on a forest plot. Wells explained that the effect estimates were different in this example due to the binary nature of the outcomes. Relative risk, he explained, is a more straightforward calculation when analyzing discrete outcomes compared with studies that examine multicomponent interventions and complex outcomes. Wells said that there is greater potential for confounding in nutrition studies compared with placebo-controlled drug trials. In the case of a nutrition study focused on sodium intake and cardiovascular events, a control group could be a group consuming table salt in normal amounts or a group that received no active intervention. Another variable that adds potential heterogeneity is whether participants received education on reducing sodium intake. When it comes to studies on LSSS, multifactorial interventions add complexity to the research, and Wells posited that it may not be possible to truly isolate the LSSS as a causative factor. Referring back to the Population, Intervention, Comparator, Outcome (PICO) framework, Wells said that in nutrition studies, the “P” (population) and the “O” (outcome) are commonly easy to define, but the “I” (intervention) and the “C” (control) can be complicated and challenging to define in a singular manner, thus making them difficult to combine and analyze. These challenges present barriers to conducting high-quality nutrition MAs.
Wells explained how one could use the GRADE approach for assessing the certainty of the evidence in which the intervention and control are complicated and not uniform across studies. When this method was applied to the studies in his example, an overall grade of “uncertain” was given, mostly due to the high levels of heterogeneity across the study methods. These inconsistencies and high heterogeneity could not be explained using subgroup analysis or meta-regression, which led to less certainty in the observed effect. Furthermore, Wells noted that the certainty of evidence was further reduced when it was found that the greatest effect was seen in a

study of individuals who were at high risk for cardiovascular events, meaning that this evidence could not be extrapolated to the general population.
Wells described potential reasons for downgrading the certainty of the evidence, including studies that are poorly conducted or inconsistent, study results that do not apply to the question being asked in the MA, small sample sizes, large confidence intervals, and publication bias. Using the MA on the use of LSSS for reducing cardiovascular events as an example, Wells noted that the GRADE method illuminated issues with the quality of evidence, especially for the groups of interest to the MA. Use of this tool helped the researchers to better understand the limitations of the certainty of the evidence.
Wells also suggested the use of A MeaSurement Tool to Assess systematic Reviews (AMSTAR),2 which he described as a critical appraisal tool for SRs and MAs that include both randomized and non-randomized studies of health care interventions.
The workshop featured a panel discussion with Boyland, Jones, and Wells as well as three additional discussants: Joseph Beyene of McMaster University, Elie A. Akl of the American University of Beirut, and M. Hassan Murad of the Mayo Clinic. The discussion was led by planning committee member Janet A. Tooze of Wake Forest University. The discussion centered around the presentation topics and addressed questions that were asked by audience members.
On the topic of ways to prevent data errors during the phases of data extraction and analysis, Akl asked Boyland about how to address complex effect modifiers. He gave the example of a WHO study on fruit and vegetable subsidies that included co-interventions such as nutrition education, which could be viewed as effect modifiers. Akl wanted to know how to best account for these effect modifiers to better understand the data. Boyland replied by urging caution during the data extraction process and suggested directly contacting study authors to better understand whether the study included control groups that were not mentioned in the final publication. Jones added that this specific example might best lend itself to a “network meta-analysis,” which plots all the possible interventions and their combinations. Wells added that complex interventions may require multiple
___________________
2 https://amstar.ca (accessed January 10, 2024).
methods. He said that while network MAs are useful, he has concerns about whether there would have been sufficient data in Akl’s example for a network MA to be effective, as there may not have been enough evidence to run the required analysis. In complex cases where heterogeneity cannot be avoided, Wells said that teams should accept the heterogeneity, noting that it can lead to a better understanding of the effect modifier.
In response to an audience question about whether to use study data repositories or rely solely on the reports when extracting data from large epidemiological studies, Boyland noted the benefits of maximizing both volume and quality of data. Having more data to feed into the MA is a positive, and the current environment of data sharing has been an overall benefit to the MA process. Jones suggested caution with using data repositories noting that they are often sloppily organized, which can inject error. He also noted that referencing the author’s report may be beneficial. Jones suggested that researchers should aim to better manage and present raw data.
Multiple audience members asked how to address the existence of multiple reports published from the same study data. If similar analyses are published in two papers from the same team, should both be included in the data synthesis or just one? If only one should be included, what criteria should guide this selection? Boyland replied that is not necessary to only use one of the data sets, but if the same data set has been analyzed in multiple ways, teams should consider which outcome to use in the MA. Those conducting the MA should be able to rely on their agreed-upon methods, PICO, and hierarchy of decision making in these cases, asking what data are most relevant to the research question and what are the most valid tools that have been used to generate the data. There are no set rules, Boyland said, but it is essential to create and adhere to guidelines for the MA. Akl agreed with Boyland, stating that it is important to consider what data have been produced using the highest quality and most appropriate methodologies and which data are the best fit for the population of interest. He added that using multiple publications based on the same data set could have an undue influence on policy development, as each paper may be interpreted by policy makers as independent evidence. It is important when doing MAs for the purpose of impacting policy, Akl said, to be sure that duplicate data is not used in a way that can bias results.
An audience member asked about including subgroups in an MA and whether a team might accidentally double- or triple-count the data from a single study. Jones asserted in response that including the same study carried out in multiple groups would violate independence principles and require a more technical, multilevel MA. However, Wells noted that there are ways to include subgroup data that do not double-count the same evidence, but they
require careful planning and analysis, a point on which Tooze concurred. Wells gave the example of one of his research teams spending months analyzing this type of data, and issues arose when they tried to manipulate the data by study participant—an attempt to make the populations independent from each other. He said that when data independence is straightforward, separating groups and carrying out subgroup analyses can be effective. However, he warned that caution should be exercised when the subgroups become highly complex, and issues can arise when researchers attempt to manipulate data to this extent. He suggested the use of an Instrument for assessing the Credibility of Effect Modification Analyses (ICEMAN), which can be used to assess factors beyond statistical significance and provide analysis within or between studies. Beyene commented that meta-regression provides an advantage over subgroup analysis when exploring potential heterogeneity, especially when the factors being analyzed are continuous. The downside to such an analysis, he said, is the focus on the aggregate data, which have limitations that can be challenging to disentangle. He noted similar challenges in dose-response studies, stating that it is difficult to know whether the relationship is linear. Beyene acknowledged that there are many potential challenges with data analysis, and researchers should use their best judgment as to which method of analysis is best for their study.
The second discussion focused on preventing and evaluating bias in SRs and MAs. Murad noted that Boyland’s presentation had a strong focus on RCTs, but there are many other study designs used in the field of nutrition research and additional tools for assessing risk of bias in other types of studies, including observational studies and case series. Although the early version of Cochrane’s Risk of Bias tool mainly focused on analyzing the presence of bias in RCTs, he said that many other methods and tools have since become available.
Murad addressed Jones’s presentation, stating that he feels “very pessimistic” about publication bias, given that it occurs frequently and its existence is difficult to determine, particularly when analyzing small, non-randomized studies that have not been registered. Murad stated that conducting multiple analyses to identify the results of greatest significance, which can be done most easily with small, unregistered, non-randomized studies, poses a large threat to research that may not be easily overcome. Murad also responded to a question from an audience member about the best risk of bias tools for nutrition research considering the lack of RCTs used in the field and listed some tools that are used for non-randomized comparative studies. He spoke about the Newcastle-Ottowa Scale (NOS)
tool,3 the Cochrane Risk of Bias in Non-randomized Studies–of Exposures (ROBINS-E) tool, and the Cochrane Risk of Bias in Non-randomized Studies–of Interventions (ROBINS-I) tool. He said that ROBINS-E is usually more relevant for nutrition studies, but the tool is challenging to use and requires specific training. Murad added that specific tools exist for each type of study, and resources and trainings can help guide researchers to the right tools for their particular study.
Tooze asked a follow-up question about the use of I-squared analysis as a measure of heterogeneity in nutrition studies, inquiring whether a high I-squared value is always expected. Murad replied that a high number would be expected in an observational study, which makes the tool less useful in these cases. Akl and Wells agreed that, in this context, the I-squared value would be high and not very useful as a statistical measurement.
The third discussion centered on best practices for interpreting data though statistical analysis. The discussion touched on best practices for data analysis within MAs and pros and cons of conducting an MA.
Beyene asked the presenters why one would conduct an MA. Wells replied that the goal of an MA is to increase the power to answer a research question by combining studies. However, challenges with MAs include inconsistencies in data, excessive heterogeneity, and the shortcoming of certain statistical tools. He noted that in addition to combining studies and potentially gaining a better understanding of a treatment effect, MAs provide a good opportunity to explore differences between studies. In this way, Wells said, MAs are tools for both synthesizing and analyzing evidence.
Wells spoke further about addressing heterogeneity, saying that the forest plot can be a helpful visual in identifying likely heterogeneity. He said that addressing heterogeneity through statistical tools can be challenging due to the lack of strength of many of the tools or their lack of accuracy. For example, the Q-test was used for this purpose for years, but it does not test the correct hypothesis because the null hypothesis is not set up to be rejected. Instead, the goal is to not reject it to claim homogeneity. Furthermore, he noted that the Q-test does not have enough power to be a useful analysis with a small sample of studies. In these cases, he suggested that the use of the I-squared test may be more useful but noted the limitations with that method as well, which were previously mentioned by Tooze.
Furthering the discussion about heterogeneity and subgroup analyses, Tooze asked the panelists for input on the categories, such as intervention
___________________
3 https://www.ohri.ca/programs/clinical_epidemiology/oxford.asp (accessed January 10, 2024).
or participant characteristics, that should be considered to reduce heterogeneity when developing the protocol. Wells replied that groups should begin with the PICO and closely examine the differences between populations with particular attention to the differences that exist across important characteristics. As nutrition interventions are often complex, Wells noted that it may be helpful to group interventions into categories to analyze them. He suggested that study duration can be a useful subgroup. Wells also noted that it can be difficult to understand the exposure in nutrition research or whether a specific intervention is causing the effect, a comment on which Murad concurred. Murad stated that in nutrition research, it is important to know exactly how much of a nutrient is leading to the change or the outcome, and he suggested using the GRADE guidance in rating certainty of evidence when interpreting dose-response studies (Murad et al., 2023). He noted that dose-response MAs are especially useful and relevant in nutrition research.
Wells detailed the statistical concept of “tau,” defining it is as a “super structure” that forms within a set of data, and the study results spread out around that structure rather than around a single point. He noted that most statistical analyses of complex data sets depend on estimating tau, and for this reason, tau has become a statistical “Achilles heel.” Without an accurately estimated tau, most of the complex statistical analyses will be inaccurate. Wells concluded that, in general, teams should analyze the visuals and statistical tests that are available to them and come to the best possible judgment of how to approach heterogeneity within the MA.
Murad addressed a question from an audience member about what constitutes a wide confidence interval in the context of GRADE, explaining that the modern approach is to define the effect size that is considered important. For example, if the outcome is depression, the tool used should be based on the magnitude of change that is considered by the patient to be relevant or important. Relative importance can also be driven by clinical relevance, stakeholder feedback, or statistical significance. Confidence intervals, he said, should be considered through a similarly relative context. If it crosses a predetermined, agreed-upon threshold, the confidence interval is wide. Akl added that when interpreting findings, it is important to consider clinical significance and to define a priori what is being considered as clinically significant. In the example given by Wells of the study of LSSS impacting cardiovascular health, the team defined a priori that they considered a change of 10 mmHg in a patient’s blood pressure to be clinically significant. Although a change of 5 mmHg may be statistically significant, it would not meet their threshold for clinical significance. Wells agreed, adding that one should determine the minimally important difference, the effect estimate, and then the confidence interval. He suggested that the confidence interval could be positioned with respect to the clinically important difference.
Tooze relayed a question from an audience member, asking how many confounders can be included in a meta-regression and whether the answer depends on the number of studies included. Wells said that it does depend on the sample size or the number of studies included. He highlighted the ecological fallacy in which an effect may appear to exist from one study to the next, but within a single study, the effect disappears. This risk is one reason that Wells said he typically uses subgroup analyses. Most MAs are done with a smaller number of studies, and meta-regressions become an imprecise tool for analyzing a combined effect when using a small number of studies.