After the introductory remarks, the workshop began with a session that explored the potential to take advantage of heterogeneous treatment effects (HTE) to improve and personalize patient care. Several presenters described how understanding this heterogeneity can lead to more effective treatments for individual patients, thus maximizing benefits and minimizing harms. Much of the information offered in this session was relevant to the patient question: Given my personal characteristics, conditions, and preferences, what should I expect will happen to me?
“What we’re really talking about is personalized evidence-based medicine,” said David Kent, Director of the Predictive Analytics and Comparative Effectiveness (PACE) Center at the Tufts Medical Center. In other words, the goal is to use evidence from randomized controlled trials (RCTs) and other sources to predict what is likely to happen with an individual patient. He and the other panel members discussed two types of prediction: outcome risk modeling, that is, creating models that differentiate patients by risk, and treatment effect modeling, or separating patients by the likely effects of treatment.
Doctors and medical researchers have long recognized the limitations of RCTs for providing evidence for clinical decision making. Indeed, Kent said, even Austin Bradford Hill, who pioneered the use of RCTs in medical research, commented 50 years ago that while RCTs can determine the better treatment on average, they “do not answer the practicing doctor’s question, What is the most likely outcome when this particular drug is given to a particular patient?”
The innovation of evidence-based medicine (EBM), Kent continued, was the realization that RCTs could be used by doctors to determine what is best for individual patients, which required what he called a “very subtle” shift in approach. Instead of seeing RCTs as tools for establishing causation, they were now seen as tools for prediction in single cases. But single-case prediction is a problematic area, he said, and “a lot of very smart people have thought deeply about it.” Kent mentioned in particular Nobel Memorial Prize in Economic Sciences winner Daniel Kahneman, who identified two distinct approaches to such a prediction. One is the “inside view,” which looks at the specifics of a case, weighs the various factors, and then synthesizes them into a prediction. “This is the view that physicians had before evidence-based medicine,” Kent said, and it “is really the view that we spontaneously adopt for making decisions in virtually all aspects of life.” The second approach is the “outside view.” In this case, predictions are made by explicitly identifying a group of patients with similar diagnoses and characteristics, known as a reference class, and using that reference class as a statistical basis for prediction.
In contrast to traditional medicine, EBM relies on the outside view. Specifically, EBM is a type of reference class forecasting. “It relies on making inferences for single cases based on the frequency of outcomes or estimated treatment effects in a reference class to which the individual of interest is similar,” Kent explained. Yet, this raises another problematic question: How does one define similarity? He referred to this situation as the classic “reference class problem,” which was first
described in 1876 by the mathematician John Venn, who noted that each item or event has a multitude of attributes that could be used as the basis for categorizing it into one class or another. How do you choose from that multitude? For doctors, making that choice is a real problem, because determining the class to which a patient belongs will have implications for his or her treatment choices.
“How does evidence-based medicine approach this very deep problem, the reference class problem?” Kent asked. “Generally, I think we’ve largely ignored it. What we’ve done is we’ve emphasized the broadest possible reference class, which is the overall effect in a trial.” On the other hand, one can quickly run into problems when dividing patients into groups according to various characteristics. “If you have just 10 binary attributes, then you have over 1,000 unique subgroups that you can describe,” he said, “and if you have 20 attributes, you have over 1 million subgroups that you can describe. And you quickly run into the problem of small sample sizes.”
What is needed, he explained, is a principled way to prioritize which attributes are important in determining both the outcome of interest and the benefits of therapy. He and his colleagues have suggested that one particularly useful approach is to define subgroups according to outcome risk. Regardless of how treatment effects are measured (i.e., as the absolute risk reduction or as a relative risk reduction), the control event rate is a mathematical determinant of treatment effect—and the control event rate is simply an observable proxy of the outcome risk. When the outcome risk varies substantially across different groups of patients in a trial, the benefit–harm trade-offs are also likely to vary substantially.
To explain why outcome risk is a valuable way to classify patients, Kent presented a figure displaying absolute mortality risk as a function of a patient’s percentile mortality risk for patients with acute myocardial infarction (see Figure 2-1). Specifically, he said, the figure depicts patients with an ST-segment elevation myocardial infarction, a type of heart attack caused when the coronary artery, which has been affected by atherosclerosis, is blocked by a blood clot at the site of an injury. “This hockey stick–shaped distribution is actually a scatter plot with 1,000 little dots, each representing a patient,” Kent explained.
As shown in Figure 2-1, the risk of death averaged over all medically treated patients is 6 percent, which, according to Kent, is the number that would appear in a typical analysis. “The control event rate would be 6 percent,” he stated. That percentage, however, obscures some critical details. For example, the fact that, when risk is determined by a multivariable model using easily obtainable baseline clinical variables, 75 percent of patients actually have a risk that is lower than the average, and 50 percent of patients have a risk that is only half of the average rate—that is, the median patient has a mortality risk of only 3 percent. Furthermore,
at the extremes, the differences among patients are pronounced. The lowest-risk quartile of patients has an average mortality risk of only 1 percent, while the highest-risk quartile has an average mortality risk of 16 percent (see Figure 2-1). “Doctors actually know that the risk–benefit trade-offs in these patients are different,” Kent noted, “but in the trial, they’re all lumped together.”
To further illustrate the value of stratifying patients by risk, Kent presented an analysis of how two risk-stratified subgroups fared in the Danish multicenter randomized study of fibrinolytic therapy versus primary angioplasty in acute myocardial infarction, known as the DANAMI-2 trial (see Figure 2-2). DANAMI-2 analyzed 1,572 patients who presented to a hospital with an ST-segment elevation myocardial infarction, or STEMI. Some patients were treated with pharmaceuticals to break up the clot, while others were treated with percutaneous coronary intervention (PCI), in which a catheter is used to insert a stent and open up a clogged artery. Figure 2-2 shows the long-term results of PCI versus clot-busting drugs in two groups of patients studied in DANAMI-2: the lowest-risk quartile and the highest-risk quartile from the distribution in Figure 2-1. “The high-risk patients, the minority of patients who are high risk, get tremendous benefit from PCI compared to medical therapy,” Kent explained. “But the majority of patients who are low risk [are] actually slightly harmed by PCI compared to medical therapy.” If you combine results from all groups, the benefit to high-risk patients overwhelms the harm to low-risk patients, and PCI appears to always be the superior choice.
The researchers who published the results of DANAMI-2 analyzed one-variable-at-a-time subgroups (e.g., groups defined by age, sex, race, or the presence or absence of diabetes or hypertension) and found that the same overall benefit existed. “Just like every other trial,” Kent said, “they claimed consistency of effects, but that’s because they didn’t stratify by risk.” On the other hand, contrasting groups of patients who differ by only a single variable under-represents the heterogeneity found among patients. As in many trials, he said, if you separate your analysis into high- and low-risk subgroups using multiple risk factors, you may observe results similar to those that appeared in the DANAMI-2 trial.
Kent then described how he and his colleagues analyzed 18 randomized treatment comparisons by studying the effects on patients separated into quartiles according to risk. When they examined the trials on the basis of relative risk (i.e., risk in the treatment group divided by risk in the control group), there were no clear patterns. But when they analyzed the trials on the basis of absolute risk, fairly consistent patterns emerged, with those in the higher-risk groups receiving greater benefit from the treatments. And, indeed, the analyses of three of those trials were deemed clinically important enough to be published in three separate clinical papers (Kozminski et al., 2015; Sussman et al., 2015; Upshaw et al., 2018).
In one of those papers, published in The British Medical Journal, Kent and his colleagues analyzed the results of the Diabetes Prevention Program (DPP) RCT
(Sussman et al., 2015). In that trial, 3,060 nondiabetic patients with evidence of impaired glucose metabolism were randomized to one of three groups: a group that was given metformin, one that was given a lifestyle intervention, and another that received usual care. The main outcome measure was whether a patient developed diabetes. Kent and colleagues showed the risk-stratified results calculated in two ways. The first was as a hazard ratio, for which the risks of a treatment group are compared with the risks of the control group to determine a measure similar to relative risk. When examined by the hazard ratio, the effects of the lifestyle treatment were homogeneous—people in every risk quartile benefited by about the same amount—approximately a 50 percent relative risk reduction. By contrast, the effects of the metformin treatment group were heterogeneous. The lowest-risk group saw no benefit whatsoever, while the highest-risk quartile obtained about a 50 percent relative risk reduction, and the intermediate quartiles received something in between.
“We have one intervention where the statisticians will say [there is] no heterogeneity of treatment effect, and another where there is,” Kent summarized. Notably, when the DPP results are shown on an absolute risk difference scale versus a relative risk difference scale, which is clinically the most important measure of treatment effect, there are important HTE for both interventions. These results further demonstrate the “scale-dependence” of HTE; whether it is present or absent depends on what scale is used to describe treatment effects. “And for both interventions,” Kent noted, “it may be important to make different decisions for different patients and to target the treatments to the high-risk groups, particularly if resources are in some way limited.”
In one final example, Kent discussed a re-analysis of the Digitalis Investigation Group (DIG) study (Kozminski et al., 2015). The DIG study was an older trial in which more than 7,000 patients with heart failure were given either digoxin or a placebo, with the outcome measures being hospitalization due to heart failure and all hospitalizations. Patients in the highest-risk quartile experienced nearly a 15 percent absolute decrease in hospitalization due to heart failure when given the digoxin versus the placebo, while those in the lowest-risk quartile experienced only a 2 percent decrease. “But when you throw in all hospitalizations,” Kent said, “you see something interesting.” He further explained, “If you look at the lowest-risk quartile, you see that there’s actually harm. And this makes sense because digoxin has a very low therapeutic index, and these are patients who really can’t benefit because they’re not at risk for hospitalization. They can’t benefit, but they can only get the toxicity that sometimes causes hospitalization with digoxin. So, there’s actually net harm in those patients.” Once again, if these results were only analyzed in the conventional way, this important heterogeneity in benefit–harm
trade-offs would be obscured both by the overall results and within conventional (i.e., one-variable-at-a-time) subgroup analyses.
In summarizing, Kent offered the following take-away messages:
Finally, he noted several caveats and a few thoughts on how to proceed:
In the next presentation, Sanjay Basu, Assistant Professor of Medicine at Stanford University, spoke about a variation on the risk-based analysis that Kent described. In particular, Basu and his colleagues created a decision score that considered a patient’s expected benefit from treatment, as well as the expected risk, in order to assess the expected net benefit from treatment. Their analysis made it possible to make sense of two major studies of blood pressure treatments that had arrived at different conclusions and to predict which patients would do best with which treatment approaches.
The original question arose, Basu explained, because of two studies that appeared in The New England Journal of Medicine 5 years apart. The first,
published in 2010, reported the results of the Action to Control Cardiovascular Risk in Diabetes (ACCORD) trial (ACCORD Study Group et al., 2010), which had a total of 4,733 participants who were followed for nearly 5 years. The study looked at the value of using intensive blood pressure control to keep people’s systolic blood pressure below 120 mm Hg, as opposed to the standard goal of keeping blood pressure below 140 mm Hg. Basu stated that the study concluded that targeting a systolic blood pressure of less than 120 mm Hg “did not reduce the rate of a composite outcome of fatal and nonfatal major cardiovascular events.” The patients in the study arm with intensive blood pressure treatment did not improve more, on average, than the control patients who had the standard treatment.
Five years later, in 2015, results were reported for the Systolic Blood Pressure Intervention Trial (SPRINT; SPRINT Research Group et al., 2015), which also examined the value of using aggressive blood pressure treatment with a target systolic blood pressure of 120 mg Hg versus a standard treatment with a target of 140 mg Hg. The conclusion of SPRINT, however, was diametrically opposed to that of ACCORD. Basu stated, “[T]argeting a systolic blood pressure of less than 120 mm Hg, as compared with less than 140 mg Hg, resulted in lower rates of fatal and nonfatal major cardiovascular events.”
There was one obvious difference between the trials. “The first trial was among people with type 2 diabetes, and the second was not,” Basu noted. Nonetheless, evidence from several other trials indicated that the presence of diabetes did not have a profound enough effect to explain the two trials arriving at such radically different answers—which left clinicians in a bind. Which trial should they trust? Various editorial writers offered differing opinions. Perhaps there were differences in the sample selection between the two trials. Perhaps the presence of type 2 diabetes had a larger effect than previous studies indicated. Or perhaps, Basu said, “HTE exist, and despite being part of an overall similar population, differences in sampling resulted in a somewhat different average treatment effects between the trials.”
This discrepancy was not only an academic issue. In particular, SPRINT found the intensive-treatment group was significantly more likely to suffer severe adverse effects (e.g., hypotension, syncope, electrolyte abnormalities, acute kidney injury or failure) than those in the standard-treatment group. “This is not such a benign choice for the primary care physician,” he stated. “Rather than simply being a matter of causing some nausea or headaches, the side effects of intensive treatment may in some cases be severe: hospitalization, disability, dialysis, and death. So, one would want to make the right decision even though blood pressure control may seem like a fairly benign treatment decision,” Basu further explained. With this in
mind, Basu and colleagues decided to analyze the two trials in attempt to explain the discrepancy in average treatment effect between them in terms of HTE. “Perhaps,” he said, “similar patients in predictable ways have more benefit than harm, and vice versa, and differences in sampling could lead to differences in the average.”
The research question guiding their study was, Which patients have the most potential for benefit and the least potential for harm from the intensive blood pressure intervention? Their analytical approach to answering that question involved developing two Cox regression models, one for benefit (i.e., a reduced risk of cardiovascular events and deaths) and one for harm (i.e., an increased risk of severe adverse events). They also chose a limited set of potential candidate variables based on previous studies that indicated potential reasonable factors that might influence the HTE. Among these candidate variables were demographic characteristics, tobacco use, pre-randomization laboratory values, medication use, and systolic and diastolic blood pressure. The model also included a term for treatment and treatment by covariate interactions. In an effort to reduce false positives, Basu and his colleagues used an elastic net regularization approach with repeat cross-validation with subsamples of the data. Collinearity was also found to be a problem, as many of the variables were interrelated. With many collinear variables, Basu said, the solution is either to choose one variable that can stand in for all of them or to shrink the coefficients among the many collinear variables.
In an a priori specification, they decided to separate people in terms of their net benefit, which was equal to the benefit of the intensive treatment minus the harm. They then created a benefit–harm score based on clinically accessible variables such as age, sex, race, systolic blood pressure, number of blood pressure medications taken, use of aspirin or statins, tobacco use, serum creatinine, urine microalbumin and creatinine, and total cholesterol and high-density lipoprotein. Next, they applied the benefit–harm score to the participants in SPRINT, retroactively assigning them “decision scores” for the trial, and then compared those decision scores with the real outcomes of the trial. What they found was that the SPRINT participants with the higher decision scores were more likely to have benefited from the intensive treatment and less likely to have experienced harm than those participants with lower decision scores.
When Basu and his colleagues divided the SPRINT participants into tertiles based on the decision scores, they observed distinctly different patterns of response to treatment between the top and the bottom tertiles (see Figure 2-3). In the top tertile—that is, the one-third of those subjects whose decision scores indicated they were most likely to benefit from intensive treatment—participants who received intensive treatment had much greater benefit than the control subjects who received the standard treatment. There was no difference, however,
between the standard group and the intensive group in the amount of harm they experienced in the form of adverse effects. Thus, among the highest tertile there was a significant net benefit to treatment.
Conversely, among the lowest tertile—that is, those whose decision scores indicated they were least likely to benefit from aggressive treatment—there was no difference between the intensive-treatment group and the standard-treatment group in the benefit they received from the treatment in terms of reduced cardiovascular events and deaths. But among those in the lowest tertile, subjects in the intensive-treatment group experienced significantly more harm (i.e., adverse events) than those in the standard-treatment group.
Next, Basu and his colleagues applied the decision scores to the ACCORD subjects and found the same pattern. Among the highest tertile on the decision score, the intensive treatment had a net benefit versus the standard treatment; but among the lowest tertile, the intensive treatment had a net harm (these data are not shown in Figure 2-3). What explained the different results from the SPRINT and the ACCORD studies?
Although the average effect for SPRINT was positive (i.e., intensive treatment led to better outcomes on average) and the average effect for ACCORD was neutral or negative (i.e., intensive treatment did not have better outcomes on average), the outcomes of both trials, when examined more closely, were in fact not that different. The perceived variance was due to the difference between the samples for the two studies, in terms of their likelihood for net benefit from aggressive blood pressure lowering. First, as Basu explained, although 21 percent of the ACCORD sample was predicted—and observed—to benefit from the aggressive therapy, there was a larger percentage of SPRINT subjects who fell into this high-benefit group. In the end, the decision score derived from the SPRINT study correctly predicted that most ACCORD patients would not benefit.
The true lesson from the two trials, Basu concluded, is that “average trial results can often hide clinically profound heterogeneities in treatment effects.” Average trial results may also appear to be contradictory, consequently confusing both clinicians and the public. Comparing the average effects of SPRINT and ACCORD overlooked vital details about how individuals can be expected to respond to blood pressure treatment; in particular, the aggressive blood pressure treatment could be expected to help only a subset of patients—not all of them. Furthermore, Basu said, it was necessary to consider several factors in combination, rather than any single factor, in order to explain the important variations.
Several limitations of the study were noted by Basu. Their analysis could not examine results further than approximately 5 years, because SPRINT was discontinued after that amount of time. Another limitation was that congestive heart failure could not be included as a negative outcome because of differences in definitions between the two studies. People may also weigh benefits and harms differently in their calculation of net benefits. Basu and his colleagues are now exploring other approaches to weighting benefit and harm, rather than simply treating them equally.
Basu believes it will eventually be possible to create a tool that makes treatment recommendations for individual patients based on their individual characteristics and preferences. As an example of how such a tool would assist doctors, he noted the difficulty in keeping track of which of the numerous available drugs for treating type 2 diabetes are best for which type of patient—who might either benefit or be harmed by each type of drug. He said,
What we’re experimenting with in a trial setting is doing a personalized risk estimate for a baseline risk. Does the patient want to be treated or not, or how aggressively might we think about doing treatment? That’s the classical absolute risk before treatment. And then from individual participant data and network
meta-analyses, we can calculate heterogeneous treatment effects across all the possible treatments that are available, [identify] what types of people might benefit more or less from each different type of therapy, and then weight it based on patient preferences.
People are different, he noted. There are some patients, for example, who simply will not inject a medication; others are willing to inject a drug, but may be worried about weight gain or avoiding hypoglycemia. The ultimate goal is to use these various factors as weights to create an individualized ranking of medications based on the individual’s personalized risk and preferences and, particularly, the uncertainty in those estimates. “That, I think, is on the horizon,” he concluded.
Derek Angus, Chief of the Department of Critical Care Medicine at the University of Pittsburgh, opened his presentation with the image of being on a Scottish mountaintop, where it is possible to look around and see everything clearly in all directions. “And that is ideally where we want to get” in HTE, he said. “We want to have some sense of the exact therapy that the patient would absolutely want and be most likely to benefit from.” However, he said, it is not so easy.
“We look out over this cloud inversion, and every valley around us is filled with clouds, and as soon as we walk off the top of the mountain, we end up in a very unique valley filled with clouds, and everyone tries to solve the problem for navigating inside just that valley and comes up with a solution that appears to be partly solving the problem—but not all of it.” That is the current situation with HTE. Everyone is grappling with just part of the problem.
Actually, it is quite difficult to combine the HTE approach with the precision medicine approach and the patient-centered approach. To provide some context, he quoted from a paper by Richard Kravitz and colleagues that examined the role of HTE in EBM (Kravitz et al., 2004). The authors, Angus related, identified four dimensions of HTE:
Historically, most HTE papers have focused on the first and third of these dimensions, Angus said. “As they go down into their valley, they make some assumptions.” He noted that Kent stated in the previous presentation, in essence, “We’d like to predict response to treatment, but we’re going to just predict risk of having the disease-related event.” Conversely, those interested in precision have tended to concentrate on the second dimension. “It comes from people who feel they understand the disease on the inside, and so they’ve tended to focus on response to treatment,” he explained. Furthermore, there is a whole field whose researchers focus on the utility of different outcomes. Each group tends to work in its own separate valley.
For the duration of his presentation, Angus discussed the relevance of the design of RCTs for studying HTE. As Kent previously described, HTE analyses seek to identify various subgroups who respond differently to treatment, with the higher-risk subgroups having larger absolute treatment effects than the lower-risk ones. The typical risk distribution in these clinical trials is left-shifted, as the majority of the participants fall at the low end of the risk axis (see Figure 2-4). In these cases, Angus noted, the median risk is always lower than the average risk.
A major challenge in analyzing such trials is not overlooking those low-risk subjects who, in addition to not receiving any benefit from the intervention, are actually harmed. A typical one-variable-at-a-time subgroup analysis, as Kent also noted, will generally miss this harm. Comparing treatment effects in men versus women or Caucasian versus African American subjects will uncover a relatively small range in net benefit. “Therefore,” Angus said, “you want to have
this multivariate risk model that spans across the entire range, where you can have quantiles far to the left” that will identify subgroups of subjects who are harmed by treatment. “Of course, this will require having huge sample sizes all the time, enrolling across the entire breadth of the disease of interest, so that we always have enough samples to build these models,” Angus explained. “And so, the answer to trialists is just to do huge trials—enrolling everyone at risk.”
With regard to precision medicine, he described how researchers in that field tend to think more in terms of prognostic and predictive biomarkers. A prognostic biomarker is one that provides information about the likelihood of a patient reaching a certain disease-related endpoint, while a predictive biomarker is one that offers information about the likelihood of a patient responding to a particular therapy. Both biomarkers provide useful information for personalized medicine; that is, for a treatment to be useful for a particular patient, that patient must, first, be likely to experience the effects of the disease and, second, be likely to respond to the treatment.
A single biomarker is not necessary, Angus said. Indeed, it is possible to use a suite of biomarkers to identify patients most likely to respond well to a particular treatment. As an example, he described a study in which the researchers used principle component analysis on a large quantity of biomarkers to define two phenotypes (Calfee et al., 2014). The study was similar to a multivariable analysis in that the researchers analyzed a large number of biomarkers; but the end result was assigning patients to one of two categories, just as in a one-at-a-time variable analysis. “They were very happy with themselves,” Angus noted, “because these phenotypes were obviously not predicted clinically,” and still the phenotypes were useful both prognostically and predictively. “If you have phenotype 2, you were much more likely to die. At the same time, the same phenotype was highly predictive of benefit versus harm when exposed to the different strategies.”
“This is the essence of much of the precision medicine world—trying to get at these predictive biomarkers,” he said. “But they just seem to have forgotten this lesson learned from HTE about the peril of having a subgrouping based on a single variable” because of the way it may hide issues with people who are in the lowest-risk quantiles.
Yet another problem arises, Angus said, in the way that unmeasured baseline variables can cause huge differences in patient outcomes. The net effect of a treatment can jump from harm to benefit or vice versa with modest swings in the prevalence of these unmeasured variables.
“The problem here is that you think you’re studying one disease, but you’re not really,” he said. “So what can be done?”
People in the precision medicine field are approaching the problem in a couple of ways, he said. “They basically have what I would call the ‘hope and pray’ models. If they think there’s a complex disease, they may have some putative biomarkers, and they either ignore the biomarkers or they take a bet ahead of time and only enroll on the biomarker.” He said he would not speak of those further.
Instead, he turned to what he called the “spread the bet” models. “You acknowledge you do not know everything about the intervention and you also do not know everything about the disease, and you’re going to try to learn as you go.”
The best and most evolved version of this approach, he said, is the adaptive platform trial. Such a trial focuses on a disease, not a particular treatment; it uses multiple interventions (in multiple arms) with continuing enrollment; it is often based on Bayes’ theorem, a formula that describes how to update one’s hypothesis as new information is uncovered; and it involves tailoring one’s choices over time. So far, he continued, researchers using adaptive platform trials have been focused on the pre-approval space in drug testing, and the emphasis has been on efficiency, with the trials relying on small sample sizes. Different therapies “graduate” to the next phase while the trial continues.
The “poster child” for adaptive platform trials, Angus said, is the I-SPY 2 trial that screened several promising breast cancer therapies simultaneously. The first results came out about 18 months ago, with papers published in The New England Journal of Medicine (Carey and Winer, 2016; Park et al., 2016; Rugo et al., 2016). Patients are assigned to different arms of the trial with “response-adaptive randomization,” which regularly changes the selection rules according to the results of the trial to that point. As an example, Angus described how a planned 400-person trial might proceed. If, after results were available for 40 patients it was clear that treatment A was looking much better than treatment B, the randomized selection rules would be modified so that a greater percentage of the next 40 subjects would be put on treatment A (see Figure 2-5). “You don’t have to be an investigator” to make that call, he explained. “It can be a preset algorithm.”
An advantage of this approach is that if indeed treatment A is superior, it will become statistically clear sooner, and the study can be stopped earlier than planned. On the other hand, if the apparent advantage of treatment A in the first 40 patients was because of random chance, then the next 40 patients will move the outcomes back toward 50/50, and the trial will continue. One caveat, Angus said, is that this is not very efficient for a two-arm trial because the power is still determined by the smaller group. “But it actually becomes very interesting in the situation where you have multiple arms and multiple subgroups, which is arguably the situation we’re in today [with heterogeneity of treatment effects].”
The use of the adaptive platform approach was successful in the I-SPY 2 trial, Angus said. When the trial began, there was uncertainty about which drugs worked and in whom they worked. What they found was that the use of one drug, neratinib, was effective only in patients with one of two different combinations of the three biomarkers used in the trial; while a second drug combination, veliparib–carboplatin, worked only in women with a different combination of the three biomarkers (Park et al., 2016).
In conclusion, Angus reiterated several points. First, there are generally multiple axes of heterogeneity, and one of the challenges is to keep this in mind and not restrict the problem down to a single axis. The classic HTE literature has largely focused on the baseline risk of disease balanced against a constant threat of avoiding one-variable-at-a-time subgroups in favor of multivariable risk models. Precision medicine studies largely ignore that one-variable-at-a-time warning and instead concentrate on “predictive” biomarkers that may not actually predict. They use trial designs with putative predictive enrichment and “hope and pray” that it works. The alternative “spread the bet” approach is quite exciting—it is working in cancer and is arguably more patient-centered.
Robert Temple, Deputy Center Director for Clinical Science, Center for Drug Evaluation and Research at the U.S. Food and Drug Administration (FDA), highlighted the importance of understanding HTE from the regulatory perspective. He offered several additional examples in which drugs had contrasting effects in
different patients and in different situations, further emphasizing the critical need to understand treatment heterogeneity in order to maximize the benefit of drugs.
He began by commenting on how the field’s increasing knowledge of the pharmacokinetics of drugs has led to a better understanding of why various subgroups of patients may respond differently to the same medication. “Forty years ago,” he said, “you didn’t know how a drug was metabolized, you didn’t have good evidence of how it was renally excreted, hepatically modified; we didn’t understand about the enzymes that were responsible for the drug’s metabolism.” To illustrate, he mentioned the case of the tricylic antidepressants. Tricyclics must generally be given at a dose of 150 to 300 milligrams (mg) to work. Yet, years ago people did not start on that dose—they started with 30 mg—because some people had terrible adverse effects on 150 to 300 mg. Why did such reactions occur? Some individuals are simply poor 2D6 metabolizers (i.e., their CYP2D6 enzymes do not function well) and they do not metabolize the tricyclics as quickly as most people. Consequently, these people will have approximately five times as much of the drug in their bloodstream as a “normal metabolizer” given the same dose. “If you just gave them the 300 mg, you could kill a poor metabolizer because those drugs are toxic at high doses,” Temple said. “So, the standard starting dose for desipramine was 30 mg, the right dose for a poor metabolizer. If that worked okay, you were fine. If it didn’t but was tolerated, you increased the dose. Of course, delaying effective antidepressant treatment poses its own problems.” Fortunately, this scenario is no longer an issue, he noted. “We know most of the metabolizing enzymes, and we know how to adjust doses for people. In clinical trials, we get blood levels on almost every patient, so we can detect unanticipated reasons for some people to have higher blood levels than others.”
According to Temple, FDA now looks for a variety of differences in how people respond to drugs—“differences in how you metabolize the drug, differences because of a concomitant drug that affects the metabolism of a drug, or differences in pharmacodynamic effect, that is, differences in how some people respond to the same blood level, perhaps because of genomic differences.” Today, when FDA receives a drug application, it examines all possibilities that might affect either safety or effectiveness, including demographic differences, genomic characteristics, and severity of the disease. “And, every once in a while, those analyses of subgroups turn up something important,” he said. “I’ll give you two of my favorite examples.”
His first example was BiDil, a combination of isosorbide and hydralazine that is used for heart failure. BiDil was examined in two studies by the U.S. Department of Veterans Affairs (VA), and the overall results showed that it performed a little
better than a placebo and much worse than an angiotensin-converting enzyme (ACE) inhibitor (Cohn et al., 1986, 1991). However, when they looked at the drug’s effectiveness broken down by various demographic groups, the researchers observed a surprising result. The drug did not perform well in Caucasian subjects, but it was effective in African American subjects (Carson et al., 1999). “That was true in both studies,” Temple said. “And we eventually allowed a confirmatory study to be done entirely in [an African American] population, and the effect size was very dramatic.” This is one of many discoveries that can result from analyzing subgroups, he noted.
The second example concerned ticagrelor, an alternative to clopidogrel, which is an antiplatelet drug that was used in people who had experienced a heart attack. A large cardiovascular outcome study revealed that the drug worked better than clopidogrel everywhere except in the United States, where it was considerably worse (Wallentin et al., 2009). “When we examined the data,” Temple said, “it turned out that it was entirely attributable to the dose of aspirin that was used. When it was used with 300 mg of aspirin, it performed worse than clopidogrel, but when it was used with 100 mg of aspirin, it performed markedly better.” And aspirin use was distinctively different in the United States, where about half of the patients used the 300 mg dose; in contrast to the rest of the world, where only about 15 percent were given that dose. Thus, differences in outcomes were neither related to region nor population, but rather to the concomitant aspirin dose. “We studied the heck out of that,” he said, “because there was a lot of suspicion among our biostatisticians that this was fishing for subgroups.” However, they observed the same pattern of effectiveness related to aspirin dose in both Europe and in the United States—people who used the drug with the higher doses of aspirin, which was uncommon in Europe, did not fare as well as those who used the lower doses of aspirin, which was more common for Europeans. “Eventually [ticagrelor] got labeled with, ‘Don’t use with high doses of aspirin,’” he said. Ultimately, FDA was able to find a solution to this issue because it analyzed the data at the subgroup level, Temple concluded.
To begin the broader discussion, moderator Harry Selker, Executive Director of the Institute for Clinical Research and Health Policy Studies at the Tufts Medical Center, solicited comments regarding access to data from trials. If one wishes to re-analyze a previously conducted study to search for latent variables, among other areas of interest, it may be difficult to obtain access to those data, he said, which can be an important factor in how quickly the field advances.
One audience member responded that he would prefer that the researchers who perform major studies release the data collected during the trial for use by other researchers and clinicians after the study is published. In particular, he suggested it would be useful to have “an online calculator so that you can apply the results from the evidence of that trial to the patient before you.” How, he asked, can that be made to happen? Joseph Ross, Associate Professor of Biomedical Informatics from Yale University, commented that several clinical trials are now, in fact, being made available by sponsors, manufacturers, the National Institutes of Health (NIH), and others for secondary research purposes. Ross leads the Yale University Open Data Access (YODA) Project, which has partnered with Johnson & Johnson in making clinical trial data available. The YODA Project offers more than 250 clinical trials to which Johnson & Johnson has provided access. There are many additional groups providing access to clinical trial data, Ross said, mentioning www.ClinicalStudyDataRequest.com, pharmaceutical companies such as GlaxoSmithKline and Roche, and NIH’s Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC), which provides access to data from studies funded by the National Heart, Lung, and Blood Institute. Temple commented that, in addition to data, it is often important to have access to tissue samples, so that additional tests can be performed using those samples when necessary.
Relatedly, Sheldon Greenfield, Executive Co-Director of Health Policy Research Institute at the University of California, Irvine, acknowledged the work described by the panelists, adding that this work should be widely practiced as quickly as possible. “Let’s get on with it,” he quipped. “In every trial, there should be predictive models of this sophisticated type embedded in the trial, not only to improve the analysis of the trial, but also so we do not have to wait forever and retire before the studies come out.” He also suggested enlarging the group of variables examined for their relationship to treatment effect to include patients’ personal variables such as comorbidities, functional status, presence of depression, participation in care, and various other social determinants of care. Sherrie Kaplan, Executive Co-Director of Health Policy Research Institute at the University of California, Irvine, compiled a composite variable that captures many of these factors, Greenfield noted. The reason it worked was because her composite captured a latent variable—the patient’s ability to respond to treatment. Angus said he strongly endorsed each study having a “multi-attribute risk model of your best understanding of predicting the outcome of disease, even though you might be wrong about mechanism.” He would also favor having such a model mandated for every large Phase III clinical trial. “Even if nothing else happened today,” he said, “the death of the one-at-a-time subgroup analysis would be great.”
Turning to another topic, Kent offered a technical comment about why subgroup analyses based on outcome risk are often easier and more reliable than conventional analyses searching for relative effect modifiers. Obtaining reliable analyses on relative effect modification is difficult for two reasons, he said. First, it is usually the case that little is known about the relative effect modifiers before the trial begins, which leads to “fishing expeditions.” Second, trials do not generally have enough statistical power to provide solid data on the relative effect modifiers. This situation causes the forest plots examining relative effects to be unreliable, he stated. “They’re unreliable empirically, but also theoretically,” Kent continued. “We should anticipate that they’re unreliable because they’re very underpowered, and the prior information that we have is typically very weak.” Throughout the remainder of the session, several participants offered opinions regarding forest plots, with some saying they are useful and others offering caveats about their weaknesses. Rodney Hayward, Professor of the Department of Internal Medicine and the Department of Health Management and Policy at the University of Michigan, suggested that forest plots should often be restricted to the appendices of a paper, thus allowing researchers who are interested to view them and preventing other readers, such as clinicians, from being misled by them.
Ralph Horwitz, Professor Emeritus of Medicine at the Yale School of Medicine, commented that a related problem is that “a lot of the heterogeneity is outside the trial.” That is, trials are generally run with relatively narrow inclusion criteria, causing much of the heterogeneity that doctors see in clinical practice to never be included in trials. For example, he continued, the drugs that patients have been taking previously is typically neither reported nor analyzed. “You’d like to see more inclusion of whatever background drugs they were on in the first place?” Temple asked. “I think we [at FDA] are very sympathetic,” he continued. “We would like to see the background drugs that people were on kept in at least some of the studies.”
Finally, Greenfield mentioned the importance of observational studies. “I’m not talking about big data,” he said. “That’s a separate topic. I’m talking about intermediate and small data—hundreds, maybe thousands of people.” Such studies are becoming more and more common, he said, to the point that they are eclipsing randomized trials. “These observational studies are a rich source of HTE,” he explained. “I think we’ve got to move ever more toward using the data that we have and are able to collect.”