The third panel at the workshop examined nongovernmental perspectives on experiments. As moderator Daniel Goroff (Alfred P. Sloan Foundation) said, improving the productivity of the research enterprise is important not only for taxpayers, but also for nongovernmental funders of research, such as the Sloan Foundation. He highlighted the need for communication and collaboration: “Inspiration can go both ways between government agencies and the private organizations that are doing things.” Furthermore, he continued, powerful tools and methods are now available for running experiments: “You may or may not be able to do a randomized control trial, but lots of methods are available now. . . . They’re not easy, not only because of the power calculations that you need to do at the beginning but figuring out how to set these things up. . . . Fortunately, there is a growing and very vital community of people who will work with you.”
With the mission of improving human health and the sustainability of society and the planet, the Novo Nordisk Foundation is the largest foundation in the world in terms of assets and the third largest in number of grants awarded. Its department of impact assessment, which has grown substantially over the past decade, follows large grants through the length of the grants and for up to 5 years thereafter, explained Rikke Christensen of the foundation. It is also interested in doing experiments to fund the most effective approaches, which involve such questions as weighing projects versus people, deciding whose expertise should be solicited, and choosing selection criteria.
Several expectations accompany experimentation, said Christensen. Evaluations in science and innovation are expected to add to the knowledge base. They should increase the transparency and accountability of grants. And they
should improve the use of resources and the ability of funding agencies to fulfill their strategic goals, ultimately with greater benefits for society.
The Novo Nordisk Foundation uses five grant-giving models:
With the open competition awards, for example, it has done experiments with both investigator grants and project grants. The decision-making process for these awards involves administrative screening, a committee assessment of multiple applications, individual evaluation of multiple applications, and a final discussion and deliberation on the winning applications.
With grants from the Novo Nordisk Foundation and the Sloan Foundation, Chiara Franzoni (Polytechnic University of Milan) and Paula Stephan (Georgia State University) are looking at how the peer-review process is designed. The purpose of their work is to critically consider, analyze, and potentially improve research funders’ use of peer review; to create science-informed solutions about how to design peer-review processes in open competition; and to understand specifically the screening and selection of high-risk, high-gain research. Christensen explained that statistical and econometric data analyses are being conducted on the relationship between the design of peer review and the predictive validity of the outcomes. Framed-field experiments are being conducted to simulate peer review, she reported, with randomization of the alternative designs used in the simulated peer review and a specific focus on the relationship between the peer-review design and the predictive validity of the outcomes in the case of high-risk, high-gain science. And field experiments are being conducted with real grant peer review to test the findings of the data analyses and lab experiments and achieve greater external validity. Relevant questions include:
Data from more than 10,000 open competition applications have been gathered, said Christensen, including titles; abstracts; full text; literature references; scores; and the names of applicants and their affiliation, contact
details, and gender. Enriched data include applicants’ publication data for the past 5 years, first year of publication, and novelty scores.
Preliminary results indicate that applicants with a record of more novel research articles and applications had a lower probability of being funded, Christensen reported. Also, applications including more “promotional language,” as indicated by a list of keywords, had a higher probability of being funded and of delivering high-impact journal articles. These experiments can yield “very actionable results” and also are “very close to practice, both of which are major strengths,” she concluded.
In response to a later question, Christensen elaborated on several evaluation processes. Grantees are required to develop an impact framework before an application can go to the board. “We have them develop a small table, and we sit with each of them and do it on a very individual basis,” she said, which is also a way of providing feedback to an applicant. The board also has asked her department not only to collect data applications and outcomes but also to do more ad hoc analyses: “When I started 10 years ago, we would never talk about doing an experiment. But today, we’re doing that.”
Eva Guinan (Harvard Medical School) and her colleagues’ research is based entirely on field work rather than artificially constructed situations. Much of the work takes advantage of opportunities made available through the Harvard Catalyst, Harvard’s clinical translational science center, where Guinan directs an innovation center, and focuses on aspects ranging from ideation to team formation and grant evaluation. “These experiments were composed of real funding opportunities that resulted in real resources being allocated to real people in ways they could identify with,” she said.
The medical context also has framed many of the questions they have asked. For example, Guinan and her colleagues are currently doing work on the review of feasibility, which is “the metric that is always the most out of alignment in reviews.” As an example, she cited a highly rated grant application that would have used a radionuclide in the surgery suite. A nuclear medicine doctor informed the grant makers that use of the radionuclide as proposed would have contaminated the operating room, making the grant unfeasible. Guinan said, “We don’t really understand a lot about feasibility assessment—how it’s done, how it’s executed, how exact it is, and how much it should bear on our final decision making.”
Another prominent issue, Guinan stated, is defining the criteria that reviewers should use. Different definitions for measures—such as impact, innovation, and feasibility—produce different responses. Assessing how reviewers respond when the criteria definition and applications are varied has been a way of understanding response patterns, she said.
Randomization is critical in their experiments, Guinan explained, although it can be difficult. Structure needs to be imposed, particularly when
conducting experiments in real-life settings. For example, collaboration can be investigated by randomly assigning where people sit in symposia or how reviewers are assigned to proposals. The imposition of structure needs to be attentive to the issue of how it will feel to the people involved, Guinan said, “because if it doesn’t feel real, they let you know it, and they let you know that they have not done their review in the way that you expected to be evaluating.”
Guinan strives to have her experiments always produce more than one data stream. When multiple questions are built into every experiment, experiments can look at more than one thing, although care has to be taken to ensure that the data streams are not interactive. Multiple data streams also make it possible to look retrospectively at issues such as feasibility and the impact of published results.
In medicine, one randomized controlled trial (RCT) is useful but rarely definitive, Guinan said. Variation introduced by population availability, enrollment biases, study design, and normal human physiology can have major effects on results. “We always try to replicate with another randomized experiment to address, as best we can, the same question and get data on robustness and generalizability,” Guinan stated. Sometimes a close replica of an RCT is called for, with slight variations in the people involved, the setting, the randomization, or other factors. Such replications can be costly and time consuming but are nevertheless important.
Doing this work in the context of health care has interesting implications, Guinan said, one of which is that the personnel trained in natural science whose behavior is studied in experiments can be disdainful if they feel that their time has been wasted. Also, “it’s not a community that has much interest in observing itself,” she pointed out. “Doctors, and to lesser degree other science and health care workers, are trained in what is, in principle, if not almost exclusively, an apprenticeship model. And if you learn something as an apprentice, that doesn’t exactly support innovation—in fact, quite the opposite.”
Health care workers have constrained time and resources, and they may have little interest or motivation in shifting their activities. As a result,” said Guinan, “we need to turn much more attention to implementation science, broadly speaking. How do you make the analysis palatable? How do you convey that what you’ve learned is important and encourage adaptation and further experimentation? . . . What we do needs to be useful and focused, but there also needs to be attention to creating real use cases that are persuasive if we are going to change how the work happens.”
In response to a question about research on the various phases of funded science projects, Guinan observed that design issues need as much attention and critique as data issues. She said, “It’s almost never the design. It’s always the analysis that is picked apart. The design, the engineering part, gets much less attention, but understanding how well the process research adheres to real practice would help the reader qualify what they think about the implications of findings.”
As a philanthropic initiative rather than a foundation, Schmidt Futures is able to deploy resources in many different ways, including gifts, grants, investments, scholarships, and other levers directed toward impact, said Schmidt Future’s James Savage. In seeking to create public goods at scale, the initiative uses a talent-focused rather than project-focused approach. “Most of our talent programs are geared toward increasing the amount of entropy,” he said. “We want crazy things to happen.” The Schmidt Science Fellows program, for example, supports people who have just finished a PhD in one field with a postdoc in a different field. The Schmidt Science Polymath program gives people who have just gotten tenure approximately $2.5 million over 5 years to do research in a completely different field.
Savage focused his talk on the Rise program, which seeks to identify 100 brilliant people each year between the ages of 15 and 17 and provides them with a high level of support for life to create public goods at scale. Schmidt has done experimentation on Rise since 2020. The program is designed to accept up to 1 million applications a year, Savage said, and simply the act of applying causes young people to “level up, in some sense.” Designing such an immense application process was a major challenge, he said. The application was built on the idea that young people would have to do a project to apply, forcing them to break from the rote learning that they often encounter in schools. Doing a project can “demonstrate to us how much you can do and the sort of work that you can do,” Savage explained. The more general concept is allowing people to provide information about their characteristics that cannot be determined from test scores, transcripts, or recommendation letters.
An immediate challenge was how to review project-based applications. One possibility was to have applicants review other applicants with whom they are not in direct competition, Savage explained. Tests were also developed that could be delivered by a mobile app or on paper and subjected to automated grading. This peer review and automated grading could then be used to collect information from submitted materials.
To gauge these approaches, the initiative collected both low- and high-cost information, he explained. In the former category, it recruited 16 teachers from 11 gifted-and-talented programs around the world to rank order their students on such measures as brilliance, integrity, perseverance, and empathy. The students were then asked to identify their top peers on each measure. “Turns out, the students agree with each other. If you look at a Gini index, where 0 is choosing at random and 1 is they all have the same ranking, you’re looking at 0.4 to 0.7 in terms of agreement across people in the same classroom,” Savage reported.
These results were then compared with high-cost data. Students were interviewed by 23 Rhodes scholars for 45 minutes to measure the same traits. The correlation between the results of these interviews and the student rankings was essentially zero. But getting this result “was worth it,” Savage said. “The measurement errors systematically favored people whose parents were in a top--
two income quintile, especially the top income quintile, . . . so low- to middle-income applicants were especially penalized, at least in this experiment.”
The students who were interviewed were then asked to answer a series of questions while being videorecorded, and teens on the internet were recruited to watch the videos and leave feedback on them. “These scores, strangely enough, corresponded much more tightly with the classmate and teacher scores of these students,” said Savage. This may be because the students doing the videos were able to rerecord them and think more about the questions, thereby producing better answers than they could in an interview.
The main conclusion drawn from this experiment was that a peer-review mechanism was a way of getting information on people; however, Savage said, the data also turned out to be “noisy.” As a result, the application was structured so that no one interview or any other stage in the process was terminal. “There was no one committee who reviewed your application and kicked you out. You accreted scores throughout the application and were evaluated as a sum of these scores,” he explained. Students could have a poor final interview after doing extremely well in every other part of the application and still be chosen.
This investigation has helped to build an experimental culture within Rise and more broadly within Schmidt Futures, said Savage. For example, one experiment asked adult reviewers of applicants to do counterfactual scoring of applications by predicting whether a young person would be able to attend a top-20 U.S. university with the support of Rise or without the support of Rise, yielding a metric through which reviewers could be compared and calibrated. Also, Savage explained, the organization ran very large A/B tests on the advertising designed to attract Rise applicants. For example, after asking for submissions on marketing copy for why a young person should apply to Rise, the initiative recruited a couple of thousand young people from around the world to ask which message they liked best. The finding was that messaging should favor oddballs—“the cerebral, creative” types—not more conventional students.
More broadly, Savage said, Schmidt Futures has been able to use this experience to roll out experiments across programs. For example, an experiment with randomizing seating arrangements and exposure at a retreat looked at the subsequent development of networks. “It turns out if you give people the option of attending breakout sessions that they themselves create, opting into the same breakout session almost completely swamps the effect of sitting people next to each other for even 6 hours,” reported Savage.
Savage drew several lessons from these observations. First, treatment arms need to be well designed and tested qualitatively. Experiments can be expensive, so considerable work needs to be done before they are begun. Second, buy-in from key people needs to be achieved at the beginning, which requires conversations among major players. In particular, the value of the information to be derived from an experiment should be assessed and communicated early on. “You want to make sure the benefits to resolve uncertainty are really high,” stated Savage. Finally, the budget for experimentation will diminish unless the
experiment produces useful results, while running useful experiments will build internal support for experimentation.
In response to a question about the costs of experimentation, Christensen noted that the costs for data collection can be high up front, but once systems have been established the costs drop and capabilities expand. Guinan agreed, although she added that costs are not only in the data infrastructure but in program management. However, these costs, too, drop over time as more is learned about getting projects done successfully. Furthermore, said Savage, the results of experimentation can reduce costs in programs beyond those being experimented on. For example, the use of large-scale experimentation by Schmidt Futures in marketing to expand reach while reducing costs has now been rolled out throughout the organization whenever it does outreach for applications.
Finally, the panelists were asked for a single lesson that could be drawn from their experiences with experimentation. Christensen urged people to analyze their processes critically. “We don’t want to break any peer-review system that is working, but we want to learn about whether it is actually working the way we want.”
Savage recommended starting with conversations with the legal and bookkeeping departments. “Once those things are in place, everything else is easier,” he said.
“Don’t get discouraged,” replied Guinan. “Having clarity of purpose about what you’re trying to do and revisiting whether or not what you’re doing is contributing to the questions that you feel are important. . . . To keep going, you have to check in with yourself pretty frequently to make sure that the global community, the resource constraints, the effort required, and whatever aren’t pushing you so far off base that you end up having done a lot of work not to your own satisfaction.”
This page intentionally left blank.