The first panel of the workshop looked at three experiments that have been conducted on and through programs of U.S. federal agencies. All reflected a comment by Adam Jaffe (Brandeis University), who moderated the panel: “When we have two ways of doing something, and we have a metric of whether it’s working or not, why not compare one to the other?”
Daniel Handel (International Initiative for Impact Evaluation [3ie]) discussed the concept of cash benchmarking, which he had promoted as a foreign service officer at the U.S. Agency for International Development (USAID). The basic question is whether a program produces greater benefits than would be obtained by taking the money spent on that program and giving it to the program’s recipients as a cash transfer. In other words, he asked, “are we doing more for the poor than they could do for themselves?”
In 2013, with support from Development Innovation Ventures; Google; the University of California, Berkeley; and the cash implementer Give Directly, Handel and his colleagues set up two randomized controlled trials (RCTs) to compare cash transfers with a traditional program. The first compared $142 per beneficiary of cash with a standard USAID nutrition program. The second compared $450 per beneficiary of cash with a workforce development program.
The nutrition program had a small impact on savings but no other effects, while the cash transfer improved a variety of economic outcomes, reported Handel. In this case, “if you had to pick between those two, cash was better,” Handel said. With the workforce development program, the cash transfer did more than the traditional program to improve consumption, livestock wealth, income, productive assets, and subjective well-being, while the traditional workforce program did more to improve business knowledge. In a survey done 18 months after both programs ended, the effects of each had faded, but the pattern of cash being more cost-effective than the traditional programs remained.
The point of cash benchmarking is not necessarily to do a large number of head-to-head studies, said Handel. For one thing, cash is not always the best benchmark. The best benchmark is the single most cost-effective intervention possible. In a program to prevent and treat HIV infection, for example, the best benchmark is the effect of antiretroviral therapy. However, the problem is that in many areas such a standard is not known. Handel said, “Cash benchmarking is a provisional attempt to say—while we’re figuring out the best way to increase school attendance, for example—don’t do any worse than just giving away the money.” In this way, the cost-effectiveness of cash can be used in the program-design phase as a hurdle so that programs are not approved without an evidence base that they are likely to be more effective than an unconditional cash transfer. “The idea is to kill the worst programs before they get started,” he said.
Cash benchmarking is not relevant for some programs, such as public goods programs, policy reform programs, or infrastructure programs. “No amount of cash to individuals is going to harmonize the software at the Rwandan and Ugandan border posts,” Handel pointed out. But cash has been shown to have widespread positive effects, including reductions in child labor, improvements in dietary diversity, enhanced female empowerment, better use of health services, greater household income and savings, increased labor force participation, reduced malnutrition, improved school attendance, and reduction in risky sexual behaviors. “If you consider USAID’s $40 billion annual budget, maybe a third of that—I think it’s probably more, but let’s just say a third of that—is for programs intended to move the kind of outcomes that cash has been shown to move,” Handel said. Improving that third by just 15 percent would in effect be increasing USAID’s budget by $2 billion per year in perpetuity.
Handel pointed to several accomplishments of this new way of looking at budgeting. By the time he left USAID (2 years before the workshop), the agency’s new economic growth policy included cash benchmarking, and benchmarking is mentioned in USAID’s rules and regulations on activity design, although these are not binding. In addition, the emphasis on cash benchmarking revealed that USAID does relatively few high-quality impact evaluations and that almost none of them includes a cost analysis. Thus, Handel stated, the new perspective was “helpful in bringing to light how much work there still is to do to be able to know about the cost-effectiveness of noncash programs.”
At the end of 2022, USAID administrator Samantha Power committed the agency to expanding the use of cash benchmarking, “which was super exciting to hear,” said Handel. The next step would be for USAID to require discussion of the evidence for cost-effectiveness in the standard approval process and make it easier for staff to access the kind of evidence they need to compare programs with the likely cost-effectiveness of a cash transfer.
In response to a question, Handel observed that program staff need to have systematic reviews and metadata available that lay out in an accessible way the levels of benefits that would be expected from a program so that those benefits can be compared with cash payments. “You want to give technical staff the cash benchmark; it’s not fair to ask them to figure that out themselves,” he said. A more
basic approach is simply to direct staff to think about the cost. He cited evidence from a program in Afghanistan that spent $90 million to increase income for 55 women. Even if the exact size of the effect is unknown, the expenditure is clearly out of proportion to the benefits. He stated that “just making people think about the cost per beneficiary would be a good place to start.”
USAID has had no requirement that evidence be cited in the approval of a program, or the likely level of impact per dollar, or that the likely cost-effectiveness of a program be compared with cash transfers. But at the time of the workshop, USAID was reviewing its internal policies in these areas. Handel concluded, “USAID is much more comfortable now doing cash programming. So I would say if nothing beats the cash benchmark, if cash is literally the most cost-effective thing you can do, then do cash.” However, he added, that was not the intention of cash benchmarking, which was simply to measure the effectiveness of programs against that standard so as to create an evidence-based criterion for programs.
Lisa Ouellette (Stanford University) described an RCT on patent applications she conducted with her colleague Daniel Ho, which offers several valuable lessons for testing policy ideas. The approximately 8,000 patent examiners at the U.S. Patent and Trademark Office (USPTO) face many challenges in reviewing the 600,000 or so patent applications that arrive at the office each year, including limited time to examine the applications, limited experience with the subject matter fields, and difficulty finding scientific publications to help them ascertain whether an invention is new and nonobvious. As a result, Ouellette and other researchers have asked whether scientific peer review would be a way to bring in outside expertise in judging applications. “If you find someone who spent their entire life focusing on biosensing with nanotube transistors, they’re going to know that literature well and could very quickly identify the most relevant [articles],” she said.
The experiment was focused on what in patent law is called prior art, the earlier publications that are relevant to whether a possible patent is new or nonobvious. This process is analogous to peer review, in which scientists are deciding whether a given contribution to knowledge advances the field or had been known before. Researchers had previously examined the idea of crowdsourcing patent applications by posting them on the web and inviting comments. But because there was no control group, the results of the study were hard to interpret.
Ouellette and Ho recruited 1,476 experts, 336 of whom opted in and were matched to two pending patent applications in their fields of expertise. One of those applications was randomly chosen to be sent to the reviewer, while the other served as a control. The reviewers were asked to submit relevant earlier publications that would indicate whether each legal “claim” in the patent application described a new invention, which raised one of the difficulties
unearthed by the trial. The scientific experts had difficulty understanding what is important for a patent, such as legal claim language and the relevance of publications to that language. As a result, Ouellette and her law student research assistants had to put the input from scientists into a form more useful to patent examiners.
Ouellette and Ho could not directly measure the social value of granting a given patent, but they observed proxies that are relevant to the patent examination context. The researchers measured whether an examiner was more likely to search for or cite nonpatent literature when working with an application that had been assigned to an expert (whether or not the expert provided input) than when working with an application in the control group. They found that the nonpatent literature search rate increased from 24 percent to 34 percent, while the citation rate increased from 23 percent to 37 percent. The initial decision of whether to grant a patent application dropped from 14 percent to 10 percent, which would be expected if the examiners knew more about the prior art relevant to an application. However, this drop was just outside the conventional statistical significance, Ouellette noted.
The study turned up several additional problems with the peer-review approach. First, the scientists found the task to be difficult. Comments received from them included: “This is quite a lot less satisfying than reviewing a paper,” “This requires a lot more work than reviewing a peer-reviewed article,” and “I must confess that I have immense trouble reading and understanding claims.” The reviewers were not paid, with some commenting that they would need a high hourly rate to do this work. Over 60 percent of the experts failed to complete the task successfully. “There is still a big gap between what’s relevant from a legal perspective and scientific expertise,” Ouellette said.
Finally, the researchers had a former patent examiner rate the quality of the submissions both in the form received from the experts and after revision by the legally trained research team. Before revision, less than 25 percent of submissions received the two highest quality ratings (the highest quality ratings are given when many or all patent claim elements are connected to the submitted references); after revision, 60 percent of submissions merited these ratings. This result further illustrated the difficulty experts had in understanding what is demanded by the patent examination process.
These results changed Ouellette’s mind about the potential of expert peer review in the patent system: “As someone who had previously advocated for this as a way to improve patent examination, I no longer think this is the most cost-effective way to help improve patents.” Nevertheless, running the RCT was important, she said, because it provides a measure of the impact of this intervention and suggests that other reforms may be more cost-effective. For example, Ouellette hypothesized that simply giving patent examiners more time may be a better use of resources. It also demonstrated the value of research that does not necessarily require partnership from a government agency. She said, “We, as two law professors with our team of research assistants, were able to test this policy intervention. A number of ideas like that can be tested that way.”
Finally, such research does not necessarily require government funding. “There are a lot of law professors out there, like Dan and I, who . . . have our funding and salary paid for,” Ouelette stated. “And there are many law professors like us who would love the opportunity to work with people in government or outside of government in testing other policy ideas.”
Asked specifically about how they worked with the USPTO, Ouellette said that any third party is allowed to submit input on patent applications in the first 6 months after they are published. She and her colleagues took advantage of that provision to provide information to the patent examiners. “We were effectively running our own pro bono scientific journal geared toward the patent office to provide input on the set of patent applications.”
In response to a question about whether a subgroup of reviews or problems could be defined in which this approach would be more successful, Ouellette observed that not enough data were gathered to do a subgroup analysis. But others have asked whether the approach might be more worthwhile for certain kinds of patent applications, whether the USPTO should have different kinds of patent examination roles for different kinds of patent applications, or whether more resources should be devoted to patents that are likely to be more important than others. Some of these alternatives might be useful, she said, “though even if you’re focusing on the most valuable patents, it’s not obvious to me that asking a scientific expert . . . is the most cost-effective way to do that.”
In 2014, a presidential executive action indicated that USPTO needed to provide assistance to applicants who lack legal representation, known as pro se applicants. Peter-Anthony Pappas was one of a group of agency employees who were tasked with addressing this request, and they developed what was essentially an RCT. Fifteen senior patent examiners were charged with examining pro se applications from various technical areas; these examiners were given specialized training that enabled them to work closely with applicants. In his computerized records, Pappas then created two groups for tracking patent applications: the treatment group consisted of pro se applications docketed to the examiners; the control group consisted of applications that went through the normal patent process.
The resulting data had greater value than Pappas expected. For example, the data showed that the examination of pro se applications led to a 16.8 percentage point increase in the likelihood of female applicants receiving a patent, compared with a 6.1 percentage point increase for male applications—with an even greater improvement for U.S. women applicants—thereby reducing the observed gender disparities in the success of pro se applications.
The success of the pro se program led to an expansion of the number of examiners involved in it. More broadly, it demonstrated that the USPTO has a “treasure trove” of data, Pappas observed, that is “ripe for analysis on a whole
bunch of different fronts.” The successful use of the data has demonstrated to the USPTO the value of RCTs.
Asked about the potential effect on his study of the union representing patent examiners, Pappas recounted that the team conducting the pro se pilot program negotiated with the unions and created a memorandum of understanding that provided the examiners with incentives for participating—for example, temporarily receiving higher pay. Another workshop participant pointed out that unions can act as coordinating organizations when they agree about the desirability of change. They can provide a structure for implementing change and can suggest ideas about how to make change most effective, as well as new ideas about needed reforms.
Moderator Adam Jaffe asked the panelists about lessons from their work that could have broader applications. Handel emphasized the importance of keeping a low profile while getting buy-in from key people. Ouellette noted that the size of the cultural gap between patent examiners and scientific reviewers created unexpected problems in measuring the costs and benefits of the intervention. She said, “Thinking about how you translate to a different group; what is useful in a particular context; and how, as part of your experimental design, you’re going to measure these costs and incorporate them into your assessment of how valuable the intervention is—that’s useful to think about on the front end.” Pappas reflected on the value of RCTs, even though he had little awareness when he was creating his two groups of patent applications that he was actually creating an RCT. In part because of the success of his experiment, the USPTO has recognized the value of further exploring RCTs and is pursuing steps in that regard.
The panelists also addressed the issue of whether RCTs advantaged different groups of people because of their different treatments, noting that the differences did not cause problems in their studies. “People are being treated differently across government all the time,” said Ouellette. “Even in the normal patent examination process, people are going to be assigned to some patent examiners, and it’s going to be different for other patent examiners.” Also, the funding for programs is limited, so some people are going to receive an intervention and some are not. That does not change the usefulness of structuring a program so that information about its cost-effectiveness can be obtained. A challenge that can arise is explaining to people that they may be treated differently but that the value of those differences is unknown, enabling the study of differences to lead to better outcomes in the long term.
On a related note, all three presenters commented on their studies’ cost-effectiveness. As Handel noted, a $1 million RCT that improved the outcomes of a $10 million USAID program has already paid for itself. Similarly, said Ouellette, small improvements in the accuracy of patent examiners can have large impacts on the economy: “Small amounts of time from people like me on things
like this, if they lead to an improvement that actually makes a difference, can easily pay for themselves.” At the same time, Handel cautioned that evaluations can be done poorly or can impose substantial burdens on field staff. “There’s a real case for the expertise model,” he said.
In response to a question about the size a study needs to be to have the correct power, Ouellette agreed that the problem is difficult. Making such an estimate requires having some kind of estimate for the effect size, “and it’s hard to know that ex ante,” she said. Handel suggested setting a minimal detectable effect size that would determine whether the intervention was more cost-effective than cash. Another approach is to establish a policy-relevant effect size, since very small effect sizes are unlikely to be appealing to policymakers. In Pappas’s case, estimates of power were not considered up front since the available data were limited in size. But now that USPTO is looking into RCTs, “that will probably come into play as they plan these things out.”
When asked about the role of qualitative research in their work, Ouellette pointed to the importance of such research in generating hypotheses to test. For instance, in the early stages of her RCT, she and her colleague did pilot work with scientists to get a sense of the questions they should be asking. Handel agreed, saying, “I always try to have qualitative research, impact evaluations, and [a sense of] the qualitative barriers and facilitators in our syntheses of impact evaluations. If anybody here is in charge of commissioning impact evaluations, try to have some budget for qualitative work as well—it’s very important.”
The presenters were asked about pushback that they might have received because of cultural, legal, and political barriers to change. Ouellette pointed not only to the staff but also to the stakeholders in a given system as potential sources of pushback. For example, she said, the stakeholders in the patent system “care a lot about any change at the patent office.” An RCT can help change the culture, she observed, if it is publicized and used by people within an agency to make improvements. Pappas agreed, saying that reports of the success of RCTs should be widely disseminated. Highlighting such results helps fight against inertia and also facilitates setting up future RCTs.
Handel pointed out that staff people at USAID have lots of ideas about potential program improvements, and the organization has taken advantage of the situation by establishing a competition to apply for funds to conduct experiments, which is how the cash benchmarking experiment got started. “Giving the rank- and-file staff the chance to be the drivers of what gets tested may be compelling,” he suggested.
Finally, in response to a question about why some evaluations lead to changes in programs while others do not, Handel pointed out that “what works is not simple, because when we do systematic reviews on a given intervention, there’s a lot of heterogeneity.” The important thing, he continued, is for program designers and implementers to have evidence about why they believe a program will work and make that evidence accessible. Field staff cannot be expected to examine the evidence and synthesize it, but the experts on an intervention need to be able to justify that intervention based on evidence. Ouellette added that “in
some ways, it takes a village.” Although a single academic cannot do all the work of justifying a program, groups of people can, she said, including people who can translate evidence for policymakers and enlist support at relevant government agencies.