JAMES E. BAKER AND LAURIE N. HOBART
The Honorable James E. Baker, J.D., is the Director of the Syracuse University Institute for Security Policy and Law, Professor of Law at Syracuse University College of Law, Professor of Public Administration (by courtesy) at the Maxwell School of Citizenship and Public Affairs at Syracuse University, and a judge on the Data Protection Review Court.
Laurie N. Hobart, J.D., is Associate Teaching Professor at Syracuse University College of Law.
CONTENTS
A Brief History from Turing to Today: Waves of AI Development
A Survey of AI Applications “Today”
How Machines Are Trained to Learn: The Machine Learning Life Cycle
Judicial Roles and High-Level AI Takeaways for Judges
There Are Many Different Methodologies and Designs
Most AI Is Iterative and Should Be Tested and Validated Continuously
AI Predicts, It Does Not Conclude
Currently, Accuracy Often Depends on the Volume and Quality of Data
Some AI Is Not Explainable but More Transparent Methodologies Are Being Developed
The Heart of AI Is the Algorithm
Algorithms That Predict Human Behavior
Algorithms for Lawyers and Non-Lawyers: Technology-Assisted Review and . . . Legal Practice?
Federal Rules of Evidence 401–403, 702, 901–902
Standards Controlling an AI Application’s Operation and Maintenance
Artificial Intelligence (AI) now permeates our lives in seen and unseen ways. It informs how we shop, drive, recreate, and work. It guides supply chains and helps with medical diagnoses and documentation. Just as AI is transforming the economy, healthcare, and American society, it will also transform the practice of government and law. Lawyers use AI platforms for discovery, legal research, and drafting. At least seventy-five countries use facial recognition for domestic security and law enforcement purposes.1 AI is used to determine travel patterns, link suspects with crime scenes, and populate watch lists. Between 2011 and 2019, the FBI used its facial recognition algorithm to search federal and state databases, including some state driver’s license databases, over 390,000 times.2 Modern militaries and intelligence services are integrating elements of AI into logistics, weapons, and information systems. “Generative AI” can create language loosely in the style of famous writers, simulate paintings reminiscent of those of the masters, imitate voices and generate fake video of real people, perform academic research at varying levels of sophistication and accuracy, and pass professional exams. The National Security Commission on Artificial Intelligence (NSCAI) has predicted that “[t]he development of AI will shape the future of power.”3
Many of these uses of AI may find their way into U.S. courts. Some already have. Not only will courts see AI-generated evidence in civil and criminal cases, some state courts and jurisdictions already choose to use various algorithmic risk assessments in bail, parole, and even sentencing decisions.4 Judges may need to determine whether to admit AI outputs into evidence or to use AI in judicial decision making. They may also be called upon to adjudicate disputes related to how AI is used in the real world, such as the implications of the use of recommendation algorithms by social media platforms to direct content to particular users, as in the 2023 Supreme Court cases Twitter v. Taamneh and Gonzalez v. Google.5 To address these challenges, judges must understand how AI works, its
1. Steven Feldstein, The Global Expansion of AI Surveillance 1 (Carnegie Endowment for International Peace, Sept. 17, 2019), https://perma.cc/C5H2-T2ZZ; National Security Commission on Artificial Intelligence (NSCAI), Interim Report 12 (Nov. 2019), https://perma.cc/RWP3-ZXP8.
2. U.S. Gov’t Accountability Office, GAO-19-579T, Face Recognition Technology: DOJ and FBI Have Taken Some Actions in Response to GAO Recommendations to Ensure Privacy and Accuracy, But Additional Work Remains (June 4, 2019).
3. NSCAI Interim Report, supra note 1, at 9.
4. “Liberty at Risk: Pre-Trial Risk Assessment Tools in the U.S.,” Electronic Privacy Information Center 1 (updated Sept. 2020), https://archive.epic.org/LibertyAtRiskReport.pdf. As noted below, criminal risk assessments may not or do not all use machine learning but may increasingly do so in the future.
5. Gonzalez v. Google, LLC, 2023 U.S. LEXIS 2059 (May 18, 2023); Twitter v. Taamneh, 143 S. Ct. 762 (2023).
applications, its implications for the fact-finding process, and its risks. Judges should be able to answer the following four question-sets in context:
This reference guide addresses these questions by providing technical background judges should possess to address legal issues that may arise surrounding AI-generated evidence, litigation over AI-enabled vehicles and applications, and proposed AI judicial tools.
However, describing AI is a challenge. The field incorporates many different technologies, functions, and uses. Almost any application or tool that uses computers, algorithms, and data might be described as AI or touch upon AI. AI is also a moving target. AI is by design iterative. This reference guide is intended as a foundation with which judges and court staff can more easily identify and assess this rapidly evolving field by presenting core concepts and terms. Specific applications and methodologies, however, will invariably change and change rapidly over time, including, for example, methods to better identify and understand AI-generated outputs (explainability) as well as validate digital imagery and voice patterns using AI-enabled tools. We do not provide legal judgments about the use of different AI applications. In discussing how AI is used today and may be used in the future, we do not endorse that use in any particular context or application—fact-intensive questions. Rather, we identify core technical concepts and legal issues, so that when judges must decide whether to admit AI applications into evidence or to use AI in a judicial determination, they decide wisely and fairly. Making these decisions requires judges and litigators to know enough about AI to ask the right questions, at the right moment, in the right depth.
We conclude this introduction with a caveat and a statement of purpose. AI is a field of technologies that is changing at an exponential pace. Today’s cutting-edge application may be commonplace tomorrow, or obsolete. New methods will replace old methods. Therefore, it is not important for judges to understand how the IBM computer Deep Blue defeated world champion Gary Kasparov in chess
in 1997. It is important to know that a machine beat a human, which means that machines can be better than humans at certain tasks for which humans otherwise use human intelligence. Likewise, it is not important for judges to know how to label or code training, validating, or testing data. It is important to know that the quality and qualities of AI datasets can impact AI accuracy because datasets, like humans, possess bias. In a specific litigation context, expert testimony may be required on the subject. The goal of this reference guide is not to make judges scientific experts on AI or specific AI applications or methodologies; the field will change before this reference guide goes to print. Judges do not need to be coders or a technologist to ask good questions; they need to know good questions to ask. That is what judges do. We hope this reference guide provides a helpful background and sound framework for judges to do so.
This reference guide starts with a brief history and overview of the constellation of technologies associated with AI. The entry point to AI will vary for most jurors and judges; this section offers a quick level set. The next sections take a deeper look at the basics of machine learning, the underpinning of what today is generally known as AI. The reference guide then considers the roles judges will play regarding AI, including overseeing discovery, acting as evidentiary gatekeepers, interpreting law’s application to AI contexts, translating complex questions of technology and law into case law, and considering whether to use AI tools in judicial decision making. We provide some high-level technological takeaways about AI for judges, then focus on two areas likely to generate adjudication: AI bias, and algorithms that predict human behavior. All of this technical knowledge is a predicate to judges understanding the myriad and complex ways AI will come before courts. Informed by this knowledge, the reference guide concludes by addressing how AI may change the practice of law as well as discovery and evidentiary considerations presented by AI, including AI deepfakes. Throughout, the reference guide identifies with bullet points threshold questions judges should ask about AI in context. They are threshold questions because each in turn should lead to additional contextual questions about specific AI applications. A glossary of terms follows the reference guide narrative.
AI scholars and practitioners use multiple definitions of AI, a fluid and expansive field. The National Security Commission on AI provided one helpful definition:
AI is not a single piece of hardware or software, but rather, a constellation of technologies that gives a computer system the ability to solve problems and to perform tasks that would otherwise require human intelligence.6
Each of the components constituting a particular AI system may be subject to legal challenge and validation. Moreover, because many AI components like algorithms are iterative, “learning” as they go, the field is constantly evolving and at a rapid pace. Case law will necessarily need to evolve and adapt as well.
A 2017 Belfer Center Study on AI and national security similarly emphasized that the field of AI is composed of multiple methodologies and technologies:
Artificial Intelligence (AI) is a science and a set of computational technologies that are inspired by—but typically operate quite differently from—the ways people use their nervous systems and bodies to sense, learn, reason, and take action.7
These definitions emphasize that artificial intelligence, while often analogized to human thinking with terms like “artificial neural network” and “neurons,” is achieved by mathematical computation: AI identifies, collects, aggregates, analyzes, and derives meaning from data, or performs tasks that rely on data. The AI is not “learning” as humans do; it is using mathematical models and other techniques to better perform or optimize tasks it has been programmed and trained to perform. AI learning is generally statistical in nature. However, as noted throughout this reference guide, developing and creating AI is a very human endeavor that reflects human choices and biases. Machines may be programmed to act autonomously, like a chatbot creating its own code and responses, but it is humans who write the source code that enables the bot to work and humans who select and define the data on which the bot is trained, tested, and validated.
Therefore, a single, capacious definition of AI is elusive, and rightly so. Depending on the AI application, a system will contain different technologies and methodologies, train on different data, and encounter different real-world data inputs in the field, all of which means each application will also have different strengths and weaknesses. At present, case law on AI is limited,8 but even as it develops, rather than simply relying on precedent, courts will still need to examine each AI system and its real-world application to specific facts and conditions. However, at least three unifying characteristics give machines “AI”: they
6. National Security Commission on Artificial Intelligence (NSCAI), Interim Report 8 (Nov. 2019), https://perma.cc/RWP3-ZXP8.
7. Greg Allen & Taniel Chan, Artificial Intelligence and National Security, Belfer Center for Science and International Affairs, Harvard Kennedy School (July 2017), https://perma.cc/RRF9-29VQ.
8. We have compiled an appendix list of illustrative cases in our longer publication, James Baker, Laurie Hobart, & Matthew Mittelsteadt, An Introduction to Artificial Intelligence for Federal Judges, (Federal Judicial Center 2023), https://perma.cc/7DJT-T9D2.
have (1) the capacity to “learn” (with the caveats above) and (2) the capacity to act (3) based on seed coding (algorithms) designed by humans.
Popular and scientific literature identifies several benchmark events in AI development. In 1950, the English computer scientist and Bletchley Park code breaker Alan Turing wrote an article, “Computing Machinery and Intelligence.” He asked, can machines think, and can they learn from experience as a child does? “The Turing Test” was Turing’s thesis advisor’s name for Turing’s experiment testing the capacity of a computer to think and act like a human. A computer would pass the test when it could communicate with a person in an adjacent room without the person realizing they were communicating with a computer.
In 1956, Dartmouth College hosted the first conference to study AI.9 The host, Professor John McCarthy, is credited by many with coining the term “Artificial Intelligence.”10 The funding proposal submitted to the Rockefeller Foundation stated,
We propose that a 2-month, 10-man study of artificial intelligence be carried out. . . . We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together.11
Notwithstanding this optimistic start, progress in the field was neither linear nor exponential. It occurred in fits and starts. As a result, AI development went through a series of “AI winters,” periods of low funding and low results. In the past twenty years, however, AI has emerged as one of the transformative technologies of the twenty-first century.
Experts have described three waves of AI development. The first wave of AI machine learning consisted of if-then linear learning, a process that relies on the brute-force computational power of modern computers. With linear learning, a computer is in essence “trained” that if something occurs, then it should take a countervailing or corresponding step. This is how the IBM computer Deep Blue beat Gary Kasparov in chess in 1997, a significant AI milestone. The computer was optimizing its computational capacity to search through possible positions and use decision trees to weigh possible moves in response to Kasparov’s actual and
9. Artificial Intelligence (AI) Coined at Dartmouth, Dartmouth College, https://perma.cc/856H-6NYD (last visited Jan. 1, 2025).
10. Id.
11. Nick Bostrom, Superintelligence: Paths, Dangers, Strategies 6 (2014).
potential moves.12 It did so with the knowledge of all of Kasparov’s prior games, while on the clock in real time. Deep Blue was an impressive demonstration of computational force, a near instantaneous series of if-this-then-that calculations.
We are in (at least) a second wave of machine learning now, the benchmark of which is AlphaGo, the Google computer that beat the world’s best Go player in 2016. The AlphaGo victory was a milestone not just because Go is a more complex, multidimensional game than chess but because AlphaGo won using reinforcement learning: it got better at the game by playing it. AlphaGo improved with experience, adjusting its own decisional weights internally—in the so-called “black box” of internal machine calculation—without training data or other if-this-then-that learning.13 Surpassing brute force computational power, this was a machine optimizing its capacity. Was it “thinking”? No. But was it “learning”? Yes.
What changed? Experts point to several factors working synergistically—specifically, the development of complex algorithms, strides in computational speed, the invention of new sensors, an explosion in data, and the advent of cloud computing and machine learning.
Merriam-Webster’s Collegiate Dictionary defines an algorithm most “broadly” as “a step-by-step procedure for solving a problem or accomplishing some end.”14 A common example is a recipe for a meal. More narrowly, an algorithm is “a procedure for solving a mathematical problem . . . in a finite number of steps that frequently involves repetition of an operation.” In the AI context, algorithms are rigorously defined steps, written in software, to find, sort, and look for patterns in data. Algorithms are “the set of rules a machine (and especially a computer) follows to achieve a particular goal.”15
Today’s algorithms can be (but are not always) much more complex than simple recipes. The Google enterprise utilizes over two billion lines of code, while the search engine operates with 200,000–300,000 lines of code.16 The engine itself combines multiple tools and algorithms that web index (internet catalogue and map), embed potentially hundreds of search factors, and rank pages and references.
12. Adam Rogers, What Deep Blue and AlphaGo Can Teach Us About Explainable AI, Forbes (May 9, 2019), https://perma.cc/2EJW-JNBB (“The 1997 version was capable of searching between 100 and 200 million positions per second, depending on the type of position, as well as a depth of 20 or more pairs of moves.”).
13. David Silver et al., Mastering the Game of Go Without Human Knowledge, 550 Nature 354 (Oct. 19, 2017), https://doi.org/10.1038/nature24270.
14. Merriam-Webster’s Collegiate Dictionary, https://perma.cc/82FG-EJRD (last visited Aug. 18, 2022).
15. Id.
16. Rachel Potvin, Why Google Stores Billions of Lines of Code in a Single Repository, YouTube (Sept. 14, 2015), https://perma.cc/5S56-5E7Z.
By some accounts “the algorithm,” meaning the collective algorithms and tools, are adjusted 500–600 times a year.17 Thus, there is no single, final Google search algorithm. For many AI applications, not just Google, the algorithm one uses today will be different from the algorithm one uses tomorrow.
The foundational component of a computer is the central processing unit (CPU), which today is composed of billions of electrical components called transistors. The more transistors in a CPU, the more data that can be processed in smaller packages with greater speed.18 Increased computational power is directly tied to the miniaturization of the transistor. So far, the number of transistors in a CPU has followed a trend called “Moore’s Law,” “the observation that the number of transistors on an integrated circuit will double every two years with minimal rise in cost.”19 By one estimate, a 1985 Cray Supercomputer, which took up sixteen square feet, would have had to be expanded to 80,000 square feet, or the size of an office building on two acres of land, to have the same processing speed as a 2020 iPhone 12.20 Humans interact and program CPUs with programming languages that are ultimately turned into ones and zeros (the only thing a CPU understands) through different layers of software. One of the defining characteristics of AI is its capacity to perform tasks and process data at “machine speed.” While all machines might be said to operate at “machine speed,” this term is used by many to mean in effect “instantaneous” speed, distinguishing the capacity of machines to process information and make calculations from the capacity of humans performing the same or similar tasks.
The development of sensor technology, such as that found in driverless cars, cell phones, and home devices, has resulted in more data and more applications for using that data to inform and influence human behavior. Personal assistants like Siri, Watson, and Alexa all use sensors to collect data.
17. See Ryan Shelley, 3 Things to Do After a Major Google Algorithm Update, Search Engine Land (Oct. 18, 2016), https://perma.cc/AQ25-SBZA.
18. For example, in 1971 the Intel 4004 processer contained 2,300 transistors and by 2010 an Intel Core processor had over 560 million transistors, https://perma.cc/TA5X-FQ5E.
19. Moore’s Law, Intel Newsroom, https://perma.cc/VP8Q-9RD6.
20. Chandra Steele, Space Wars: 80s Cray-2 Supercomputer vs. Modern-Day iPhone, PC Magazine (Nov. 23, 2022), https://perma.cc/5AGF-L9TE.
Data drives the AI revolution. For many though not all types of current AI modeling, the more data one has the easier it is to train a computer system to perform a task or solve a problem, and likely the more accurate the result.21 As discussed below, the quality of the data, as well as the metrics selected in designing algorithms, will also affect accuracy and the degree to which different forms of bias will affect accuracy.22
The advent of cloud computing allows more data to be stored on a permanent and centralized, and thus retrievable, basis. As the Supreme Court encountered in Carpenter v. United States, where a criminal prosecution relied on historical cell phone location records that provided “a comprehensive chronicle of the user’s past movements,”23 data can persist for years, even forever, if not purposefully deleted as a matter of law or policy. (The lifespan of data might also be limited by the expense of storing it and the possibility that future software or computer systems might not be able to read its format or encoding.)
A critical technological advancement has been the development of machine learning, which refers to different methodologies to program software-driven machines to learn on their own and thus improve and optimize their functions. After the Dartmouth conference in 1956, the AI field “split into two camps: one focused on symbolic systems, problem solving, psychology, performance, and serial architectures, and the other focused on continuous systems, pattern recognition, neuroscience, learning, and parallel architectures.”24 That second camp, or AI subfield, was machine learning. Much machine learning research is
21. Large language models, discussed below, rely on vast quantities of data, but experts are also developing models that do not depend on large quantities of data. For example, an AI “information lattice” model that helps humans without a musical background compose music was trained on a set of 370 Bach compositions. See Michael O’Boyle, (Re)discovering Music Theory: AI Algorithm Learns the Rules of Musical Composition and Provides a Framework for Knowledge Discovery, U. of Ill. Urbana-Champaign News (Jan. 27, 2023), https://perma.cc/BA62-KK46.
22. See Remarks of Nisheeth Vishnoi at Yale Cyber Leadership Forum, “Session #1: Big Data, Data Privacy, and AI Governance” (Feb. 18, 2022), https://perma.cc/62SK-L3SW.
23. 138 S. Ct. 2206 (2018).
24. Kush R. Varshney, Trustworthy Machine Learning 1 (2022), https://perma.cc/LHL2-ANFP. Dr. Varshney, an IBM Fellow, directs the Human-Centered Artificial Intelligence and
predicated on trying to mimic the human brain (literally, in the case of efforts to replicate the brain using 3D printers) or with neurological metaphors such as “artificial neural networks.” While AI may mimic, and in some cases outperform, human intelligence, it is not actual human intelligence. It is machine capacity and optimization, hence the preferable term: Human-Level Machine Intelligence (HLMI). Machine learning is explained in more detail, below.
Specialists describe the AI of today as “narrow,” generally good at addressing discrete tasks or “application areas[,] such as playing strategic games, language translation, self-driving vehicles, and image recognition.”25 Narrow AI capabilities exploded in the early 2010s with a machine learning technique called “deep learning,”26 described in greater depth below.
Beyond narrow AI, computer engineers contemplate the emergence of Artificial General Intelligence, or AGI. Artificial General Intelligence is an AI multitasking capacity that can serve multiple purposes. Much like a human, AGI would be able to understand and perform multiple tasks and shift from task to task as needed. Most analysts foresee AGI arriving, if it arrives at all, as a stage in development, like the advent of flight—not necessarily a moment in time, like the Soviet launch of the Sputnik satellite in 1957. AGI will present more complex legal questions than narrow AI. A system that can write and rewrite its own code as well as shift from task to task will be harder to regulate, requiring courts and legislators to wrestle with questions of accountability and responsibility for actions the AI takes or information it provides.
Indications of a potential third wave of AI machine learning were being discussed in 2016, the year of AlphaGo. As the National Artificial Intelligence Research and Development Strategic Plan reported in October of that year,
[t]he AI field is now in the beginning stages of a possible third wave, which focuses on explanatory and general AI technologies. . . . If successful, engineers could create systems that construct explanatory models for classes of real world phenomena, engage in natural communication with people, learn and reason as they encounter new tasks and situations, and solve novel problems by generalizing from past experience.27
Trustworthy Machine Intelligence teams at IBM, and was the founding co-director of the IBM Science for Social Good initiative from 2015–2023.
25. Executive Office of the President, National Science and Technology Council, Committee on Technology, Preparing for the Future of Artificial Intelligence 7 (Oct. 2016), https://perma.cc/BD4K-VC8G.
26. Large, Creative AI Models Will Transform Lives and Labour Markets, The Economist (Apr. 22, 2023), https://perma.cc/7YTH-R25H.
27. National Science and Technology Council, Networking and Information Technology Research and Development Subcommittee, The National Artificial Intelligence Research and Development Strategic Plan 14 (Oct. 2016).
Today, we may well be on the rising swell of a third wave, and somewhere between narrow AI and AGI. In late 2022, OpenAI released ChatGPT, which stunned users with its ability to respond conversationally to prompts and to generate relatively sophisticated text and research products across subject matter, though not without error. ChatGPT is a large language processing model (LLM), a type of “generative AI.” Generative AI models use machine learning to generate or create new images, text, voices, videos, computer code, or other content; some are even being developed to discover new molecules.28 Yet, as might be expected from models trained on text and images from the internet, ChatGPT and other generative AI models have been shown to create false narratives and reports, to embed racism, nationalism, and other biases, and even to generate fake legal cases. Even when trained on factually accurate data, generative AI models such as ChatGPT have produced false or counterfactual outputs, often referred to as “hallucinations.”
Big tech companies are working on AI products similar to ChatGPT,29 with an emphasis on “big”: Thus far, the success of LLM models like ChatGPT has been contingent on vast amounts of training data (estimated at 45 terabytes for ChatGPT30) as well as huge compute capacity and gigawatt-hours of electricity required to train and test the models on that data.31 All of these factors come at significant monetary expense as well. AI researchers are developing methods to reduce the size of generative models, for example by decreasing the number of mathematical connections between artificial neurons.32 Generative AI has been improving “at a dizzying pace, with new models, architectures, and innovations appearing almost daily.”33
No matter how sophisticated and impressive, ChatGPT and other LLMs, with their errors and limitations, are not yet Artificial General Intelligence, or human level intelligence, though they approach it.34 AGI contemplates that a computer linked to the internet, the Cloud, and the Internet of Things (IoT) might perform multiple tasks, solve problems, or answer questions generally, moving fluidly from one function to the next. Commentators ask three
28. What Is Generative AI?, IBM, https://perma.cc/9J6Z-JHUP.
29. Large, Creative AI Models Will Transform Lives and Labour Markets, The Economist, supra note 26.
30. What Is Generative AI?, McKinsey & Co. 4 (Apr. 2, 2024), https://perma.cc/8RMA-PV3Z.
31. Large, Creative AI Models Will Transform Lives and Labour Markets, The Economist, supra note 26; What Is Generative AI?, IBM, supra note 28.
32. The Bigger-Is-Better Approach to AI Is Running out of Road, The Economist (June 21, 2023), https://perma.cc/Y22K-SCQZ.
33. What Is Generative AI?, IBM, supra note 28.
34. The OpenAI website acknowledges this, stating “We believe our research will eventually lead to artificial general intelligence, a system that can solve human-level problems. Building safe and beneficial AGI is our mission.” https://openai.com/research/overview (last accessed Oct. 18, 2024).
threshold questions when it comes to general AI: Will it happen? When will it happen? And will it be a good or bad thing when it does?
From 2015 to 2016, a group of scholars associated with the Oxford Future of Humanity Institute, AI Impacts, and Yale University surveyed “all researchers who published at the 2015 NIPS and ICML [Workshop on Neural Information Processing Systems and International Conference on Machine Learning] conferences (two of the premier venues for peer-reviewed research in machine learning).”35 The survey asked respondents to estimate when human-level machine intelligence (HLMI) would arrive. The study did not define AGI but stipulated that “Human-Level Machine Learning is achieved when unaided machines can accomplish every task better and more cheaply than human workers.” Three-hundred and fifty-two researchers responded, a return rate of 21%. The results ranged across the board: from fairly soon to never to beyond 100 years. What is noteworthy is that the “aggregate forecast gave a 50% chance of HLMI occurring within 45 years and a 10% chance of it occurring within 9 years.” The two countries with the most survey respondents were China and the United States. The median response for the Americans was seventy-six years, for the Chinese, twenty-eight.36 As the survey indicates, experts do not agree on whether or when we will get to AGI.
Some experts believe that the powers and potential of AI are exaggerated. The authors of this reference guide do not share that view. Neither do the governments and companies dedicating billions of dollars to the development of AI. AI tools and methods will continue to change rapidly. The same factors that led to current AI, including data, computational capacity, and innovative modeling, will generate the next waves of AI, and those factors will only grow. The courts, like other elements of society, must adjust to AI, just as they previously adjusted to computers and electronic filing. Preparation starts with an understanding of what AI is, and is not, and the confidence that, explained in plain language, AI, its strengths, and its weaknesses, can be understood by judges, litigants, and jurors.
Contemporary “narrow AI” can be defined as the “ability of computational machines to perform singular tasks at optimal levels, or near optimal levels, and
35. Katja Grace et al., Viewpoint: When Will AI Exceed Human Performance? Evidence from AI Experts, 62 J. of Artificial Intelligence Res. 729–54 (July 31, 2018), https://doi.org/10.1613/jair.1.11222.
36. Id.
usually better than, although sometimes just in different ways [than], humans.”37 Under this umbrella come many single-purpose technologies, e.g., the AI that enables facial recognition and driverless vehicles. These technologies are “intelligent” in only one domain, limiting their ability to be used for multiple purposes or deal with certain complex situations.
Applications of machine learning today fall under three broad categories:
(1) cyber-physical systems, (2) decision sciences, and (3) data products. Cyber-physical systems are engineered systems that integrate computational algorithms and physical components, e.g. surgical robots, self-driving cars, and the smart grid. Decision sciences applications use machine learning to aid people in making important decisions and informing strategy, e.g. pretrial detention, medical treatment, and loan approval. Data products applications are the use of machine learning to automate informational products, e.g. web advertising placement and media recommendation.38
Current AI is particularly good at correlating, connecting, and classifying data; recognizing patterns; and weighting probabilities—which is why it is good, and getting better, at tasks like facial recognition, image compression and identification, and voice recognition. But the field of AI is not static, it is moving at a seemingly exponential pace. That is why it is more important for judges to ask the right questions than to “master” a single iteration of AI or AI application. Binding and precedential case law will prove elusive.
At present, narrow AI can be “brittle,” by which engineers mean incapable of adapting to new circumstances on its own and lacking in situational awareness. To illustrate the point, AI philosophers like to point to the thought experiment known as the Trolley Problem, now more commonly conveyed as a crosswalk problem. In the problem, a driverless car loses its brakes at just the moment it is coming up to a crosswalk at speed. There are various pedestrians in the crosswalk of different ages and of different perceived virtues—for example, a pregnant woman and an armed robber carrying a bag of stolen money. The car’s computer must make a choice: swerve left, swerve right, brake, or drive ahead. An alert human driver, after first trying to brake, would in theory make a values-based, ethical choice about where to aim the car, likely at the presumed bank robber. The AI-driven car, on the other hand, unless it has been specifically trained to identify an “armed bank robber” or a “pregnant woman” and to adjust its decisional weights to favor one over the other in a choice scenario, will likely perceive the persons in the crosswalk as “persons in the crosswalk,” no more. Chances are the software code will select the path of least numerical damage. This weakness in current AI is especially important where an AI application is likely to
37. James E. Baker, The Centaur’s Dilemma: National Security Law for the Coming AI Revolution 34 (2020).
38. Varshney, supra note 24, at 2.
encounter changing or novel circumstances, like driving, or where there is an incentive for external actors to spoof or fool the AI application, as might be the case with military, intelligence, and law enforcement surveillance tools.
Many narrow AI applications are known to consumers who rely on them daily. If you shop on Amazon, you are using AI algorithms. Amazon back-propagates training data from purchases made on Amazon as well as data from individual consumers. Algorithms then identify patterns in the data and weight those patterns, allowing the algorithm to suggest (predict) additional purchases to the shopper. The algorithm adjusts as it goes based on the responses (or lack of responses) from recipients. This is an example of predictive big-data analytics. It is also an example of a push, predictive, or recommendation algorithm.
Why do companies use AI? Former Secretary of the Navy Richard J. Danzig explains:
. . . machines can record, analyze and accordingly anticipate our preferences, evaluate our opportunities, perform our work, etc. better than we do. With ten Facebook “likes” as inputs, an algorithm predicts a subject’s other preferences better than the average work colleague, with 70 likes better than a friend, with 150 likes better than a family member and with 300 likes better than a spouse.39
Narrow AI is also embedded in mapping applications, which sort through route alternatives with constant, near-instantaneous calculations factoring speed, distance, and traffic to determine the optimum route from A to B. Then the application uses AI to convert numbered code into natural language telling the driver to turn left or right.
AI is also used to spot minute changes in stock pricing and to generate automatic sales and purchases of stock, as well as to spot anomalies that generate automatic sales and purchases. All of this is based on algorithms created and initiated by humans but programmed to act autonomously and automatically because the calculations are too large, the margins too small, and the speed too fast for humans to keep pace and make decisions in real time. Of course, as one trader’s algorithm gets faster, the next trader must either change their algorithm’s design, speed, or both to achieve an advantage, reducing the window of opportunity for real-time human control even further. AI machine learning and pattern recognition are also used for translation, logistics planning, and spam detection, among many, many more commercial applications. In 2017, Andrew Ng, the former chief scientist for Baidu, declared AI “the new electricity.”40
39. See Richard Danzig, An Irresistible Force Meets a Moveable Object: The Technology Tsunami and the Liberal World Order, in The World Turned Upside Down: Maintaining American Leadership in a Dangerous Age (Nicholas Burns, Leah Bitounis, & Jonathon Price eds., 2017).
40. Why AI Is the “New Electricity,” Knowledge @ Wharton Staff (Nov. 7, 2017), https://perma.cc/HD93-E9WH.
A prominent illustration of emerging next-generation AI is the driverless vehicle, or more precisely, the software processes and hardware that enable driverless vehicles. AI empowers driverless cars by performing myriad data input and output tasks simultaneously, as a driver does, but in a different way. Human drivers rely on intuition, instinct, experience, and rules to drive—seemingly all at once—using the neural networks of the brain. In driverless cars, sensors instantaneously feed computers data based on speed, conditions, and images of the sort ordinarily processed by the driver’s eyes and brain. The car’s software processes the data to determine the best outcome based on probabilities and based on what it has been programmed to understand and decide. This requires constant algorithmic calculations that a human actor could not make in real time. Luckily, humans do not rely on math to drive cars. They exercise their judgment and intuition, which is why alert drivers generally handle situational change better than AI applications do. On the other hand, AI can possess 360-degree vision and does not fall asleep at the wheel, text while driving, or drive drunk. Driverless cars will and do make different kinds of mistakes than humans—but in theory and presumably in practice, fewer of them.
One of the most successful applications of AI to date is found in the field of medical diagnostics. Here, narrow AI’s capacity to detect and match patterns and find anomalies has led to breakthroughs in the detection of tumors as well as the onset of diabetic retinopathy. In places like India, where there is a shortage of ophthalmologists, the use of such screening diagnostics can help prioritize access to doctors and treatment by identifying at-risk patients.41 Studies indicate that AI can be or may soon be more accurate than humans in detecting certain cancerous tumors or making prognostic findings about their growth and treatment.42 However, that is not the same as saying that humans are prepared to rely on AI alone, or wish to receive medical diagnoses from machines rather than doctors.
Law enforcement authorities use AI for predictive policing and surveillance. Some seventy-five countries use AI-powered surveillance, including many liberal democracies.43 According to a 2019 Government Accountability Office report, the FBI has logged hundreds of thousands of searches of its facial recognition system, which has access to 641 million face photos.44 The FBI
41. Cade Metz, India Fights Diabetic Blindness with Help from AI, N.Y. Times, Mar. 10, 2019, https://www.nytimes.com/2019/03/10/technology/artificial-intelligence-eye-hospital-india.html.
42. See Elizabeth Syoboda, Artificial Intelligence Is Improving the Detection of Lung Cancer, 587 Nature S20 (Nov. 18, 2020), https://doi.org/10.1038/d41586-020-03157-9; Nadia Jaber, Can Artificial Intelligence Help See Cancer in New, and Better Ways?, National Cancer Institute (Mar. 22, 2022), https://perma.cc/A8HQ-5PW8.
43. Steven Feldstein, The Global Expansion of AI Surveillance 1-2 (Carnegie Endowment for International Peace, Sept. 17, 2019), https://perma.cc/C5H2-T2ZZ; National Security Commission on Artificial Intelligence, Interim Report 12 (Nov. 2019), https://perma.cc/RWP3-ZXP8.
44. Drew Harwell, FBI, ICE Find State Driver’s License Photos Are a Gold Mine for Facial-Recognition Searches, Wash. Post, July 7, 2019, https://perma.cc/3WSW-MRSG. See U.S. Gov’t
reported its system has proven 86% accurate at finding the right person, if a search was able to generate a list of fifty possible matches.45
Generative AI differs from traditional predictive models,46 such as those used in driverless cars, shopping algorithms, and image recognition. Those predictive models classify data, identify patterns, and use statistics to predict or match other real-world examples with the data provided: For example, facial recognition applications are trained to identify images of humans and predict which images from a data bank might most closely match a real-world photo of a person.47
Generative AI, while still using statistics and prediction,48 creates new content. Early models developed in the 2010s could generate deepfake videos of political figures and excellent simulations of famous paintings.49 ChatGPT has been used to author academic papers, poems, essays, books, and code;50 GPT-4, released in March 2023, passed the July 2022 uniform bar exam with a score “approaching the 90th percentile” of human test-takers.51 ChatGPT and similar large language models (LLMs) use statistics to auto-fill or complete sentences. Today’s generative models train on datasets of sample text, images, or other content. The models “compress a dataset into a dense representation, arranging similar data points closer together in an abstract space.”52 For LLMs, the dataset is vast quantities of sample text, where common parts of words are represented in numbers called “embeddings.”53 The LLM can “sample from this space to create something new while preserving the dataset’s most important features.”54 An “attention” feature helps the LLM predict which embeddings from the training data will best match and complete a phrase and eventually a sentence.55 Modern LLMs can predict the parts of a sentence in parallel or simultaneously.56
Accountability Off., GAO-16-267, FACE Recognition Technology: The FBI Should Better Ensure Privacy and Accuracy (May 16, 2016); U.S. Gov’t Accountability Off., GAO-19-579T, Face Recognition Technology: DOJ and FBI Have Taken Some Actions in Response to GAO Recommendations to Ensure Privacy and Accuracy, But Additional Work Remains (June 4, 2019).
45. Harwell, supra note 44.
46. What Is Generative AI?, McKinsey & Co., supra note 30.
47. Id.
48. Large, Creative AI Models Will Transform Lives and Labour Markets, The Economist, supra note 26.
49. See What Is Generative AI?, IBM, supra note 28.
50. Id.; Large, Creative AI Models Will Transform Lives and Labour Markets, The Economist, supra note 26.
51. Debra Cassens Weiss, Latest Version of ChatGPT Aces Bar Exam with Score Nearing 90th Percentile, ABA Journal (Mar. 16, 2023), https://perma.cc/7DDQ-QVCY.
52. What Is Generative AI?, IBM, supra note 28.
53. Large, Creative AI Models Will Transform Lives and Labour Markets, The Economist, supra note 26.
54. What Is Generative AI?, IBM, supra note 28.
55. Large, Creative AI Models Will Transform Lives and Labour Markets, The Economist, supra note 26; What Is Generative AI?, IBM, supra note 28.
56. Large, Creative AI Models Will Transform Lives and Labour Markets, The Economist, supra note 26; What Is Generative AI?, IBM, supra note 28.
Because they are based on statistical prediction and the content of training datasets, ChatGPT and similar models still make errors. While GPT-4 outperformed most bar takers, it still only answered 75.7% of the multiple-choice questions correctly and received scores around 4.2 out of 6 on the essays.57 As noted earlier, ChatGPT has been shown to generate false information and research (sometimes referred to as “hallucinations”58), and sometimes quite persuasively. Some federal judges have ordered that litigants disclose whether they have used generative AI in their research and briefing, and if so, affirm that they have verified its accuracy.59 At least one state bar has cautioned members that the use of generative AI for legal research, depending on how it is used, may violate client confidentiality were the AI to incorporate the research into its ongoing dataset.60 We surmise the advent of disputes over privilege waiver. The use of AI chatbots as legal “receptionists” may result in the creation of an attorney-client relationship, or appearance of one, without the attorney being aware.61
Generative AI can reproduce racism, nationalism, sexism, and every other bias found on the internet.62 The 2016 chatbot Tay, for example, generated and spread hate within hours of its public release, adopting the language and values found on Twitter.63 (Similarly, non-generative social media “recommendation” or “push” algorithms that promote pre-existing content to certain users allegedly have been used to spread and amplify ethnic hatred, such as against the Rohingya in Myanmar;64 disinformation, such as reported by Nobel Peace Prize recipient Maria Ressa in the Philippines65 and widely observed in U.S. elections; and terrorist content, as the Supreme Court heard in Gonzalez v. Google and Twitter v. Taamneh.66)
57. Weiss, supra note 51.
58. Karen Weise & Cade Metz, When A.I. Chatbots Hallucinate, N.Y. Times, updated May 9, 2023, https://www.nytimes.com/2023/05/01/business/ai-chatbots-hallucination.html.
59. See, e.g., Judge Brantley Starr, Mandatory Certification Regarding Generative Artificial Intelligence (N.D. Tex.), https://perma.cc/GQ34-JLUM.
60. Florida Bar Ethics Opinion 24-1 (Jan. 19, 2024), https://perma.cc/PN5Z-E76Z.
61. Id.
62. Marie Lamensch, Generative AI Tools Are Perpetuating Harmful Gender Stereotypes, Center for International Governance Innovation (June 14, 2023), https://perma.cc/A56E-ZJKX; Zachary Small, Black Artists Say A.I. Shows Bias, With Algorithms Erasing Their History, N.Y. Times (July 4, 2023), https://www.nytimes.com/2023/07/04/arts/design/black-artists-bias-ai.html.
63. Oscar Schwartz, In 2016, Microsoft’s Racist Chatbot Revealed the Dangers of Online Conversation: The Bot Learned Language from People on Twitter—But It Also Learned Values, IEEE Spectrum (Nov. 25, 2019), https://perma.cc/5N4S-M8N9.
64. Myanmar: Facebook’s Systems Promoted Violence Against Rohingya; Meta Owes Reparations, Amnesty International (Sept. 29, 2022), https://perma.cc/53WW-3E3S.
65. Maria Ressa, Facts, The Nobel Prize (2021), https://perma.cc/CM45-Q4PH.
66. Gonzalez v. Google, LLC, 2023 U.S. LEXIS 2059 (May 18, 2023); Twitter v. Taamneh, 143 S. Ct. 762 (2023).
The content created by generative AI models can be so realistic as to be misleading, raising significant concerns about models and their products being used wittingly and unwittingly for disinformation and misinformation, potentially polluting the internet at overwhelming levels. There is also significant concern that generative AI will be used to develop harmful and very real products, such as bioagents.67
This survey of today’s AI suggests that AI will both improve and necessarily be imperfect. For each potential application, humans must decide whether it is wise and fair to use AI in the first instance. Where it is employed, AI will generally be best used to augment rather than supplant human judgment. One issue AI policy makers, designers, and ethicists must resolve in context is how to structure human-machine teaming to allocate responsibility and accountability. Judges in turn will have to determine whether as a matter of law, or a matter of law and fact, the humans made the correct decisions: that is, whether the humans appropriately applied AI to a problem, selected an appropriate algorithm, and interpreted the results correctly. Judges will also have to consider to what extent, if any, they should rely on AI applications to inform their own decisions.
Machine learning (ML) is a subfield68 of and “one of the most important technical approaches to AI.”69 As the name implies, it is a set of methods computers use to “learn,” both to perform tasks and to improve on that performance, by making predictions or classifications based on input data.70 Dr. Kush Varshney writes:
The term machine learning was popularized in Arthur Samuel’s description of his computer system that could play checkers, not because it was explicitly programmed to do so, but because it learned from the experiences of previous games. In general, machine learning is the study of algorithms that take data and information from observations and interactions as input and generalize from specific inputs to exhibit traits of human thought. Generalization is a
67. How Generative Models Could Go Wrong: A Big Problem Is That They Are Black Boxes, The Economist (Apr. 19, 2023), https://perma.cc/7K7W-ZPEP.
68. What Is Artificial Intelligence (AI)?, IBM, https://perma.cc/L9YW-QN7V (Machine learning and deep learning are subcategories of AI and “disciplines . . . comprised of AI algorithms which seek to create expert systems which make predictions or classifications based on input data.”).
69. Executive Office of the President, supra note 25, at 8.
70. Christopher Molnar, Interpretable Machine Learning: A Guide for Making Black Box Models Explainable (2d ed. 2022), https://perma.cc/L98E-J4PM; What Is Artificial Intelligence (AI)?, IBM, supra note 68.
process by which specific examples are abstracted to more encompassing concepts or decision rules.71
For example, the constellation of technologies that make up AI allow mathematically driven algorithms to perceive, break down, and connect patterns in data or digital sequences in images. Machine learning allows these machines to not only recognize patterns, such as those employed for facial recognition, shopping, and medical diagnosis, but to then on their own classify, or generalize, by associating certain patterns with desired outputs, e.g., potential matches in a facial recognition database.
Varshney outlines six stages in the development and life cycle of a machine learning system: (1) problem specification; (2) data understanding; (3) data preparation; (4) modeling; (5) evaluation; and (6) deployment and monitoring.72 As we will emphasize, humans are involved at every point along the continuum of that life cycle, infusing their choices into eventual AI outputs.73 Experts also point to the importance of involving all stakeholders—i.e., those individuals or groups who might use or be affected by the AI application, including traditionally marginalized groups—at all stages of the machine learning life cycle.74
71. Varshney, supra note 24, at 1. See also Executive Office of the President, supra note 25, at 8 (“Modern machine learning is a statistical process that starts with a body of data and tries to derive a rule or procedure that explains the data or can predict future data.”).
72. Varshney, supra note 24, at 14–22 (citing the Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology); see also Nick Hotz, What is the Data Science Process?, https://perma.cc/W8DU-SQ94 (identifying the CRISP-DM and other life cycle models for data science).
73. David Leslie, Understanding artificial intelligence ethics and safety: A guide for the responsible design and implementation of AI systems in the public sector, The Alan Turing Institute 16 (2019), https://doi.org/10.5281/zenodo.3240529.
74. Varshney, supra note 24, at 15, 17 (citing Meg Young, Lassana Magassa, & Batya Friedman, Toward Inclusive Tech Policy Design: A Method for Underrepresented Voices to Strengthen Tech Policy Documents, 21 Ethics & Information Tech. 89–103 (June 2019)), https://doi.org/10.1007/s10676-019-09497-z; Leslie, supra note 73, at 16; see also Artificial Intelligence Ethics Framework for the Intelligence Community, Version 1.0 (June 2020), https://perma.cc/T6UT-JJL3 (“Identifying and addressing risk is best achieved by involving appropriate stakeholders. As such, consumers, technologists, developers, mission personnel, risk management professionals, civil liberties and privacy officers, and legal counsel should utilize this framework collaboratively, each leveraging their respective experiences, perspectives, and professional skills.”).
The machine learning process begins with a question, problem, or task that a machine might address. The goal might be to automate a task or aid and improve upon human decision making,75 such as detecting tumors or winnowing job applications. Even in this initial framing, humans are making choices that may reflect their own values and biases. Judges and lawyers will recognize this as analogous to the framing of a legal issue or question, or of a question in a survey or cross-examination. (Indeed, legal minds may find many connections and disjunctions between the rules and decision trees created by AI systems and those established by law.) Within the category of facial recognition algorithms, for example, it is a much different ethical undertaking to code and train a system to track a minority group, such as Uyghurs in China, than it is to develop a system to recognize missing children for Amber Alert purposes. The United States Intelligence Community has adopted an “Artificial Intelligence Ethics Framework” with lists of questions for government actors, including: “What is the goal you are trying to achieve by creating this AI? . . . Is there a need to use AI to achieve this goal? Can you use other non-AI related methods to achieve this goal with lower risk?” as well as “What benefits and risks, including risks to civil liberties and privacy, might exist when this AI is in use? Who will benefit? Who or what will be at risk? What is the scale of each and likelihood of the risks? How can those risks be minimized and the remaining risks adequately mitigated? Do the likely negative impacts outweigh likely positive impacts?”76 The European Union has taken the approach of identifying certain AI uses as prohibited, and identifying other AI applications as “high risk,” requiring independent verification and testing.77
At this initial stage, the human “problem owners” and AI technologists must also determine the metrics by which they will measure the success of the AI, another value-laden decision point.78 For example, if the State Department decided to make foreign aid allocations to nation-states with the help of an algorithm, it might choose to measure success by how much a recipient state’s total GDP rose in future years or, instead, by what percentage of children received secondary-level education and certain vaccines. The two metrics might align, but they might not, or they might align but only to a point. Each metric also reflects different micro- and macro-level values.
75. Varshney, supra note 24, at 16.
76. Artificial Intelligence Ethics Framework for the Intelligence Community, supra note 74.
77. European Commission, Regulatory Framework Proposal on Artificial Intelligence, Sept. 29, 2022, https://perma.cc/EUJ5-SJZT.
78. Varshney, supra note 24, at 17; see also remarks of Nisheeth Vishnoi, Yale Cyber Leadership Forum, “Session #1: Big Data, Data Privacy, and AI Governance,” Feb. 18, 2022.
The next step is selecting and collecting the relevant datasets that the machine learning model will use.79 The data might be “numbers, photos, or text, like bank transactions, pictures of people or even bakery items, repair records, time series data from sensors, or sales reports.”80 Some of the data will be “used as training data, or the information the machine learning model will be trained on.”81
When designing machine learning algorithms, engineers divide data into three sets: training data, validation data, and testing data.82 Training data is intended for the AI, curated and sometimes labeled so that the AI can analyze it, learn from it, and adjust its coding to ultimately form better predictions. Validation data is intended for the developer, who uses this data to stress test the model’s training and decide whether to update its settings and sensitivities. Testing data provides a final round of analysis; this data is used on the trained, tuned model to evaluate the model’s fit to the data and overall accuracy.
The selection of each of these sets of data is critical to the accuracy and fairness of the final product, hence the phrase “garbage in, garbage out.”83 “Algorithmic bias” is discussed at length below, but by way of example here, two common types of bias in machine learning are “temporal bias” and “population bias.” Temporal bias concerns the timing of data collection and how that timing might affect accuracy.84 This might occur if a university used an admissions AI algorithm that was trained on admissions data from the 1960s, before many American colleges accepted women. An algorithm so trained might not see fit to admit women (or rather, its outputs would not suggest doing so). Population bias occurs where some inputs are underrepresented and others overrepresented,85 illustrated by underperforming facial recognition algorithms. If the training datasets do not contain sufficient numbers of photographs of faces of different skin colors and genders, there will be output disparity across “race” and gender. Social bias in algorithms can also “stem[] from prejudice: labels from historical human decisions contain systemic differences across groups.”86 As discussed below, this concern comes up in criticisms of predictive policing algorithms and
79. Varshney, supra note 24, at 18–19; see also CRISP-DM Help Overview, IBM, https://perma.cc/XB3C-N6TH.
80. Sara Brown, Machine Learning, Explained, MIT Management Sloan School (Apr. 21, 2021), https://perma.cc/R74J-SDDC.
81. Id.
82. Tarang Shah, About Train, Validation and Test Sets in Machine Learning, Medium (Dec. 6, 2017), https://perma.cc/6VRX-MUSP.
83. Coined by Wilf Hey, computer scientist at IBM (Varshney, supra note 24, at 40).
84. Varshney, supra note 24, at 18.
85. Id.
86. Id. Quotations are used around “race” to emphasize the prevailing scientific and sociological understanding that race is a social construct, rather than a biological one.
criminal-risk assessment tools. Datasets might also include proxies, or pieces of data that might correlate with another type of information, such as a zip code for demographic information or, in some cultures, last names for religion.87 Housing and employment data used in criminal-risk assessments have proved to be strong proxies for “race” and class.88 An algorithm may not explicitly be programmed to make such a connection but it may nonetheless learn to do so. Proxies can be subtle (and sometimes unexpected by human programmers). For example, an individual’s online presence might serve as a proxy for age: An AI screening tool might distinguish between applicants with social media accounts and those without, or between types of social media accounts, both of which distinctions often reflect a user’s age.
The data preparation stage of the machine learning life cycle involves multiple steps. Data scientists and data engineers first integrate data from multiple, disparate sets into one set in a single format; they then “clean” it by, among other things, filling in missing information, discarding other information, looking for outliers, and dropping features that might lead to data leakage (where the same data used to train the AI are encountered in actual use, potentially tainting the result by making it “easier” or more likely for the AI to find a match that would ordinarily occur in real-world conditions) or that might be unethical or illegal to consider.89 The last step is feature engineering, or “mathematically transforming features to derive new features.”90 In short, there is much room for “creativity” and choice91 at the data preparation phase, so the expertise, care, and worldview of the data scientist and data engineer have real impact.
Once the data are curated and prepped, engineers “choose a machine learning model to use, supply the data, and let the computer model train itself to find patterns or make predictions. Over time the human programmer can also
87. Id.
88. Chelsea Barabas et al., An Open Letter to the Members of the Massachusetts Legislature Regarding the Adoption of Actuarial Risk Assessment Tools in the Criminal Justice System, Berkman Klein Center for Internet & Society 3 (Nov. 9, 2017), https://perma.cc/M77Z-MW7J (citing Sonja B. Starr, Evidence-Based Sentencing and the Scientific Rationalization of Discrimination, 66 Stan. L. Rev. 803 (2014)).
89. Varshney, supra note 24, at 18–19.
90. Id. at 19.
91. Id.
tweak the model, including changing its parameters, to help push it toward more accurate results.”92
In general, there are, at least currently, three categories of machine learning models: supervised learning, unsupervised learning, and reinforcement learning. Each category is described in Types of Machine Learning Models and Other Terminology, below.
After a model is made and the machine is trained on training data, it is next “stress-tested”93 with new data. “Some data is held out from the training data to be used as evaluation data, which tests how accurate the machine learning model is when it is shown new data. The result is a model that can be used in the future with different sets of data.”94 Experts are quick to point out that “leakage” between the training dataset and the testing dataset will undermine the value of the tests; if the machine is being shown the same data on which it was trained, its performance will seem better than it might be against new data. The remedy is clear: separate and distinct datasets; however, data lineage is not always known, clear, or available in the ways lawyers and judges understand chain of custody.
The accuracy or reliability of an AI system can degrade over time if inputs shift away from training data95 or immediately if new inputs and training data simply differ. As such, AI systems must be constantly and continuously monitored and tested even after they are deployed in the field.96 A system trained to recognize video footage of battlefield tanks at night might, for example, misidentify a dark box that is actually an air-conditioner unit on the side of a building as a tank.97 This is an example of brittleness and technical bias.
Precisely because some ML-driven AI continues to learn or degrade as it operates, the relevance and reliability of AI output can be moving targets. It is important to keep in mind that AI models might reinforce inaccurate or biased
92. Brown, supra note 80.
93. Varshney, supra note 24, at 22.
94. Brown, supra note 80.
95. Varshney, supra note 24, at 22.
96. James E. Baker et al., National Security Law and the Coming AI Revolution, Observations from a Symposium Hosted by Syracuse University Institute for Security Policy and Law and Georgetown Center for Security and Emerging Technology, Oct. 29, 2020 (2021), https://perma.cc/8QKL-7W26.
97. Id.
results as they learn. Opening discovery or testimony to AI/ML potentially opens the door to an array of subordinate questions involving methodologies, data, and testing. Judges may thus have to determine where it is essential for fact finders to understand the underlying methodologies, or just the conclusions or results derived from them.
In sum, not only does the AI field of study continue to evolve, but many individual AI applications also learn and change (for better or worse) over the course of their life cycles. Courts may need to test the reliability and validity of a given individual application both at precise points in its life cycle and against current best practices in the AI field. A court might examine not just how a particular AI application performed during its testing phase, but how it performed later in its life cycle, in the given, factually contested moment. A court might also consider whether a particular AI application meets the current best practices or standards in the AI field, such as those regarding the explainability and transparency of its computations and results. The evolving nature of AI also means that case law precedent may have limited or diminishing value when it comes to certain AI applications. Unlike polymerase chain reaction (PCR) DNA testing, for example, evidence generated by the same AI tool may vary in accuracy and reliability at different times.
As alluded to above, there are three categories of machine learning models: supervised learning, unsupervised learning, and reinforcement learning. (Those categories, however, “simplify the enormous amount of complexity and variation in these systems.”98)
“In supervised learning, the input data includes observations and labels; the labels represent some sort of true outcome or common human practice in reacting to the observation.”99 For example, a dataset of photographs of lions
98. Ben Buchanan & Taylor Miller, Machine Learning for Policymakers: What It Is and Why It Matters 5 (Cyber Security Project, Belfer Center, June 26, 2017), https://perma.cc/TZ4Q-UAX8.
99. Varshney, supra note 24, at 1. See also Matthew Mittelsteadt, Artificial Intelligence: An Introduction for Policymakers, Mercatus Special Study (Mercatus Center at George Mason University, Feb. 2023).
and tigers might have photos that humans have labeled lion or tiger.100 Running through its algorithms, the machine would learn on its own ways to discern whether new photos of big cats were lions or tigers (or perhaps neither). Here, “true outcome” should be read as referring to an objective reality. In this example, the “true outcome” of lion or tiger is straightforward. But one could imagine ways in which the very act of labeling a piece of data might be value-laden—for example, whether a particular applicant was a “strong” or “weak” candidate for a loan. “For the machine learning system to make even better decisions than people, true outcomes rather than decisions should ideally be the labels, e.g. whether an applicant defaulted on their loan in the future rather than the approval decision.”101
“In unsupervised learning, the input data includes only observations.”102 Algorithms “analyze and cluster unlabeled datasets” to “discover hidden patterns or data groupings without the need for human intervention.”103 Unsupervised learning is a technique for teaching a computer to find links and patterns in large volumes of unlabeled data without a determined outcome in mind. In contrast, supervised learning matches a data point, such as an image, to a known database of labeled data. Unsupervised learning is often used for “exploratory analysis,” with “no simple goal” in mind.104 It is harder to check or validate because there is no given answer or labeled dataset against which to check the model’s outputs.105 The government might use an unsupervised learning methodology to search for meaningful patterns and hidden links in phone call records, travel patterns, or trade and commerce records indicating sanctions violations. Here, the algorithm is not searching for a particular number or face but for meaning in otherwise unstructured data. Importantly, it might also find connections without meaning, for example, by “matching” faces in a facial recognition application with similar backdrops or lighting. When this occurs within the program’s internal calculations or neural network as explained below, it may be difficult, or impossible, to discern that the “match” is based on a factor irrelevant to the output objective.
100. See Brown, supra note 80.
101. Varshney, supra note 24, at 18.
102. Id. at 1.
103. What Is Unsupervised Learning?, IBM (2020), https://perma.cc/6D9A-2BWD.
104. Gareth James et al., An Introduction to Statistical Learning: with Applications in R 373–74 (2017), https://doi.org/10.1007/978-1-4614-7138-7_1.
105. Id.
“In reinforcement learning, the inputs are interactions with the real world and rewards accrued through those actions rather than a fixed dataset.”106 Reinforcement learning introduces a “machine learning agent,” either an incentive or a desired goal, into the algorithmic code that might cause the machine to weight or improve its outcome on its own, as in the case of AlphaGo.107 A shopping algorithm, for example, might do this by automatically adjusting its code based on whether a recommendation is accepted, rejected, or ignored.
One currently popular type of model that crosses all three major categories of machine learning (supervised, unsupervised, and reinforcement) is the artificial neural network. Artificial neural networks (ANN) mimic, or seek to mimic, the brain through a set of algorithms.108 Metaphorical “neurons” each mark a discrete point in the network where the system interprets a data input and responds to it, based on formulae programmed by humans, or, as the machine learns, by the machine.
Every neural network consists of layers of nodes, or artificial neurons—an input layer, one or more hidden layers, and an output layer. Each node connects to others, and has its own associated weight and threshold. If the output of any individual node is above the specified threshold value, that node is activated, sending data to the next layer of the network. Otherwise, no data is passed along to the next layer of the network. . . . Neural networks rely on training data to learn and improve their accuracy over time.109
An oversimplified example drawn from IBM’s website110 illustrates the process: Sam wants to go surfing as often as possible but only in good conditions. Sam develops a simple algorithm to help him decide each morning whether it is a good day to go. One input will be the water temperature; one the height of the waves; another the weather conditions; and the last the number of recent shark sightings. Each of these inputs will be assigned a weight, or a multiplier, representative of how much Sam values it. The shark multiplier will be negative. If the sum of the inputs and their multipliers is greater than a predetermined value, that neuron will “fire” or turn “on,” and will send an electronic signal or output
106. Varshney, supra note 24, at 1.
107. See Buchanan & Miller, supra note 98, at 11–12.
108. AI vs. Machine Learning vs. Deep Learning vs. Neural Networks: What’s the Difference?, IBM, https://perma.cc/9YRM-HKYB.
109. What Is a Neural Network?, IBM, https://perma.cc/JXN3-HF2A.
110. See id. The website presents a similar example with mathematical formulas.
to the next neuron to start calculating whatever decision it helps the system make (whether Sam should wear a wetsuit, perhaps). If the sum is less than the predetermined value for “getting to surf (yes),” the neuron does not fire, and the output signal never reaches the next neuron.
Neurons are metaphorically assembled in layers, that is, time-ordered sequences. If, in between the input and output layers, there are three or more (potentially many more) hidden layers, the ANN is called a “deep learning” model.111 Deep learning can be used in supervised, unsupervised, or reinforcement learning models.
Neural networks are sometimes referred to as the “black box” of machine learning. Because the machine is learning and adjusting as it goes, engineers may not be able to determine with certainty what the machine is weighing, how, and with how many input and output layers within the internal neural network. This is important where, for example, coding or data bias may impact the quality of the ultimate predictive outcome. A theoretical112 algorithm designed to predict recidivism, for example, might use many different data inputs as well as patterns it derived from the data of the general population in generating a predictive percentage risk that an individual defendant will return to prison. But with deep learning the judge and lawyers might not know what decisional weight, if any, was assigned to “race,” location, past convictions, the letter “S” in a defendant’s name, or population trends inapt to a specific defendant.
Although deep learning and neural networks are the most common AI methods, they are not the only ones. AI is a dynamic field, and new developments could change how algorithms process and ascribe meaning to data.
Importantly, increasingly there are methodologies that engineers can incorporate to make such internal calculations more transparent, or fully transparent, to users. Research is underway in academia and by organizations such as the National Institute of Standards and Technology (NIST) to enable more transparent neural networks that will explain AI outputs on a global or per-decision basis after the fact so that AI users and those who are impacted by AI outputs might more fully understand the parameters and weights that are or were applied within.113 There are different approaches for doing so. Decision-tree models, for
111. What Is Artificial Intelligence (AI)?, IBM, supra note 68. See also What Is Machine Learning (ML)?, IBM, https://perma.cc/JSA9-CXM6.
112. Current recidivism algorithms are often proprietary but may lack sophistication and complexity.
113. AI Fundamental Research—Explainability, NIST (June 16, 2022), https://perma.cc/3YX5-JC8Q.
example, might run through a series of “if-this, then-that” analyses to essentially map out the input-to-output run of the network.114 Counterfactual algorithms seek to identify factors that were determinative by showing which factors would change the output, e.g., by identifying factors that caused a neural gate to open or close.115 Depending on the algorithm and its use, the factors could be benign, like a color or a shape, or potentially problematic, like “race,” or a proxy for “race.” Generally, the more detailed and layered an algorithm, the more accurate it is likely to be in its outputs (assuming good underlying data and coding methodology), but also the more difficult it will be to explain the output. In addition, not every AI application warrants, or would seem to require as a matter of law, the same measure of explainability. What users and courts require or expect to understand about an AI application that generates a credit score, a thunderstorm warning, or a bail risk decision will likely be different.
Moreover, what is transparent may not be explainable if it comes in the form of mathematical modeling and equations. A 2021 NIST report notes “that the failure to articulate the rationale for an answer can affect the level of trust users will grant that system. Suspicions that the system is biased or unfair can raise concerns about harm to oneself and to society.”116 The report identifies four core principles of “Explainable AI”: (1) “Explanation,” or whether the system provides the evidence that it relied upon for its outputs; (2) “Meaningful,” or whether that evidence or data is understandable to the user or recipient; (3) “Explanation Accuracy,” or whether the explanation actually reflects how the output was generated and the metrics used to make that judgment; and (4) “Knowledge Limits,” or whether the system produces outputs within the parameters of its design and training and whether those outputs reach the intended or appropriate confidence level.117 The study concludes by noting that there are some aspects of decision making and explanation that humans are better at than AI-empowered machines and some that algorithms are better at than humans.118
As with so many aspects of AI, the questions for policy makers and legal decision makers, like judges, are firstly, when to use AI at all for a particular task, and secondly, when using AI, how to structure human-machine collaboration so as to maximize the benefits of AI and minimize and mitigate its risks, including the risks of bias and lack of transparency. We call this latter challenge the Centaur’s Dilemma, after the mythical animal that is half human and
114. P. Jonathon Phillips et al., NISTIR 8312, Four Principles of Explainable Artificial Intelligence 12 (Sept. 2021), https://doi.org/10.6028/NIST.IR.8312.
115. Id. at 13.
116. Id. at 1.
117. Id. at 2.
118. Id. at 18–21.
half horse as well as the Department of Defense adoption of the “Centaur Model” to describe human-machine teaming. Here the centaur is part machine and part human and the dilemma is to determine how much human choice and control to require and how much autonomous machine choice and control to require (or permit) of each AI application, output, or decision. Three related questions for judges and experts to ask in the context of AI-generated evidence are:
Judges will play at least five roles related to AI in the courtroom. First, judges will oversee the process of discovery in civil and criminal cases: where AI outputs are at issue or may be offered as evidence, judges will need to determine how much discovery to permit into the AI’s design, background data, and accuracy. Second, they will serve as evidentiary gatekeepers, applying the Federal Rules of Evidence (or state equivalents) to proffers of testimonial and documentary evidence, including and perhaps especially Rules 401, 402, and 403. Third, judges will serve as guardians of the law (some may prefer the phrase “interpreters of the law”), protecting and preserving the rights and values embedded in the Constitution as well as statutes and rules of procedure and evidence. Fourth, judges may serve as potential AI consumers who need to decide whether to receive, review, or rely on AI-generated outputs to inform bail, probation, or sentencing decisions. Some risk-assessment tools rely on predictive algorithms, raising legal concerns about due process, equal protection, and confrontation, triggered in part by technical concerns about accuracy and bias. Fifth, judges will serve as communicators, translating the sometimes complex inputs behind AI into plain-language instructions for jurors and case law precedent for lawyers.
The previous sections introduced the technology behind AI. This section highlights nine technical aspects about AI of which judges should be aware in their roles as referees, gatekeepers, guardians, potential consumers, and communicators. The following sections address bias, predictive algorithms, algorithms used by lawyers, discovery, and AI as evidence.
Because there are different AI methodologies, each application should require authentication and validation, not just in concept but as applied in each context and potentially moment in time. As described above, within the category of machine learning, there are multiple ways to teach a machine to learn using data,119 the most common being supervised learning, unsupervised learning, and reinforcement learning. Any of those might make use of artificial neural networks employing deep learning with several or many hidden layers.
Yet there are still other theories and methods for teaching computing machines to learn, each built into the operative algorithm. These alternatives include evolutionary or genetic algorithms, inductive reasoning, computational game theory, Bayesian statistics, fuzzy logic, hand-coded expert knowledge, and analogical reasoning.120
In addition to deciding on the learning methodology, computer engineers must also decide how much depth and breadth to apply to any deep learning neural network—in other words, how widely the algorithm will search (breadth, also referred to as width) and how many layers of internal inputs and outputs it will employ before providing an output (depth). With facial recognition, for example, breadth might represent the number of datasets an algorithm searches. Depth might be illustrated by the number of points on a human face the algorithm is programmed to analyze before providing an output. Increases in network size tend to be required to capture the complexity of modern AI algorithms. Such increases create a challenge, however: the greater the depth—the number of layers in the neural network—the harder it will likely become to determine which factors were determinative in the output prediction. This could become important to the extent there is risk or concern that bias or some other factor might undermine outcome accuracy.
Some algorithms are designed to provide outputs, plural—for example, a range of match faces with a facial recognition algorithm, or a range of products with a shopping recommendation algorithm. As mentioned above, given the black box problem of deep learning, there are additional methodologies engineers can incorporate or substitute to make an algorithm’s internal calculations more transparent, or fully transparent, to users.
A court will need to satisfy itself that the specific AI application (as opposed to AI generally), its design, and its specific use meet the foundational requirements for the purpose for which it is being offered into evidence or used by a court. Verification will entail considering the theory and methods behind the AI,
119. Id. at 3.
120. See The Weaponization of Increasingly Autonomous Technologies: Artificial Intelligence 5, United Nations Institute for Disarmament Research (UNIDIR) No. 8 (2018), https://perma.cc/HQM5-BHWQ.
including the nature of the datasets used to train, test, and validate the AI, as well as, in the case of deep machine learning, inquiring into the design of the neural network.
Many AI/ML models learn, for better or worse, as they proceed.121 That means AI systems need to be tested and validated on an ongoing basis. In other words, if a machine is learning, its variance rates and accuracy should change as well—in theory, in the direction of greater accuracy and lower variance, but, as described above, the machine’s accuracy and reliability may degrade.
If the underlying data are biased or poorly selected in that they do not match real-world conditions, that may undermine the accuracy of the AI or embed bias in the AI’s application. Results from the testing phase may not match actual outputs in the field. In context, judges, experts, and litigators will have ample reasons to test the reliability of any AI evidence offered in court, and judges will need to determine in context just how wide to open the door to expert testimony and discovery about matters like datasets, algorithmic design, search parameters, bias, and neural network architecture.
Machines do what they are programmed to do, not because they choose to do so, but because they are programmed to do so—including learning on their own. Software drives machines. And humans, in the first instance, write software and design programs. Behind each AI application there are human choices, human values, and human bias that may impact the operation of the algorithm and the accuracy of its results. Humans select not only the data but also the metrics the algorithm uses to frame and analyze that data.122
In the operation of AI, humans are also involved. Under current vernacular in the AI field, humans are said to be “in-the-loop,” “on-the-loop,” or “out-of-the-loop.” As implied, in-the-loop describes humans in functional control of an application, deciding when and how it is used. On-the-loop describes humans observing AI but not controlling it, but with the option to do so. Out-of-the-loop describes an autonomous or semiautonomous system operating
121. Other models might learn during training but are not capable of further learning on their own during deployment. Still others might be trained and later adjusted or retrained only with human help or supervision.
122. Remarks of Nisheeth Vishnoi, supra note 22.
automatically. These terms are imprecise in at least two regards. First, they describe a wide variance of conduct within each category and thus may convey a sense of control and oversight that is, in operation, absent. More to the point, they are insufficiently descriptive to apportion accountability and responsibility for the purpose of legal judgments. Take the example of a “driverless car.” Some driverless cars are configured to employ a safety driver as an observer or, in the case of a semi-driverless car, a driver with shared responsibility for the operation of the vehicle. Other driverless cars, without a human in the car, may operate under remote human control. In each of these three scenarios, at any moment in time the vehicle may be driving autonomously without human control, it may be following the explicit direction of the remote or present driver, or the human driver may be keenly observing the operation of the vehicle without overriding the car’s computers. In each case, humans were out of, in, and on the loop.
However described, a human is always involved with an AI application. For courts, the factual questions will be: Who designed the seed algorithm? Using what metrics or weights? Who trained the algorithm? Using what data? Who collected the data? Who validated the data? Who used the algorithm or monitored its use? These factual questions will lead to legal questions. For example, where Crawford123 applies, multiple persons might be called as witnesses regarding the design and operation of an AI algorithm. Because humans are always involved with AI, there will be persons who can, if relevant and material, provide answers to the sorts of questions essential to authenticating and validating the use of AI:
Judges might also consider that a qualified AI expert or witness ought to be able to credibly answer these questions, or perhaps the expert or witness may not be qualified to address the application at issue.
123. Crawford v. Washington, 541 U.S. 36 (2004).
AI is generally a predictive tool based on statistics. Through weighted calculation an algorithm predicts an outcome. A facial recognition algorithm, for example, might predict the likelihood that one photograph in a set of fifty returned by the machine as potential matches is indeed the correct match. What the algorithm does not do is confirm that an image presented is in fact a match in the same way that a chemical test confirms the presence of a compound. This is one reason why engineers use the term “confidence threshold” in describing the accuracy of an application. The FBI facial recognition algorithm, for example, is designed not to conclusively find a match but to find pictures that might match. As the 2019 GAO report on the subject stated, the algorithm is most accurate when offering a range of potential matches.124
In the case of a Google search algorithm, for example, the algorithm is predicting that one of the provided links will respond to the query. This is self-evident if one asks a question like, “Who was George Washington?” The algorithm is likely to provide a Wikipedia link to a webpage about the first U.S. president. It is also likely that many readers will conclude that the algorithm has answered the question: “George Washington was the first president of the United States.” It has not. The algorithm has predicted that one or more of the links provided will answer the question, and likely in descending order of probability as the links are listed. Modify the question a bit, and the predictive aspect becomes more evident. If you ask, “Who is my friend George Washington?”—a quite different person than the first president—Google responds with sites listing the first president’s friends. That is the algorithm’s best prediction as to which links will answer the question based on code matching, likely use of the word “friend,” and the way prior readers have responded to similar word searches. In other words, like a shopping algorithm, the search algorithm is tracking whether the searcher “bought” the response by clicking on it and measuring how long the searcher stayed. Of course, it has not answered the question at all and is nowhere near to providing a link that will answer the question about the user’s friend George Washington—not without more details that can help shape the predictive outcome.
In a medical context (perhaps coming before a court in a malpractice case) an input might query, “Is this a picture of a benign or malignant tumor?” To respond to that question, an algorithm trained on prior pictures of tumors might break the picture into quadrants and subcomponents, as a facial algorithm might do, and then compare the picture submitted to database images of tumors. Based on all accessed images of benign and malignant tumors, the algorithm will predict whether the picture is a better match for one or the other. What the algorithm offers that a human does not is the capacity to search multiple databases rapidly for comparative patterns, as well as the ability to break the image into
124. See Harwell, supra note 44.
subordinate components in a way humans cannot, and thus to see connections and patterns the human eye cannot. Moreover, the algorithm is neither affected by fatigue nor subject to ordinary human distractions, pressures, and emotions. If the algorithm has not been trained properly, or trained to identify new patterns, it is less likely than a human to identify a rare disease or new manifestation of an existing disease, raising the prospect of a false negative. One can imagine in a malpractice case how the parties might litigate the manner in which any human–machine teaming occurred. Where a tumor was not diagnosed, a plaintiff might argue that doctors placed unreasonable reliance on a “negative” AI output. Alternatively, a plaintiff might argue an unreasonable lack of reliance if a broader use of AI databases was not employed.
For sure, algorithms are used to “make decisions” as well as to predict outcomes. Algorithm-generated credit scores, for example, are used to determine eligibility for loans and credit cards. The algorithm produces the score, but it is humans who decide what parameters should inform that score and it is humans who decide what score should be used as a threshold cutoff for loan eligibility or rates. The algorithm in essence is making a numeric prediction about the reliability of the loan; the human is making a decision to rely on the prediction and at what confidence level to do so. Likewise, driverless cars make “decisions” all the time. However, they do so based on human programming and values embedded in that programming, a centaur model of decision, even in a fully autonomous vehicle.
If an algorithm has only been trained on one picture of a cancerous tumor or has never seen a cancerous tumor, then it will be less likely to correctly identify a tumor in response to a query. This is an important limitation on the capacity and accuracy of current AI. Moreover, volume here is not measured in hundreds, but in hundreds of thousands of pictures. A human performing the same task with only one picture will more likely identify a tumor using intuition, judgment, and experience, as well as external factors the algorithm cannot assess, like the patient’s unique pain threshold or situational responses to touch and feel. As discussed above, today’s large language models (LLMs) train on vast datasets, though researchers are developing other types of modeling less dependent on large datasets, such as information lattice models.
The quality of data is also important and will likely remain so. Data can be flawed in a variety of ways. Dated data, known as stale data, are more likely to generate inaccurate results (“temporal bias”), as are incomplete or underrepresentative data (“population bias”). A facial recognition algorithm trained on driver’s license pictures or parole pictures is more likely to identify pictures reflecting the demographics represented in the databases. This has the potential
to increase the false negative rate for underrepresented groups and to increase the false positive rate for overrepresented groups.
Likewise, data may possess flaws that impact algorithms but not humans. Algorithms may discern links in data or perceive patterns in data creating matches, based on elements or numeric formulas that are unintended or that humans would not discern. In our bank robber scenario, the algorithm may match digital sequences found in the image based on irrelevant factors, such as a common backdrop in a photo or pattern on the robber’s face mask. In either instance, there may be a match but not a meaningful match. If this occurs within the neural network, it may skew a result in a manner unseen and unknown to the user. The output is a face, but the user may not know this face has been passed through to the output stage because of similarities in the picture backdrops, not the face itself, because AI systems often are unable to explain their results.
The black box of neural networks renders many AI outputs unexplainable. However, the field of explainable AI is expanding as is the development of tools allowing reconstruction of the reasoning behind AI outputs. This capacity has many legal implications, such as the ability to reconstruct or explain AI results that are biased, incorrect, or result in accidents (driverless cars) or mistakes (medical diagnoses). Therefore, users of AI, including proponents of its admission into evidence, should articulate what is explainable and not explainable and why, based on the tools available to do so at the time of use or at evidentiary proffer.
As will be seen below, this limitation makes certain predictive algorithms particularly susceptible to error. It is essential that judges and factfinders understand the ways data and design can embed witting and unwitting bias, as discussed in the section below titled “AI Is Biased,” impacting the predictive accuracy of AI.
If the accuracy of an AI application often depends on the amount of data on which it is trained, it depends even more on the algorithm applied to that data. As noted previously, an algorithm is a process that guides the software determining which data are selected and how they are weighted. The choice of algorithm is a choice of decisional metrics or framework—an analytical lens or value-laden perspective.
Using the analogy of an algorithm as the recipe a chef uses: the chef chooses not only the end-dish but also dietary restrictions (vegetarian, low sodium),
flavor profile (sweet, acidic, spicy), cultural heritage, ingredient measurement system (metric, English), etc. As this is AI, not a simple algorithm, the chef supplements her recipe every time she cooks in a way that only she knows. What is more, the sous chefs supplement the recipe when no one is looking. Thus, in some cases no one can be quite sure what gives the recipe its distinctive taste; and, if the chef knew, she would not tell because she wants customers to continue to come to her restaurant. Restated, if one knew the Google search algorithm, every search platform could, in theory, be as good, provided of course, that the algorithm could access and apply the same level of data (Google’s data) with which to train, test, validate, and apply the algorithm.
Thus, the heart of many disputes about the use of AI-generated evidence in court or the use of AI tools to inform judicial decision making will revolve around access to and disputes over the accuracy and quality of algorithms. This may be the proprietary secret AI companies want most to protect, because it is the recipe to their market success and because too much inquiry may undermine confidence in the AI’s capacity.
Here are several questions judges should contemplate before using an AI application or admitting one into evidence:
These questions might lead to further questions:
As noted earlier, narrow AI is not particularly good yet at situational awareness. The driverless car may not timely identify novel objects on the road. Judges will therefore have to consider whether the scenario or fact for which an AI application is offered presents questions involving situational awareness. If so, they should then ask in what manner the algorithm, the data, and the training are keyed for such circumstances and whether the accuracy rate varies in such contexts. They might ask:
Proponents of AI tend to emphasize its strengths, opponents its weaknesses. Of course, the strengths and weaknesses of any AI application must be assessed on a case- and application-specific basis. One strength of AI, however, is its general capacity to identify, aggregate, and derive meaning from data in ways that humans cannot. With driverless vehicles, for example, this capacity is simply illustrated by the fact that, with the right sensors, driverless vehicles can “see” in all directions at once and calculate, with mathematical precision at speed, the distance needed to brake. AI can also identify patterns, anomalies, and links in data that humans cannot. In many cases, AI is better than humans at tasks like comparing pictures of tumors to database images of benign and malignant tumors. And AI can do all this instantaneously and, in many cases, more consistently than humans, a point illustrated with reference to comparisons between humans reviewing documents for security classification or discovery purposes and algorithms doing so. Different humans often reach different conclusions. A good AI will be more consistent, albeit it might be more consistently correct or incorrect in its result. Of course, depending on the context, humans will need to determine whether the patterns and links that are made are relevant and reliable for the purpose presented.
AI can create false content, as described above.
AI, like humans, has biases. It is also an area of AI that will attract adjudication, as the next section will suggest.
Bias is often associated with the human application of stereotypes or prejudices to an ethnic, gender, racial, or other identity group. In U.S. law, such categories are generally recognized as “suspect classes” in equal protection law under the Fifth and Fourteenth Amendments.
As judges know, any application of law that treats classes of persons differently from the populace as a whole, if challenged in court, must pass either a strict scrutiny test, intermediate test, or rational basis test, depending on the suspect class. An application of law that is facially neutral but adversely affects one protected group more than another might also be subject to a disparate impact claim. For example, a hiring algorithm that disproportionately favored one group over another might be subject to a disparate impact lawsuit.
However, when technologists consider bias, they are often referring to the variance between what an AI is intended to predict or perform and the accuracy with which the algorithm predicts or performs that function. This “algorithmic bias” might include social biases incorporated into the AI’s design and implementation—but with AI, bias is usually defined more broadly as a witting or unwitting (conscious or unconscious) predisposition that can undermine the accuracy of an AI application or output. Bias thus addresses a range of cognitive tendencies that can adversely affect objective analysis and technical accuracy. Significantly, AI bias also incorporates and describes unintentional design and data flaws that can impair the accuracy of AI outputs. Because humans design AI algorithms and choose the data that trains the software, developers’ choices can incorporate varying types and degrees of bias into the algorithm’s design. Unintentional bias is often difficult to discern because it is embedded in the design of an AI system or in the data used to train an algorithm. Decision makers may subsequently place undue reliance on AI outputs because they do not know to what extent, if any, the outputs are predicated on biased input, e.g., a facial recognition algorithm or a predictive algorithm trained on limited or poor-quality data.
When it comes to algorithmic bias, there are four immediate takeaways for judges:
The United Nations Institute for Disarmament Research suggests several categories and sources of algorithmic bias.127 Starting with the Institute’s findings, we highlight eight forms of potential AI bias: statistical bias, moral bias, training data bias, inappropriate focus, inappropriate deployment, interpretation bias, unwitting human bias, and intentional bias.128 We also discuss the issues of overfitting and outliers as potential sources of bias.
Statistical bias might occur when an algorithm’s predicted outcomes deviate from a statistical standard, such as the actual frequency of real-world outcomes.129 This deviation can be caused by bad statistical modeling or incorrect or
125. Joni R. Jackson, Algorithmic Bias, 15 J. of Leadership, Accountability & Ethics 55–65 (2018), https://doi.org/10.33423/jlae.v15i4.170.
126. Jake Silberg & James Manyika, Notes from the AI Frontier: Tackling Bias in AI, McKinsey Global Institute (June 6, 2019), https://perma.cc/K8Z8-WUAE.
127. Algorithmic Bias and the Weaponization of Increasingly Autonomous Technologies: A Primer, UNIDIR Resources No. 9, United Nations Institute for Disarmament Research (UNIDIR) (2018), https://perma.cc/N2H7-DWPT.
128. This section relies heavily on our prior work with co-author Matthew Mittelsteadt, An Introduction to Artificial Intelligence for Federal Judges, supra note 8.
129. Id. at 2.
insufficient data. The difficulty of calculating the infection or mortality rate of a disease such as Covid-19 illustrates the problems that can result. At the outset of the pandemic, AI-driven modeling of infection rates differed widely, in part because the models could not account for those who had the disease but did not show symptoms. Thus, it was only within closed data samples, such as the passengers aboard a cruise ship, that the models could account for an asymptomatic pool (because all the passengers were tested before they were allowed to leave the vessels). The risk with cruise ship findings was having too small a sample pool, and one not necessarily random or representative of a cross-section of the population in more typical, land-based living conditions.
Moral bias occurs when an algorithm’s output differs from accepted norms (regulatory, legal, ethical, social, etc.).130 For example, an algorithm may weigh factors that the law or society deem inappropriate or do so with a weight that is inappropriate in the context presented. A hypothetical predictive crime algorithm might use data derived from the current prison population to “predict” rates of recidivism. Given the disproportionate number of people of color imprisoned, the algorithm might produce biased results by explicitly or incidentally weighing “race” and ethnicity or their proxies, and perhaps do so within the black box of a neural network. AI neither thinks nor understands the world like humans, and unless instructed otherwise, its results can reflect an ignorance of norms found in the equal protection and due process clauses.
Like humans, AI learns from experience; however, AI experience is based exclusively on data, often hand-selected by a human developer. Inaccuracies or misrepresentations in this data can perpetuate biases by embedding them in algorithmic code or weights.131 In other words, the results are skewed; the algorithm produces wrong answers. As presented in the discussion of the machine learning life cycle, above, biased data might be outdated (“temporal bias”), under- or over-representative of certain groups (“population bias”), or simply reflective of individual or systemic social prejudices. For example, an algorithm intended to identify potentially successful job applicants might rely on past successful job performance as an indicator of future successful job performance and derive from that data certain preferred hiring characteristics like age, school, and experience.
130. Id.
131. Id. at 2–3.
But if the data are from a period when women or other marginalized groups were not numerically well represented in the relevant employment market or educational pool, at least 50% of the potential workforce might be excluded from results. Such data might likewise incorporate human bias in the form of a past company policy to only hire persons from certain schools. The criterion might have seemed objective when the company adopted the policy, but it necessarily incorporates the socio-economic and other biases of the college admissions processes of the time. Thus, the algorithm might exclude candidates as good or better than those from whom the dated dataset was sourced.
Similar concerns have been raised about algorithms designed to inform parole decisions by predicting recidivism risk. These algorithms, critics argue, appear132 to rely on socioeconomic status, neighborhood location, and past crime statistics as predictive criteria of future criminal conduct, potentially resulting in a self-fulfilling prediction with disparate racial and socio-economic effect. Criminal risk assessments used in bail and sometimes sentencing meet with analogous criticisms, discussed below under Algorithms That Predict Human Behavior. Proxies for suspect categories, such as zip codes or last names, might also introduce bias into data, as discussed above under How Machines Are Trained to Learn: The Machine Learning Life Cycle. Proxies may be hard to detect. Additionally, data used to test or validate algorithms might also infuse bias.
This occurs when an algorithm’s training data are ill-suited to the algorithm’s task.133 This might lead the algorithm to identify factors within a neural network that, though objectively reasonable, are logically irrelevant to the desired outcome. An algorithm that matches faces based on colors, backdrops, or lighting demonstrates inappropriate focus bias.
One can readily imagine how similar bias might migrate into datasets designed to train machine-learning AI to predict terrorism recruitment or threats. To begin with, the data may rely too heavily on international versus domestic actors owing to the designer’s perceptions or the selection of training data. And because the amount of data may be limited (in contrast to, say, an Amazon or YouTube algorithm), human actors may put too much credence in the reliability of the predictive output. In general, the more data used to train a predictive algorithm, the more accurate the result. An algorithm trained to predict terrorism risk based on a stereotyped “terrorist profile” will, unsurprisingly, be best at locating persons who meet that profile. More persons with the profile
132. They are often proprietary, as was the case for the risk assessment tool at issue in Loomis v. Wisconsin, 881 N.W.2d 749, 759 (Wis. 2016), cert. denied, 137 S. Ct. 2290 (2017).
133. Baker, Hobart, & Mittelsteadt, supra note 8, at 4.
will be identified as potential terrorists, and more may be found to be engaged in suspicious activities because of increased scrutiny, thus appearing to validate the algorithm and the choice of criteria. Potential terrorists who don’t fit the profile may be erroneously omitted.
As the example demonstrates, the risk is not just in false positives, which is the focus of much bias analysis to date, but in the potential failure to identify credible risks: false negatives. Disparities in facial recognition data between males and females or people of different ethnicities could lead to greater inaccuracies in identifying female subjects or people of color, increasing the number of false positives—for example, the number of innocent people selected for extra screening or questioning at airports. In contrast, an inability to identify known subjects or threats—for example, a missing or wanted person or Amber Alert kidnap victim on CTV camera feeds—has security implications.
This happens when a system is used in a context for which it was not designed, tested, and validated.134 For instance, a driverless car trained for driving in the United States might not be able to handle left-hand driving in the United Kingdom. A human (we hope) would adapt to such a change; a driverless car algorithm would need more training.
This occurs where an algorithm’s output is confusing or subject to incorrect interpretation by those working with the technology.135 Users of facial recognition technology might expect singular, or perfect, matches, in contrast to what most facial recognition algorithms—including the FBI’s—actually produce, which is an array of potential matches, none of which might represent the individual sought, leaving the interpretation and conclusions to human users.
Interpretation bias can also occur because of ambiguity embedded in the algorithmic design—for instance by software designers who, unaware of cultural or linguistic cues, overlook or misuse phrases and concepts, skewing results. Sometimes the reasoning behind a match is necessary to understand its value or import. Engineers might design algorithms to search for particular words or phrases, with the goal, for example, of identifying persons engaged in radicalizing internet users. Insufficient knowledge of culture and language, however, could have unintended consequences. Phrases like “the bomb,” “knock ’em
134. Id.
135. UNIDIR Resources No. 9, supra note 127, at 5.
dead,” and “kill it,” all can mean something in American vernacular quite different from what might be intended in a terrorist cell. By the same token, an algorithm designed by an engineer who, for example, does not know the import of “the fourteen words” (which form two slogans of white supremacists) may inadvertently enable a potential data threat stream to escape detection.
This refers to the unintentional infusion into an application of human preferences, stereotypes, values, fears, or knowledge. Consider an algorithm intended to predict risk. An engineer might apply engineering principles to a risk equation. But what is risk? An algorithm will almost certainly incorporate the particular fears, risk tolerances, and perceptions of its designers. (The problem may be compounded when the algorithm is both human and machine generated—a “centaur”—clouding where and how bias might have entered the system.) But the algorithmic equation does not account for human behavior, which is informed not only by the calculation of objective zero-sum costs but also by the emotional impact of fear.
Use of racial, gender, and other social descriptors in algorithms is inherently risky and potentially fraught with ethical and legal issues. One can imagine how both intentional and unintentional human bias might enter the equation as a computer scientist embeds what he or she believes are parameters associated with a “race” or ethnicity into facial recognition software. Racial and ethnic categories are inherently ambiguous social constructions covering wide continuums of individuals. Similarly, one can see how nuance might “fool” an algorithm intended to identify age based on the subtle distinctions of faces alone without allowing for the possibility of make-up or efforts at disguise. Bias may also occur unwittingly in machine-learning applications that may not be designed to depend on social identity descriptors but nonetheless rely on such characteristics within the neural network black box. Bias can lead to both the under- and over-inclusion of the targeted group, as “race,” ethnicity, gender and other social categories are malleable concepts. Scholars like Charles Szymański of the University of Bialystok note the risk of unintended age bias, as occurred with the use of an employment screening algorithm that sorted candidates based on whether the candidate submitting an application utilized the default browser or had downloaded a substitute browser. The tool, which was intended to identify technology-savvy applicants, had the unintended effect of excluding candidates based on age.136 In effect, the choice of browser served as a proxy for age.
136. Charles F. Szymański, The Future of Labor Law, Presentation at Trnava University, Slovakia (Oct. 5, 2022).
Scientists, operators, and decision makers may use AI facial recognition tools or predictive algorithms to target disfavored or vulnerable groups. Algorithms can be designed to identify and select certain real and perceived social descriptors associated with “race,” gender, sexuality, national origin, religion, disability, and more. Facial recognition technology can identify and track certain ethnic groups as is the case in China with “Uyghur characteristics.” Clearly pernicious in the profiling of Uyghurs (or more accurately, a band of physical characteristics Chinese state security services associate with Uyghurs), such technology presents one legal question that may arise for judges: Whether, when, and under what controls is the purposeful use of social identity descriptors ever an appropriate search parameter?
On the one hand, there are qualitative differences between the reactive versus predictive uses of social identity descriptors. For example, using an individual suspect or victim description, including descriptors like “race,” gender, and age, in response to a credible, individualized predicate, in context, might make sense. Of course, one needs to consider that the initial social indicator that might trigger the use of an AI application may itself be affected by cognitive, societal, or the personal bias of witnesses. In the context of using facial recognition to search for a known suspect or a person identified in an Amber Alert, one would not necessarily expect law enforcement to employ “race”-, gender-, or age-neutral input or an algorithm incapable of searching for the specific or reported characteristics of the suspect or victim. Using individualized, known suspect descriptions as an “identifying factor” is “perhaps the least controversial use of race in the context of the Fourth Amendment” under federal case law.137 However, as already noted, one potential challenge to individually based suspect descriptions is that to the extent social identities are fluid rather than fixed, they may be difficult to code, and they may be difficult for officials to implement fairly and accurately in AI and in general contexts.
On the other hand, using a suspect category and accompanying social identifiers to identify and predict persons who might engage or have engaged in an unlawful act—rather than relying on individualized, behaviorally based reasonable suspicion or probable cause—is an exercise in bias. Courts that have not “sidestepped” the issue have generally rejected explicit racial profiling based on “incongruity” arguments (that a particular person is “out of place” in a particular location or neighborhood).138 Likewise, with exceptions in the immigration and border contexts, courts have rejected “propensity” arguments (that membership in a particular social group suggests a “propensity” for an individual to commit or have committed a crime).139
137. Farag v. United States, 587 F. Supp. 2d 436, 462 (E.D.N.Y. 2008).
138. Id. at 462–63.
139. Id. at 463–64.
Courts may need to consider whether an AI application might blur the lines between individual suspect identification and group profiling. An AI application—for example a facial recognition database—might cast a wider net than a traditional human law enforcement investigation, to the point where what started as an individual suspect identification begins to look more like group profiling, opening more people (all of whom or all but one of whom might be innocent) to suspicion and investigation.
As Matt Mittelsteadt has identified,140 two significant risks posed by the use of machine learning in criminal risk assessments and other legal contexts are “overfitting” and “outliers.” Overfitting occurs when the ML model is too tailored to the data it has been trained on and does not account for ambiguities or variations.141 Generally in machine learning development, this problem is addressed by ensuring that the data the ML algorithm is trained on are separate from the data it will encounter in use (provided that the training data is still representative of the data that will be encountered in use). A model thus “generalized” should be flexible enough to correctly interpret data it has not encountered.142 If this data separation is not made, biased results may occur.
As discussed below, some courts have already used algorithms in sentencing. Overfitting, like other biases and weaknesses of AI, presents particular risk in sentencing and other applications where liberty interests are at stake. If a hypothetical ML sentencing algorithm is built on a training set of past offenders, the AI could design its neural network with results custom fit for those specific offenders. If a person reoffends, perhaps with a lesser crime, and his data are used to train the algorithm, there is risk that in calculating a sentence the algorithm might find and match his prior personal data and reproduce the prior sentence; statistically, the sentence will be the best match for his case. In essence, the algorithm might conclude, within its black box, “For someone with this background, we give a sentence of X.” In effect, the sentencing algorithm has shown a focused bias targeting a specific person (or even a person with similar personal data unrelated to any offense; this overlaps with population bias). (Note this is only a hypothetical model: we do not know of any risk assessments currently in use that are designed to correlate sentences with crimes for which an individual has already been convicted, as suggested in the hypothetical above. Rather, in actuality, courts appear to be using risk assessments designed to predict future crimes or dangerousness (recidivism), designed for bail or post-sentencing purposes, and importing those
140. Baker, Hobart, & Mittelsteadt, supra note 8.
141. What Is Overfitting?, IBM (Mar. 3, 2021), https://perma.cc/BN3F-GNKJ.
142. See id.
risk assessments into the sentencing context, raising constitutional questions about a defendant’s right to individualized and accurate sentencing.143)
Risk in the other direction might produce an “outlier,” which occurs when the input is sufficiently distinct from the scenarios built into training sets to confuse the algorithm. The situation is analogous to sentencing for a crime that is not included in the Sentencing Guidelines and is not readily analogous to an existing offense. Outlier inputs could lead to unpredictable, potentially biased, and incorrect results. New crimes that do not fit the algorithmic model for assessing bail or recidivism risk, or for which there is an exceedingly small dataset, could produce a similar result.
For sentencing, the threshold for not using ML or throwing out algorithmic results could reasonably be low, as the alternative—human decision making—is already the standard. In any event, it would seem incumbent on the proponents of using such an algorithm to demonstrate its validity and explain its functioning, just as a judge should (and in some jurisdictions is required to) put reasoning for a sentence on the record, allowing appellate courts and the parties to understand what occurred and why.
With AI as with people, some bias is always present. But steps can and should be taken to minimize bias and mitigate its influence. Mitigating bias starts with the design of an AI system and the nature and quality of the algorithm and the data on which the algorithm is trained, tested, and validated. An additional mitigator is sound process—timely, contextual, and meaningful. For policy makers and engineers, “timely” means at each point where input can directly influence outcomes, i.e., at the conception, design, testing, deployment, and maintenance phases of AI development and use. Sound process includes asking whether AI should be used for a particular purpose in the first instance—that is, whether an AI tool should be developed or employed for a particular use. “Contextual” means specific to the tool and use in question and with actual knowledge of its purposes, capabilities, and weaknesses. “Meaningful” means independent, impartial, and accountable. Specifically, the person using or designing an application should validate its ethical design and use. If a particular community or group of people is likely to be affected by the use of the tool, designers and policy makers should consult with that community or group in deciding whether and how to develop, design, or use that tool.144 In addition, to the extent feasible,
143. See State v. Loomis: Wisconsin Supreme Court Requires Warning Before Use of Algorithmic Risk Assessments in Sentencing, Recent Case, 130 Harv. L. Rev. 1530, 1532–33 (Mar. 2017).
144. Baker et al., supra note 96.
the system’s parameters should be known or retrievable. The system should be subject to a process of ongoing review and adjustment. The rules regarding the permissible use, if any, of social identifying descriptors or proxies should also be enunciated, clear, and transparent, and they should be subject to constitutional and ethical review.
Asking the right questions is crucial, not just for legal reasons but because the questions invariably underpin judgments about the reliability of the AI at issue. With the caveat that humans may not be able to understand or predict the behaviors of some AI systems with certainty or even reliably (which might counsel against using them at all or in particular contexts), here are some questions to ask to probe for bias:
Most algorithms are based on statistical prediction. In this sense, all algorithms are predictive. There exists a class of algorithms, however, that also seek to make predictions about future behavior based on past data. This happens all the time. Shopping algorithms seek to predict through data about prior purchases (past behavior) the predisposition of individuals to make additional purchases (future
145. Some AI models will not produce false positives or false negatives, e.g., a generative AI model that creates fictional or new artistic content.
behavior). YouTube uses algorithms that seek to predict additional videos a viewer might watch to generate additional views. They are called “recommendation algorithms,” but what they do is push products to viewers using predictions about their future viewing behavior based on past viewing behavior.
Judges may be called upon to determine whether particular algorithms that predict behavior, such as those in the criminal justice system, are constitutional or unconstitutional. As part of that determination, judges may need to examine whether the algorithm is accurate, and by what measure, and whether a particular application presents legal issues, as when, for example, a predictive algorithm embeds certain types of bias. Likewise, issues might occur because litigants are unwilling or unable to determine the parameters or datasets that informed an algorithm’s prediction and thus cannot reliably evaluate its accuracy as applied to the circumstances at hand.
Predictive algorithms are used, or might be used, in a variety of judicial and collateral settings. Some of the most frequently cited applications are in the criminal justice system. Many police departments and courts across the country use algorithmic risk assessments.146 Police use such tools to predict where crime might occur and by whom.147 Policing algorithms are not intended to predict individual conduct, though they might include the characteristics of individual actors in an area, like registered sex offenders. Proponents of such algorithms argue that algorithmic tools better focus finite police resources on areas where crime is most likely to occur, based on “neutral” data rather than the hunches, perceptions, or potential biases of police officers. The argument against them is at least twofold. First, such algorithms can generate their own reinforcing and circular logic. The algorithm predicts criminal conduct, police patrols are increased, and additional arrests occur, validating the accuracy of the algorithm. Second, the underlying data are not, in fact, neutral. Such algorithms may have a disproportionate racial and socioeconomic impact where they generate increased patrols in poorer neighborhoods with historically higher recorded crime rates and larger concentrations of people of color. In this way, they may also reflect past police practices and prosecutorial decisions focusing on communities and people of color. They may also have intentional racial impact to the extent they use “race” or socioeconomic status as predictive factors.
Some courts already use actuarial and/or algorithmic risk assessments in pretrial release, probation, and sentencing decisions; parole boards also make use of them.148 Some of these algorithmic risk assessments are capable of machine
146. Randy Rieland, Artificial Intelligence Is Now Used to Predict Crime. But Is It Biased?, Smithsonian Magazine (Mar. 5, 2018), https://perma.cc/88JS-VTS8; AI in the Criminal Justice System, Electronic Privacy Information Center, https://perma.cc/5PE6-9WQG.
147. Rieland, supra note 146.
148. AI in the Criminal Justice System, supra note 146; Brandon Garrett & John Monahan, Assessing Risk: The Use of Risk Assessment in Sentencing, 103 Judicature No. 2 (2019), https://perma
learning,149 a capacity that will increase with time. Private companies may develop the risk assessment algorithms; Northpointe developed the COMPAS system used by several states.150 These companies may not release the underlying code for the algorithms (or training, testing, and validation data) for defendants to test and challenge.151
Using risk assessment algorithms to make or inform liberty decisions creates potential Fifth and Fourteenth Amendment due process and equal protection issues. It may also create Sixth Amendment Confrontation Clause questions. To identify a few due process issues:
As with policing algorithms, any racial and other biases contained in or produced by risk assessments present equal protection issues. The adoption of risk assessment tools has caused controversy in this context,152 and there is a rich
.cc/BTC5-QWLE (describing and analyzing how some Virginia judges use and some do not use risk assessments at sentencing); see also Pennsylvania Commission on Sentencing, Guidelines and Statutes: Risk Assessment, https://perma.cc/VJ9S-H9BZ (“Act 95 of 2010 required the Commission adopt a sentence risk assessment instrument.”).
149. Danielle Kehl, Priscilla Guo, & Samuel Kessler, Algorithms in the Criminal Justice System: Assessing the Use of Risk Assessments in Sentencing, Responsive Communities Initiative (July 2017), https://perma.cc/82NU-2K9D.
150. AI in the Criminal Justice System, supra note 146.
151. See id.
152. For example, a 2016 ProPublica study determined that COMPAS was almost twice as likely to falsely identify a black person as a repeat violent offender as it was to falsely identify a white person as a repeat offender. The company contested this finding. Julia Angwin et al., Machine Bias:
academic literature debating (generally challenging) the efficacy and fairness of these tools.153 However, there is yet little case law. Lawmakers or police departments using these tools might seek to replace, improve, or inform judicial decisions with “evidence-based”154 algorithmic recommendations, or to decrease the incarceration rate by releasing more people before trial and during probation.155 Critics argue that risk assessment tools not only have racially biased results but, through the ML process, may exacerbate racial inequalities in the criminal justice system. One significant concern is that using historical data to train risk assessment tools may only replicate and repeat at scale “the mistakes of the past.”156 Over one hundred civil rights groups issued a joint statement detailing their concerns with pretrial risk assessments.157 While the ideal of “evidence-based” practice may be appealing, the risk that an assessment tool may cause disparate treatment under a mantel of “data-driven” legitimacy warrants careful consideration. Whether any type of unbiased machine neutrality or fairness to all is possible is a matter of debate.158 Some scholars have concluded that technical flaws in criminal risk assessments, in particular, pose grave concerns and cannot be fixed.159
Academic commentary highlights several other risks of risk assessments: (1) that the existence of AI risk assessment tools will place pressure on courts to use the tools, whether they are accurate or not; (2) the psychological bias toward relying on empirical evidence more heavily than nonempirical evidence (“anchoring
There’s Software Used Across the Country to Predict Future Criminals. And It’s Biased Against Blacks, ProPublica (May 23, 2016), https://perma.cc/K4ZP-2MP2. See also Sam Corbett-Davies et al., A Computer Program Used for Bail and Sentencing Decisions Was Labeled Biased Against Blacks. It’s Actually Not That Clear., Wash. Post (Oct. 17, 2016), https://perma.cc/8UMU-8EK6.
153. For an introductory overview of that literature, see A Letter to the Members of the Criminal Justice Reform Committee of Conference of the Massachusetts Legislature Regarding the Adoption of Actuarial Risk Assessment Tools in the Criminal Justice System (Feb. 9, 2018), https://perma.cc/W3W5-YQ7V.
154. Loomis v. Wisconsin, 881 N.W.2d 749, 759 (Wis. 2016), cert. denied, 137 S. Ct. 2290 (2017).
155. Derek Thompson, Should We Be Afraid of AI in the Criminal-Justice System?, The Atlantic (June 20, 2019), https://perma.cc/4DKC-PDDL.
156. Karen Hao, AI Is Sending People to Jail—and Getting It Wrong, MIT Tech. Rev. (Jan. 21, 2019), https://perma.cc/MG6X-HQF6.
157. The Use of Pretrial “Risk Assessment” Instruments: A Shared Statement of Civil Rights Concerns, https://perma.cc/MS7U-R6KC.
158. See, e.g., Christopher Bavitz et al., Assessing the Assessments: Lessons from Early State Experiences in the Procurement and Implementation of Risk Assessment Tools, Berkman Klein Center for Internet & Society 20–21 (Nov. 2018), https://perma.cc/55FS-DRFY; Craig Smith, Dealing with Bias in Artificial Intelligence: Three Women with Extensive Experience in A.I. Spoke on the Topic and How to Confront It, N.Y. Times (Nov. 19, 2019, updated Jan. 2, 2020), https://www.nytimes.com/2019/11/19/technology/artificial-intelligence-bias.html; Sandra Mason, Bias In, Bias Out, 128 Yale L. J. 2218, 2233 (2019).
159. Martha Minow, Jonathan Zittrain, & John Bowers, Technical Flaws of Pretrial Risk Assessments Raise Grave Concerns, Berkman Klein Center for Internet & Society (July 17, 2019), https://perma.cc/PQY3-JK8A.
bias”); and (3) that “most judges are unlikely to understand algorithmic risk assessments,” and therefore may misuse them or give them inappropriate weight.160
Several generalized arguments for and against the use of predictive algorithms for human behavior emerge. Proponents might argue that predictive algorithms can identify patterns and trends humans cannot see and can do so rapidly, if not instantaneously, thus curtailing additional risk or harm. They might argue that predictive AI rests on the premise, and some would say reality, that neither judges nor law enforcement personnel can reasonably predict conduct based on judgment and intuition alone. AI simply has more data and excels at statistics. In the courtroom, predictive AI could add data to judgments about risk assessment, informing decisions on bail, parole, and sentencing. Moreover, because AI is data driven, some argue that a well-designed algorithm could in theory be more neutral or objective than a human. While AI invariably contains bias, a particular application might, in theory, be less biased than a human subject to implicit or explicit bias. All of these assumptions can be contested, in the abstract as well as with reference to specific AI applications.
Opponents of predictive algorithms make the following arguments.
Western law and criminal procedure are premised on individualized suspicion. This means an individual should be investigated or prosecuted based on articulable facts about them, not patterns found in data about the past conduct of other persons who may simply share one or more social descriptors, or data about past police practices and prosecutorial decisions. This argument is also especially applicable in the sentencing context.
All algorithms are biased in some way by the choices their human designers make: what metrics are used to evaluate data to make predictions, what data the algorithm is trained on, and what data the algorithm is tested on. Further, algorithms reflect human bias and can multiply and magnify bias by repeating it at scale. One of the most common criticisms of criminal risk assessment tools, for example, is that they rely on historical records of arrests, charges, convictions, and sentences, though “[d]ecades of research have shown that, for the same conduct, African-American and Latinx people are more likely to be arrested, prosecuted, convicted, and sentenced to harsher punishments than their white counterparts.”161
Predictive algorithms focus on characteristics that are, at least purportedly, readily discerned and susceptible to data adaptation and recording. Classifications such as “race,” gender, marital status, family status, address, and education likely play a disproportionate role in algorithm design and operation. Conversely, in
160. State v. Loomis: Wisconsin Supreme Court Requires Warning Before Use of Algorithmic Risk Assessments in Sentencing, Recent Case, supra note 143, at 1535; Ellora Israni, Algorithmic Due Process: Mistaken Accountability and Attribution in State v. Loomis, JOLT Digest (Aug. 31, 2017), https://perma.cc/9NDU-NHXX.
161. Minow et al., supra note 159, at 1.
operation or design, algorithms are less likely to include subjective weights, like role models and community connections and participation, that might also predict behavior and perhaps do so more accurately.
Classifications used as factors in algorithmic predictions are subject to all the risk of bias, intended and unintended. Even when unintentional, this bias may infiltrate an application through training data, how computer engineers assign weights to factors, or the “learning” the AI does on the job.
Some factors do not account for variation or nuance. Factors that appear to be subject to yes/no answers, and thus data scoring, may in reality be more complex and fall along a continuum. “Race” and ethnicity are good examples, even if not used intentionally in predictive tools (one would hope), other than to counter historical or algorithmic bias. Even marital status, a seemingly objective data point, may fall on a contextual continuum ranging from stable to unstable, happy to unhappy. Depending on what an algorithm is intended to predict, nuance can make all the difference in outcomes.
All of these factors are compounded where there is an inability to understand or challenge the underlying algorithm. Judges, of course, will have to determine when algorithm transparency is required as a matter of law, including due process. Lack of transparency undermines the ability of judges and litigators to assess the accuracy and meaning of an algorithmic output by asking questions like:
We would encourage courts to look under the hood. Accurately understanding AI outputs often requires knowing not only the AI inputs but also the following: the weights that were attached and allocated to each input; the data the algorithm was trained, tested, and validated on; and an explanation of the methodologies used for prediction. Courts should ask:
If the moving party cannot answer these questions to the satisfaction of the court, or is not prepared to answer these questions, judges might well be skeptical about the reliability of the evidence proffered or the use of a judicial tool.
A variety of technologies offer litigation, drafting, and other tools for lawyers and non-lawyers. These technologies include: online legal advice tools; e-discovery tools; legal research; document management and creation; document- and case-level analytics; case outcome prediction (for forum and judge shopping); and
162. See Barabas et al., supra note 88, at 3.
lawyer-client matching.163 We anticipate others. As with behavioral risk assessments, not all are AI-based, but an increasing number rely and will rely on machine learning and other AI methodologies.164
Technology-assisted review (TAR), sometimes referred to as computer-assisted review (CAR) or “predictive coding,” is used for document review and production.165 (TAR is not a brand name but rather a collective acronym for this type of software.) With the first generation of TAR (referred to as TAR 1.0), a lawyer first reviews and labels a set of “seed” documents for issues such as privilege, relevancy, and responsiveness, and then the algorithm learns to do the same based on the human-inputted labels.166 The second generation (TAR 2.0) uses a similar model but with continuous active learning by the machine.167 “Lawyers can also use TAR to structure the case, identify parties to depose, develop strategic defenses, and [complete] other tasks related to the case.”168
Other examples of legal tools include those that provide automated legal advice, assist unrepresented litigants with legal proceedings, or draft documents, such as answers, discovery requests, motions, and simple briefs.169
The quality of any such tools will depend significantly on the datasets available for training them; where data is not publicly available, large law firms and other businesses may have better access to confidential data (with permission from clients), and therefore better tools at their disposal.170 This type of technology, however, does allow for the possibility of increasing access to justice for those who cannot afford traditional representation.171 It may also raise ethical issues about confidentiality, representation, competence, and diligence, as discussed earlier.172
Legal research tools also use machine learning, for example to do natural language searches instead of Boolean searches,173 or “analyze a draft argument to
163. David F. Engstrom & Jonah B. Gelbach, Legal Tech, Civil Procedure, and the Future of Adversarialism, 169 U. Pa. L. R. 1001, 1010–12 (2021).
164. See id. at 1015.
165. Unlocking the E-discovery TAR Blackbox, 102 Judicature No. 2, 63 (2018), https://perma.cc/F3CS-9LM3.
166. Kamron Sanders, What Is Technology-Assisted Review? (TAR), Nat’l L. Rev. (May 19, 2022), https://perma.cc/BQ4H-W7UP.
167. Id.; Opentext, Choosing the Right Technology-Assisted Review Protocol to Meet Objectives, https://perma.cc/4UBY-Z2NJ.
168. Sanders, supra note 166.
169. Engstrom & Gelbach, supra note 163, at 1012.
170. Id. at 1018.
171. Id. at 1013, 1039.
172. See id. at 1013.
173. Boolean searches use words such as “and,” “or,” and “not,” and punctuation such as quotation marks to expand, limit, and qualify searches. Shauntee Burns, What is Boolean Search?, New York Public Library (Feb. 2011), https://perma.cc/CL8E-4WP3. The 19th century mathematician, George Boole, like many AI designers today, was seeking to use mathematical principles and logic
gain further insights or identify relevant authority that may have been missed.”174 Many of these tools use natural language processing (NLP):
At a high level of abstraction, NLP aims to identify patterns in human language in ways that facilitate problem-solving. . . .
. . . The current research frontier, and a rapidly advancing one, is a mix of linguistics and “deep learning” (i.e., neural network) techniques. In a nutshell, deep-learning NLP machines make language computationally tractable by converting words, sentences, documents, or, in the legal context, entire cases into unique vectors, called “embeddings.” Each vector can be envisioned as an arrow from the origin to a point that represents the item of interest in a large, n-dimensional space, its magnitude a function of the presence of words, case citations, indexing concepts, or other features. Once this vast vector space has been constructed and human-annotated labels affixed to training materials (again, words, sentences, documents, cases), a sophisticated machine learning model can manipulate the vectors mathematically using large numbers (on the order of billions) of calculations to model relationships between them. With sufficient data and computing power, the system’s outputs enable a range of legal tasks, such as identifying relevant or privilege documents, past legal decisions that may be controlling, or, though . . . it is far trickier, the winning argument in a case.175
A more complex and forward-leaning NLP application is Open AI’s ChatGPT 3 research app, publicly released in late 2022, and as of 2024, in a fourth public and for fee iteration (ChatGPT 4 and ChatGPT 4-Plus). Chat GPT 5 was released in summer 2025, showing the rapidity with which generative AI is developing. ChatGPT and other products use large language models (LLMs) to generate language or text, “writing” sentences that are often difficult to distinguish from a human’s. Open AI, a nonprofit funded by Microsoft, opened up ChatGPT to the public for use, in the process raising questions about work product and academic integrity across disciplines.176 As described above, ChatGPT can incorporate research into its products as well, but that research may not always be reliable. The further development of this type of AI (large language models) presents fundamental questions for the legal profession—about the thoroughness and transparency of research and work product, about the very writing of law—that go well beyond the scope of this reference guide.177 That said, many
to derive greater meaning from data and do so more rapidly than with existing 19th century constructs.
174. Matthew Stepka, Law Bots: How AI Is Reshaping the Legal Profession, ABA Bus. L. Today (Feb. 21, 2022), https://perma.cc/54JD-KUTG.
175. Engstrom & Gelbach, supra note 163, at 1020–21.
176. Tools such as ChatGPT Threaten Transparent Science; Here Are Our Ground Rules for Their Use, Nature (Jan. 24, 2023), https://doi.org/10.1038/d41586-023-00191-1.
177. Engstrom and Gelbach’s article, supra note 163, takes on many of these and other issues.
of the issues raised here, such as bias of all types and explainability and transparency, will shape future discussion.
If not in the form of an AI-generated brief (or hopefully and better, a human-generated brief aided perhaps by AI tools), a judge’s consideration of AI at trial may start with discovery. Pretrial discovery is governed by the Constitution, case law, Federal Rule of Civil Procedure 26 (among other civil rules), and Federal Rule of Criminal Procedure 16. Rule 26 mandates parties make certain initial disclosures, “without awaiting a discovery request,” including “a copy—or a description by category and location—of all documents, electronically stored information, and tangible things that the disclosing party has in its possession, custody, or control and may use to support its claims or defenses; unless the use would be solely for impeachment.”178 Additionally, with respect to expert testimony, “a party must disclose to the other parties the identity of any witness it may use at trial to present evidence under Federal Rule of Evidence 702, 703, or 705.”179 Unless otherwise stipulated or ordered by the court, the disclosure must be accompanied by a written report providing “a complete statement of all opinions the witness will express” and “the facts or data considered by the witness in forming them.” Rule 16 of the Federal Rules of Criminal Procedure states, inter alia, that upon request “the government must permit the defendant to inspect and to copy or photograph. . . documents [and] data. . . within the government’s possession, custody, or control [if] (i) the item is material to preparing the defense; (ii) the government intends to use the item in its case-in-chief at trial; or (iii) the item was obtained from or belongs to the defendant.”180 Documents and data might be argued to include documentation of algorithms and design and the data used to train, test, and validate AI models.
We forecast that as part of the discovery process at least four issues relating to AI will arise, and arise with frequency:
Timing, timeliness, and timelines: As noted above, Rule 26 provides for certain initial disclosures regarding documents and potential expert witness testimony, “without awaiting a discovery request.” “A party must make the initial disclosures at or within 14 days after the parties’ Rule 26(f) conference unless a different time is set by stipulation or court order.” We surmise that, in some contexts, 14 days may be an unrealistic period of time for a party to determine which data and how much to turn over where the party intends to rely on AI-generated outputs as
178. Fed. R. Civ. P. 26.
179. Fed. R. Civ. P. 26(a)(2). Federal Rules of Evidence 702, 703, and 705 address “testimony by expert,” “bases of an expert,” and “disclosing the facts or data underlying an expert,” respectively.
180. Fed. R. Crim. P. 16(a)(1)(E).
evidence. Consider a medical malpractice suit placing at issue a doctor’s use of AI imagery as a diagnostic tool. Litigators are not likely to voluntarily turn over data without first being ordered to do so following a discovery conference.
Interpretation and Scope of Data Discovery: The timeline will also be impacted by the uncertain scope of Rule 26 when it comes to how much of the underlying data is required to be disclosed where an expert witness relies on an AI output “or electronically stored data that may be used to support its claims of defenses.” Judges, for example, will surely need to resolve on a case-by-case basis the parameters of “data considered by the witness in forming” an opinion when that opinion derives from the use of AI or AI outputs. Referring to our medical malpractice suit, for example, the parties are likely to dispute whether “data considered in forming an opinion” should include the data, or a characterization of the data, used to train, test, and validate an AI screening algorithm. An AI specialist could well argue that such data is necessary to determine whether the algorithm was trained to detect the disease in question, or to detect the disease in question with a comparative population and demographic base.
Similar interpretive issues are likely to arise in a criminal discovery context where, for example, the government intends to introduce AI outputs, such as a potential facial or voice recognition match placing a defendant at the scene of a crime. The challenge for courts and for discovery is that most facial or voice recognition tools are predictive rather than determinative in nature, and most are more accurate when more potential matches are identified (assuming that a true match is in the database). Restated, there are few Jason Bourne moments where the target is conclusively identified. The question for criminal discovery is just how much discovery a court should allow into the underlying AI tools based on the argument that any information—the algorithmic math, the training data, testing data, and validating process, for example—might undercut the predictive accuracy of the recognition tool in the case at hand and thus serve as potentially exculpatory evidence.
Supplementing Disclosures and Responses: Federal Rule of Civil Procedure 26 states that parties must supplement or correct their disclosure or response “in a timely manner if the party learns that in some material respect the disclosure or response is incomplete or incorrect.” AI tools are often iterative and changing, like a generative AI that is absorbing and incorporating new data into its outputs. That may present two challenges, one on either side of the same coin. First, in what manner to delimit the duty to supplement where the ongoing use of an AI tool may reveal earlier strengths and weaknesses at the time the output evidence was generated. Here the qualifier “material” will carry considerable weight and likely draw judges into disputes over whether iterative AI changes are material to understanding AI outputs offered into evidence. Second, the iterative nature of AI may make it difficult to fix and assess the quality and accuracy of AI outputs in a moment in time, such as the time and date an AI tool was used to assess a medical condition or may have impacted the performance of an AI-empowered vehicle.
The duty of continuing disclosure in Federal Rule of Criminal Procedure 16 is drafted differently than Rule 26. Rule 16(1)(c) states that: “A party who discovers additional evidence or material before or during trial must promptly disclose its existence to the other party or the court if: (1) the evidence or material is subject to discovery or inspection under this rule; and (2) the other party previously requested, or the court ordered, its production.” Where the government relies on an AI-generated output as evidence, updated information associated with that AI might be “additional new evidence or material.” Where new information about an AI output in a criminal context might be viewed as exculpatory under Brady,181 it might be constitutionally required as well, as in the case of new information decreasing the probability that an AI facial recognition algorithm correctly identified the defendant as a suspect. Similarly, if an AI model itself were to be considered a government witness, disclosures of material evidence that might impeach its veracity or accuracy might be required under Giglio.182
Protective Orders, and Proprietary and Trade Secrets: One issue we forecast that will be argued in court, but for which courts are already well empowered and equipped, is the protection of trade secrets embedded in AI in the form of algorithms, data, parameters, and weights. Some litigants using AI in court, such as the government in the case of Loomis,183 have argued that they cannot disclose the underlying algorithms, parameters, and weights, even to the court, because the underlying information is proprietary. We are skeptical of such assertions, which may arise in a variety of civil and criminal contexts. Civil Rule 26 provides clear and broad authority for courts in civil litigation to issue protective orders “requiring that a trade secret or other confidential research, development, or commercial information not be revealed or revealed only in a specified way.” Likewise, Criminal Rule 16 provides courts authority to issue protective and modifying orders “at any time the court may, for good cause, deny, restrict, or defer discovery or inspection, or grant other appropriate relief,” and do so ex parte. As discussed above, in the specific context of courts themselves using predictive algorithms for bail, probation, sentencing, or similar purposes, there are weighty due process, equal protection, and Confrontation Clause issues that might require public disclosure of algorithms and data, and indeed suggest any use of such tools is unconstitutional. In other contexts, it would seem the moving party should either be prepared to disclose material information about the AI outputs they intend to use at trial (potentially subject to appropriate protective orders) or the court should decline to admit such outputs into evidence. Where there is room for reasonable
181. Brady v. Maryland, 373 U.S. 83, 87 (1963); Giglio v. United States, 405 U.S. 150, 154 (1972).
182. See Giglio, supra note 181, at 154.
183. Loomis v. Wisconsin, 881 N.W.2d 749, 759 (Wis. 2016), cert. denied, 137 S. Ct. 2290 (2017).
debate, and judicial discretion and ruling, is on whether a particular parameter, dataset, or algorithm, for example, warrants protection or is in the public domain.
As evidentiary gatekeepers, judges will need to determine whether and when AI evidence will assist the factfinder and is admissible in court. The Federal Rules of Evidence and their state equivalents will help guide this determination. The Supreme Court’s Daubert,184 Crawford,185 and Carpenter186 cases may also inform the evidentiary questions presented by AI. Neither these cases nor the Rules, however, were written with AI in mind. And currently few federal or state cases or jury instructions address AI. Judges will, of course, interpret and apply these cases and rules to AI in the specific contexts presented and do so consistent with the law of the jurisdiction in which they preside.
We briefly discuss here how the Federal Rules of Evidence and the Daubert factors might apply to AI-generated evidence not as a legal primer for an audience that needs no such explanation, but to highlight the complexity of applying the rules of evidence and case law to AI. We identify four threshold considerations when considering AI evidence given its technological characteristics and iterative nature. Courts, in time, will use their knowledge of the law to address myriad constitutional and evidentiary questions AI may raise.
As judges know, under Federal Rule of Evidence 401, evidence is relevant if “(a) it has any tendency to make a fact more or less probable than it would be without the evidence; and (b) the fact is of consequence in determining the action.”187 Rule 402 states that relevant evidence is admissible unless the Constitution, a federal statute, the other Federal Rules of Evidence, or other rules prescribed by the Supreme Court apply and would exclude the evidence.188 Due process or Confrontation Clause concerns, for example, might bar or limit certain AI evidence from admission. Statutes addressing data privacy and use may do so as well. Rule 403 allows a court to exclude relevant evidence if its probative value is substantially outweighed by a danger of: creating unfair prejudice; confusing
184. Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579 (1993).
185. Crawford v. Washington, 541 U.S. 36 (2004).
186. Carpenter v. United States, 138 S. Ct. 2206 (2018).
187. Fed. R. Evid. 401.
188. Fed. R. Evid. 402.
the issues; misleading the jury; causing undue delay; wasting time; or needlessly presenting cumulative evidence.189
Many of the threshold evidentiary issues associated with AI will be litigated under Rules 402 and 403 or their state equivalents. Relevancy in most or all jurisdictions is broadly defined, and most AI applications are essentially tools for assessing probability, in theory, making them inherently relevant in assessing, under Rule 401, whether something is “more or less probable.” (Litigants might still argue that certain AI outputs are too inaccurate or biased to be relevant or that they are inapt for the purpose offered.) The primary issues, then, are (1) the reliability of AI generally and (2) the appropriateness of use in the context presented. Rules 402 and 403 are pertinent because the evidentiary use of AI will invariably present questions about discovery and due process, such as whether there is a right to access an underlying algorithm or data used to generate evidence or inform judicial decisions. Another issue is the risk that litigation over AI will present the figurative “trial within a trial” and potentially confuse the jury under Rule 403. Also, courts might apply Rule 403 to exclude AI evidence that is biased or otherwise unreliable. Inquiry is prudent; otherwise, juries may assume AI evidence has the imprimatur of “science” or “technology” in the context presented, potentially lending it false authority or undue weight, or permitting its use in a manner for which it was not intended.190
Judges will need to decide in what manner and to what extent to require authentication of the AI evidence offered and how, if at all, to validate its reliability. These criteria will likely bring Federal Rules of Evidence 702 and 902 into play, as well as Daubert and Crawford.
AI and the interpretation of AI outputs are complex. Courts will have to determine the appropriate means to verify AI outputs. This might involve expert testimony, or it might be done through technical means, such as cryptographic hashes embedded in an image at the time it is created. Courts will need to determine who is qualified to testify about the accuracy and fairness of an AI application. Of course, steady, purposeful, and consistent application of the Federal Rules of Evidence or their state equivalents on the record is a good place to start.
It is intuitive, but worth remembering that proponents of AI-generated evidence will seek to simplify its admission by limiting or eliminating as many threshold foundational requirements as possible. Opponents of admission will seek to undermine its relevance and reliability in general or for the purpose for
189. Fed. R. Evid. 403.
190. Andrea Roth, Machine Testimony, 126 Yale L. J. 1972 (2017) (“Moreover, just as the Framers were concerned that factfinders would be unduly impressed by affidavits’ trappings of formality, ‘computer[s] can package data in a very enticing manner.’ The socially constructed authority of instruments, bordering on fetishism at various points in history, should raise the same concerns raised about affidavits.”) (internal citations omitted).
which it is offered. To challenge relevance and accuracy, opponents will seek access to the underlying algorithm, the data on which it was trained, tested, and validated, as well as knowledge of what occurs and what is weighted inside any machine-learning black box. Thus, courts will face a layered adjudicative challenge each time AI-generated evidence is offered.
Where AI outputs are admitted, opponents may seek to cross-examine the software engineers responsible for its design. Each AI application is different. It will have a different purpose, rely on a different algorithm, use a different machine learning methodology or methodologies, and will train, test, and validate using different data. Consequently, AI issues are generally not subject to resolution through the application of case-law precedent in the same way that, for example, DNA analysis is now widely accepted in court. A single or lead case will not likely resolve when to admit AI-generated evidence. Adjudication is to be expected for each application and in each context for which the application is offered as evidence. Further, because AI is a constellation of technologies and applications, it is not a single process or technology that can be validated once and generally adopted. In each instance where AI evidence is offered, there may be a legitimate need to explore the underlying technology and different components of that technology for use in that instance or for the proffered purpose. Courts should also consider the following analytical factors.
Courts should pay attention to whether a particular AI application is a good “fit”191 for the purpose for which it is proffered. Some criminal risk assessment algorithms, for example, are designed for the purpose of determining which individuals might benefit from alternatives to incarceration, such as parole or counseling. These algorithms might have less relevance and reliability if used to determine sentencing.192 That will depend on all the factors noted above, including the input factors, the weight assigned to those factors, the data on which the algorithm was trained, tested, and validated, and the nature of the confidence thresholds applied to the output. Courts should pause and ask not only whether the AI at issue is relevant and material to the matter before the court but for what purpose the AI was specifically designed and whether the outputs will materially and fairly inform the fact finder. Courts should inquire about the contents of the datasets on which the algorithm was trained, tested, and validated.
191. See Daubert, 509 U.S. at 591–92.
192. See Christopher Bavitz et al., supra note 158, at 6–7 (discussing the Wisconsin Supreme Court’s warning in State v. Loomis, 881 N.W.2d 729 (Wis. 2016) that the risk assessment tool COMPAS was not developed for use at sentencing).
Even when an algorithm is being used for the purpose for which it was designed, there may be data or design reasons why output reliability will decrease in a specific context. An AI algorithm may have been designed for and tested on a population substantially different from the population for which the output is offered, with less accurate results than the lab-tested confidence threshold193 (“inappropriate deployment” as discussed above in the section titled “AI Is Biased”). As previously noted, according to the FBI, the Bureau’s facial recognition application has an 86% percent match rate (confidence threshold) when an input image is compared to at least fifty potential output matches drawn from state license databases. However, the same algorithm would not have the same match accuracy if run against a different input demographic, say, the population of another country—not because the algorithm is necessarily intentionally biased but because it has not been trained against a comparative population pool. (In fact, output disparity across gender and ethnicity has been an issue with some facial recognition algorithms.194 As a result, facial recognition accuracy has been a focal point of AI design initiatives, and therefore we anticipate future American facial recognition applications will help mitigate this issue.)
There is a risk with ML that a neural network will rely on inapt factors in making its output predictions. Judges will want to know whether this is possible and, if so, regarding which factors, before allowing a jury to assess the weight of AI evidence or before judges use an algorithm themselves to assess bail or recidivism risk. For example, a judge would want to determine, consistent with case law and the Constitution, which factors were included and weighted within any AI-driven bail, parole, confinement, or sentencing tool, to ensure that inappropriate, inapt, or unconstitutional factors were not included and that any factors, even if appropriate, were not given undue weight by the neural network. A judge would also want to know if any factors might be working as proxies195 for suspect categories.
193. See Barabas et al., supra note 88, at 3, and Bavitz et al., supra note 158, at 7.
194. NIST, NIST Study Evaluates Effects of Race, Age, Sex on Face Recognition Software: Demographics study on face recognition algorithms could help improve future tools (Dec. 19, 2019), https://perma.cc/NKN2-PLDA.
195. See Barabas et al., supra note 88.
Courts will also want to investigate the ways in which a given AI application is biased, or may be biased, before admitting its outputs into evidence or relying on it to inform a judicial decision, as discussed earlier.
One way to conceptualize AI and appreciate its evidentiary complexity in relation to other technology is to apply the (non-exhaustive) list of factors the Supreme Court developed in Daubert v. Merrell Dow Pharmaceuticals, Inc.196 to determine whether expert testimony based on a specific, scientific methodology should be admitted.197 Daubert and, in certain states, its predecessor, Frye v. United States,198 govern the admission of expert testimony based on scientific methodology. Daubert uses a “factors” approach, while Frye uses a “general acceptance” standard. Each of the Daubert factors opens wide the door to debate over many AI attributes. Analysis under Frye, too, is likely to be complicated, requiring judges to determine when a scientific method is “sufficiently established to have gained general acceptance in the particular field to which it belongs.”199 In theory, making such a determination will entail examining not only the specific algorithm and use in question but also identifying the relevant field of acceptance and what acceptance means for something like facial recognition or behavior prediction—all in a context where algorithms may be iterative and/or the data they are trained on are iterative and changing.
The Daubert factors include:200
With AI, these factors would need to be applied to individual algorithms and applications rather than “AI” generally, which, as stated above, generically
196. Daubert, 509 U.S. at 579.
197. We are not aware of a court having done so at the time this reference guide was drafted.
198. Frye v. United States, 293 F. 1013 (D.C. Cir. 1923).
199. Id. at 1014.
200. Daubert, 509 U.S. at 593–95.
describes a constellation of technologies and methodologies. We address each factor in turn.
The first step suggested by Daubert is to identify the theory, technique, or component that is subject to evaluation. There are many options with AI. Is it: The sensor(s) that fed data to the AI system? The algorithm? The math behind the algorithm? The dataset used to train the algorithm? The training methodology? Or is it the system as an integrated whole that is subject to review?
The second step is to decide what test is appropriate and what baseline to use to establish accuracy. Medical diagnostic AI, for example, might be compared to physician-diagnosed outcomes. It is true that medical diagnostics are subject to social influence and human and machine bias. But in medicine there is often a fixed data point, an established fact or yes/no answer to whether a disease or tumor is present, against which testers can measure the algorithm’s accuracy.
In contrast, an algorithm intended to predict future behavior, such as a criminal assessment tool, cannot be tested with the same degree of scientific or evidence-based meaning, given the weight placed on social factors. Recidivism algorithms, for example, attempt to predict future human behavior, using circumstantial factors drawn from a base population. In such contexts, there is no certain result and no control group, and confirming predictions is difficult. Human circumstances are endlessly complex, creating multiple influences on behavior—without necessarily determining behavior. Nor is there a way to verify, after an individual has been jailed or sentenced, how an individual’s future behavior is affected by imprisonment. The experience of imprisonment itself might turn a person toward or away from future crime, making it difficult or impossible to verify the machine’s prediction. In short, predictive algorithms in the criminal context are especially difficult to test, to peer review, and to assess for accuracy and error rates.
An example innovation in AI-enabled medicine highlights the question of machine reliability and illustrates the importance of peer review. In April 2019, NPR reported that Stanford computer scientists had created an algorithm for reading chest X-rays to diagnose tuberculosis.201 They hoped to use it to diagnose the disease in HIV patients in South Africa, and the machine’s results were
201. Richard Harris, How Can Doctors Be Sure A Self-Taught Computer Is Making The Right Diagnosis?, NPR (Apr. 1, 2019), https://perma.cc/Y2WM-FF4N.
already better than doctors’.202 To corroborate their success, the Stanford scientists submitted their results to other scientists for review.203 One noticed a peculiarity in the AI’s decision making.
[The peer reviewers] Zech and his medical school colleagues discovered that the Stanford algorithm to diagnose disease from X-rays sometimes “cheated.” Instead of just scoring the image for medically important details, it considered other elements of the scan, including information from around the edge of the image that showed the type of machine that took the X-ray. When the algorithm noticed that a portable X-ray machine had been used, it boosted its score toward a finding of TB.
Zech realized that portable X-ray machines used in hospital rooms were much more likely to find pneumonia compared with those used in doctors’ offices. That’s hardly surprising, considering that pneumonia is more common among hospitalized people than among people who are able to visit their doctor’s office.
“It was being a good machine-learning model and it was aggressively using all available information baked into the image to make its recommendations,” Zech says. But that shortcut wasn’t actually identifying signs of lung disease, as its inventors intended.204
The machine was making a correlational, rather than causal, connection between the use of a portable machine and TB. Without informal peer review, humans might not have discovered that aspect of how the AI algorithm was making decisions, a clear example of both how AI adapts and often does so in the black box. The TB-scan example also demonstrates the disruptive role of unknowns, here an unwitting, algorithmic bias (“inappropriate focus”). The original programmers evidently did not anticipate that the machine would teach itself to evaluate information beyond the scan itself. It is impossible for a programmer to anticipate every real-world factor a machine will encounter and attempt to interpret.
Peer review may be particularly challenging with AI. For example, it is hard to imagine a credible peer review that does not include access to the underlying algorithm, parameter weights, and the data on which an algorithm was trained, tested, and validated. How else could a peer validate the accuracy and use of an application? Moreover, peer review should be case or use specific. An AI that is good at predicting the need for substance abuse counseling may not be suited for predicting other types of risk or need. The usefulness of any peer review will also vary in accordance with the talent and capability of the individuals or entities conducting the peer review.
202. Id.
203. Id.
204. Id.
Judges will also need to ask the right questions to determine whether error rates are accurate and meaningful. For example, will or might error rates vary depending on whether the AI application is tested and reviewed using the relevant local population (database) to which it will be applied, as opposed to a national population, or perhaps a more idealized lab database?205 What types of bias might be affecting the accuracy of any reported error rates? (See suggested questions under “Probing for Bias,” above)
AI imposes operational and maintenance obligations. At this point in time, however, operational standards, if any, are set voluntarily. The Intelligence Community and the Department of Defense have each published principles for the ethical use of AI, while many companies have their own internal standards. In the absence of uniform statutory standards, courts might begin by asking: What dataset is used? Is that dataset updated appropriately? Is the machine learning monitored by continued testing against known results to ensure the machine is not learning bad habits? Courts might also ask all the questions about bias suggested in “Probing for Bias,” above.
Courts will also need to determine what widespread acceptance within the relevant scientific community means in the context of AI. There is a big difference between general acceptance of the field and acceptance of a specific application. Many computer engineers and government actors accept the premise and use of facial recognition, but privacy advocates do not. Skepticism will remain with any specific application. The point is also illustrated by driverless cars. General acceptance of the concept has not at present translated to acceptance of a model of fully autonomous driverless car that is ready for commercial sale and public use. What then would constitute appropriate general acceptance?
How does one test the accuracy or conduct a peer review of a proprietary algorithm or an iterative or evolving ML algorithm? Google is not likely to disclose
205. See Barabas et al., supra note 88, at 3, and Bavitz et al., supra note 158, at 7.
its search algorithm for public or peer inspection and risk its market position in the search engine arena. Unless courts can demonstrably protect such trade secrets while also testing their validity, applying the Daubert factors to many or most AI applications in open court may be difficult. (Jurists or lawmakers206 may determine defendants or the public should have access to certain underlying algorithms and data, such as in instances where liberty interests are at stake. Courts will need to determine whether the Fifth, Sixth, and Fourteenth Amendments require it.)
In other contexts, where courts seek to allow litigants to test the validity of AI applications while still protecting proprietary information, they might exercise their general power to oversee how evidence is entered, to enforce rulings, and to seal records. A parallel can be found in the way classified information is protected while still allowing certain litigation to proceed, with records reviewed by judges and sometimes by cleared counsel. Also relevant is the 1996 Defend Trade Secrets Act (18 U.S.C. § 1835), which specifically directs federal courts to protect trade secrets in proceedings arising under Title 18 of the U.S. Code. Section 1835(a) states,
In any prosecution or other proceeding under this chapter, the court shall enter such orders and take such other action as may be necessary and appropriate to preserve the confidentiality of trade secrets, consistent with the requirements of the Federal Rules of Criminal and Civil Procedure, the Federal Rules of Evidence, and all other applicable laws.
In context, specific statutes also provide intellectual property protections for AI, such as those protections found in § 705 of the Defense Production Act (DPA), which allow the president in the first instance and courts in the second instance, through the power of contempt and jurisdiction found in § 706, to protect intellectual property relevant to DPA enforcement or defend against DPA actions.
AI’s capacity to convert symbolic language (coded numbers) into natural language and to discern, recognize, and formulate patterns at the pixel level makes it a tool of choice not only for identifying voices and pictures but also for mimicking voices and altering images. Moreover, AI can do so with real-life
206. For example, an Idaho law, Section 19–1910 of the Idaho Code, states, “All pretrial risk assessment algorithms shall be transparent, and all documents, records, and information used to build or validate the risk assessment shall be open to public inspection, auditing, and test. No builder or user of a pretrial risk assessment algorithm may assert trade secret or other protections in order to quash discovery in a criminal matter by a party to a criminal case.” https://perma.cc/J3QQ-GTZ4.
precision, creating images or recordings known as “deepfakes.” Hollywood has, of course, known about deepfakes for years, though in movies they’re called “special effects,” as in Star Wars or Forrest Gump. What makes deepfakes noteworthy for courts is not only the lifelike quality attainable but the accessibility of this capability to the general population. Tools now readily available on the internet allow non-specialists to alter photographs and mimic speech with startling realism, capable of fooling practically everyone—including triers of fact.
Deepfakes today are often created with “generative adversarial networks,” or GANs. GANs “are two-part AI models consisting of a generator that creates samples [e.g., of images, film, or audio] and a discriminator that attempts to differentiate between the generated sample and real-world samples.”207 The discriminator (or sometimes multiple discriminators) provides feedback to the generator that helps it improve its realism.208
Deepfakes can have artistic and educational value, such as in bringing historical figures back to life in film, but also nefarious purposes, such as the creation of non-consensual pornography and national security threats.209 Deepfake technology, its “accessibility [by the public], usability, and the verisimilitude of its outputs,” will only improve over time; meanwhile, the U.S. government, academia, nonprofits, and industry all have sought to improve technology for detecting deepfakes,210 creating a bit of an arms race.
In addition to improving deepfake detection, technologists have also sought to improve methods for authenticating genuine records.211 There are tools and methods to authenticate images, such as using a cryptographic hash, a numerical value based on the zeros and ones in the digital sequence of an image.212 For example, The Economist reports of an app, eyeWitness to Atrocities, developed by a London charity, that:
When a photo or video is taken by a phone fitted with that app, it records the time and location of the event, as reported by hard-to-deny electronic
207. Kyle Wiggers, Generative Adversarial Networks: What GANs Are and How They’ve Evolved, VENTUREBEAT (Dec. 26, 2019), https://perma.cc/LLR4-G5LP; see also Riana Pfefferkorn, “Deepfakes” in the Courtroom, 29 B.U. Pub. Int. L.J. 245, 249 (2020), https://perma.cc/9T7R-3GAU.
208. Wiggers, supra note 207.
209. Danielle K. Citron & Robert Chesney, Deep Fakes: A Looming Challenge for Privacy, Democracy, and National Security, 107 Cal. L. Rev. 1753 (2019); Rebecca A. Delfino, Pornographic Deepfakes: The Case for Federal Criminalization of Revenge Porn’s Next Tragic Act, 88 Fordham L. Rev. 887 (2019).
210. Pfefferkorn, supra note 207, at 250.
211. Id. at 251
212. See NIST Special Publication 800-152, A Profile for U.S. Federal Cryptographic Key Management Systems (Oct. 2015), https://perma.cc/45Q6-K3ZQ. This defines “hash function” as “[a]n algorithm that computes a numerical value (called the hash value) on a data file or electronic message that is used to represent that file or message, and depends on the entire contents of the file or message. A hash function can be considered to be a fingerprint of the file or message.”
witnesses such as GPS satellites and nearby mobile-phone towers and Wi-Fi networks. This is known as the controlled capture of metadata, and is more secure than collecting such metadata from the phone itself, because a phone’s time and location settings can be changed. Second, the app reads the image’s entire digital sequence (the zeros and ones which represent it) and uses a standard mathematical formula to calculate an alphanumeric value, known as a hash, unique to that picture.213
The metadata is sent to a secure server, separate from where the image is stored. Later, the image can be examined, and any difference from the recorded hash value would indicate an alteration in the image.214 Such a process depends on calculating and recording the numeric value at the time of origin in a chain of custody manner. In short, there are different and developing mechanisms to tag social media posts, pictures, and other data. As discussed below, one question is when courts should require factual as well as legal authentication before admitting images, videos, or voice recordings into evidence.215
For all these reasons, deepfakes may present evidentiary challenges to civil and criminal justice systems—for example, if the video or audio recording is a deepfake used to show guilt or innocence, or liability or lack thereof. A litigant might also proffer a deepfake as evidence without realizing it is a deepfake.216 Scholars have also expressed concerns that a jury’s general awareness of deepfakes, or a litigant’s false defense that something is a deepfake, may also cause a jury to discount a genuine recording and generally undermine faith in the justice system.217
Under Federal Rule of Evidence 901, “To satisfy the requirement of authenticating or identifying an item of evidence, the proponent must produce evidence sufficient to support a finding that the item is what the proponent claims it is,” with the jury or trier-of-fact ultimately determining the weight of the evidence.218 Scholars identify some risks, including:
213. See Proving a Photo Is Fake Is One Thing. Proving It Isn’t Is Another, The Economist (Jan. 9, 2023), https://perma.cc/2F8T-JMER.
214. Id.
215. Media outlets and NGOs like Bellingcat have developed the use of AI applications that can help identify the location of a photo or video as well as the date and time it was taken. There are also AI apps that reverse search photos to match those photos with historical or social media photos to verify locations and personal identities. There are AI apps as well that can colorize black and white photos to help match them against colored photos. Such tools have proven useful in documenting and validating the documentation of war crimes committed by Russian soldiers and units in Ukraine. See Foeke Postma, Using New Tech to Investigate Old Photographs, Bellingcat (Aug. 9, 2022), https://perma.cc/T7E5-G89N.
216. Pfefferkorn, supra note 207, at 255.
217. Id.; see also Rebecca Delfino, Deepfakes on Trial: A Call to Expand the Trial Judge’s Gatekeeping Role to Protect Legal Proceedings from Technological Fakery, 74 Hastings L. J. 293 (2023), https://perma.cc/5URU-HBUA.
218. John P. LaMonaga, A Break From Reality: Modernizing Authentication Standards for Digital Video Evidence in the Era of Deepfakes, 69 Am. U. L.R. 1945, 1963 (citing United States v. Branch,
Courts and legislatures might examine whether existing rules of evidence for authentication and expert witness testimony are sufficient to address the issue of deepfakes. Scholars diverge on this point.220 Scholars and practitioners tend to agree, however, that trials will require expert testimony by digital forensic specialists as well as increased litigation over the authenticity and veracity of photographic, video, and audio evidence.221
Federal Rule of Evidence 902(13) and (14), adopted in 2017, set forth “certified records generated by an electronic process or system” and “certified data copied from an electronic device, storage medium, or file,” respectively, as “items of evidence that are self-authenticating.” Items with cryptographic hashes, discussed above, might be authenticated under these rules. The committee notes on the 2017 amendments suggest that 902(13) and (14) were adopted to avoid “the expense and inconvenience of producing a witness to authenticate an item of electronic evidence” only to have the other party stipulate to authenticity or fail to challenge the authentication testimony. The amendments do, however, provide for a party to challenge self-authentication: “the proponent must give an adverse party reasonable written notice of intent to offer the record—and must make the record and certification available for inspection.”222 A party suspicious of a deepfake would have “fair opportunity to challenge” the record and certification.223 Again, digital forensic experts would be in demand.
970 F.2d 1368, 1370 (4th Cir. 1992)); Naciye Celebi, Qingzhong Liu, & Muhammed Karatoprak, A Survey of Deep Fake Detection For Trial Courts, https://perma.cc/42BY-QEKJ.
219. Pfefferkorn, supra note 207, at 250; see also Delfino, supra note 217, at 330–35.
220. See Pfefferkorn, supra note 207, at 259–75 (rules suffice), and Delfino, supra note 217, at 332–48 (suggesting changes).
221. Id.
222. Fed. R. Evid. 902(11), which is incorporated in the quoted part by reference in Fed. R. Evid. 902(13) and (14).
223. Id.
The multidisciplinary AI field presents a myriad of complex evidentiary challenges. The law rarely, if ever, keeps pace with technology. The legislative and appellate processes simply do not move at the same pace as technological change and could not if they tried. Moore’s Law is faster than case law. Likewise, scholars and commentators are currently better at asking questions than they are at answering them. Artificial Intelligence itself is a fast-moving field encompassing a constellation of technologies.
Judges and lawyers do not need to be mathematicians or coders to understand AI and to wisely adjudicate the use of AI in courts or by courts. Judges need to define and understand their roles as overseers of discovery, evidentiary gatekeepers, constitutional guardians, translators, and in some cases, potential AI consumers. Perhaps the most important thing courts can do is ask careful and informed questions. Judges should also put their analysis and application of the answers on record as to whether, why, how, and subject to what evidentiary standards and determinations AI has been admitted into evidence or used by courts. This will allow full and informed appellate review. We hope this reference guide provides a foundation judges can use for understanding AI and building a common law of AI.
This glossary is borrowed from our publication An Introduction to Artificial Intelligence for Federal Judges,224 co-authored by Matthew Mittelsteadt.
algorithm. “[A] step-by-step procedure for solving a problem or accomplishing some end.”225 A familiar example is a recipe, which details the steps needed to prepare a dish. In a computer, an algorithm is implemented in computer code and details the discrete steps and calculations a computer needs to implement to complete a task. An algorithm is the “engine” an AI uses to “think” and make predictions. In the field of AI, the term “algorithm” is often used synonymously with “computer program.” A program, however, is a more specific term, referring to an algorithm written in computer code and packaged for execution.
algorithmic bias. According to McKinsey, “[w]hile ‘bias’ can refer to any form of preference, fair or unfair,” undesirable AI bias is bias that leads to “discrimination against certain individuals or groups of individuals based on the inappropriate use of certain traits or characteristics.”226 As AI is designed by humans, AI will always be biased and will assume the biases of its engineers, potentially leading to poor or discriminatory results. As noted throughout this guide, causes of bias can stem from the statistical, such as errors in model design, to the social, to the use of inapt data to the context presented. Bias can also be caused by incomplete datasets. For instance, a facial recognition AI trained only on male faces may perform poorly when analyzing female faces.
artificial general intelligence (AGI). In the future it is possible we will move past narrow AI and develop artificial general intelligence that does not have a narrow function and can serve multiple purposes. AGI can be conceived of as an AI system that equates to or surpasses the general-purpose intelligence of the human brain.227 AGI does not have a precise definition, and the line that divides it from narrow AI is grey. As such, the introduction of AGI will likely be a gradual process, and it is unlikely there will be a precise “Sputnik moment” that introduces the age of AGI.
artificial intelligence (AI). There is no agreed upon or general definition of AI. However, one practical definition is that artificial intelligence is any machine
224. James Baker, Laurie Hobart, & Matthew Mittelsteadt, An Introduction to Artificial Intelligence for Federal Judges (Federal Judicial Center 2023), https://perma.cc/7DJT-T9D2.
225. Merriam-Webster, Definition of “algorithm” (May 20, 2021), https://perma.cc/82FG-EJRD.
226. Jake Silberg, Notes from the AI Frontier: Tackling Bias in AI (and in Humans), Mckinsey Global Institute (June 2019), https://perma.cc/K8Z8-WUAE.
227. What Is Artificial Intelligence (AI)?, IBM, supra note 68.
that can “perform tasks that would otherwise require human intelligence.”228 The NSCAI expands on this idea, noting “AI is not a single piece of hardware or software, but rather, a constellation of technologies that gives a computer system the ability to solve problems and to perform tasks that would otherwise require human intelligence.”229 AI is implemented in computers as algorithms based on models such as artificial neural networks (ANN), which are often iteratively designed through methods such as machine learning (ML) or deep learning (DL). One statutory definition of AI is found in the National Defense Authorization Act of 2020, which states:
The term “artificial intelligence” means a machine-based system that can, for a given set of human-defined objectives, make predictions, recommendations or decisions influencing real or virtual environments. Artificial intelligence systems use machine and human-based inputs to—
- perceive real and virtual environments;
- abstract such perceptions into models through analysis in an automated manner; and
- use model inference to formulate options for information or action.
Section 5002(3) of the NAII Act 2020 (Division E of NDAA 2020).
artificial neural network (ANN). The model (or “tool”) used in deep-learning AI best defined as a computer system that works to achieve intelligence through a network structure that works to simulate the human brain.230 An ANN analyzes data by passing data through multiple layers of artificial neurons that sift through and decipher the data. This layered network structure allows the system to analyze discrete data elements, draw connections between discovered data patterns, and ultimately derive meaning and form predictions. Neural networks can be wide, meaning each layer has large numbers of neurons, or deep, meaning data must pass through many layers of neurons before a final conclusion is drawn. Engineers determine the width and depth of the network based on their interpretation of the tools and structures a specific AI application needs for success.
autonomous systems. AI-controlled machines and vehicles such as driverless cars and aerial drones that can operate and make decisions with little or no human control. Such systems already exist. In most cases, stringent safety demands have forestalled widespread use.
228. Baker, supra note 37, at 21.
229. NSCAI, supra note 1, at 8.
230. What Is a Neural Network?, IBM, supra note 110.
black box. A term used to describe the often-mysterious nature of AI decision making and the problem of AI explainability.231 Most AI programs write their own algorithms through machine learning, which can result in complex code and decision-making processes indecipherable even to engineers. Such complexity limits human ability to understand how an AI makes decisions, and what factors, including biases, may have influenced those decisions. However, considerable research is underway by organizations such as the National Institute of Standards and Technology (NIST) to enable more transparent neural networks, which may allow judges and lawyers to more fully understand the parameters and weights applied within.232
confidence score. Any expression of certainty in the predictive accuracy of an AI or ML application.233 AI applications are imperfect and offer approximate results, decisions, or predictions that can be provided with a level of confidence. Few, if any, results an AI produce should be treated as a certainty. For example, the FBI facial identification software mentioned in the introduction to this guide is not designed or intended to match a single identity with a face. Rather it offers the user a range of potential matches based on potential pattern similarities or matches. The algorithm is reported to be accurate 86% of the time when the algorithm output offers the user at least fifty potential match pictures.234 Put another way, the AI has 86% confidence that the match will be one of the fifty given matches.
deep learning. A machine-learning approach characterized by the use of a multilayered artificial neural network.235 Deep learning can use but does not necessarily require labeled datasets.236 This approach has exploded in popularity over the last decade237 and is the predominant form of machine-learning AI. Common applications use deep learning, including many driverless cars and voice-recognition AI.
facial recognition. A prominent class of AI applications that can detect a face and analyze its features (or “biometrics”) and even predict the identity of that face. These AI applications are notable for their common use in criminal justice and national security as a means of identifying suspects or threats.
231. Ariel Bleicher, Demystifying the Black Box That Is AI, Scientific American (Aug. 9, 2017), https://perma.cc/5BFP-WRW6.
232. NIST, AI Fundamental Research—Explainability (June 16, 2022), https://perma.cc/3YX5-JC8Q.
233. Jason Brownlee, Confidence Intervals for Machine Learning, Machine Learning Mastery (Aug. 8, 2019), https://perma.cc/VGT2-282N.
234. GAO, supra note 2, at 14.
235. What Is Machine Learning (ML)?, IBM, supra note 111.
236. Id.
237. What Is Artificial Intelligence (AI)?, IBM, supra note 68.
Facial recognition algorithms can also be used to surveil more generally. Facial recognition may also be used as a biological “password” to authenticate an individual’s identity (for example, to unlock a smartphone).
human in-the-loop. An autonomous AI system designed to work cooperatively with a human to complete its tasks. Often these AI defer to human judgment when making certain decisions, especially those with significant consequences or moral weight. Human in-the-loop systems generally seek a “best of both worlds” approach that maximizes the benefits of both human and AI decision making.
human on-the-loop. An autonomous AI system designed to work under human oversight, allowing the human to easily intervene if the AI’s decisions are in error, pose significant danger, or are ethically compromising.
human out-of-the-loop. An autonomous AI system designed to operate without human oversight or involvement. Such systems do not facilitate easy human intervention if unethical or dangerous decisions are made.
lethal autonomous weapons systems (LAWS or simply AWS). Autonomous systems that can use deadly force, which have received outsized legal, ethical, and political attention given widespread concerns about giving inhuman systems the power to take a human life.
machine learning (ML). A method of creating AI that relies on data, algorithms, and learned experience to refine algorithms and form intelligence.238 The premise of machine learning is that “intelligence” is not innate but must be learned through experience. Machine-learning AI algorithms are “trained” by engineers who feed it mass amounts of data that it slowly learns to interpret and understand. In response to the data, the AI gradually tweaks its code to steadily improve its abilities. These tweaks add up over time, helping the AI create stronger predictions. Forms of machine learning include the following:
238. What Is Machine Learning (ML)?, IBM, supra note 111.
239. Id.
narrow AI. “[T]he ability of computational machines to perform singular tasks at optimal levels, or near optimal levels, and usually better than, although sometimes just in different ways, than humans.”242 Under this umbrella falls many single-purpose or limited-purpose AI technologies, such as facial recognition algorithms, driverless cars, and drones. These technologies are intelligent in one or a few domains, limiting their ability to handle complexity or tasks outside of their intended purpose. All AI currently in use falls in this category.
natural language processing (NLP). AI algorithms designed to process, analyze, and recognize written or verbal human speech at human levels.243 NLP has a wide variety of applications. Familiar applications include virtual assistants such as Amazon’s Alexa or Apple’s Siri. In national security and criminal justice, NLP can be used to analyze and understand language recordings and written information, drawing conclusions, insights, and patterns from that data. This offers a powerful intelligence and investigative tool. It can also be used to match a voice to an identity (much like facial recognition) and for language translation.
superintelligence (SI). AI philosophers also contemplate the emergence of superintelligence, a stage of AI evolution marked as beyond human intelligence.244 SI has sparked widespread concern, and its risks and benefits are unclear. Stephen Hawking highlighted this uncertainty in 2015, exclaiming that “AI may be the best thing to ever happen to humanity or the worst.”245 A malicious SI could cause incalculable harm and a beneficial SI could prove an invaluable tool. It must be noted that many engineers and government officials
240. Id.
241. Id.
242. Baker, supra note 37, at 29.
243. What Is Natural Language Processing (NLP)?, IBM, https://perma.cc/23HX-3JPN.
244. What Is Artificial Superintelligence? IBM, https://perma.cc/K834-Z3CW.
245. Alex Hern, Stephen Hawking: AI Will Be “Either Best or Worst Thing” for Humanity, The Guardian (Oct. 19, 2016), https://perma.cc/2DGX-25HF.
disdain consideration of superintelligence as the stuff of science fiction as well as a distraction from the real and immediate challenges of narrow AI today.
The views expressed in this reference guide are our own, as are any errors. However, we would like to thank the following people for their help: Matthew Mittelsteadt, for helping to develop and edit the ideas herein; our research assistants over several years, Rebecca Buchanan, Thomas Clifford, Shannon Cox, Henry DuBeau, Aaron Ernst, Thomas Finnigan III, Hannah Gabbard, Rickson Galvez, Harrison Gregoire, Kaitlyn Keane, Alyssa Kozma, R.J. Naperkowski, Carlos Negron, Margaret Santandreu, Michael Stoianoff, and David Trombly, for their hard work and good humor; Kristen Duda, for all her help in managing our team; the Federal Judicial Center and National Academies of Sciences, Engineering, and Medicine, including Jason A. Cantone, Joe Cecil, Nathan Dotson, José Idler, Steven Kendall, Dominic LoBuglio, Anne-Marie Mazza, and Beth Wiggins, for the opportunity to publish this reference guide and their review and other work on the reference guide; the Hon. Curtis Collier, who provided helpful feedback as part of the FJC review process; and reviewers and editors of a shorter version of this guide (published by the Georgetown Center for Security and Emerging Technology), Chuck Babington, Keith Bybee, Tobias Gibson, Danny Hague, Matt Mahoney, Hon. John Sparks, and Lynne Weil. We also thank the Committee on Science for Judges and our NAS reviewers, known to us as A, B, and C, for their helpful comments, and Geoff Erwin for his careful and detailed editing. Finally, we thank our families and loved ones who gave us the time, patience, and support we needed to immerse ourselves in AI and draft this reference guide. Thank you all.