Read "Reference Manual on Scientific Evidence: Fourth Edition" at NAP.edu

Page 113 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

Reference Guide on Forensic Feature Comparison Evidence

VALENA E. BEETY, JANE CAMPBELL MORIARTY, AND ANDREA L. ROTH

Valena E. Beety, Robert H. McKinney Professor of Law, Indiana University Maurer School of Law.

Jane Campbell Moriarty, Carol Los Mansmann Chair in Faculty Scholarship and Professor of Law, Thomas R. Kline School of Law at Duquesne University.

Andrea L. Roth, Professor of Law and Barry Tarlow Chancellor’s Chair in Criminal Justice, UC Berkeley School of Law.

CONTENTS

Overview

Legal Framework Governing Forensic Feature Comparison Evidence

Prior Government and NGO Review of Forensic Feature Comparison Evidence

Recurring Concerns with Forensic Feature Comparison Evidence

Terminology and Testimony Issues

Reliability Concerns

Discussion of Selected Forensic Feature Comparison Evidence

Differences Between DNA and Other Methods

Fingerprint Evidence

Introduction

The Method and Its Claims

Scientific Assessments, Critiques, and Debates

Case Law Development

Handwriting Evidence

Introduction

The Method and Its Claims

Scientific Assessments, Critiques, and Debates

Case Law Development

Firearms and Toolmark Analysis

Introduction

The Firearms Comparison Method and Its Claims

The Toolmark Comparison Method and Its Claims

Scientific Assessments, Critiques, and Debates

Case Law Development

Page 114 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

Bitemark Evidence

Introduction

The Method and Its Claims

Scientific Assessments, Critiques, and Debates

Case Law Development

Facial Recognition Software and Other Methods

Special Issues with Machine-Generated Feature Comparison Evidence

Types of Machine-Generated Feature Comparison Evidence

Recurring Legal Issues

Rule 702/Reliability Issues

Hearsay and Confrontation

Discovery Issues

Glossary of Terms

Page 115 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

Overview

This reference guide discusses techniques used for forensic feature comparison, meaning comparing features in trace evidence (such as a latent fingerprint, handwriting on a document, or toolmarks on a projectile) to a reference sample (such as a suspect’s reference fingerprints or handwriting exemplar, or a projectile fired from a known weapon) to determine whether the samples might originate from the same source. These techniques play a significant role in modern federal litigation, particularly criminal cases. In turn, federal judges play a central gatekeeping and managerial role in regulating the admissibility and use at trial of these techniques. The role is inherently challenging, as judges are asked to weigh in on complex issues of scientific validity that are the realm of experts. The role is also particularly challenging now, as some techniques’ ability to accurately determine whether two patterns have a common source have been challenged as unreliable in light of high-profile DNA exonerations and recent governmental reports critical of feature comparison disciplines. Meanwhile, the increasing automation of feature comparison techniques with the help of software raises unique issues.

With these realities in mind, the goal of this reference guide is to provide an accessible overview for judges as they resolve legal questions about forensic feature comparison evidence. While other guides in this manual cover in depth the legal rules related to expert testimony and the specific issues related to forensic DNA typing, this reference guide offers judges a solid background in scientific and legal issues related to a number of forensic feature comparison techniques. This guide’s scope is limited to feature comparison techniques, as such techniques raise recurring issues worthy of separate treatment and do not include other technical or scientific fields such as medicolegal death investigations, seized drug analysis, or digital forensics.

The first section of this reference guide explains the basic legal framework for determining the admissibility of forensic feature comparison evidence, situating the case law and rules discussed in The Admissibility of Expert Testimony, in this manual, in the specific context of this type of evidence.

The second section catalogs the most frequently cited institutional efforts to evaluate forensic feature comparison evidence, such as the 2009 National Research Council report (2009 NRC Report), the 2016 report of the President’s Council of Advisors on Science and Technology (PCAST), the National Commission on Forensic Science (NCFS), and the National Institute of Standards and Technology (NIST)’s Organization of Scientific Area Committees (OSAC). We catalog the major findings of these groups not only because they are frequently cited by litigants but because many lower-court rulings on the admissibility of forensic feature comparison evidence were issued before these reports and may not be consistent with them.

The third section of this reference guide explains the continuing concerns that feature comparison evidence poses, with respect to experts’ arguable

Page 116 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

overclaims of certainty or individualization, frequently confused terms such as reliability versus validity, class versus individual characteristics, and other matters. This section also provides an in-depth discussion of foundational reliability and reliability as applied.

The fourth section offers a detailed explanation of some of the forensic feature comparison techniques currently in use and most likely to arise in federal cases in the next decade. These techniques include fingerprint analysis, handwriting comparison, firearms and toolmark identification, and bitemark evidence. While some techniques like bitemark analysis and handwriting comparison appear to be gradually receding in use, we include them both because they are still offered as proof and because they are sure to arise in habeas litigation or other postconviction proceedings for years to come. This guide omits several other techniques, such as shoeprint comparisons, hair and fiber analysis, and glass comparisons, that present some of the same issues as those techniques that are covered.¹ Each overview explains the basic method, the extent of empirical studies on the method, and the method’s legal status in the courts. We end with a brief discussion of facial recognition and other emerging techniques.

Finally, this reference guide offers a section on automated, machine-generated feature comparison conclusions. While the Reference Guide on Human DNA Identification Evidence, in this manual, discusses DNA software, other feature comparison techniques are also being increasingly automated. We offer an overview of the types of machine-generated forensic conclusions a federal judge is likely to see, as well as the legal issues likely to arise in admitting these conclusions. The goal of this final section is not to definitively resolve these issues, but to equip judges to understand the stakes and assumptions underlying the conflicts they themselves will need to resolve.

Legal Framework Governing Forensic Feature Comparison Evidence

This section builds on The Admissibility of Expert Testimony, in this manual, by briefly highlighting significant aspects of Federal Rule of Evidence 702’s application to forensic feature comparison evidence in particular. As The Admissibility of Expert Testimony discusses in depth, Rule 702 codifies the so-called “Daubert trilogy” and requires that an expert be qualified, that the opinion be helpful to the jury, that the testimony be based on sufficient facts or data, and that the testimony is the product of a method that is both foundationally reliable and has been reliably

1. Judges can find a more comprehensive list of feature comparison disciplines by perusing OSAC’s website: https://perma.cc/8W8P-DHFT. Clicking on the link of each subcommittee leads to a list of proposed standards on a host of subdisciplines.

Page 117 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

applied to the facts of the case.² As Daubert noted, courts determining the reliability of scientific expert testimony look to various factors, such as whether a method has been tested; the method’s error rate; whether the method has been subject to peer review; whether the method is governed by standards and protocols; and whether the method is generally accepted in the relevant scientific community.³

As noted in The Admissibility of Expert Testimony, the 2023 amendments to Rule 702 and the advisory committee’s note make explicit that the proponent must meet the requirements above by a preponderance of the evidence.⁴ The advisory committee’s note urges special caution with “subjective” forensic methods:

The amendment is especially pertinent to the testimony of forensic experts in both criminal and civil cases. Forensic experts should avoid assertions of absolute or one hundred percent certainty—or to a reasonable degree of scientific certainty—if the methodology is subjective and thus potentially subject to error. In deciding whether to admit forensic expert testimony, the judge should (where possible) receive an estimate of the known or potential rate of error of the methodology employed, based (where appropriate) on studies that reflect how often the method produces accurate results. Expert opinion testimony regarding the weight of feature comparison evidence (i.e., evidence that a set of features corresponds between two examined items) must be limited to those inferences that can reasonably be drawn from a reliable application of the principles and methods.⁵

What follows is a brief overview of each admissibility requirement in the context of feature comparison evidence.

Helpfulness to the jury.

First, the skill of non-DNA feature comparison experts may well be “helpful to the jury” in explaining the evidence. For example, in a firearm comparison case, one federal district court recently stated that the expert’s “technical knowledge, skill, and training to make microscopic observations of and comparisons between cartridge cases, would be helpful to the trier of fact.”⁶ Courts have come to similar conclusions with respect to other feature comparison disciplines.⁷ But these rulings presuppose that the conclusion itself is reliable and thus worthy of being heard by the jury. In short, “helpfulness to the jury” is typically not the dispositive factor affecting admissibility of feature

2. Fed. R. Evid. 702.

3. See Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579 (1993) (listing nonexhaustive factors in determining reliability of a scientific expert method).

4. See Fed. R. Evid. 702 & advisory committee’s note to 2023 amendment.

5. See Fed. R. Evid. 702 advisory committee’s note to 2023 amendment.

6. United States v. Davis, No. 4:18-CR-00011, 2019 WL 4306971, at *6 (W.D. Va. Sept. 11, 2019).

7. See, e.g., United States v. Dale, 618 F. App’x 494 (11th Cir. 2015) (fingerprint and handwriting comparison); United States v. Shipp, 422 F. Supp. 3d 762, 783 (E.D.N.Y. 2019) (firearm comparison); United States v. Mallory, 902 F.3d 584, 593 (6th Cir. 2018) (handwriting comparison).

Page 118 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

comparison discipline evidence once it is deemed foundationally reliable and reliable as applied.

Expert qualifications.

When allowing a feature comparison discipline expert to testify “before the jury cloaked with the mantle of an expert,” a trial court must “assure that a proffered witness truly qualifies as an expert.”⁸ Trial courts should be careful not to conflate expert qualification with either foundational reliability or reliability as applied. A method might be foundationally reliable, but wielded by an examiner who is not sufficiently proficient to be qualified as an expert to begin with. In turn, as explained more fully below, even an expert with established credentials and proficiency must demonstrate through documentation that the method was reliably applied in the case at hand.

In theory, Rule 702 allows much flexibility to the proponent in how to show an expert is qualified. The rule is written in the disjunctive—qualified by “knowledge, skill, experience, training, or education”—and provides multiple avenues for qualifying witnesses, including experience alone.⁹ Moreover, experts need not be “blue-ribbon practitioners” with “optimal qualification[s],”¹⁰ or even “highly qualified in order to testify about a given issue.”¹¹

Still, courts must provide sufficient reasons for finding the expert qualified so that an appellate court can determine “whether the district court properly applied the relevant law.”¹² The question is not merely whether the witness is qualified in a vacuum to speak on expert matters; it is whether the witness is qualified to offer the opinion she is offering. Some courts have excluded feature-comparison experts on this ground.¹³ An expert might also be qualified to render opinions on some topics but not others.¹⁴ Specifically, some (but not all) courts have deemed

8. United States v. Cloud, No. 1:19-CR-02032-SMJ-1, 2022 WL 801694, at *1 (E.D. Wash. Mar. 8, 2022) (fingerprint expert) (quoting Jinro Am. Inc. v. Secure Invs., Inc., 266 F.3d 993, 1004 (9th Cir. 2001)).

9. The advisory committee’s note to the 2000 amendment of Rule 702 states that “[n]othing in this amendment is intended to suggest that experience alone . . . may not provide a sufficient foundation for expert testimony.” Fed. R. Evid. 702 advisory committee’s note to 2000 amendment.

10. United States v. Vargas, 471 F.3d 255, 262 (1st Cir. 2006).

11. Huss v. Gayden, 571 F.3d 442, 452 (5th Cir. 2009).

12. United States v. Avitia-Guillen, 680 F.3d 1253, 1259 (10th Cir. 2012) (no abuse of discretion ruling that fingerprint expert was qualified; district court provided sufficient information to establish basis for the opinion).

13. See, e.g., Balimunkwe v. Bank of Am., No. 1:14-CV-327, 2015 WL 5167632 (S.D. Ohio Sept. 3, 2015) (expert not permitted to testify where he did not possess sufficient qualifications as handwriting comparison expert); Almeciga v. Ctr. for Investigative Reporting, Inc., 185 F. Supp. 3d 401, 424 (S.D.N.Y 2016) (same).

14. See, e.g., Marten Transp., Ltd. v. Plattform Advertising, Inc., 184 F. Supp. 3d 1006 (D. Kan. 2016) (holding that expert witness, though qualified to render several opinions on nature of trucking industry and hiring practices, was not qualified to render opinion on search engine optimization).

Page 119 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

the exclusion of experts such as law professors, who themselves are not scientists but who have written extensively about a discipline, not an abuse of discretion.¹⁵

Before the 2023 amendments to Rule 702, some courts had come close to deeming a forensic feature comparison expert unqualified, but concluded that qualification was a low enough bar that any questions should go to weight, not admissibility.¹⁶ The 2023 amendments explicitly note that, while questions of precise weight as to the opinions of a qualified expert are for the jury, the issue of qualification (as well as helpfulness and reliability) goes to admissibility and not merely weight.¹⁷ Ultimately, appellate courts reviewing qualifications of forensic feature or pattern comparison experts have affirmed that it is the judge’s obligation, and not a jury question, to determine whether a witness possesses sufficient qualifications.¹⁸

Some recent commentary argues that, given the 2023 amendments’ concern over “subjective” methods based on examiner judgment and experience, trial courts should take proficiency testing results (or lack thereof) of feature comparison experts more seriously at the expert qualification stage.¹⁹ Several consensus reports and other scholars have similarly urged courts to be more careful gatekeepers about the qualifications of feature comparison experts, noting that the proficiency tests that exist to date are not encouraging and the results may be misleading, as the tests do not simulate realistic crime-scene samples.²⁰

15. See, e.g., State v. Clifford, 121 P.3d 489 (Mont. 2005) (holding that trial court did not abuse its discretion in excluding law professor as expert in handwriting); United States v. Paul, 175 F.3d 906, 912 (11th Cir. 1999) (same).

16. See, e.g., Banuchi v. City of Homestead, 606 F. Supp. 3d 1262, 1272–73 (S.D. Fla. 2022) (finding an expert “barely” qualified to testify about why fingerprints would not be at a crime scene, reasoning that “[t]he qualification standard for expert testimony is not stringent, and so long as the expert is minimally qualified, objections to the level of the expert’s expertise [go] to credibility and weight, not admissibility”) (quoting Vision I Homeowners Ass’n, Inc. v. Aspen Specialty Ins. Co., 674 F. Supp. 2d 1321, 1325 (S.D. Fla. 2009) (emphasis added) (quoting Kilpatrick v. Breg, Inc., Case No. 08-10052-CIV, 2009 WL 2058384, at *3 (S.D. Fla. June 25, 2009))); Thomas v. United States, No. 2:03-CV-02416-JPM, 2015 WL 5076969, at *184 (W.D. Tenn. Aug. 27, 2015), aff’d, 849 F.3d 669 (6th Cir. 2017) (explaining the witness’s lack of qualifications in the area of handwriting comparison, but admitting it with the caveat that it was entitled to “little weight” in a bench trial).

17. See Fed. R. Evid. 702 & advisory committee’s note to 2023 amendment.

18. See, e.g., United States v. Ruvalcaba-Garcia, 923 F.3d 1183, 1189 (9th Cir. 2019) (trial judge erred in ruling that fingerprint technician’s qualification “as an expert” was a “determination for the jury”).

19. See, e.g., Brandon L. Garrett & Gregory Mitchell, The Proficiency of Experts, 166 U. Pa. L. Rev. 901, 902 (2018) (explaining their evaluation of twenty years of fingerprint proficiency tests reflected a “surprisingly high rates of false positive identifications” and that “credentials and experience are often poor proxies for proficiency”).

20. See, e.g., President’s Council of Advisors on Sci. & Tech. (PCAST), Exec. Office of the President, Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods 51 (Sept. 2016), https://perma.cc/GJ9A-2DFS [hereinafter 2016 PCAST Report]; Garrett & Mitchell, supra note 19, at 902 (explaining their evaluation of twenty years of fingerprint proficiency tests reflected a “surprisingly high rates of false positive identifications”);

Page 120 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

Foundational reliability.

This section provides a brief reminder of the foundational reliability requirements of Rule 702. A more in-depth discussion of specific recurring reliability issues in feature comparison evidence is set forth in the next section.

In determining foundational reliability of feature comparison expert testimony, judges have multiple options under Rule 702’s framework, consistent with their role as gatekeeper: (1) they may decide that the evidence is admissible; (2) they may exclude the evidence entirely, concluding that the proponent has not established foundational reliability; or (3) they may admit the evidence with limitations, as some courts have done.²¹ While trial court determinations of admissibility are reviewed for an “abuse of discretion,”²² a misapprehension as to the applicable legal standard under Rule 702 is considered an abuse of discretion.²³

One issue facing judges when applying Rule 702’s reliability requirements to feature comparison disciplines is whether such disciplines are “scientific,” as compared to “technical” or “specialized.” On the one hand, feature and pattern recognition experts distinguish themselves from lay persons on their ability to identify individuals based on the experts’ knowledge, training, and expertise. Although lay people may be able to identify a known individual’s handwriting²⁴ or may be able to distinguish between two clearly dissimilar fingerprints, courts have consistently ruled that feature and pattern recognition by trained individuals is, in fact, specialized knowledge.²⁵ On the other hand, some commentators have claimed that examiners wielding relatively more subjective experience- and judgment-based

Jonathan J. Koehler, Forensics or Fauxrensics? Ascertaining Accuracy in the Forensic Sciences, 49 Ariz. St. L. J. 1369 (2017) (explaining that, in most areas of forensic science, the necessary proficiency testing has not been done). Qualifications, proficiency, and reliability as applied are not entirely distinct concepts, and there is some overlap among them.

21. See, e.g., United States v. Tibbs, No. 2016-CF1-19431, 2019 WL 4359486 (D.C. Super. Ct. Sept. 5, 2019) (precluded the government from eliciting testimony identifying the recovered firearm as the source of the recovered cartridge and limiting expert’s testimony to a conclusion that the firearm cannot be excluded as the source, based on the consistency of the class characteristics and microscopic toolmarks); United States v. Adams, 444 F. Supp. 3d 1248 (D. Or. 2020); United States v. Shipp, 422 F. Supp. 3d 762, 778–79 (E.D.N.Y. 2019) (disallowing a conclusion of a “match” with firearm evidence); United States v. Rutherford, 104 F. Supp. 2d 1190, 1193 (D. Neb. 2000) (disallowing the conclusion of a “match” of handwriting samples); United States v. Van Wyk, 83 F. Supp. 2d 515 (D.N.J. 2000); United States v. Hines, 55 F. Supp. 2d 62 (D. Mass. 1999) (same).

22. See Gen. Elec. Co. v. Joiner, 522 U.S. 136, 141–43 (1997).

23. See, e.g., Koon v. United States, 518 U.S. 81, 100 (1996) (noting that a trial court necessarily abuses its discretion when it fails to apply the correct legal standard or makes an error of law); Jeff D. v. Otter, 643 F.3d 278 (9th Cir. 2011) (same).

24. See, e.g., Fed. R. Evid. 901(b)(2) (permitting “nonexpert’s opinion that handwriting is genuine, based on a familiarity with it that was not acquired for the current litigation”).

25. See, e.g., Shipp, 422 F. Supp. 3d at 783 (firearm comparison); United States v. Llera Plaza, 188 F. Supp. 2d 549, 563–64 (E.D. Pa. 2002) (fingerprint comparison); United States v. Mallory, 902 F.3d 584, 593 (6th Cir. 2018) (handwriting comparison).

Page 121 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

feature and pattern identification techniques are not “scientists” practicing “science.”²⁶

Of course, a field that is “technical” rather than “scientific” is still subject to the reliability requirements of Rule 702, as the Supreme Court held in Kumho Tire Co. v. Carmichael.²⁷ But the Supreme Court has not squarely addressed the types of factors a trial judge should consider in determining the foundational reliability of a nonscientific technical method.²⁸ The National District Attorneys Association suggested in its response to the 2016 PCAST Report that validation studies are not required for nonscientific, experience-based methods.²⁹ But as the PCAST committee responded, it is not clear what the alternative means of testing validity of such a method would be.³⁰ While some have argued that validity can be established through so-called “blind” verification by a second examiner or interlaboratory comparison, such comparisons would presumably focus mostly on repeatability, not accuracy. Thus, courts determining the foundational reliability of a feature comparison discipline method deemed nonscientific will have to determine how—if not through validation studies showing the limits of the method’s ability to get the right answer—the proponent of the method can establish its accuracy for its stated purpose by a preponderance of the evidence.

Reliability as applied.

Once the proponent has demonstrated that the method is foundationally reliable (i.e., the method produces accurate, repeatable, and reproducible results on materials like those in the instant case), and that the examiner is qualified to conduct the method, the proponent must introduce additional evidence that the examiner in fact followed the method in the case at hand. The 2023 amendments to Rule 702 reaffirmed that reliability as applied, like other admissibility requirements, goes to admissibility and not merely weight and must be proven by a preponderance of the evidence.³¹ To prove that

26. See Letter from Nat’l Dist. Att’ys Ass’n to President Obama in response to 2016 PCAST Report (Nov. 16, 2016), https://perma.cc/5JX7-BU5V. See generally discussion in section titled “Recurring Concerns with Forensic Feature Comparison Evidence” below.

27. Kumho Tire Co. v. Carmichael, 526 U.S. 137, 142 (1999).

28. See generally discussion in section titled “Prior Government and NGO Review of Forensic Feature Comparison Evidence” below (discussing critiques of PCAST and arguments about how and whether to calculate error rates for feature comparison disciplines).

29. See id. (noting the NDAA’s response to the 2016 PCAST Report).

30. See President’s Council of Advisors on Sci. & Tech. (PCAST), Exec. Office of the President, An Addendum to the PCAST Report on Forensic Science in Criminal Courts 1, 3 (Jan. 6, 2017), https://perma.cc/TFP8-YUYS [hereinafter PCAST Addendum] (Noting that some critics of PCAST “suggested that the validity and reliability of such a method could be established without actually empirically testing the method in an appropriate setting. Notably, however, none of these respondents identified any alternative approach that could establish the validity and reliability of a subjective forensic feature-comparison method.”).

31. Fed. R. Evid. 702 advisory committee’s note to 2023 amendment (“[M]any courts have held that the critical questions of the . . . application of the expert’s methodology, are questions of weight and not admissibility. These rulings are an incorrect application of Rules 702 and 104(a).”).

Page 122 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

a laboratory or examiner reliably applied a method, the proponent might rely on various types of documentation, from internal validation studies (showing that the method works on the laboratory’s equipment with the specific protocols, practices, and personnel of the specific forensic service provider) to case file documentation to examiner bench notes, reports, and testimony showing the conditions of the case and what the examiner did in the particular case.

This “reliability as applied” showing is different from foundational reliability or expert qualification. For example, a trial court might deem fingerprint comparison a foundationally reliable method and deem a particular examiner to be qualified based on experience, credentials, and proficiency tests. But if the comparison in the case is between a reference print and a particularly smudged or incomplete print that goes beyond the bounds of the method’s empirically tested accuracy, the method might not be reliably applied. As the advisory committee note to the 2023 amendments emphasized, expert opinions must “stay within the bounds of what can be concluded from a reliable application of the expert’s basis and methodology.”³² Another way that an opinion can exceed the bounds of a valid application of a method is if it relates to a different type of conclusion—for example, whether a note was made to look like a forgery, rather than whether two notes have the same author.³³ Still another way an opinion might be unreliable as applied is if the examiner claims certainty in their opinion, a concern the advisory committee considers “particularly pertinent to the testimony of forensic experts.” Instead, in deciding whether to admit the evidence, trial courts should “(where possible) receive an estimate of the known or potential rate of error of the methodology employed, based (where appropriate) on studies that reflect how often the method produces accurate results.”³⁴

Prior Government and NGO Review of Forensic Feature Comparison Evidence

This subsection offers an overview of governmental and nongovernmental organization (NGO) review of forensic feature comparison evidence over the past fifteen years. Judges should be aware of these reports both because they are frequently cited and because they can be helpful guides to areas of continuing research and controversy.

32. See id.

33. See, e.g., Almeciga v. Ctr. for Investigative Reporting, Inc, 185 F. Supp. 3d 401, 424 (S.D.N.Y 2016) (noting that method for one purpose was not validated for other purpose and that a trial judge “should not [admit questioned document testimony] without carefully evaluating whether the examiner has actual expertise in regard to the specific task at hand”).

34. See Fed. R. Evid. 702 advisory committee’s note to 2023 amendment.

Page 123 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

The 2009 National Research Council Report.

The push since the mid-2000s to improve feature comparison disciplines has resulted in part from the DNA exoneration movement, which exposed erroneous convictions based on embellished or misleading forensic evidence.³⁵ Additionally, the misidentification of Brandon Mayfield by the FBI as the perpetrator of the 2004 Madrid train bombings based on latent fingerprint analysis further spurred concern over non-DNA methods.³⁶ The National Research Council of the National Academies (NRC) began a formal review in 2006 of the state of non-DNA forensic feature comparison disciplines.³⁷ The result was a lengthy report published in 2009, Strengthening Forensic Science in the United States: A Path Forward.³⁸

The primary conclusion of the 2009 NRC Report was that, “[w]ith the exception of nuclear DNA analysis . . . no forensic [feature identification] method has been rigorously shown to have the capacity to consistently, and with a high degree of certainty, demonstrate a connection between evidence and a specific individual or source.”³⁹ Specifically, the NRC pointed out the relative subjectivity and nonreproducibility in non-DNA methods that are based primarily on examiner experience and judgment; the lack of validation studies to assess accuracy of these relatively subjective methods;⁴⁰ the lack of “population studies” that show the “variability” and rarity of a feature within a relevant population (critical to assessing the probative value of a “match” between two sets of observed features, like whorls and loops in a thumbprint or a certain shoe sole pattern);⁴¹ the lack of corrective measures to avoid examiner contextual bias;⁴² the “absence of a feedback mechanism” to alert examiners that a method

35. See generally Restoring Freedom, Innocence Project, https://perma.cc/LMU4-87JU (last visited Nov. 18, 2024) (explaining the categories of reasons underlying convictions of factually innocent defendants).

36. See Robert B. Stacey, Report on the Erroneous Fingerprint Individualization in the Madrid Train Bombing Case, Fed. Bureau of Investigation (FBI) (Jan. 2005), https://perma.cc/JQ9W-RX7K.

37. See generally Erin Murphy, What ‘Strengthening Forensic Science’ Today Means for Tomorrow: DNA Exceptionalism and the 2009 NAS Report, 9 L. Probability & Risk 7 (2010), https://doi.org/10.1093/lpr/mgp030 (explaining the story of how DNA exonerations led to the Senate’s charge, and the forensic community’s request, to the NRC to write a report on non-DNA evidence); Nat’l Rsch. Council, Nat’l Acads., Strengthening Forensic Science in the United States: A Path Forward 1 (2009), https://doi.org/10.17226/12589 [hereinafter 2009 NRC Report] (explaining Congress’s charge to them). We refer to the 2009 report as the “2009 NRC Report” to distinguish it from later reports from the National Academy of Sciences, even though some authors refer to the 2009 report as the “NAS Report.”

38. See 2009 NRC Report, supra note 37.

39. Id. at 7.

40. Id. at 8.

41. Id. at 149. See also id. at 154 (same concern for firearms and toolmarks), 163 (same for fiber evidence), 165 (handwriting), 184 (same concern for friction ridge patterns).

42. Id. at 8, 24.

Page 124 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

might have produced a false positive or negative;⁴³ and the lack of “quantifiable measures of uncertainty” (i.e., an error rate).⁴⁴

Responses to the 2009 NRC Report varied, although the scientific community appeared largely supportive. Some pushed back against the report’s calling established methods into question.⁴⁵ On the other hand, organizations such as the American Academy of Forensic Sciences (AAFS) published a formal response “support[ing] the recommendations” of the report,⁴⁶ and other scientists followed suit.⁴⁷

2013–17: The National Commission on Forensic Science (NCFS).

The 2009 NRC Report led to a broad debate over the validity and reliability of forensic science methods, and to the creation of the NCFS, a joint project of the Department of Justice (DOJ) and NIST. The NCFS’s thirty-seven members included forensic examiners, psychologists and other scientists, prosecutors, defense attorneys, academics, and judges.⁴⁸ Between 2014 and 2017, the NCFS approved twenty “recommendations” and twenty-three “views documents,” all available on the Commission’s website⁴⁹ and including recommendations requiring proficiency testing of examiners even in fields lacking an accreditation body; views on what federal prosecutors should disclose in pretrial discovery about forensic methods, above and beyond what is required by Federal Rule of Criminal Procedure 16;⁵⁰ and the view that forensic examiners should base their opinions only on

43. Id. at 149.

44. Id. at 23.

45. See generally Paul C. Giannelli, The 2009 NAS Forensic Science Report: A Literature Review, 48 Crim. L. Bulletin 378, 382, 385 (2012) (discussing responses to the 2009 NRC Report, including a prosecutor in front of the Senate Judiciary Committee insisting that the report was an “agenda-driven attack upon well-founded investigative techniques” and then-Senator Jeff Sessions’ comment that “I don’t think we should suggest that those proven scientific principles that we’ve been using for decades are somehow uncertain”).

46. See Am. Acad. of Forensic Scis. (AAFS), Response to the National Academy of Sciences’ “Forensic Needs” Report (Sept. 4, 2009), https://perma.cc/K7B4-TWCP.

47. See, e.g., Karen Kafadar, Statistical Issues in Assessing Forensic Evidence, 83 Int’l Statistical Rev. 111, 120 (2015), https://doi.org/10.1111/insr.12069 (Expressing support for the 2009 NRC Report and opining that “[t]he scientific method is very general, and the principles of science apply to all branches, irrespective of the specific application (e.g. physics, chemistry or biology). If the scientific method has not been applied, then scientists from no branch can defend the results.”).

48. See U.S. Dep’t of Justice, National Commission on Forensic Science: Members (June 1, 2017), https://perma.cc/V98X-BMQB (listing members and biographies).

49. See U.S. Dep’t of Justice, National Commission on Forensic Science: Work Products Adopted by the Commission (July 23, 2018), https://perma.cc/5XVZ-KN4F (links to documents).

50. See id. These recommendations were written before the 2022 amendments to Rule 16, which (among other things) require parties to disclose not only written summaries of their expert’s anticipated testimony but “a complete statement” of the witness’s opinions, the bases and reasons for those opinions, the witness’s qualifications, and a list of other cases in which the witness had testified in the previous four years. Fed. R. Crim. P. 16 advisory committee’s note to 2022 amendment.

Page 125 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

“task-relevant” information.⁵¹ While the NCFS’s charter expired in April 2017 and was not renewed by then-Attorney General Jeff Sessions,⁵² the views documents remain potentially persuasive documents wielded by parties litigating forensic evidence issues.

The 2016 PCAST Report.

While the NCFS was still completing its work, PCAST issued a report in 2016 on forensic feature comparison evidence.⁵³ The report’s working group largely consisted of judges and academics from the “derivative sciences” (i.e., sciences like chemistry, statistics, genetics, and computer science underlying forensic methods), although the group sought input from “the Federal Bureau of Investigation (FBI) Laboratory and individual scientists at NIST, as well as from many other forensic scientists and practitioners, judges, prosecutors, defense attorneys, academic researchers, criminal-justice-reform advocates, and representatives of Federal agencies.”⁵⁴

The 2016 PCAST Report echoed the concerns of the 2009 NRC Report and focused on explaining what was missing to properly validate non-DNA forensic feature comparison methods. Specifically, PCAST explained that the only way to meaningfully test the validity of a method based primarily on examiner judgment and experience (where the method’s steps are largely an inscrutable “black box” to those on the outside) is through “black box” studies where the method’s accuracy is tested on samples where the tester knows the right answer.⁵⁵ In turn, PCAST explained that any statement about a method’s reliability, or the certainty underlying an examiner’s conclusion, should not overstate the probative value of the evidence beyond what is “demonstrated by empirical evidence” based on error-rate studies.⁵⁶ “For example,” PCAST wrote, “if the false positive rate of a method has been found to be 1 in 50, experts should not imply that the method is able to produce results at a higher accuracy.”⁵⁷

The PCAST Report then covered what a well-designed error rate study requires: samples and populations representative of real casework; a sufficiently

51. See U.S. Dep’t of Justice, Views of the Commission Ensuring That Forensic Analysis Is Based Upon Task-Relevant Information (Dec. 8, 2015), https://perma.cc/U2JF-QSCP. This document was heavily influenced by, and frequently cited, the work of psychologist Itiel Dror (who spoke to the commission), who has written on contextual bias in forensic feature comparison methods such as DNA mixture interpretation and latent fingerprint analysis. See Itiel E. Dror & Greg Hampikian, Subjectivity and Bias in Forensic DNA Mixture Interpretation, 51 Sci. & Justice 204 (2011), https://doi.org/10.1016/j.scijus.2011.08.004, and Itiel E. Dror et al., Cognitive Issues in Fingerprint Analysis: Inter- and Intra-Expert Consistency and the Effect of a ‘Target’ Comparison, 208 Forensic Sci. Int’l 10 (2011), https://doi.org/10.1016/j.forsciint.2010.10.013.

52. See U.S. Dep’t of Justice, National Commission on Forensic Science, https://perma.cc/V6GT-YSWN (noting charter expiration in April 2017).

53. See 2016 PCAST Report, supra note 20.

54. Id. at 2.

55. Id. at 5.

56. Id. at 6.

57. Id.

Page 126 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

large sample size; “blind” testing; and no mid-stream change in protocol. In turn, PCAST opined on the proper way to calculate a method’s error rate. First, it insisted that the false positive rate be calculated by determining the number of false inclusions among an examiner’s conclusive examinations (exclusions or declarations of a “match”) rather than all examinations (including “inconclusive” determinations).⁵⁸ Second, the reported error rate should be the upper bound of a “confidence interval” around the study error rate to account for measurement uncertainty, so that the factfinder has a sense of how high the false positive rate might be.⁵⁹ Using these criteria, PCAST ultimately found insufficient evidence of foundational validity for several feature comparison disciplines—toolmark analysis, handwriting analysis, bitemark analysis, footwear analysis, microscopic hair analysis, and deconvolution of DNA mixtures over three contributors.⁶⁰ The only non-DNA feature comparison method PCAST deemed foundationally valid was latent print analysis, but added several caveats.⁶¹

The 2016 PCAST Report garnered formal responses from numerous organizations, some supportive and some critical.⁶² PCAST issued an addendum to its report in January 2017 responding to critiques.⁶³ After the addendum, additional critiques⁶⁴ and defenses⁶⁵ of PCAST continued. A 2018 article by Professor Jay Koehler, directed at the judiciary, offers a helpful and measured summary of the critiques of PCAST, possible responses, lingering debates, and issues for

58. Id. at 51.

59. Id.

60. Id. at 7.

61. Id. at 9–10.

62. See, e.g., Nat’l Dist. Att’ys Ass’n, supra note 26 (arguing that not all methods are “scientific,” and not all methods need to be supported by validation studies); Fed. Bureau of Investigation (FBI), Comments on President’s Council of Advisors on Science and Technology REPORT TO THE PRESIDENT Forensic Science in Federal Criminal Courts: Ensuring Scientific Validity of Pattern Comparison Methods [PCAST Report] (Sept. 20, 2016), https://perma.cc/B5S9-XSG3 (criticizing PCAST for not taking into account studies that the FBI argued met the criteria for appropriate black box studies).

63. See PCAST Addendum, supra note 30.

64. See, e.g., Ted Robert Hunt, Scientific Validity and Error Rates: A Short Response to the PCAST Report, 86 Fordham L. Rev. Online 24, 26 (2018) (agreeing that methods must be tested empirically but deeming PCAST’s views of an appropriate test as too narrow); U.S. Dep’t of Justice, Justice Department Publishes Statement on 2016 President’s Council of Advisors on Science and Technology Report (Jan. 13, 2021), https://perma.cc/8J6L-L4DN (arguing that black box studies are not always needed and that PCAST’s criteria are too narrow).

65. See Letter from Democracy Forward Foundation on behalf of Union of Concerned Scientists, requesting correction under the Information Quality Act (regarding U.S. Dep’t of Justice’s statement on the PCAST Report) (June 24, 2021), https://perma.cc/MQD8-K835 (defending PCAST and demanding that DOJ retract its statement); Innocence Project Staff, Innocence Project Calls on Department of Justice to Retract Statement on PCAST Report (Feb. 19, 2021), https://perma.cc/SS23-4K5F (same).

Page 127 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

judges to resolve in light of these debates.⁶⁶ For their part, statisticians and other scientists have likewise urged the forensic community to follow PCAST’s recommendations.⁶⁷

Notwithstanding its continued criticism of PCAST, DOJ has voluntarily made several efforts to improve its forensic practices in response to PCAST and other reports. From 2017 onward, the department has implemented several new initiatives, including an updated code of professional responsibility, publication of research needs, testimony monitoring, and creation of a laboratory needs working group.⁶⁸ FBI representatives have also reported such DOJ initiatives being implemented to strengthen the accuracy of FBI experts’ testimony.⁶⁹ DOJ does continue to use and endorse the term “source identification,” notwithstanding concerns expressed by others to the Advisory Committee on the Rules of Evidence that this term is synonymous with a claim of individualization, which DOJ reports its experts no longer testify to in court.⁷⁰

The forensic community has responded to PCAST in part by promulgating standards to improve feature comparison disciplines through a new entity, NIST’s Organization of Scientific Area Committees (OSAC). OSAC is composed of hundreds of forensic examiners along with lawyers, judges, psychologists, statisticians, and laboratory directors. Created in 2014, its mission is “to strengthen the nation’s use of forensic science by facilitating the development

66. Jonathan Jay Koehler, How Trial Judges Should Think About Forensic Science Evidence, 102 Judicature 29 (2018), https://perma.cc/DLM4-CXCL.

67. See, e.g., Suzanne Bell et al., A Call for More Science in Forensic Science, 115 Proc. Nat’l Acad. Scis. 4541, 4541 (2018), https://doi.org/10.1073/pnas.1712161115 (Discussing the 2009 NRC and PCAST reports with approval and opining that “[f]orensic science is at a crossroads. It is torn between the practices of science, which require empirical demonstration of the validity and accuracy of methods, and the practices of law, which accept methods based on historical precedent even if they have never been subjected to meaningful empirical validation.”); Thomas D. Albright, The US Department of Justice Stumbles on Visual Perception, 118 Proc. Nat’l Acad. Scis. e2102702118 (2021), https://doi.org/10.1073/pnas.2102702118 (offering a neuroscientist’s critique of DOJ’s refusal to adopt PCAST’s findings).

68. See U.S. Dep’t of Justice, Forensic Science: Publications, https://perma.cc/BLR3-WNTK (last visited Nov. 20, 2024) (listing priorities, publications, and initiatives).

69. See, e.g., Alice R. Isenberg & Cary T. Oien, Scientific Excellence in the Forensic Science Community, 86 Fordham L. Rev. Online 39 (2018) (discussing testimony monitoring as an example of a recent initiative, as well as the FBI’s existing accreditation requirements); Andrew D. Goldsmith, The Reliability of the Adversarial System to Assess the Scientific Validity of Forensic Evidence, 86 Fordham L. Rev. Online 16 (2018) (noting that the FBI has eliminated use of the terms “reasonable degree of scientific certainty” and similar statements; eliminated claims of zero error; prohibited examiners from citing the number of examinations conducted as an indication of the accuracy of their conclusion; and created a testimony monitoring program). However, DOJ’s policies apply only to its own witnesses. If witnesses from local or state agencies testify in a federal case, they may not adhere to the DOJ policy.

70. See Judicial Conference Advisory Committee on the Federal Rules of Evidence, Minutes of the Meeting of May 3, 2019, 1, 20–23 (May 3, 2019), https://perma.cc/2QRN-7LQ8 (disagreement on the FBI’s continued use of this term and whether it is different from individualization).

Page 128 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

and promoting the use of high-quality, technically sound standards.”⁷¹ OSAC both creates new standards and reviews standards set by other standards development organizations (SDOs), such as the American Academy of Forensic Science’s (AAFS) Academy Standards Board (ASB),⁷² ASTM International (formerly the American Society of Testing and Materials),⁷³ and the National Fire Protection Association (NFPA).⁷⁴ Standards that have sufficient “technical merit” are placed on OSAC’s “registry.”⁷⁵ Judges should be aware of OSAC’s activities and output, as well as controversy over the quality of its registry standards, given that a standard’s inclusion on the registry may be touted by one side or another as proof that a discipline is governed by standards, one of the “Daubert factors.”⁷⁶

In addition to OSAC, other governmental and nongovernmental organizations independent of the forensic examiner or legal communities⁷⁷ continue efforts to improve forensic science. The Center for Statistics and Applications in

71. NIST’s Organization of Scientific Area Committees for Forensic Science (OSAC), About Us, Nat’l Inst. of Standards & Tech. (NIST), U.S. Dep’t of Commerce, https://perma.cc/9NJ8-7L2F (last visited Nov. 20, 2024).

72. See Academy Standards Board, Am. Acad. of Forensic Sci. (AAFS), https://perma.cc/U74R-Y4SM (last visited Nov. 20, 2024).

73. See ASTM International, https://perma.cc/25VC-ZCQK (last visited Nov. 20, 2024).

74. See Nat’l Fire Prot. Ass’n (NFPA), NFPA Codes and Standards, https://perma.cc/G24D-JGNZ (last visited Nov. 20, 2024).

75. See NIST’s OSAC, About Us, supra note 71 (“OSAC also reviews standards and posts high quality ones to the OSAC Registry [https://perma.cc/L2SZ-L7X3 (last visited Nov. 20, 2024)]. Inclusion on this registry indicates that a standard is technically sound and that laboratories should consider adopting them.”).

76. Three OSAC scientists published commentary criticizing OSAC for promulgating “vacuous” standards that merely require laboratories to have standards, rather than offering meaningful guidance on what the standards should be. These critics claimed that placing such standards on the OSAC registry allows disciplines to misleadingly claim they are governed by standards. Geoffrey Stewart Morrison, Cedric Neumann & Patrick Henry Geoghegan, Vacuous Standards—Subversion of the OSAC Standards-Development Process, 2 Forensic Sci. Int’l: Synergy 206 (2020), https://doi.org/10.1016/j.fsisyn.2020.06.005. In response, several members of ASB argued that the standards were an improvement, had to go through several rounds of comments from task groups and the public, and can be revised. See Linton A. Mohammed et al., Response to Vacuous Standards Subversion of the OSAC Standards Development Process, 3 Forensic Sci. Int’l: Synergy 100145 (2021), https://doi.org/10.1016/j.fsisyn.2021.100145. A reply to the response, signed by eleven scientists, including several OSAC members, argued that the issue is the content of the standards, not the process that led to them, and ended with an admonition to judges to “not accept at face value claims of scientific validity based on the fact that published standards have been followed. We would encourage courts to enquire further so as to ascertain whether those standards are fit for purpose.” See Geoffrey Stewart Morrison et al., Reply to Response to Vacuous Standards—Subversion of the OSAC Standards-Development Process, 3 Forensic Sci. Int’l: Synergy 100149 (2021), https://doi.org/10.1016/j.fsisyn.2021.100149.

77. There are also professional organizations within the forensic examiner and legal communities that work to improve and critique forensic techniques, such as the Forensic Justice Project, https://perma.cc/RS96-DK5U, and the American Academy of Forensic Sciences, https://perma.cc/CSF3-69L4.

Page 129 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

Forensic Evidence (CSAFE), a federally funded research and training center that is the only one of three NIST Centers of Excellence dedicated to building the scientific and statistical foundations of feature and pattern comparison evidence, remains active in nationwide efforts to review and improve forensic feature comparison methods. In particular, many of its members are statisticians developing more objective means of rendering feature and pattern comparison based on automated methods.⁷⁸ Likewise, the American Association for the Advancement of Science (AAAS) holds conferences and publishes articles related to forensic science issues.⁷⁹

Recurring Concerns with Forensic Feature Comparison Evidence

Before this reference guide delves into specific disciplines, this section flags several recurring issues with respect to many non-DNA feature comparison methods that judges will face.

Terminology and Testimony Issues

Overclaims of certainty and challenges to claims of “matches” or source attribution.

Challenges to non-DNA feature comparison expert testimony sometimes relate to the terms used by experts to describe their conclusions. One issue is claims of certainty. The advisory committee’s notes to the 2023 amendments to Rule 702 state that “[f]orensic experts should avoid assertions of absolute or one hundred percent certainty.”⁸⁰ The 2009 NRC Report and 2016 PCAST Report also urged against such statements,⁸¹ and DOJ’s guidelines prohibit them.

Another frequently challenged word in feature comparison reports and testimony is “match.” When an expert deems two sets of features to be consistent

78. See generally Center for Statistics and Applications in Forensic Evidence (CSAFE), https://perma.cc/UW29-6XDB (linking to studies and ongoing research).

79. For example, the AAFS Center for Scientific Responsibility and Justice held a conference on forensics in 2019 on the tenth anniversary of the 2009 NRC Report. See Am. Ass’n for the Advancement of Sci. (AAAS), An Update on Strengthening Forensic Science in the United States: A Decade of Development [Forensic Conference 2019], https://perma.cc/8NXY-9WHK. See generally Am. Ass’n for the Advancement of Sci. (AAAS), “Forensic” Search Results, https://perma.cc/9ZRZ-4RQH (listing publications and events).

80. Fed. R. Evid. 702 advisory committee’s note to 2023 amendment.

81. See, e.g., 2009 NRC Report, supra note 37, at 47–48 (noting testimony of some experts as to “unfailing certainty” and describing “failure to acknowledge uncertainty” as a problem in feature comparison methods); PCAST Report, supra note 20 at 6 (criticizing experts’ “[s]tatements claiming or implying greater certainty than demonstrated by empirical evidence”).

Page 130 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

with each other, they might testify that the evidence and reference samples “match.” But the term “match” might inaccurately suggest both an air of certainty in the expert’s determination of the consistency features, and that the mere consistency of a set of compared features means that the two items necessarily share a common source. For example, if two fingerprints “match” at six points, they might not “match” if the examiner looked at ten more points. For a discipline like DNA, with robust population data and quantified “match” statistics, examiners do not need to fall back on terms like “match” or “identification”; they can just present a statistic and allow the factfinder to make their own determination of whether a suspect is the source of the DNA. But for more subjective methods without such data, an examiner’s opinion that two patterns “match” or share the same source is more fraught.

For similar reasons, government and NGO reports have also voiced concern over use of other terms that imply that two patterns definitely have the same source, such as “individualization,” “similar in all respects tested,” or “to the exclusion of all others,” given their “profound effect on how the trier of fact in a criminal or civil matter perceives and evaluates scientific evidence.”⁸² For example, the 2009 NRC Report expressed concern that imprecise terminology in microscopic hair analysis, such as “associated with,” could imply individualization and a “match” without necessary confirmation from mitochondrial DNA (mtDNA) analysis.⁸³ Indeed, an internal FBI study found that “of 80 hair comparisons that were ‘associated’ through microscopic examinations, 9 of them (12.5 percent) were found in fact to come from different sources when reexamined through mtDNA analysis.”⁸⁴ The 2016 PCAST Report likewise urged that “quantitative information about the reliability of methods be stated clearly in expert testimony,” and ambiguous language such as “match” and “identification” be avoided, as this could be misinterpreted as scientifically supported individualization.⁸⁵ PCAST also discouraged examiners from testifying as to the “uniqueness” of features rather than focusing on the extent to which evidence of consistency in features supports an inference that the features or patterns share a common source.⁸⁶

Other scientific organizations, such as the American Statistical Association (ASA), have voiced similar warnings about language suggesting identification or source attribution.⁸⁷ To avoid factfinders making inaccurate inferences about

82. 2009 NRC Report, supra note 37, at 21. See also id. at 7, 43 (“individualization”); 176 (“exclusion of all others”).

83. Id. at 161.

84. Id. at 160–61 (discussing Max Houck & Bruce Budowle, Correlation of Microscopic and Mitochondrial DNA Hair Comparisons, 47 J. Forensic Scis. 964 (2002)).

85. 2016 PCAST Report, supra note 20, at 121–22.

86. Id. at 61–62.

87. See Am. Stat. Ass’n, American Statistical Association Position on Statistical Statements for Forensic Evidence 1, 4–5 (Jan. 2, 2019), https://perma.cc/GHH3-VY4G (“The ASA strongly discourages statements to the effect that a specific individual or object is the source of the forensic

Page 131 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

certainty, the ASA further suggests requiring experts to make clear in their testimony that “it is possible that other individuals or objects may possess or have left a similar set of observed features” and to acknowledge “both in testimony and in written reports” a method’s “absence of models and empirical evidence.”⁸⁸ While some testimony guides direct examiners to limit their claims to stating that evidence is “strong support” for identification (as compared to a statement of categorical source attribution), such statements themselves should be backed up by empirical data about the rarity of features such that the examiner’s observation of them justifies such language in a given case.

Still other terms have been critiqued as simply inappropriate. For example, the NCFS published a consensus views document that terms like “reasonable degree of scientific certainty” and “reasonable degree of ballistic certainty” have no scientific basis and should not be used by legal actors or examiners.⁸⁹ The NCFS also published a consensus views document on “Inconsistent Terminology,” which discussed inconsistent and misleading uses across disciplines of words like “perimortem,” “inconclusive,” or implicitly overclaiming phrases like “there could be another individual somewhere in the world” with the same feature or pattern.⁹⁰ In federal court, some examiners will be limited by DOJ’s guidelines (relatively new as of this writing) on uniform language for testimony and reports, which typically require examiners to report a source identification, a source exclusion, or an inconclusive determination.⁹¹ Some commentators urge a retreat from such categorical language, given the lack of data about the rarity of features in the population that would presumably be needed to scientifically support any claim of individualization. Instead, some commentators urge examiners to only use scientifically supported statements about the strength of the evidence, in the form of likelihood ratios or other statements that take into account population data showing the rarity of features in the relevant population.⁹²

science evidence. Instead, the ASA recommends that reports and testimony make clear that, even in circumstances involving extremely strong statistical evidence, it is possible that other individuals or objects may possess or have left a similar set of observed features. . . . The ASA recommends that the absence of models and empirical evidence be acknowledged both in testimony and in written reports.”).

88. Id.

89. See U.S. Dep’t of Justice, Nat’l Comm’n on Forensic Sci. (NCFS), Work Products Adopted by the Commission (July 23, 2018), https://perma.cc/64DR-4VDA (links to documents).

90. See U.S. Dep’t of Justice, NCFS, Final Draft Views on Inconsistent Terminology, https://perma.cc/9CT6-VY3U (last visited Nov. 26, 2024).

91. See, e.g., U.S. Dep’t of Justice, Uniform Language for Testimony and Reports for the Forensic Firearms/Toolmarks Discipline Pattern Examination (2023), https://perma.cc/4N57-TDAG.

92. See, e.g., Geoffrey Stewart Morrison et al., Calculation of Likelihood Ratios for Inference of Biological Sex from Human Skeletal Remains, 3 Forensic Sci. Int’l: Synergy 100202 (2021), https://doi.org/10.1016/j.fsisyn.2021.100202 (urging use of likelihood ratios (LRs) instead of statements of

Page 132 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

Conflating related but distinct scientific principles such as “reliability” and “validity.”

Although courts often use the terms validity and reliability interchangeably, the terms have distinct meanings in different scientific disciplines. Validity typically refers to the ability of a test to “accurately measure[] what it is intended to measure.”⁹³ Reliability, in scientific and statistical terms, refers to the extent to which a method “produces largely consistent results when properly applied.”⁹⁴ Reliability can be further broken down into repeatability (“intra-examiner reliability,” or the extent to which the same examiner reaches the same results under the same conditions) and reproducibility (“inter-examiner reliability,” or the extent to which other examiners reach the same result using the same method).⁹⁵ The Supreme Court acknowledged these distinctions in Daubert, but the Court indicated that it was using the term reliability in a different sense. The Court wrote that its concern was “evidentiary reliability—that is, trustworthiness. . . . In a case involving scientific evidence, evidentiary reliability will be based upon scientific validity.”⁹⁶ Thus, judges should be aware that when the 2016 PCAST Report refers to “foundational validity” and “validity as applied,” it is speaking to the same questions as Daubert. At the same time, other experts might speak of a method’s reliability solely in terms of its reproducibility and repeatability, rather than its accuracy in getting the right answer. Judges should be aware of the context in which the terms are used.

“Class” versus “subclass” versus “individual” characteristics.

Many non-DNA forensic feature comparison techniques purport to distinguish among “class,” “subclass,” and “individual” characteristics. Class and sub-class characteristics are believed to be shared by a group of persons or objects (e.g., ABO blood types, or the toolmark patterns seen on a particular kind of Glock handgun).⁹⁷ So-called individual characteristics are thought to be unique to an object or person, to the

source identification); David H. Kaye, The Nikumaroro Bones: How Can Forensic Science Assist Factfinders?, 6 Va. J. Crim. L. 101 (2018) (same).

93. See Hal S. Stern, Maria Cuellar & David Kaye, Reliability and Validity of Forensic Science Evidence, 16 Significance 21, 22 (2019), https://doi.org/10.1111/j.1740-9713.2019.01250.x.

94. Id.

95. Id. at 22–23.

96. Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579, 590 n.9 (1993) (“We note that scientists typically distinguish between ‘validity’ (does the principle support what it purports to show?) and ‘reliability’ (does application of the principle produce consistent results?) . . . .”).

97. See Jan S. Bashinski & Joseph L. Peterson, Forensic Sciences, in Local Government: Police Management 556 n.74 (William Geller & Darrel Stephens eds., 4th ed. 2004):
The forensic scientist first investigates whether items possess similar ‘class’ characteristics—that is, whether they possess features shared by all objects or materials in a single class or category. (For firearms evidence, bullets of the same caliber, bearing rifling marks of the same number, width, and direction of twist, share class characteristics. They are consistent with being fired from the same type of weapon.) The forensic scientist then attempts to determine an item’s ‘individuality’—the features that make one thing different from all others similar to it, including those with similar class characteristics.

Page 133 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

exclusion of all others. The term match is especially ambiguous with respect to these types of methods because a match could mean merely that the two patterns are thought to come from the same class or subclass, or it could mean that the two patterns are thought to come from the same source because they share individual characteristics. Expert opinions involving “individual,” “subclass,” and “class” characteristics raise different issues, including whether the determination rests on a firm scientific foundation⁹⁸ and how large a class or subclass is (for determining the probative value of a determination that two compared items are part of the same class or subclass).⁹⁹

Where an examiner seeks to testify to observing an individual characteristic, judges might further inquire into the assumptions underlying that description as part of a Daubert inquiry. The examiner is presumably assuming that a particular observed feature is unique, raising the need for empirical data suggesting uniqueness. The examiner’s use of that term also presumably reflects an assumption that a single “individual” characteristic is sufficient to make a “source identification” conclusion, given that no other item in the world has that one characteristic. Some disciplines might be able to sustain that assumption, while others might not.

Presenting conclusions in quantitative terms based solely on examinations conducted by the examiner.

Forensic examiners from some disciplines and laboratories sometimes appeal to the sheer number of examinations they have done as a sufficient empirical basis for a source conclusion statement. For example, an examiner might claim that “in 10 years of practice they have examined 5,000 items like this and have never seen such a high degree of similarity,” or that “the probability of observing such a strong similarity if items do not have a common source is less than 1 in a 1,000.” The FBI, for its part, has recently affirmed that its testimony monitoring program seeks to eliminate such claims made by testifying experts.¹⁰⁰

The fallacy of the transposed conditional (the “prosecutor’s fallacy”).

Forensic examiners, like all people, occasionally fall prey to statistical fallacies in drawing conclusions. One recurring example in the feature comparison context is the so-called fallacy of the transposed conditional. A conditional probability is the

98. See Michael Saks & Jonathan Koehler, The Individualization Fallacy in Forensic Science Evidence, 61 Vand. L. Rev. 199 (2008).

99. See Margaret A. Berger, Procedural Paradigms for Applying the Daubert Test, 78 Minn. L. Rev. 1345, 1356–57 (1994):
We allow eyewitnesses to testify that the person fleeing the scene wore a yellow jacket and permit proof that a defendant owned a yellow jacket without establishing the background rate of yellow jackets in the community. Jurors understand, however, that others than the accused own yellow jackets. When experts testify about samples matching in every respect, the jurors may be oblivious to the probability concerns if no background rate is offered, or may be unduly prejudiced or confused if the probability of a match is confused with the probability of guilt, or if a background rate is offered that does not have an adequate scientific foundation.

100. See Goldsmith, supra note 69 (noting the FBI’s position on this).

Page 134 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

probability of one event A “given” another event B—for example, the probability that a playing card is a queen, given that it is a heart, is 1 in 13. If you “transposed” this conditional it would be the probability of B given A—for example, the probability a card is a heart, given that it is a queen, which is instead 1 in 4. Conflating the two as equal is the fallacy of the transposed conditional. In the forensic context, the fallacy would be to conflate one conditional probability—“the chance that he would match the crime scene evidence given that he is innocent,” with the transposed conditional—“the chance that he is innocent given that he matches the crime scene evidence.” Imagine, for example, that there was A positive blood found at a crime scene, and that the arrested suspect has A positive blood (along with 30% of the rest of the population). We might say that the chance a random person would “match” the blood evidence, assuming he is innocent, is 30%. But it’s definitely not true that the chance a person is innocent, given the mere fact that he “matches” the blood evidence, is 30%. In a population of a million people, a full 300,000 people would “match” the blood evidence simply by coincidence. The chance that any one of these 300,000 random people is innocent, given that they “match,” is certainly more than 30%; indeed, it is overwhelmingly high.

But when the probabilities become very small, people have a hard time with this concept. In the DNA context, the error would be as follows. Imagine a random match probability (say, 1 in a million), which is the probability that, given a person is not the source of the DNA, they will “match” by sheer chance. This means we would expect that every 1 in a million people would have this profile, and that, in a country of 300 million people, around 300 would “match” by chance. The fallacy would be to inaccurately present it as the probability that, given a person “matches” the DNA at a crime scene, there is only a 1 in a million chance they are not the source.¹⁰¹ In the shoe print context, the error would be similar. If one starts with the statement, “given a randomly selected person, the probability they will have these particular chevron and wave patterns on their shoe is small,” the fallacy would be to claim that “given this person had these particular chevron and wave patterns on their shoe, the chance that they are not the source is small.”

Reliability Concerns

Concerns about foundational reliability.

As reflected in the governmental and NGO reports discussed in the previous subsection titled “Prior Government and NGO Review of Forensic Feature Comparison Evidence,” lingering concerns

101. See, e.g., McDaniel v. Brown, 558 U.S. 120 (2010) (noting that the prosecutor and the government’s DNA expert both engaged in the fallacy of the transposed conditional in their statements before the jury); Andrea Roth, Safety in Numbers?: Deciding When DNA Alone Is Enough to Convict, 85 N.Y.U. L. Rev. 1130 (2010) (explaining the fallacy in laypersons’ terms).

Page 135 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

exist about the foundational reliability of some non-DNA forensic feature comparison methods. These are important concerns, as “it has become apparent, over the past decade, that faulty forensic feature comparison has led to numerous miscarriages of justice.”¹⁰² In brief, critics focus on four main issues.

First, non-DNA methods lack variability data that show how rare the compared features are in the relevant population. Thus, it is difficult to know merely from the presence of shared features between two items how strong the inference should be that the two items share a common source.¹⁰³ While some examiners might rely on their own casework experience to opine about the likely frequency of features in the relevant population, the 2016 PCAST Report cautioned against reliance on examiner experience on this front without empirical data.¹⁰⁴

Second, these methods are relatively subjective compared to DNA.¹⁰⁵ Because they are based largely on individual examiner judgment and experience, the feature comparisons themselves are sometimes critiqued as more prone to contextual bias and less able to be repeated and reproduced both by the same examiner and by other examiners, which are hallmarks of good science (reproducibility and repeatability).

Third, critics argue there are too few well-designed validation studies indicating the extent of a method’s accuracy. Put differently, these methods have not been sufficiently shown through well-designed empirical testing to consistently get the right answer. Regardless of how subjective or black box-like a method is, the method can still be tested against a known truth to determine how often the

102. 2016 PCAST Report, supra note 20, at 44 (citing exoneration compilations); see also Kori Khan & Alicia Carriquiry, Hierarchical Bayesian Non-response Models for Error Rates in Forensic Black-Box Studies, 381 Phil. Transactions Royal Soc’y A: Mathematical, Physical, & Eng’g Scis. 1, 2 (2023), https://doi.org/10.1098/rsta.2022.0157 (explaining that a primary cause of these wrongful convictions is the “ad hoc, subjective methods to evaluate evidence, and the exaggerated claims by expert witnesses at trial”).

103. See, e.g., 2009 NRC Report, supra note 37, at 149 (critiquing lack of variability studies in non-DNA methods); Robin Mejia et al., What Does a Match Mean? A Framework for Understanding Forensic Comparisons, 16 Significance 25–28 (2019), https://doi.org/10.1111/j.1740-9713.2019.01251.x (noting the lack of population data to explain the rarity of characteristics and the resulting inability to accurately determine the probative value of a match).

104. 2016 PCAST Report, supra note 20, at 55 (“The frequency with which a particular pattern or set of features will be observed in different samples, which is an essential element in drawing conclusions, is not a matter of ‘judgment.’ It is an empirical matter for which only empirical evidence is relevant.”).

105. See, e.g., id. at 47 (“Objective methods are, in general, preferable to subjective methods. Analyses that depend on human judgment (rather than a quantitative measure of similarity) are obviously more susceptible to human error, bias, and performance variability across examiners.”). For more on the need for objective measures, see Karen Kafadar, The Need for Objective Measures in Forensic Science, 16 Significance 16 (2019), https://doi.org/10.1111/j.1740-9713.2019.01249.x. Of course, DNA analysis itself is also subjective in many respects. See generally Erin Murphy, The Art in the Science of DNA: A Layperson’s Guide to the Subjectivity Inherent in Forensic DNA Typing, 58 Emory L.J. 489 (2008).

Page 136 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

method gets the right answer.¹⁰⁶ Such testing can be used to calculate an error rate, giving factfinders a better sense of how strongly the evidence supports the inference the examiner has drawn. The 2023 amendments to Rule 702 reflect the growing consensus that courts must be willing to “receive an estimate” of the error rates on methodologies employed. The advisory committee’s notes’ emphasis on error rates is consistent with the independent consensus positions of multiple national committees¹⁰⁷ as reflected in the 2009 NRC Report,¹⁰⁸ the 2016 PCAST Report,¹⁰⁹ and other recent reports from scientific review boards and experts.¹¹⁰

The 2016 PCAST Report in particular concluded that non-DNA feature comparison specialties vary widely in terms of well-designed validation studies. The report, after reviewing existing validation studies for a variety of disciplines, concluded that friction ridge analysis (fingerprint comparison) and analysis by expert systems of certain DNA mixtures have proof of foundational reliability with a known error rate,¹¹¹ while the validity of other methods (such as microscopic hair comparison, toolmark analysis, and bitemark analysis) had not been

106. See 2016 PCAST Report, supra note 20, at 5–6 (noting that “black box” methods can and must be tested).

107. See generally Paul C. Giannelli, Forensic Science: Daubert’s Failure, 68 Case W. Rsrv. L. Rev. 869 (2018) (discussing the conclusions of 2009 NRC Report, PCAST, and the NCFS).

108. See discussion at section titled “Prior Government and NGO Review of Forensic Feature Comparison Evidence” above.

109. See id.

110. See, e.g., Karen Kafadar, Statistical Issues in Assessing Forensic Evidence, 83 Int’l Stat. Rev. 111, 120 (2015), https://doi.org/10.1111/insr.12069 (Expressing support for the 2009 NRC Report and opining that “[t]he scientific method is very general, and the principles of science apply to all branches, irrespective of the specific application (e.g. physics, chemistry or biology). If the scientific method has not been applied, then scientists from no branch can defend the results.”); William Thompson et al., Am. Ass’n for the Advancement of Sci. (AAAS), Forensic Science Assessments: A Quality and Gap Analysis: Latent Fingerprint Examination 7–8, 43–44 (2017), https://www.aaas.org/report/latent-fingerprint-examination [hereinafter AAAS Fingerprint Report] (noting the need to quantify uncertainty in the method); Melissa Taylor et al., NIST, Forensic Handwriting Examination and Human Factors: Improving the Practice Through a Systems Approach (2020), https://doi.org/10.6028/NIST.IR.8282 [hereinafter 2020 NIST Report] (same); Kelly Sauerwein et al., Bitemark Analysis: A NIST Scientific Foundation Review (2023), https://doi.org/10.6028/NIST.IR.8352 [hereinafter NIST Bitemark Report] (same).

111. 2016 PCAST Report, supra note 20, at 9 (“PCAST finds that latent-fingerprint analysis is a foundationally valid subjective methodology—albeit with a false positive rate that is substantial and is likely to be higher than expected by many jurors based on longstanding claims about the infallibility of fingerprint analysis.”). See also PCAST Addendum, supra note 30, at 4:
[T]he friction-ridge discipline . . . has set an excellent example by undertaking both (i) path-breaking black-box studies to establish the validity and degree of reliability of latent-fingerprint analysis, and (ii) insightful “white-box” studies that shed light on how latent-print analysts carry out their examinations, including forthrightly identifying problems and needs for improvement. PCAST also applauds ongoing efforts to transform latent-print analysis from a subjective method to a fully objective method.

Page 137 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

supported by even one appropriately designed validation study.¹¹² More recent studies since the 2016 PCAST Report have echoed these concerns.¹¹³

Fourth, some examiners have used methods for a purpose or context beyond its scientific foundations. Even if a method might be foundationally valid for one modest purpose (such as determining whether two projectiles were fired from the same brand and model of firearm), it might not be foundationally valid for a more ambitious purpose (such as determining the likelihood that two projectiles were fired from the same firearm). Similarly, a handwriting comparison method might be foundationally valid for determining whether two exemplars were written by a different person (exclusion), but not for determining the likelihood that they were written by the same person. The more ambitious an examiner’s claim on the witness stand, the more the concern that the method has been stretched beyond its scientific foundations. This could also be an issue with the method’s reliability as applied, but it could be seen as an issue of foundational reliability if the method itself is simply not fit for the purpose it is used for.

These continuing issues with non-DNA feature comparison methods are likely in part a reflection of their origins outside the derivative sciences. For the most part, non-DNA feature comparison methods did not develop as “scientific” methods; rather, they grew from forensic practices that did not have governing

112. See PCAST Addendum, supra note 30, at 5 (“In its report, PCAST stated that it found no empirical studies whatsoever that establish the scientific validity or degree of reliability of bitemark analysis as currently practiced. To the contrary, it found considerable literature pointing to the unreliability of the method.”); id. at 6 (With respect to non-DNA hair comparison, “the acknowledged lack of any empirical evidence about false-positive rates indeed means that, as a forensic feature-comparison method, hair comparison lacks a scientific foundation. . . . PCAST concludes that there are no empirical studies that establish the scientific validity and estimate the reliability of hair comparison as a forensic feature-comparison method.”); NIST Bitemark Report, supra note 110, at 24 (concluding “forensic bitemark analysis lacks a sufficient scientific foundation” and stating that “the data available does not support the accurate use of bitemark analysis to exclude or not exclude individuals as the source of the bitemark”).

113. A 2018 study concluded that the method of fingerprint experts is “far from error free.” See Garrett & Mitchell, supra note 19, at 908 (summarizing their findings of an error rate ranging from 1–2% to 10–20% per test with respect to false positives, with an average of 7% false positives and 7% false negatives over a nearly 20-year period from 1995–2016). Several black box studies have been conducted in multiple forensic disciplines since the 2016 PCAST Report, suggesting a very low rate of error. However, those examining the studies claim that the accuracy is lower than reported, owing to a variety of causes that reflect “missingness problems”: self-selected participants, high rates of attrition, nonresponse, how inconclusive responses are used, low reproducibility (inter-examiner consistence) and repeatability (intra-examiner consistency). Khan & Carriquiry, supra note 102, at 3–6. The “tremendously high rates of missingness” precludes a true estimate of error rates. Id. at 17–18. For a critical analysis of the error rates of black box studies conducted on firearms analysis, see Heike Hofmann, Alicia Carriquiry & Susan Vanderplas, Treatment of Inconclusives in the AFTE Range of Conclusions, 19 L., Probability & Risk 317, 342–43 (2020), https://doi.org/10.1093/lpr/mgab002 (addressing the effect of inconclusive findings on error rate accuracy and concluding that there is “significant work to be done” before the authors could confidently provide an error rate associated with firearms and toolmark analysis).

Page 138 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

standards, empirical testing, or controlled methods.¹¹⁴ There were few, if any, scientists involved in the development of these specialties.¹¹⁵ Many in the forensic community were resistant to change, and historically courts were lenient in evaluating admissibility of such evidence.¹¹⁶

While courts frequently admit feature comparison evidence without limitation in many specialties,¹¹⁷ several federal courts have noted these concerns, particularly in the areas of handwriting comparison and firearm comparison.¹¹⁸ Even with respect to fingerprint analysis, Judge Frank Easterbrook, writing for the U.S. Court of Appeals for the Seventh Circuit, noted that the method’s “false positive rate . . . is substantial and is likely to be higher than expected by many jurors based on longstanding claims about the infallibility of fingerprint analysis.”¹¹⁹ Following the suggestions of the advisory committee’s notes to the 2023 amendment to Rule 702, courts should ensure that examiner testimony is limited to what is supportable by the empirical data. Courts might choose, for example, to limit the direct testimony to the known error rate, restrict experts’ certainty of conclusions, limit the language experts use, or in some cases exclude the evidence entirely.¹²⁰

Concerns about reliability as applied.

According to PCAST and other commentators, one key to showing a feature or pattern method’s “reliability as applied” is proficiency testing, or “ongoing empirical tests to ‘evaluate the capability and performance of analysts.’”¹²¹ Such testing is particularly critical where a method rests in large part on the expert’s human judgment, which is especially vulnerable to error and inter-examiner variability.¹²² Given how central an

114. See generally Jennifer L. Mnookin et al., The Need for a Research Culture in the Forensic Sciences, 58 U.C.L.A. L. Rev. 725 (2011) (noting the nonscientific origin of pattern disciplines and the lack of a culture of scientific inquiry in these fields).

115. Michael J. Saks & Jonathan J. Koehler, The Coming Paradigm Shift in Forensic Identification Science, 309 Science 892, 893 (2005), https://doi.org/10.1126/science.1111565.

116. 2009 NRC Report, supra note 37, at 108–09 (citations omitted).

117. Giannelli, supra note 107 (explaining that, with few exceptions, courts have continued to admit forensic science in the same way as they did pre-Daubert).

118. See sections titled “Handwriting Evidence” and “Firearms and Toolmark Analysis” below.

119. United States v. Bonds, 922 F.3d 343, 345 (7th Cir. 2019).

120. “Forensic experts should avoid assertions of absolute or one hundred percent certainty—or to a reasonable degree of scientific certainty—if the methodology is subjective and thus potentially subject to error.” Fed. R. Evid. 702 advisory committee’s note to 2023 amendment. The notes additionally urge that when making the decision whether to admit forensic expert testimony, “the judge should (where possible) receive an estimate of the known or potential rate of error of the methodology employed, based (where appropriate) on studies that reflect how often the method produces accurate results.” Id.

121. 2016 PCAST Report, supra note 20, at 57.

122. Id. at 58. See also Garrett & Mitchell, supra note 19, at 902 (explaining why “proficiency testing is the only objective means of assessing the accuracy and reliability of experts who rely on subjective judgments to formulate their opinions (so-called ‘black-box experts’)”).

Page 139 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

expert’s individual skill is to the success (or not) of relatively subjective methods, the demonstration of such individual skill is important, and—according to these commenters—the only way to meaningfully demonstrate such skill is by measuring each expert’s successful performance on realistic proficiency tests.¹²³ One recurring challenge, however, is that existing proficiency tests tend to be relatively easy, arguably not a meaningful check on examiner competence.¹²⁴

Of course, proficiency testing alone only shows that the expert is qualified, not that the expert reliably applied the method in the case at hand. Also relevant to showing reliability as applied is documentation that the method was correctly applied in the case at hand, by the particular laboratory and examiner(s). This documentation can come in the form of internal validation, protocols, bench notes, reports, and testimony as to what the expert did and did not do in the particular case and the conditions of the sample and testing.

Moreover, to stay within the bounds of validity as applied, the examiners cannot make claims that go beyond the empirical evidence, and any limitations of the method should be acknowledged along with the demonstrated error rate of the method used.¹²⁵ The advisory committee’s note to the 2023 amendment to Rule 702(d) makes clear that the expert’s application of the methodology should be deemed unreliable if they overstate the strength of the inference that can be drawn from the methodology’s application in a given case.¹²⁶ Expert opinion testimony regarding the weight of feature comparison evidence (i.e., evidence that a set of features corresponds between two examined items) must be limited to those inferences that can reasonably be drawn from a reliable application of the principles and methods. Thus, a statement of certainty or source attribution might not be a reliable application of a particular method, even if a statement about a class characteristic is.

To be sure, most courts are hesitant to exclude testimony where there are concerns about how well the examiner has applied the methodology to the case at hand, often finding such issues are properly matters for cross-examination.¹²⁷

123. Given the wide variability of accuracy among examiners, each expert should be performance-tested. See Bradford T. Ulery et al., Accuracy and Reliability of Forensic Latent Fingerprint Decisions, 108 Proc. Nat’l Acad. Scis. 7733 (2011), https://doi.org/10.1073/pnas.1018707108 (noting the disagreement among examiners about whether fingerprints were suitable for reaching a conclusion).

124. The president of the Collaborative Testing Services (CTS) has candidly acknowledged that “[e]asy tests are favored by the community” of customers for such tests. 2016 PCAST Report, supra note 20, at 57 n.133 (quoting president of CTS).

125. Id. at 50-51. Note that calculation of the error rate might itself be controversial in a particular case, given issues related to, for example, whether to treat inconclusive determinations as false positives. See discussion at section titled “Firearms and Toolmark Analysis” below (discussing firearms validation studies).

126. See Fed. R. Evid. 702 advisory committee’s note to 2023 amendment.

127. See, e.g., Itiel E. Dror, Bridget M. McCormack & Jules Epstein, Cognitive Bias and Its Impact on Expert Witnesses and the Court, 54 Judges J. 8, 11 (2015) (the prevailing view is that “errors

Page 140 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

Nonetheless, there are decisions where courts have not permitted testimony where the practitioner has not reliably applied the methodology to the facts of the case.¹²⁸ In any event, the 2023 amendments to Rule 702 make clear that testimony is not admissible where the expert has not reliably applied the methodology to the case at hand. This shortcoming is not simply a matter for cross-examination but is critical to the decision about admissibility. If the judge is not satisfied by a preponderance of the evidence that the expert has reliably applied the methodology to the case at hand, the testimony should not be admitted.

Discussion of Selected Forensic Feature Comparison Evidence

Differences Between DNA and Other Methods

This reference guide addresses only non-DNA forensic feature comparison techniques, in part because DNA is ubiquitous and complex enough that it is deserving of separate treatment, and in part because of inherent differences between DNA and other existing techniques. Judges would benefit from a basic understanding of these differences.

First, forensic DNA typing was developed in the medical research context. As discussed above, most forensic disciplines were developed by law enforcement, housed in law enforcement facilities, and staffed by law enforcement personnel. In the 2009 NRC Report’s view, this lack of independence from law enforcement has contributed to contextual bias, insistence upon a “zero error” rate and other unscientific claims, and a dearth of empirical studies.¹²⁹ Whatever one thinks of the NRC’s conclusions in this regard, the fact is that much of the literature judges will read in forensic science litigation will tout this difference as a reason to view single-source DNA as the “gold standard” of forensic feature comparison. Second, the criteria for determining whether two DNA profiles are consistent with each other are relatively objective, precise, and repeatable compared to other feature comparison methods. While there is certainly some

in application should result in the exclusion of evidence only if they render the expert’s conclusions unreliable; otherwise, the jury should be allowed to consider whether the expert properly applied the methodology in determining the weight or credibility of the expert testimony” (quoting State v. Bernstein, 349 P.3d 200, 201 (Ariz. 2015))).

128. See, e.g., State v. McPhaul, 808 S.E.2d 294, 305 (N.C. Ct. App. 2017) (holding that the trial court abused its discretion in admitting fingerprint comparison evidence where the witness failed to explain how she reached her conclusion in the case at hand).

129. See 2009 NRC Report, supra note 37, at 71, 99, 104. See generally Erin E. Murphy, What “Strengthening Forensic Science” Today Means for Tomorrow: DNA Exceptionalism and the 2009 NAS Report, 9 L., Probability & Risk 7 (2010), https://doi.org/10.1093/lpr/mgp030 (explaining the NRC’s comparison of DNA typing to other more subjective techniques).

Page 141 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

judgment involved in determining what is and is not a “real” genetic marker, it is more straightforward to compare two graphs and note that both have a particular marker at a particular location than it is to determine that a smudged characteristic in a latent fingerprint “matches” a characteristic in a reference print. Moreover, as the Reference Guide on Human DNA Identification Evidence, in this manual, discusses, validation studies estimate the variability of genetic markers in the population. Third, the data underlying DNA testing allow for a quantified measure of uncertainty rather than merely a statement that two features or patterns “match” or are “consistent,” or reliance on an examiner’s judgment to decide how many similarities merit a source identification.

Ultimately, DNA will always be different from other feature comparison disciplines. DNA has high probative value in part because it is based on a biological model that explains inheritance; a statistical model that faithfully corresponds to the biology; and the population data that enables estimation of the statistical model parameters. This unique paradigm for DNA cannot be applied in the same way to other types of evidence (such as friction ridge examinations), much less evidence that is manufactured (like a toolmark). That said, other aspects of DNA typing, such as collection of population data about the variability and frequency of features and development of quantified match statistics and error rates, could be replicated in non-DNA disciplines with more research and testing.

Fingerprint Evidence

Introduction

Fingerprinting has been a forensic technique since the mid-1800s,¹³⁰ and English and American courts have accepted fingerprint identification testimony for just over a century.¹³¹ Over the years, at least before DNA, fingerprint analysis

130. Francis Galton authored the first textbook on the subject. Francis Galton, Fingerprints (1892) (describing the history of forensic fingerprinting, the ridges on hands and feet, and suggesting techniques for measuring and comparing ridges, including alleged racial differences). The origin of fingerprinting, like that of some other biometric techniques, is fraught in its connections to eugenics and “scientific racism,” and is far different from the discipline’s modern uses. See generally Simon Cole, Suspect Identities: A History of Fingerprint and Criminal Identification (2001). For more on the history of fingerprint identification, see Jennifer L. Mnookin, Fingerprint Evidence in an Age of DNA Profiling, 67 Brook. L. Rev. 13 (2001).

131. United States v. Llera Plaza, 188 F. Supp. 2d 549, 572 (E.D. Pa. 2002) (describing a 1906 English case in which “a New York City detective who had in 1904 been posted to Scotland Yard to learn about fingerprinting . . . used his new training to break open two celebrated cases: in each instance fingerprint identification led the suspect to confess. . . .”). The first published American appellate opinion sustaining admission of fingerprint testimony was People v. Jennings, 96 N.E. 1077 (Ill. 1911).

Page 142 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

became known as the “gold standard” of forensic feature comparison expertise.¹³² In fact, proponents of new, emerging techniques in forensics would sometimes attempt to invoke onto the new techniques the prestige of fingerprint analysis.¹³³ Likewise, some early proponents of DNA typing alluded to it as “DNA fingerprinting.”¹³⁴ Fingerprint comparison is still widely used and is one of the most relied-upon forms of forensic feature comparison across the world.

This is not to say that fingerprint comparisons are free from error. In fact, the FBI’s incorrect identification of attorney Brandon Mayfield as the perpetrator of the 2004 Madrid train bombing is perhaps the most well-known error and one that spurred additional research.¹³⁵ Like other forms of forensic feature comparison, fingerprint comparison was judicially adopted without any proof of its foundational reliability. As noted below, at least one court has limited expert testimony about fingerprints and government reports have raised new questions about the limits of its accuracy.

The Method and Its Claims

Fingerprint analysis involves comparing the features of an evidence print (typically a “latent” invisible print lifted from a surface) and a reference print to determine if they might have a common source. Fingerprint comparison has historically been based on three assumptions: (1) the uniqueness of each person’s friction ridges on their fingers, (2) the permanence of those ridges throughout a person’s life, and (3) the transferability of an impression of that uniqueness to another surface. Contemporary research addresses each of these assumptions. First, scientific research has “convincingly established” that ridge patterns of humans’ fingers vary greatly among individuals and, as such, fingerprint comparison is a theoretically viable way to distinguish individuals.¹³⁶ Second, fingerprints appear to be persistent over one’s lifetime, although there can be minor changes in friction ridge detail due to aging or occupation.¹³⁷ Third, there is

132. See Sandy L. Zabell, Fingerprint Evidence, 13 J. L. & Pol’y 143 (2005) (noting fingerprint evidence was long considered the “gold standard” of human identification).

133. See, e.g., Kenneth Thomas, Voiceprint—Myth or Miracle, in Scientific and Expert Evidence 1015 (2d ed. 2011) (noting that advocates of sound spectrography referred to it as “voiceprint” analysis).

134. Colin Norman, Maine Case Deals Blow to DNA Fingerprinting, 246 Science 1556 (1989), https://doi.org/10.1126/science.2688090.

135. See Robert B. Stacey, Report on the Erroneous Fingerprint Individualization in the Madrid Train Bombing Case, Fed. Bureau of Investigation (Jan. 2005), https://perma.cc/5WY6-AWHE.

136. See AAAS Fingerprint Report, supra note 110, at 17. This variability extends to related individuals, and even identical twins develop distinguishable friction ridge skill detail. Id. (citations omitted). Distinguishing identical twins occurs at a slightly lower accuracy than non-twins, albeit with “relatively high accuracy.” Id.

137. Id. at 27–28 (citing numerous sources).

Page 143 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

ample evidence that fingerprints left on a surface can be accurately transferred to another surface for purposes of comparison using various methods, although as stated, the quality (and thus analytical value) of these prints may vary dramatically from “rolled” prints (“record” prints that are typically rolled onto a fingerprint card or digitized and scanned into an electronic file).¹³⁸ For example, latent prints from a crime scene vary in pressure, and the elasticity of skin naturally distorts the impression.¹³⁹ Consequently, fingerprint impressions from the same person typically differ in some respects each time the impression is left on an object.¹⁴⁰ The latent print might sometimes be so fragmentary or smudged that analysis is impossible.

Fingerprint examiners generally follow a procedure known as analysis, comparison, evaluation, and verification (ACE-V). In the analysis stage, the examiner studies the evidence print (typically a latent print from a crime scene) to determine whether the quantity and quality of details in the print are sufficient to permit further evaluation, and determines what features are present.¹⁴¹ In the comparison stage, the examiner visually compares the evidence and known prints to determine “the details that correspond” between them, such as the “overall shape” of the prints, “lengths of the ridges, minutia location and type, thickness of the ridges and furrows, shapes of the ridges, pore position, crease patterns and shapes, scar shapes, and temporary feature shapes (e.g., a wart).”¹⁴² In the evaluation stage, the examiner “evaluates the sufficiency of the detail present to establish an identification (source determination),” determining whether, “based on his or her experience,” a “sufficient quantity and quality of friction ridge detail is in agreement between the latent print and the known

138. See David A. Stoney, Scientific Status, in Modern Scientific Evidence: The Law and Science of Expert Testimony § 32.29 (David L. Faigman et al., 2021–2022 ed.) (discussing the various methods of extraction); United State v. Cruz-Mercedes, 379 F. Supp. 3d 24, 44–45 (D. Mass.), aff’d on other grounds, 945 F.3d 569 (1st Cir. 2019) (explaining how there was only one available latent print from which to make a comparison; the remaining prints were insufficiently clear to be useful).

139. See, e.g., Stoney, supra note 138, § 32.32 (“Smudging, dirty surfaces, dirty fingers, and contingencies of fingermark deposition all contribute to the incomplete transfer of the finger’s detail and the introduction of specious or artifactual detail.”); United States v. Mitchell, 365 F.3d 215, 220–21 (3d Cir. 2004) (“Criminals generally do not leave behind full fingerprints on clean, flat surfaces. Rather, they leave fragments that are often distorted or marred by artifacts. . . . Testimony at the Daubert hearing suggested that the typical latent print is a fraction—perhaps 1/5th—of the size of a full fingerprint.”); id. at 221 n.1 (“In the jargon, artifacts are generally small amounts of dirt or grease that masquerade as parts of the ridge impressions seen in a fingerprint, while distortions are produced by smudging or too much pressure in making the print, which tends to flatten the ridges on the finger and obscure their detail.”).

140. 2009 NRC Report, supra note 37, at 144 (“The impression left by a given finger will differ every time, because of inevitable variations in pressure, which change the degree of contact between each part of the ridge structure and the impression medium.”).

141. Id. at 137–38.

142. Id. at 138–39. See also Stoney, supra note 138, § 32.32.

Page 144 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

print” to tell whether the prints do or do not come from the same source.¹⁴³ In the verification stage, a second examiner repeats the analysis to see if they arrive at the same conclusion. Verification is blind if the second examiner does not know the first examiner’s conclusion.

Each of these stages in ACE-V affords examiners significant discretion. For example, examiners themselves (at least those in the United States) determine based on their judgment and experience not only how many points of comparison are sufficient before drawing a conclusion,¹⁴⁴ but whether any apparent dissimilarities (which would otherwise require reporting an “exclusion”) can be discounted as an artifact or are a result of distortion.¹⁴⁵ The examiner also has broad discretion in how to evaluate such factors as “inevitable variations” in pressure, but to date these factors have not been “characterized, quantified, or compared.”¹⁴⁶

Examiners often use the Automated Fingerprint Identification System (AFIS) of databases to rapidly search fingerprint databases in no-suspect cases. This system of databases was developed in the late 1970s and has become more important for law enforcement with the growth of databases of known prints.¹⁴⁷ AFIS is highly accurate for comparison of all ten rolled or slapped prints to another set of known prints. AFIS searches are also useful for screening large numbers of prints, which can then be evaluated by human examiners.

Scientific Assessments, Critiques, and Debates

The following discussion is focused on critiques of fingerprint analysis, given that these will frame the litigation judges will face over admission of latent print examiner testimony.

Since the 2004 Brandon Mayfield false positive FBI incident and the 2009 NRC Report, critics have focused primarily on the following concerns with

143. 2009 NRC Report, supra note 37, at 138.

144. Stoney, supra note 138, § 32.35; United States v. Llera Plaza, 188 F. Supp. 2d 549, 566–71 (E.D. Pa. 2002) (noting that U.S. and Scotland Yard examiners have no minimum or set number of points of comparison); 2009 NRC Report, supra note 37, at 141 (“The latent print community in the United States has eschewed numerical scores and corresponding thresholds” and consequently relies “on primarily subjective criteria” in making the ultimate attribution decision).

145. Commonwealth v. Patterson, 840 N.E.2d 12, 17 (Mass. 2005):
There is a rule of examination, the “one-discrepancy” rule, that provides that a nonidentification finding should be made if a single discrepancy exists. However, the examiner has the discretion to ignore a possible discrepancy if he concludes, based on his experience and the application of various factors, that the discrepancy might have been caused by distortions of the fingerprint at the time it was made or at the time it was collected.

146. 2009 NRC Report, supra note 37, at 144.

147. AAAS Fingerprint Report, supra note 110, at 29. AFIS includes a few million subjects for state law enforcement agencies to over a hundred million for national agencies.

Page 145 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

respect to fingerprint analysis: its relative subjectivity, both in terms of reliance on judgment and lack of standards related to determining minimum number of points of comparison or sufficiency of detail (and thus its tendency to allow contextual bias and inter- and intra-examiner variability); the lack of variability data (about rarity of various features) to support determinations of source identification; the lack of well-designed validation studies and accurate error rate estimates; and problems with specific applications of ACE-V, such as overreliance on AFIS, failure to examine a sufficient number of details, problems with certain latent prints that go beyond the method’s capabilities, lack of well-designed proficiency testing, and nonblind verification (where the verifying examiner already knows what conclusion the first examiner came to, and thus might be biased in favor of affirming the first conclusion).

Both the FBI’s post-Mayfield internal investigative report and the 2009 NRC Report focused on concerns related to subjectivity, lack of variability data, and reliability as applied. The FBI noted that the Mayfield false positive result was likely the result of examiners falsely assuming that the source of the latent print was among the top potential matches suggested by AFIS, and the fact that Mayfield’s print and the latent print were, in fact, remarkably similar (and thus also remarkably similar to the real source, a different suspect in Spain).¹⁴⁸ The 2009 NRC Report focused on the relative subjectivity of the method and lack of data as to how rare or common a particular type of fingerprint characteristic is.¹⁴⁹ As the 2009 NRC Report explained, “the ACE-V method does not specify particular measurements or a standard test protocol, . . . examiners must make subjective assessments throughout.”¹⁵⁰ Or, as statistician Sandy Zabell put it a year after the Mayfield incident: “In contrast to the scientifically-based statistical calculations performed by a forensic scientist in analyzing DNA profile frequencies, each fingerprint examiner renders an opinion as to the similarity of friction ridge detail based on his subjective judgment.”¹⁵¹ A later study by psychologist Itiel Dror and others found significant differences in latent print examiner conclusions related to the same set of samples depending on whether the examiner had been exposed to task-irrelevant information, and inconsistency within the same examiner’s conclusions over time.¹⁵² The 2009 NRC Report’s conclusion was that the ACE-V method, while then nearly universally accepted by courts, was too “broadly stated” to “qualify as a validated method for this type of analysis.”¹⁵³

148. See generally Off. of the Inspector Gen., U.S. Dep’t of Justice, A Review of the FBI’s Handling of the Brandon Mayfield Case (Mar. 2006), https://perma.cc/7UY6-575D.

149. 2009 NRC Report, supra note 37, at 139–40, 144.

150. Id. at 139.

151. Zabell, supra note 132, at 158.

152. Dror et al., supra note 51.

153. 2009 NRC Report, supra note 37, at 142.

Page 146 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

The 2016 PCAST Report’s evaluation of ACE-V focused largely on (1) whether sufficient well-designed validation studies existed and what the estimated error rate was; and (2) the reliability of ACE-V as applied. The PCAST Report identified two friction ridge studies that met its criteria for a well-designed validation study, meaning one with samples representative of real casework; a large enough sample to generate a meaningful error rate estimate; double-blind (meaning subject and grader do not realize it is a test or what the results are); and not subject to changing testing protocol midstream.¹⁵⁴ First, the FBI published a peer-reviewed study, in 2011, that was conducted in direct response to the 2009 NRC Report.¹⁵⁵ The false positive rate from the FBI study was 0.17%, with an upper bound of a 95% confidence interval of 0.33% (meaning, in simple terms, that the error rate is likely to be within a certain range with 0.33% being the highest in the range).¹⁵⁶ In terms of application, these rates correspond to a false positive occurring in an estimated 1 in every 604 cases, but potentially as often as 1 in every 306 cases.¹⁵⁷ In 2012, the FBI conducted a follow-up study directed toward assessing repeatability and reproducibility. In the follow-up study, 75 of the original 169 examiners participated, and the false positive rate was broadly consistent with the prior study.¹⁵⁸

The second and final ACE-V study identified by the PCAST Report as meeting its criteria for a well-designed validation study was done in 2014 by the Miami-Dade Police Department Forensic Services Bureau.¹⁵⁹ This study had a false positive rate of 4.2%, with an upper bound of a 95% confidence interval of 5.4%.¹⁶⁰ These rates correspond to a false positive occurring in an estimated 1 in 24 cases, but potentially as often as 1 in 18 cases.¹⁶¹ Of note, the study’s authors identified 35 potential clerical errors in the documentation of the participants.¹⁶² If the 35 known errors were properly accounted for, then the actual false positive rate would have been 0.7%, with an upper bound of a 95% confidence interval of 1.4%.¹⁶³ These rates would correspond to a false positive occurring in an estimated 1 in 146

154. 2016 PCAST Report, supra note 20, at 94–97 (two studies); 52 (noting criteria for well-designed study).

155. Id. at 94.

156. Id.

157. Id.

158. Id. For more on the repeatability and reproducibility of examiner’s decisions, see Bradford T. Ulery et al., Repeatability and Reproducibility of Decisions by Latent Fingerprint Examiners, 7 PLoS ONE e32800 (2012), https://doi.org/10.1371/journal.pone.0032800.

159. 2016 PCAST Report, supra note 20, at 94–95. The study can be found at https://perma.cc/C47K-GCQA.

160. Id. at 95.

161. Id.

162. Id.

163. Id.

Page 147 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

cases, but potentially as often as 1 in 73 cases.¹⁶⁴ However, the report noted that to exclude errors in a post hoc manner was inappropriate for a validation study.¹⁶⁵

The 2016 PCAST Report ultimately concluded based on the two well-designed studies that “latent fingerprint analysis is a foundationally valid subjective methodology—albeit with a false positive rate that is substantial and is likely to be higher than expected by many jurors based on longstanding claims about the infallibility of fingerprint analysis.”¹⁶⁶ Note that the report’s use of the term foundationally valid appears to track the concept of “foundational reliability” in Daubert and its progeny. The 2016 PCAST Report did also suggest that juries in cases involving latent print expert testimony be informed that: (1) there have only been two properly conducted studies whose results are worth considering; and (2) the false positive rates from these studies “could be as high as 1 in 306 in one study and 1 in 18 in the other study.”¹⁶⁷ It also suggested that examiners acknowledge that because the participants knew they were participating in the study, the actual false positive rate could be higher than observed.¹⁶⁸ Regarding this third point, however, the report also noted, in a separate section, that “[i]t is likely that a properly designed program of systemic, blind verification would decrease the false-positive rate, because examiners in the studies tend to make different mistakes.”¹⁶⁹ Giving this information to factfinders, according to the report’s authors, would allow jurors “to weigh the probative value of the evidence.”¹⁷⁰

Even though it concluded latent fingerprint analysis is foundationally valid (or foundationally “reliable” in Daubert parlance), the 2016 PCAST Report concluded there were several issues related to its validity (or reliability, in Daubert parlance) as applied.¹⁷¹ These issues are: (1) confirmation bias—examiners altering initially marked features in a latent print based on comparison with an exemplar, (2) contextual bias—other facts of the case influencing the examiner’s judgment, and (3) lack of proficiency testing.¹⁷² Given these concerns, the report concluded the following would be necessary to establish validity of ACE-V as applied:

From a scientific standpoint, validity as applied requires that an expert: (1) has undergone appropriate proficiency testing to ensure that he or she is

164. Id.

165. Id.

166. Id. at 101.

167. Id. at 96. The 2016 PCAST Report authors determined the number of studies worth considering based on their framework; other studies may provide helpful information. Moreover, the report authors urged that the error rates from existing validation studies be calculated based on the number of incorrect calls as a percentage of all conclusive examinations, and that they treat “inconclusives” as false positives. Id. at 93.

168. Id. at 109.

169. Id. at 96 (emphasis in original).

170. Id.

171. Id. at 102.

172. Id.

Page 148 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

capable of analyzing the full range of latent fingerprints encountered in casework and reports the results of the proficiency testing; (2) discloses whether he or she documented the features in the latent print in writing before comparing it to the known print; (3) provides a written analysis explaining the selection and comparison of the features; (4) discloses whether, when performing the examination, he or she was aware of any other facts of the case that might influence the conclusion; and (5) verifies that the latent print in the case at hand is similar in quality to the range of latent prints considered in the foundational studies.¹⁷³

Note that while proficiency testing might also be related to the qualification of a latent print examiner, the requirements above, beyond proficiency testing alone, relate to documentation of the method sufficient to support a finding by a preponderance of the evidence that the examiner reliably applied ACE-V in a particular case.

The findings of PCAST as to ACE-V’s error rate and limitations were repeated in a report on latent print examination the following year by the AAAS.¹⁷⁴ One difference, however, was that the AAAS report looked at the entire body of research surrounding latent fingerprint analysis, whereas the PCAST report drew its conclusions from what it determined were the two appropriately designed black box studies.¹⁷⁵ Despite the differences in scope of review, AAAS stated that its “conclusions largely align with those of the PCAST report.”¹⁷⁶

The AAAS report also mirrored the 2009 NRC Report’s concern about lack of empirical data underlying source conclusion and concluded that statements of source attribution are as yet scientifically unsupportable. While the AAAS report concluded that research had “convincingly established” that ridge patterns vary greatly among humans,¹⁷⁷ it found no scientific basis for determining the rarity of any such ridge feature¹⁷⁸ or “how many features, of what types, are needed in order for an examiner to draw definitive conclusions about whether a latent print was made by a given individual.”¹⁷⁹ As a result, “[e]xaminers may well be able to exclude the preponderance of the human population as possible sources of a latent print, but there is no scientific basis for estimating the number of people who could not be excluded,” and there is no science to support the determination that the latent print came from a single source.¹⁸⁰ Thus, any notion that an examiner can make or has

173. Id.

174. AAAS Fingerprint Report, supra note 110.

175. Id.

176. Id. at 44.

177. Id. at 23.

178. Id.

179. Id. at 21.

180. Id.

Page 149 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

made an identification with 100% accuracy is “overstated” and “indefensible,”¹⁸¹ as is any “claim that they can associate a latent print with a single source.”¹⁸²

The AAAS report then offered several suggestions to address any misconceptions a juror may harbor based on previous assertions of either source attribution or infallibility:

[L]atent print examiners should include specific caveats in reports that acknowledge the limitations of the discipline. They should acknowledge: (1) that the conclusions being reported are opinions rather than facts (as in all pattern-matching disciplines); (2) that it is not possible for a latent print examiner to determine that two friction ridge impressions originated from the same source to the exclusion of all others; and (3) that errors have occurred in studies of the accuracy of latent print examination.¹⁸³

The AAAS report also recommended that examiners refrain from making statements that are not supported by the science behind latent fingerprint examination.¹⁸⁴ Any words or phrases—such as “match,” “identification,” and “individualization”—that would imply a single source should be avoided.¹⁸⁵ The AAAS report recommended proposed testimony an examiner might offer that avoids any such peril:

The latent print on Exhibit ## and the record fingerprint bearing the name XXXX have a great deal of corresponding ridge detail with no differences that would indicate they were made by different fingers. There is no way to determine how many other people might have a finger with a corresponding set of ridge features, but this degree of similarity is far greater than I have ever seen in non-matched comparisons.¹⁸⁶

The Department of Justice, for its part, has changed its source conclusion language and other practices (such as nonblind verification) in response to the reports cited above. The FBI, for example, now forbids claims of “individualization.”¹⁸⁷ DOJ’s Uniform Language for Testimony and Reports (ULTR) also prohibits testifying examiners from “assert[ing] that a ‘source identification’ or a ‘source exclusion’ conclusion is based on the ‘uniqueness’ of an item of evidence,” “us[ing] the terms ‘individualize’ or ‘individualization’ when describing a source conclusion,” or “assert[ing] that two friction ridge skin impressions originated from the same

181. Id. at 71.

182. Id. at 60. U.S. Dep’t of Justice, Approved Uniform Language for Testimony and Reports for the Forensic Latent Print Discipline (2018), https://perma.cc/E59D-B6C2.

183. AAAS Fingerprint Report, supra note 110, at 73.

184. Id. at 11.

185. Id.

186. Id.

187. See Fed. Bureau of Investigation, FBI Approved Standards for Scientific Testimony and Report Language for Forensic Document Comparisons 4 (2022), https://perma.cc/VZ5C-5ZBT.

Page 150 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

source to the exclusion of all other sources.”¹⁸⁸ These prohibitions are in addition to the proscription of asserting latent fingerprint analysis is “infallible or has a zero error rate.”¹⁸⁹

Notwithstanding the recent purging of terms like individualization and zero error rate, nearly all laboratories, including DOJ’s, still allow examiners to testify to language suggesting categorical source attribution or exclusion. DOJ’s ULTR now directs their examiners to arrive at one of three conclusions: a source identification, a source exclusion, or an inconclusive determination.¹⁹⁰ A source identification is an examiner’s opinion that the two prints came from the same person and is based on an examiner’s opinion that there is “extremely strong support” indicating the two prints came from the same source and “extremely weak support” indicating the two prints came from different sources.¹⁹¹ A source exclusion is an examiner’s opinion that the two prints did not come from the same source and is based on the examiner’s opinion that there is “extremely strong support” indicating the two prints came from different sources and “extremely weak or no support” that the two prints came from the same source.¹⁹² An examiner will make an inconclusive determination when “there is insufficient quantity and/or clarity” between the two impressions for the examiner to arrive at a source identification or source exclusion determination.¹⁹³ While these terms do not claim infallibility, they do suggest the ability to reliably conclude that prints have a single common source. Concerns still linger, of course, about whether jurors understand the significance of such limitations.¹⁹⁴

A final recurring concern relates to use of AFIS databases. The Mayfield case was one high-profile example of a false hit from a database search. As databases grow, they may contain more fingerprints with many common features and few discernible dissimilar features, known as close non-“matches” (CNMs). In a study evaluating 125 fingerprint agencies completing a mandatory proficiency test with two pairs of CNMs, the false positive rate was 15.9% and 28.1% respectively, raising serious concerns.¹⁹⁵ While there are continued efforts to

188. U.S. Dep’t of Justice (DOJ), Uniform Language for Testimony and Reports for the Forensic Latent Print Discipline 3 (2020), https://perma.cc/J7GQ-US5Y.

189. Id.

190. Id. at 2.

191. Id. In other words, “[a]n identification is the statement of an examiner’s opinion (an inductive inference) that the probability that the two impressions were made by different sources is so small that it is negligible” (internal citation omitted).

192. Id.

193. Id. at 3.

194. In a study in the firearms context evaluating juror comprehension about differences in expert’s conclusion language, the authors found that more modest phrasing about conclusions did not affect jurors’ conclusions. See Brandon L. Garrett et al., Mock Jurors’ Evaluation of Firearm Expert Testimony, 44 Law & Hum. Behav. 412 (2020), https://doi.org/10.1037/lhb0000423.

195. See Jonathan J. Koehler & Shiquan Liu, Fingerprint Error Rate on Close Non-matches, 66 J. Forensic Scis. 129, 131–32 (2020), https://doi.org/10.1111/1556-4029.14580. The authors

Page 151 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

improve AFIS,¹⁹⁶ the systems are not currently designed to prove that a particular pair of prints has a common source.¹⁹⁷

Case Law Development

Since the Illinois Supreme Court’s approval of fingerprint analysis in People v. Jennings (1911), fingerprint testimony has been routinely admitted.¹⁹⁸ Indeed, some courts stated that fingerprint evidence was the strongest proof of a person’s identity.¹⁹⁹ With the exception of one federal district court decision that was later withdrawn by the court itself upon reconsideration, and another decision in which a court remanded to determine whether ACE-V might be unreliable as applied to simultaneous impressions (rather than single latent prints),²⁰⁰ nearly all courts before 2017 continued to reject challenges to fingerprint testimony.²⁰¹

note that there were limitations to the study; the participants knew they were being tested and the samples were “convenience samples,” not random and representative selections from database searches. For more on AFIS searches, see Itiel E. Dror & Jennifer L. Mnookin, The Use of Technology in Human Expert Domains: Challenges and Risks Arising from the Use of Automated Fingerprint Identification Systems in Forensic Science, 9 Law, Probability & Risk 47 (2010), https://doi.org/10.1093/lpr/mgp031.

196. AAAS Fingerprint Report, supra note 110, at 33, quoting the 2016 PCAST Report, supra note 20.

197. AAAS Fingerprint Report, supra note 110, at 33.

198. People v. Jennings, 96 N.E. 1077 (Ill. 1911). As Professor Mnookin has noted, “fingerprints were accepted” after Jenkins “as an evidentiary tool without a great deal of scrutiny or skepticism.” Mnookin, supra note 130, at 17.

199. People v. Adamson, 165 P.2d 3, 12 (Cal. 1946), aff’d, 332 U.S. 46 (1947).

200. See, e.g., Commonwealth v. Patterson, 840 N.E.2d 12, 15, 16–17 (Mass. 2005) (“These latent print impressions are almost always partial and may be distorted due to less than full, static contact with the object and to debris covering or altering the latent impression”; “In the evaluation stage, . . . the examiner relies on his subjective judgment to determine whether the quality and quantity of those similarities are sufficient to make an identification, an exclusion, or neither.”); and United States v. Llera Plaza, 179 F. Supp. 2d 492, 516, vacated, mot. granted on recons., 188 F. Supp. 2d 549 (E.D. Pa. 2002). The ruling excluded expert testimony that two sets of prints “matched,” to the exclusion of all other persons. On a motion for reconsideration, the court reversed itself. A spate of legal articles followed. See, e.g., Simon A. Cole, Grandfathering Evidence: Fingerprint Admissibility Rulings from Jennings to Llera Plaza and Back Again, 41 Am. Crim. L. Rev. 1189 (2004); Robert Epstein, Fingerprints Meet Daubert: The Myth of Fingerprint “Science” Is Revealed, 75 S. Calif. L. Rev. 605 (2002); Kristin Romandetti, Recognizing and Responding to a Problem with the Admissibility of Fingerprint Evidence Under Daubert, 45 Jurimetrics 41 (2004).

201. See United States v. Baines, 573 F.3d 979, 990 (10th Cir. 2009) (“Fingerprint identification has been used extensively by law enforcement agencies all over the world for almost a century.”); United States v. Abreu, 406 F.3d 1304, 1307 (11th Cir. 2005) (“We agree with the decisions of our sister circuits and hold that the fingerprint evidence admitted in this case satisfied Daubert.”); United States v. Janis, 387 F.3d 682, 690 (8th Cir. 2004) (finding fingerprint evidence to be reliable); United States v. Mitchell, 365 F.3d 215, 234–52 (3d Cir. 2004); United States v. Crisp, 324 F.3d 261, 268–71 (4th Cir. 2003); United States v. Collins, 340 F.3d 672, 682 (8th Cir.

Page 152 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

Since the AAAS Fingerprint Report in 2017, at least one court has limited latent print testimony in new ways on “reliability as applied” grounds. For example, the North Carolina Supreme Court in 2018 held that the trial court abused its discretion in allowing a latent print examiner to testify to a “match” between the defendant’s known prints and a latent print on a truck, where the testimony merely established that the examiner used ACE-V, and determined the “match” based on her “training and experience” and “looking at the individual minutia” in each print.²⁰² The court found a similar error in another case two years later.²⁰³

Still, since the 2016 PCAST Report, most courts have deemed concerns with latent print analysis as going to weight, not admissibility,²⁰⁴ citing the longstanding nature of the discipline and its traditional acceptance by courts under Daubert.²⁰⁵ Other courts upholding the admissibility of fingerprint analysis have noted that while reports have exposed flaws in ACE-V, these reports have also opened up new opportunities for defense. One court noted that “[c]rossexamination on issues like error rates is possible now in a way in which it was not 30 years ago.”²⁰⁶ Still other courts have reasoned that even with flaws, fingerprint analysis is likely better than older evidence like eyewitness identifications and “grainy” photographs, suggesting that the middle ground is to admit such evidence and “subject [it] to cross-examination about a method’s reliability and whether the witness took appropriate steps to reduce errors.”²⁰⁷

2003) (“Fingerprint evidence and analysis is generally accepted.”); United States v. Hernandez, 299 F.3d 984, 991 (8th Cir. 2002); United States v. Sullivan, 246 F. Supp. 2d 700, 704 (E.D. Ky. 2003); United States v. Martinez-Cintron, 136 F. Supp. 2d 17, 20 (D.P.R. 2001).

202. State v. McPhaul, 808 S.E.2d 294, 304–05 (N.C. Ct. App. 2017), discretionary review improvidently allowed, 818 S.E.2d 102 (N.C. 2018) (abuse of discretion to admit where the expert failed to demonstrate that she applied the principles and methods reliably to the facts of the case).

203. Cf. State v. Koiyan, 841 S.E.2d 351 (N.C. Ct. App. 2020) (same error as in McPhaul, though declining to reverse under plain error review).

204. See United States v. Reyes-Ballista, No. CV 18-634-2 (ADC), 2020 WL 6822372, at *4 (D.P.R. Nov. 20, 2020) (“considering that defendant will have ample opportunity to conduct vigorous cross-examination of the government’s expert witnesses and present contrary evidence, defendant is not without means of attacking the evidence he now claims to be based on methods that run afoul the profession’s parameters and accepted methods”).

205. United States v. Pitts, No. 16-CR-550 (DLI), 2018 WL 1116550, at *4–6 (E.D.N.Y. Feb. 26, 2018), citing United States v. Avitia-Guillen, 680 F.3d 1253, 1260 (10th Cir. 2012) (“Fingerprint comparison is a well-established method of identifying persons, and one we have upheld against a Daubert challenge.”). See also Reyes-Ballista, 2020 WL 6822372, at *3 (noting that “several . . . Circuits have explicitly determined that, ‘in the context of fingerprint evidence, a Daubert hearing is not always required’”) (internal citation omitted); United States v. Stevens, 219 F. App’x 108, 109 (2d Cir. 2007) (summary order declining to hold a Daubert hearing); United States v. Bonds, No. 15 CR 573-2, 2017 WL 4511061 (N.D. Ill. Oct. 10, 2017), aff’d, 922 F.3d 343 (7th Cir. 2019).

206. United States v. Fell, No. 5:01-CRCR-12-01, 2016 WL 11550830, at *1 (D. Vt. Dec. 29, 2016).

207. Bonds, 922 F.3d at 346 (Easterbrook, J.).

Page 153 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

Given the 2023 amendment to Rule 702, courts may now need to engage in more robust gatekeeping to evaluate both the foundational reliability and the reliability as applied of latent print expert testimony. As the advisory committee’s notes clarify, “many courts have held that the critical questions of the sufficiency of an expert’s basis, and the application of the expert’s methodology, are questions of weight and not admissibility. These rulings are an incorrect application of Rules 702 and 104(a).”²⁰⁸ The notes explain that “[j]udicial gatekeeping is essential.” As explained earlier, several gatekeeping options are available to the courts, along with the option of providing jury instructions to aid jurors in evaluating the evidence. With the enactment of the 2023 amendment to Rule 702, the judge must be satisfied that the proponent of the evidence has met each of the rule’s requirements by a preponderance. If the court is not satisfied, it should exclude those opinions. Additionally, the judge may appropriately limit any opinion to exclude overstatement of a conclusion.

Handwriting Evidence

Introduction

Individuals who compare handwriting, ink formulations, and other tasks related to evaluation of questioned documents are known as questioned document examiners or forensic document examiners (FDEs). FDEs are called on to perform a variety of tasks such as determining potential authorship of a writing; determining the sequence of strokes on a page; and determining whether a particular ink formulation existed on the purported date of a writing.²⁰⁹ Courts were originally skeptical of handwriting comparison and dismissed its value, but by 1900, many courts had begun to admit such expert evidence.²¹⁰ In the 1936 trial about the Lindbergh baby kidnapping, expert testimony about the handwriting of the random notes figured prominently.²¹¹ Following the Lindbergh

208. Fed. R. Evid. 702 advisory committee’s note to 2023 amendment.

209. Questioned document examinations cover a wide range of analyses: handwriting, hand printing, typewriting, mechanical impressions, altered documents, obliterated writing, indented writing, and charred documents. See Paul C. Giannelli et al., Scientific Evidence § 14 (6th ed. 2020).

210. See, e.g., D. Michael Risinger, Mark P. Denbeaux, & Michael J. Saks, Exorcism of Ignorance as a Proxy for Rational Knowledge: The Lessons of Handwriting Identification “Expertise,” 137 U. Pa. L. Rev. 731, 762 (1989). For a deep and nuanced analysis of the history of handwriting identification, see Jennifer L. Mnookin, Scripting Expertise: The History of Handwriting Identification Evidence and the Judicial Construction of Reliability, 87 Va. L. Rev. 1723 (2001).

211. Risinger et al., Exorcism of Ignorance, supra note 210, at 771.

Page 154 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

case, handwriting evidence became widely used and judicially accepted, but with little critical scrutiny by the courts into its accuracy.²¹²

Reliance on handwriting comparison has diminished in recent years and forensic document examination units are closing as demand shifts to other forensic disciplines, such as DNA analysis.²¹³ Additionally, there is a continual decrease in handwritten communication, as individuals use more electronic communication and digital signatures. These developments suggest that this field may soon join other receding identification specialties. Nevertheless, it continues to be used in court, and in 2020 a NIST working group issued a report, discussed below, for improving the practice.

The Method and Its Claims

Some FDE tasks, such as ink identification, are based on sophisticated techniques such as chromatography and mass spectrometry that are widely considered to be reliable.²¹⁴ Other tasks, such as determining authorship of a document, present more substantial potential legal issues. This section will focus primarily on authorship.

Like other feature comparison experts, an FDE determines authorship by analyzing a questioned document and known sample (either previous writing or a requested handwriting exemplar), observing various details in each, comparing them in an attempt to discern consistent or inconsistent patterns, and then evaluating those similarities and differences to arrive at a conclusion about whether they might have a common author. There must be a sufficient quantity and quality of material to permit the examiner to compare the samples.²¹⁵ The 2020 NIST Report suggests that it is best if the exemplars are written during the same period as the questioned document, to remove potential variables. In performing their comparison, examiners consider letter formation, size, and inter-word and intra-word spacing.²¹⁶

The so-called conventional or classical approach is a two-stage process. Examiners consider (1) class and (2) so-called “individual” characteristics. Of characteristics labeled “class” characteristics, two types are weighed: “system” and “group.”²¹⁷ People exhibiting system characteristics would include, for

212. Id. (noting that the Lindbergh case seemed to have “stamped out virtually all manifestations of judicial skepticism”).

213. 2020 NIST Report, supra note 110, at xi.

214. Giannelli et al., supra note 209, § 14.04 [4].

215. 2020 NIST Report, supra note 110, at 45 (noting the assumption of individuality rests on a large enough quality sample).

216. Id. at 14.

217. See James A. Kelly, Questioned Document Examination, in Scientific and Expert Evidence 695, 698 (2d ed. 2011).

Page 155 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

example, those who learned the Palmer method of cursive writing, taught in many schools. People taught the same system might reasonably be expected to manifest some of the characteristics of that writing style. However, because schools now devote much less time to teaching handwriting as a skill, the more contemporary view is that the “class” characteristics are not as readily identifiable as they once were.²¹⁸ Other “group” characteristics might include those with a medical condition affecting their handwriting.²¹⁹

The conventional belief in individual characteristics unique to a particular person also persists among many examiners. That belief rests on two premises: (1) that no two writers share the same combination of writing characteristics; and (2) adults have consistent writing habits.²²⁰ These beliefs remain unproven,²²¹ but still assumed by many practitioners and still cited by some courts.²²² The traditional process for handwriting identification involves measuring selected features in handwriting, determining whether and how they differ across specimens, and interpreting the significance of similarities and differences.²²³ Although this process can include measuring features, more often it involves an examination of relative measurements (an estimation of features in proportion to each other), including size, spacing, and slant of features. Although there are differences among the various examiners, most follow these general procedures:

analyzing the features of the questioned writing and known standards both macroscopically and microscopically;
noting conspicuous features such as size, slant, and letter construction, as well as more subtle characteristics such as pen direction, the nature of connections between letters, and spacing between letters, words, and lines;
comparing the observed features to determine similarities and dissimilarities;
taking into account the degree of similarity or otherwise and the nature of the writing (quality, amount, and complexity), evaluating the evidence,

218. 2020 NIST Report, supra note 110, at 7–8.

219. See, e.g., Larry F. Stewart, The Process of Forensic Handwriting Examinations, 4 Forensic Res. & Criminology Int’l J. 139 (2017), https://doi.org/10.15406/frcij.2017.04.00126.

220. 2020 NIST Report, supra note 110, at 8 (citing Howard Sieden & Frank Norwitch, Questioned Documents, in Forensic Science: An Introduction to Scientific and Investigative Techniques (S.H. James et al. eds., 4th ed. 2014)).

221. For more, see Andrew Sulner, Critical Issues Affecting the Reliability and Admissibility of Handwriting Opinion Evidence—How They Have Been Addressed (or Not) Since the 2009 NAS Report, and How They Should Be Addressed Going Forward: A Document Examiner Tells All, 48 Seton Hall L. Rev. 631, 639 (2018) (“absolutist statements concerning uniqueness and intra-writer variability are as yet unproven, and likely unprovable”).

222. See, e.g., United States v. Mallory, 902 F.3d 584 (6th Cir. 2018); United States v. Foust, 989 F.3d 842, 846 (10th Cir.), cert. denied, 142 S. Ct. 294 (2021).

223. 2020 NIST Report, supra note 110, at 8.

Page 156 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

and arriving at an opinion regarding the writership of the questioned writing.²²⁴

Examiners believe that the various features they examine, such as letter formation and size, and the inter- and intra-word spacing, are more variable in the writing of different individuals than in the writing of the same individual, but today there is widespread acknowledgment that the statistical properties of such variabilities have not been rigorously studied.²²⁵ There is no universally accepted number of points of similarity required to conclude the writings came from the same source. Additionally, although there is an expected range of variability for a single writer, there are no specific standards or procedures for determining that range.²²⁶

After evaluation, the examiner “expresses an opinion indicating [their] subjective confidence in the process outcome.”²²⁷ There are five opinions possible, according to the 2020 NIST Report: (1) Identification; (2) Probably did write; (3) Inconclusive; (4) Probably did not write; and (5) Elimination.²²⁸ In contrast, the Scientific Working Group for Forensic Document Examination (SWGDOC) gives examiners nine options that may be expressed, along with associated descriptions.²²⁹ For its part, the FBI laboratory uses five categories that “collapse” the nine possible opinions on the SWGDOC scale.²³⁰ As of 2021, DOJ requires examiners to offer any of the following five conclusions: (1) Source Identification (i.e., identified); (2) Support for a Common Source; (3) Inconclusive; (4) Support for Different Sources; and (5) Source Exclusion (i.e., excluded).²³¹ DOJ has also limited the certainty that experts may use, consistent with other forensic feature comparison methods.²³² In short, just as there is no consensus about the number of specific points of similarity to determine authorship, there is no nationwide consensus about the range of possible opinions.

224. Id. at 9.

225. Id. at 14.

226. Id.

227. Id. at 24.

228. Id.

229. See, e.g., 2009 NRC Report, supra note 37, at 166; 2020 NIST Report, supra note 110, at 25–26, listing: 1. Identification; 2. Strong probability; 3. Probable; 4. Indications; 5. No conclusion; 6. Indications did not; 7. Probably did not; 8. Strong probability did not; and 9. Elimination.

230. 2020 NIST Report, supra note 110, at 25, citing SWGDOC, Version 2013-2.

231. U.S. Dep’t of Justice, Uniform Language for Testimony and Reports for Testimony and Reports for Forensic Document Examination 2 (2021), https://perma.cc/KMC4-6KG3.

232. Id. at 4 (“Therefore, an examiner shall not: assert that a ‘source identification’ or a ‘source exclusion’ conclusion is based on the ‘uniqueness’ of an item of evidence; use the terms ‘individualize’ or ‘individualization’ when describing a source conclusion; assert that two or more bodies of writing were prepared by the same writer to the exclusion of all other writers.”).

Page 157 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

Scientific Assessments, Critiques, and Debates

The following discussion is focused on critiques of handwriting analysis, given that these will frame the litigation judges will face over admission of FDE testimony.

As is the case with other feature comparison methods, critiques of the foundational reliability of handwriting analysis have focused on its relative subjectivity, lack of empirical data underlying assumptions about rarity of characteristics, potential for contextual bias to affect examinations, and lack of accurate error rate estimates. The 2009 NRC Report concluded that the technique was useful but not yet scientifically valid for determining authorship:

The scientific basis for handwriting comparisons needs to be strengthened. Recent studies have increased our understanding of the individuality and consistency of handwriting . . . and suggest that there may be a scientific basis for handwriting comparison, at least in the absence of intentional obfuscation or forgery. Although there has been only limited research to quantify the reliability and replicability of the practices used by trained document examiners, the committee agrees that there may be some value in handwriting analysis.²³³

A long-time forensic document examiner himself agreed with this assessment, writing in a 2018 article that the method could be very reliable under the right conditions and with more research but that “forensic handwriting analysis still lacks robust, ground truth studies that provide empirical support for the reliability of many of the tasks routinely performed.”²³⁴

The latest governmental evaluation of handwriting analysis, the 2020 NIST Report, focused its foundational reliability concerns on the relative subjectivity of the method, the resulting potential for contextual bias, and the need for empirical testing to determine the extent of both intra- and inter-examiner variability (that is, repeatability and reproducibility, two aspects of what is often called reliability, as compared to accuracy).²³⁵ To provide trustworthy estimates of repeatability and reproducibility, studies should “compare the performance within and between FDEs in their judgments on the same samples of handwriting against ground truth.”²³⁶ The report urges that multiple independent laboratories work together on these problems by using the same methods and materials.²³⁷ The NIST report ultimately urges that both “black box studies” (to calculate an accurate error rate estimate) and “white box studies” (to understand the steps examiners are taking, with an eye toward developing standards that then must be applied when shown to lead to accurate results) are needed, and

233. 2009 NRC Report, supra note 37, at 166–67.

234. Sulner, supra note 221, at 714.

235. 2020 NIST Report, supra note 110, at 52.

236. Id. at 53.

237. Id.

Page 158 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

at least one has been performed since 2020 (when the NIST Human Factors report was released).²³⁸

The 2020 NIST Report also expressed concern over the method’s reliability as applied, and in particular, the use by examiners of language that suggests source attribution or that otherwise goes beyond the scientifically supportable bounds of the method or overstates the evidence. The report cautioned, for example, that the expressions “uniqueness” and “individualization” do not have a settled meaning in forensic science, and their casual use can lead to misunderstandings in court and an “exaggeration of the strength of such evidence.”²³⁹ The report also concluded that “empirical research and statistical reasoning do not support source attribution to the exclusion of all others.” Instead it forcefully recommended that FDEs “must not report or testify, directly or by implication, that questioned handwriting has been written by an individual (to the exclusion of all others).”²⁴⁰ It likewise recommended that language about uniqueness and individualization should give way to a critical evaluation of the “rarity of the features,” on a continuum of rare to common. “A more contemporary view . . . individuality is defined with respect to the probability of observing writing profiles of two individuals that are indistinguishable using the specified comparison method.”²⁴¹ Additionally, there is movement away from the conventional approach to handwriting analysis and toward a more empirical, neurobiological approach that would permit hypothesis generation and testing of relevant principles.²⁴²

Finally, with respect to reliability as applied, the 2020 NIST Report dedicated a full section to the several types of cognitive bias that could arise in FDE because of its layers of relatively high subjectivity.²⁴³ The report cautioned that such bias is a “legitimate cause for concern” that must be addressed by forensic

238. See, e.g., R. Austin Hicklin et al., Accuracy and Reliability of Forensic Handwriting Comparisons, 119 Proc. Nat’l Acad. Scis. e2119944119 (2022), https://doi.org/10.1073/pnas.2119944119.

239. 2020 NIST Report, supra note 110, at 47.

240. Id. DOJ’s ULTR for Forensic Document Examination continues to permit an examiner to offer a conclusion on “source identification” but must not assert that “two or more bodies of writing were prepared by the same writer to the exclusion of all other writers.” U.S. Dep’t of Justice, Uniform Language for Testimony and Reports for Testimony and Reports for Forensic Document Examination 3 (2021), https://perma.cc/KMC4-6KG3.

241. 2020 NIST Report, supra note 110, at 46 (citing Sargur Srihari et al., Individuality of Handwriting, 47 J. Forensic Scis. 856 (2002)).

242. See, e.g., Michael P. Caligiuri & Linton A. Mohammed, The Neuroscience of Handwriting 35–57 (2012).

243. 2020 NIST Report, supra note 110, at 30. The NIST report dedicates an entire section (2.1) to the multiple types of contextual bias in FDE. Id. at 30–44. See also D. Michael Risinger et al., The Daubert/Kumho Implications of Observer Effects in Forensic Science: Hidden Problems of Expectation and Suggestion, 90 Calif. L. Rev. 1 (2002), https://doi.org/10.2307/3481305; Adele Quigley-McBride et al., A Practical Tool for Information Management in Forensic Decisions: Using Linear Sequential Unmasking-Expanded (LSU-E) in Casework, 4 Forensic Sci. Int’l: Synergy 100216 (2022), https://doi.org/10.1016/j.fsisyn.2022.100216.

Page 159 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

laboratories through standards, protocols like the “contextual information management” designed to control information flow to and among examiners, and documentation of steps taken to avoid such bias.²⁴⁴ The report made several recommendations along these lines and urged research to study the problem.²⁴⁵ There are different suggested approaches to combat these problems—notably Dr. Itiel Dror’s hierarchy of expert performance method (HEP).²⁴⁶

Another legal question that has arisen is whether FDE expert testimony is “helpful to the jury” for Rule 702 purposes. Forensic handwriting analysis is a unique discipline because handwriting is a fundamental aspect of everyday life, and thus, ordinary people are capable of comparing some aspects of handwriting samples.²⁴⁷ Laypersons are usually capable of identifying gross characteristics, but forensic handwriting analysis focuses on nuanced details that laypersons are often unable to correctly identify.²⁴⁸ A 2022 FBI-funded study found that adequately trained forensic handwriting experts are more accurate than those with minimal training and that examiners with less than two years of formal training had higher error rates and were more likely to make definitive, unqualified conclusions compared to those with at least two years of formal training.²⁴⁹ A 2018 study reached a similar conclusion when comparing novice and expert examiners, but found that the overall error rate among expert examiners was still high enough to question the trustworthiness of handwriting comparison evidence in the courts.²⁵⁰

244. 2020 NIST Report, supra note 110, at 34.

245. Id. at 35–36. For a detailed discussion of how the contextual information management (CIM) should work in FDE, see id. at 36–44.

246. See, e.g., Itiel Dror, A Hierarchy of Expert Performance, 5 J. Applied Res. Memory & Cognition 121 (2016), https://doi.org/10.1016/j.jarmac.2016.03.001 (creating a hierarchy of expert performance to identify weaknesses in an individual’s performance and compare experts across domains).

247. 2020 NIST Report, supra note 110, at viii.

248. See, e.g., State v. Cooke, 914 A.2d 1078, 1101 (Del. Super. Ct. 2007) (“Handwriting comparison is beyond a lay person’s general knowledge.”); Diane Harrison, Ted M. Burkes & Danielle P. Sieger, Handwriting Examination: Meeting the Challenges of Science and the Law, Forensic Sci. Commc’ns (Oct. 2009), https://perma.cc/J2Z2-W7D5.

249. R. Austin Hicklin et al., Accuracy and Reliability of Forensic Handwriting Comparisons, 119 Proc. Nat’l Acad. Scis. e2119944119 (2022), https://doi.org/10.1073/pnas.2119944119.

250. Kristy Martire et al., What Do the Experts Know? Calibration, Precision, and the Wisdom of Crowds Among Forensic Handwriting Experts, 25 Psych. Bull. Rev. 2346, 2353 (2018), https://doi.org/10.3758/s13423-018-1448-3:
On the one hand, there is some evidence that handwriting experts will be able to estimate the frequency of occurrence for handwriting features better than novices. However, even the single best performing participant produced an average deviation of 18.5% from the true value. On the other hand, this number is considerably lower than would be expected by chance (25%).

Page 160 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

Some examiners incorporate handwriting pattern recognition technology to assist in their analysis.²⁵¹ This technology can outperform laypersons²⁵² and can be as accurate as an expert examiner, depending on the complexity of the task.²⁵³ These systems will only become more advanced as machine learning algorithms enhance their accuracy and reliability.²⁵⁴ Another area of research has been computer-driven forensics linguistics to identify authors.²⁵⁵ For instance, a Swiss company, OrphAnalytics SA, recently released a software package for conducting stylometric analysis to support the attribution of text to a particular author. While promising, these techniques have yet to be used forensically, and as with other forms of artificial intelligence, they too will have shortcomings. The 2020 NIST Report calls for FDEs to integrate these tools into their casework and “collaborate with the computer science and engineering communities to develop and validate applicable, user-friendly, automated systems.”²⁵⁶ Training programs requiring proficiency in the use of these systems, combined with interdisciplinary research and development, should provide courts with additional metrics to evaluate accuracy and reliability of handwriting comparisons in the future.

Case Law Development

Although nineteenth-century courts were skeptical of handwriting expertise,²⁵⁷ the twentieth century saw a marked shift, as testimony in leading cases like the Lindbergh kidnapping helped the discipline gain judicial acceptance.²⁵⁸ There was little dispute that handwriting testimony was admissible at the time the

251. 2020 NIST Report, supra note 110, at 68–71.

252. Harrison et al., supra note 248 (citing Sargur Srihari et al., On the Discriminability of the Handwriting of Twins, 53 J. Forensic Scis. 430 (2008), https://doi.org/10.1111/j.1556-4029.2008.00682.x).

253. 2020 NIST Report, supra note 110, at 70–71 (citing Muhammad Imran Malik et al., Man vs. Machine: A Comparative Analysis for Signature Verification, 24 J. Forensic Document Examination 21 (2004)).

254. 2020 NIST Report, supra note 110, at 71.

255. See, e.g., Janet Ainsworth & Patrick Juola, Who Wrote This?: Modern Forensic Authorship Analysis as a Model for Valid Forensic Science, 96 Wash. U. L. Rev. 1159, 1169–72 (2019) (language has many “objectively identifiable features”); Patrick Juola, Verifying Authorship for Forensic Purposes: A Computational Protocol and its Validation, 325 Forensic Sci. Int’l 110824 (2021), https://doi.org/10.1016/j.forsciint.2021.110824 (claiming a measured accuracy of 77% across more than 32,000 different document pairs); Cami Fuglsby et al., Elucidating the Relationships Between Two Automated Handwriting Feature Quantification Systems for Multiple Pairwise Comparisons, 67 J. Forensic Scis. 642 (2022), https://doi.org/10.1111/1556-4029.14914.

256. 2020 NIST Report, supra note 110, at 72.

257. See Strother v. Lucas, 31 U.S. 763, 767 (1832); Phoenix Fire Ins. Co. v. Philip, 13 Wend. 81, 82–84 (N.Y. Sup. Ct. 1834).

258. See Risinger et al., Exorcism of Ignorance, supra note 210, at 771.

Page 161 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

Federal Rules of Evidence were enacted in 1975. Rule 901(b)(3) recognized that a document could be authenticated by an expert, and the drafters explicitly mentioned handwriting comparison through the “testimony of expert witnesses.”²⁵⁹

The first significant admissibility challenge under Daubert was in United States v. Starzecpyzel.²⁶⁰ In that case, the district court concluded that “forensic document examination, despite the existence of a certification program, professional journals and other trappings of science, cannot, after Daubert, be regarded as ‘scientific . . . knowledge.’”²⁶¹ Nevertheless, the court did not exclude handwriting comparison testimony. Instead, the court admitted the individuation testimony as nonscientific “technical” evidence.²⁶²

Starzecpyzel prompted more litigation that questioned the lack of empirical validation in the field.²⁶³ For many years, there was a three-way split of authority. The majority of courts permitted examiners to express individuation opinions.²⁶⁴ As one court noted, “all six circuits that have addressed the admissibility of handwriting expert [testimony] . . . [have] determined that it can satisfy the reliability threshold” for nonscientific expertise.²⁶⁵ In 2021, the Tenth Circuit, in United States v. Foust, held that handwriting comparison was properly admitted, relying largely on the “widespread acceptance of handwriting comparison through the years” and citing the importance of the general acceptance factor to its decision.²⁶⁶ The Foust opinion explained that the proposed testimony fell short in the “standards, peer review, and error rate” factors, but upheld the trial

259. Fed. R. Evid. 901(b)(3) advisory committee’s note.

260. 880 F. Supp. 1027 (S.D.N.Y. 1995).

261. Id. at 1038.

262. Id. at 1047.

263. See, e.g., United States v. Hidalgo, 229 F. Supp. 2d 961, 967 (D. Ariz. 2002):
Because the principle of uniqueness is without empirical support, we conclude that a document examiner will not be permitted to testify that the maker of a known document is the maker of the questioned document. Nor will a document examiner be able to testify as to identity in terms of probabilities.

264. See, e.g., United States v. Prime, 363 F.3d 1028, 1033 (9th Cir. 2004); United States v. Crisp, 324 F.3d 261, 265–71 (4th Cir. 2003); United States v. Jolivet, 224 F.3d 902, 906 (8th Cir. 2000) (affirming the introduction of expert testimony that it was likely that the accused wrote the questioned documents); United States v. Velasquez, 64 F.3d 844, 848–52 (3d Cir. 1995); United States v. Ruth, 42 M.J. 730, 732 (A. Ct. Crim. App. 1995), aff’d on other grounds, 46 M.J. 1 (C.A.A.F. 1997); United States v. Morris, No. 06-87-DCR, 2006 WL 2054585, *2 (E.D. Ky. July 20, 2006); Orix Fin. Servs. v. Thunder Ridge Energy, Inc., No. 01 Civ. 4788, 2006 WL 587483 (S.D.N.Y. Mar. 8, 2006).

265. Prime, 363 F.3d at 1034.

266. United States v. Foust, 989 F.3d 842, 846 (10th Cir.), cert denied, 142 S. Ct. 294 (2021). For more about courts’ reliance on a “long history of use” as justification to admit forensic evidence, see Jane Campbell Moriarty, Deceptively Simple: Framing, Intuition, and Judicial Gatekeeping of Forensic Feature-Comparison Methods Evidence, 86 Fordham L. Rev. 1687 (2018).

Page 162 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

judge’s decision to admit the evidence. Explaining that the Daubert factors were meant to be helpful and not definitive, the Tenth Circuit held that the trial court did not abuse its discretion in admitting the evidence. Echoing Starzecpyzel, the court noted “handwriting comparison is not a traditional science.”²⁶⁷ So, too, the Sixth Circuit in United States v. Mallory held that handwriting comparison was properly admitted, as it was based on “knowledge and experience to answer the extremely practical question of whether a signature is genuine or forged.”²⁶⁸ The court reasoned that since experts “see things in handwriting that laypeople do not—both because of analysts’ training in the minutiae of loops, swoops, and dotted ‘i’s, and because of the volume of handwriting they inspect,” such testimony is helpful to the jury. Like the Tenth Circuit in Foust, the Sixth Circuit recognized that “handwriting analysis may not boast the ‘empirical’ support underpinning scientific disciplines,” but ruled that it was properly admitted as a proper area of expertise based on specialized or technical expertise.²⁶⁹

Both Foust and Mallory downplayed the concerns raised about foundational reliability, choosing to cite older, pre-NRC and PCAST report cases, and to cast the evidence as technical or specialized rather than scientific. Both courts determined that as the specialty was not scientific, the traditionally applied Daubert factors were of less concern.²⁷⁰ In light of the 2023 amendments to Federal Rule of Evidence 702, as discussed earlier, courts may be required to more robustly address the concerns about foundational reliability that this specialty poses. These two cases also highlight potential questions courts may have about the need for proof of reliability as applied. Both courts appeared to rely on the fact that the experts were qualified and had years of experience. Indeed, many other courts have relied on experience and qualifications as proxies for reliability of FDE as applied.²⁷¹ There are two potential concerns with this approach. First, some commentators have argued that experience alone, at least in relatively subjective disciplines, is not a substitute for proficiency testing in establishing expert qualification. Second, the fact that an expert is proficient in a method does not establish that the method was reliably applied in a given case.²⁷² Reliability as applied could instead be established through internal validation, case file documentation, standards and protocols showing avoidance of contextual bias, the

267. Foust, 989 F.3d at 846–47.

268. United States v. Mallory, 902 F.3d 584, 593 (6th Cir. 2018).

269. Id.

270. Id. at 593–94.

271. See also Giannelli et al., Scientific Evidence, supra note 209, § 1.03[2] (noting the tendency of some courts to elide qualifications with reliability inquiries, particularly with experiential-knowledge-based specialties).

272. Garrett & Mitchell, supra note 19, at 909 (“Experts should be qualified based on empirical evidence of their proficiency before addressing whether their methods used and conclusions reached are valid and reliable.”).

Page 163 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

nature of the samples examined, and the examiner’s own testimony as to what they did and the conclusions they drew.

Other courts have been more amenable to hearing challenges to FDE. Although the excluded testimony at issue in United States v. Johnsted related to hand printing rather than hand writing, the court expressed doubts about the foundational reliability of the entire field of handwriting as well:

This lack of testing has serious repercussions on a practical level: because the entire premise of interpersonal individuality and intrapersonal variations of handwriting remains untested in reliable, double blind studies, the task of distinguishing a minor intrapersonal variation from a significant interpersonal difference—which is necessary for making an identification or exclusion—cannot be said to rest on scientifically valid principles. The lack of testing also calls into question the reliability of analysts’ highly discretionary decisions as to whether some aspect of a questioned writing constitutes a difference or merely a variation; without any proof indicating that the distinction between the two is valid, those decisions do not appear based on a reliable methodology. With its underlying principles at best half-tested, handwriting analysis itself would appear to rest on a shaky foundation.²⁷³

In another federal district court case, Almeciga v. Center for Investigative Reporting, Inc.,²⁷⁴ the court excluded on Daubert grounds testimony about whether writing was real or disguised, opining that the field is relatively subjective and lacks adequate testing:

[T]here are no studies, to this Court’s knowledge, that have evaluated the extent to which the angle at which one writes or the curvature of one’s loops distinguish one person’s handwriting from the next. Precisely what degree of variation falls within or outside an expected range of natural variation in one’s handwriting—such that an examiner could distinguish in an objective way between variations that indicate different authorship and variations that do not—appears to be completely unknown and untested. Ditto the extent to which such a range is affected by the use of different writing instruments or the intentional disguise of one’s natural hand or the passage of time. Such things could be tested and studied, but they have not been; and this by itself renders the field unscientific in nature.²⁷⁵

273. United States v. Johnsted, 30 F. Supp. 3d 814, 818 (W.D. Wis. 2013). See also United States v. Fujii, 152 F. Supp. 2d 939, 940 (N.D. Ill. 2000) (holding expert testimony concerning Japanese handprinting inadmissible; “Handwriting analysis does not stand up well under the Daubert standards. Despite its long history of use and acceptance, validation studies supporting its reliability are few, and the few that exist have been criticized for methodological flaws.”); United States v. Lewis, 220 F. Supp. 2d 548 (S.D. W. Va. 2002); United States v. Saelee, 162 F. Supp. 2d 1097 (D. Alaska 2001); Almeciga v. Ctr. for Investigative Reporting, Inc., 185 F. Supp. 3d 401 (S.D.N.Y. 2016).

274. 185 F. Supp. 3d 401 (S.D.N.Y. 2016).

275. Id. at 419.

Page 164 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

The court also relied on what it described as a dearth of peer review and added that existing studies are “not sufficiently robust or objective to lend credence to the proposition that handwriting comparison is a scientific discipline.”²⁷⁶ The court stated that the known error rate is poor and even worse when there is a question of a disguised signature.²⁷⁷ Finally, the court discussed the lack of standards as to how many similarities are needed for a “match” or how many exemplars must be considered. In the court’s view, many conclusions by questioned document examiners are highly subjective, “bottom line” decisions whether there is a “match.” Summing up, the court wrote: “It remains the case that the methodology has not been subject to adequate testing or peer review, that error rates for the task at hand are unacceptably high, and that the field sorely lacks internal controls and standards, and so forth.”²⁷⁸ The court did not rule categorically that all handwriting comparison is inadmissible but noted the importance for caution in admitting testimony from an FDE, particularly on an opinion on authorship. More specifically, the court suggested that a trial judge “should not [admit questioned document testimony] without carefully evaluating whether the examiner has actual expertise in regard to the specific task at hand,” including proficiency testing.²⁷⁹ The court concluded that in this case, the evidence of the witness’s expertise was insufficient.²⁸⁰

Some district courts have endorsed a third view. These courts limit the reach of the examiner’s opinion, permitting expert testimony about similarities and dissimilarities between exemplars but not an ultimate conclusion that the defendant was the author (common authorship opinion) of the questioned document.²⁸¹ The expert is allowed to testify about “the specific similarities and

276. Id. at 420–21.

277. Id. at 422. Studies “suggest that while forensic document examiners might have some arguable expertise in distinguishing an authentic signature from a close forgery, they do not appear to have much, if any, facility for associating an author’s natural handwriting with his or her disguised handwriting.”

278. Id. at 424. Accord Johnsted, 30 F. Supp. 3d 814.

279. Almeciga, 185 F. Supp. 3d at 424.

280. But see United States v. Pitts, No. 16-CR-550, 2018 WL 1116550 (E.D.N.Y. Feb. 26, 2018) (noting that many courts have admitted such evidence in the Second Circuit and distinguishing Almeciga as it involved a potentially forged signature, not just a comparison of known and unknown samples).

281. See, e.g., United States v. Oskowitz, 294 F. Supp. 2d 379, 384 (E.D.N.Y. 2003) (“Many other district courts have similarly permitted a handwriting expert to analyze a writing sample for the jury without permitting the expert to offer an opinion on the ultimate question of authorship.”); United States v. Rutherford, 104 F. Supp. 2d 1190, 1194 (D. Neb. 2000):
[T]he Court concludes that [the examiner]’s testimony meets the requirements of Rule 702 to the extent that he limits his testimony to identifying and explaining the similarities and dissimilarities between the known exemplars and the questioned documents. [The examiner] is precluded from rendering any ultimate conclusions on authorship of the questioned documents and is similarly precluded from testifying to the degree of confidence or certainty on which his opinions are based.

Page 165 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

idiosyncrasies between the known writings and the questioned writings, as well as testimony regarding, for example, how frequently or infrequently in his experience, [the expert] has seen a particular idiosyncrasy.”²⁸² As the justification for this limitation, these courts often state that the examiners’ claimed ability to individuate lacks “empirical support.”²⁸³

As explained at the end of the section titled “Fingerprint Evidence” above, the 2023 amendment to Rule 702 may require courts to engage in more robust gatekeeping to evaluate both the foundational reliability and the reliability as applied of handwriting comparison evidence. The new amendment to Rule 702 may require a reconsideration of prior decisions focusing solely on weight and not admissibility of such evidence. As the advisory committee’s notes clarify, “many courts have held that the critical questions of the sufficiency of an expert’s basis, and the application of the expert’s methodology, are questions of weight and not admissibility. These rulings are an incorrect application of Rules 702 and 104(a).”²⁸⁴ As the notes recommend, “[j]udicial gatekeeping is essential.” As explained earlier, several gatekeeping options are available to the courts, along with the option of providing jury instructions to aid jurors in evaluating the evidence. With the enactment of the 2023 amendment to Rule 702, the judge must be satisfied that the proponent of the evidence has met each of the rule’s requirements by a preponderance. If the judge is not satisfied, she should exclude those opinions. Additionally, the judge may appropriately limit any opinion to exclude overstatement about the conclusions.

Firearms and Toolmark Analysis

Introduction

When ammunition travels through a firearm, the firearm leaves indentations or marks on the ammunition that arguably may be used to link it with a particular firearm or class of firearms. To that end, firearm and toolmark (sometimes referred to as FATM or FA/TM) examiners analyze fired (“spent”) ammunition found at crime scenes. They compare the marks on such ammunition either to the marks on other ammunition, to determine if both may have been fired from the same firearm or same category of firearm, or to a test fire from a particular suspected weapon, to determine if the ammunition might have been fired from

See also United States v. Hines, 55 F. Supp. 2d 62, 69 (D. Mass. 1999) (expert testimony concerning the general similarities and differences between a defendant’s handwriting exemplar and a stick-up note was admissible while the specific conclusion that the defendant was the author was not).

282. United States v. Van Wyk, 83 F. Supp. 2d 515, 524 (D.N.J. 2000).

283. United States v. Hidalgo, 229 F. Supp. 2d 961, 967 (D. Ariz. 2002).

284. Fed. R. Evid. 702 advisory committee’s note to 2023 amendment.

Page 166 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

that weapon or a similar class of weapon. A related technique is used to compare the marks left on a surface of interest (like at a crime scene) to the marks left by a known tool (like a screwdriver) to determine if the known tool might have left the mark in question.

For decades, firearms and toolmark identification evidence has been widely accepted by courts.²⁸⁵ As with other feature comparison evidence, it was accepted with little critical scrutiny of its reliability and accuracy. More recently, the technique has been increasingly challenged. As with many disciplines, it is least controversial when used for excluding firearms, rather than identifying an individual firearm—exclusion as opposed to inclusion.

Concerns about the accuracy of firearm/toolmark analysis when used for source attribution are not merely hypothetical. The National Registry of Exonerations documents wrongful convictions based on firearms and toolmark analysis.²⁸⁶ Most notable is likely the wrongful conviction of Anthony Ray Hinton, who spent thirty years on death row for a murder he did not commit based on a firearm misidentification, before the U.S. Supreme Court reversed his conviction.²⁸⁷ Firearms misidentifications have also resulted in crime lab audits and lost accreditations.²⁸⁸

285. See Gardner v. United States, 140 A.3d 1172, 1183 (D.C. 2016) (discussing how District of Columbia courts have “allowed the admission of expert testimony concerning ballistics comparison matching techniques” for “decades”) (citing Laney v. United States, 294 F. 412, 416 (D.C. Cir. 1923)).

286. See, e.g., National Registry of Exonerations, Patrick Pursley, https://perma.cc/Q4VG-E5JC (last updated 4/29/2023); People v. Pursley, 2018 IL App (2d) 170227-U (2018). See also Brandon Garrett, Siggers’ Firearms Exoneration, Duke L. Forensic Forum (Oct. 23, 2018), https://perma.cc/2HHS-HUYC (wrongful conviction of Darrell Siggers); Craig Cooley & Gabriel Oberfield, Increasing Forensic Evidence’s Reliability and Minimizing Wrongful Convictions: Applying Daubert Isn’t the Only Problem, 43 Tulsa L. Rev. 285, 337–38 (2007) (wrongful conviction of Charles Stielow); Simon A. Cole, Implicit Testing: Can Casework Validate Forensic Techniques?, 46 Jurimetrics J. 117, 126–27 (2006). The Association of Firearm and Tool Mark Examiners (AFTE) has their own journal to publish studies; for concessions from the AFTE journal, see Bruce Moran, A Report on the AFTE Theory of Identification & Range of Conclusions for Tool Mark Identification & Resulting Approaches to Casework, 34 AFTE J. 227 (2002) (“In the 1980s some striated toolmark misidentifications resulting from a poor understanding of toolmark criteria for identification were experienced. An increasing need to address problems of applying subjective criteria became apparent.”); Evan E. Hodge, Guarding Against Error, 20 AFTE J. 290 (1988) (noting that “most of us [firearms examiners] know someone who has committed serious error” and describing misidentification by another examiner of the wrong .45 caliber firearm because of cognitive bias and pressure from prosecutors); Lowell Bradford, Forensic Firearms Identification: Competence or Incompetence, 11(2) AFTE J. 12 (1979) (“[a]n appalling number of misidentifications have been found in the firearm identification field”).

287. See Hinton v. Alabama, 571 U.S. 263 (2014); National Registry of Exonerations, Anthony Ray Hinton, https://perma.cc/YJ3J-UQRU (last updated Aug. 1, 2017).

288. In 2008, the Michigan State Police Forensic Science Division conducted an audit of the Detroit Police Department’s firearms unit at the request of the Wayne County Prosecutor’s Office and the Detroit Police Department chief. The audit included a random reanalysis of 283 cases. In 10% of the reanalyzed cases, the firearms unit had made false identifications or false exclusions.

Page 167 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

Basic terms.

Typically, three types of firearms—rifles, handguns, and shotguns—are encountered in criminal investigations. Rifles and handguns are classified according to their caliber. The caliber is the diameter of the bore of the firearm; the caliber is expressed in either hundredths or thousandths of an inch (e.g., .22, .45, .357 caliber) or millimeters (e.g., 7.62 mm). The two major types of handguns are revolvers²⁸⁹ and semiautomatic pistols. A major difference between the two is that when a semiautomatic pistol is fired, the cartridge case is automatically ejected and, if recovered at the crime scene, could help link the case to the firearm from which it was fired. In contrast, when a revolver is discharged the case is not ejected. In terms of ammunition, rifle and handgun cartridges consist of the projectile (bullet),²⁹⁰ case,²⁹¹ propellant (powder), and primer. The primer contains a small amount of an explosive mixture, which detonates when struck by the firing pin. When the firing pin detonates the primer, an explosion occurs that ignites the propellant. The most common modern propellant is smokeless powder.

The barrels of modern rifles and handguns are rifled; that is, parallel spiral grooves are cut into the inner surface, or bore, of the barrel. The surfaces between the grooves are called lands. The lands and grooves twist in a direction: right twist or left twist. For each type of firearm produced, the manufacturer specifies the number of lands and grooves, the direction of twist, the angle of

Michigan State Police Forensic Science Division, Audit of the Detroit Police Department Forensic Services Laboratory Firearms Unit (2008); see also Nick Bunkley, Detroit Police Lab is Closed After Audit Finds Serious Errors in Many Cases, N.Y. Times, Sept. 25, 2008, https://www.nytimes.com/2008/09/26/us/26detroit.html. Similarly, a failed proficiency test by a firearms examiner in 2017 launched multiple audits of the D.C. Department of Forensic Sciences. The audits revealed that three firearms examiners had misidentified cartridge cases in ongoing cases. The D.C. Department of Forensic Sciences lost its accreditation in 2021, after which the laboratory fired all of its firearms personnel, and set to review every firearms case from the last decade. See Spencer S. Hsu & Keith L. Alexander, Forensic Errors Trigger Reviews of D.C. Crime Lab Ballistics Unit, Prosecutors Say, Wash. Post, Mar. 24, 2017, https://perma.cc/28FH-LWTT; Jack Moore, Sweeping Report Urges DC to Review Every Case Handled by Firearms, Fingerprint Units at Troubled Crime Lab, WTOP News, Dec. 14, 2021, https://perma.cc/GFN8-BYQK; Jack Moore, Officials Now Expect DC Crime Lab to Remain Sidelined Until Next Spring, WTOP News, Mar. 31, 2022, https://perma.cc/6LMQ-P8RM.

289. Revolvers have a cylindrical magazine that rotates behind the barrel. The cylinder typically holds five to nine cartridges, each within a separate chamber. When a revolver is fired, the cylinder rotates and the next chamber is aligned with the barrel. A single-action revolver requires the manual cocking of the hammer; in a double-action revolver the trigger cocks the hammer. See generally BulletPoints Project, Guns 101, https://perma.cc/6A43-DYDT (last visited Dec. 5, 2024) (state-funded firearms injury prevention site with basic information about firearms).

290. Bullets are generally composed of lead and small amounts of other elements (hardeners). They may be completely covered (jacketed) with another metal or partially covered (semi-jacketed). See id. (discussing types and basics of ammunition).

291. Cartridge cases are generally made of brass. Id. They are sometimes referred to as “casings” by courts and laypeople, but firearms examiners do not use that term.

Page 168 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

twist (pitch), the depth of the grooves, and the width of the lands and grooves. As a bullet passes through the bore, the lands and grooves force the bullet to rotate, giving it stability in flight and thus increased accuracy. Shotguns are “smooth-bore” firearms; that is, they do not have lands and grooves.

The Firearms Comparison Method and Its Claims

Bullet identification involves a comparison of the evidence bullet and a test bullet fired from the firearm.²⁹² The two bullets are examined by means of a comparison microscope, which permits a split-screen view of the two bullets and manipulation in order to attempt to align the striations (marks) on the two bullets.

The theory behind this comparative analysis is that barrels are machined during the manufacturing process, and imperfections in the tools used in the machining process are imprinted on the bore. The subsequent use of the firearm could add further individual imperfections. For example, mechanical action (erosion) caused by the friction of bullets passing through the bore of the firearm could produce accidental imperfections. Similarly, chemical action (corrosion) caused by moisture (rust), as well as primer and propellant chemicals, could produce other imperfections. However, the prevalence of certain imperfections remains unknown.

When a bullet is fired, microscopic striations are imprinted on the bullet surface as it passes through the bore of the firearm. These bullet markings are produced by the imperfections in the bore. Because these imperfections are randomly produced, examiners assume that they are unique to each firearm. Currently, there is no statistical basis for this assumption.

The condition of a firearm or evidence bullet may preclude a conclusion. For example, there may be insufficient marks on the bullet or, because of mutilation, an insufficient amount of the bullet may have been recovered. Likewise, if the bore of the firearm has changed significantly as a result of erosion or corrosion, a conclusion may be impossible. (Unlike fingerprints, firearms change over time.) In these situations, the examiner may render a “no conclusion” determination.

Firearms comparisons are made based on examinations of so-called class, subclass, and individual characteristics of bullet or cartridge cases. The class characteristics of a firearm result from design factors prior to manufacture. They include caliber and rifling specifications: (1) the land and groove diameters,

292. Test bullets are obtained by firing a firearm into a recovery box or bullet trap, which is usually filled with cotton, or into a recovery tank, which is filled with water. See generally NIST, Forensic Science Program, Firearms and Toolmarks, https://perma.cc/9ATM-CZNX (last updated Dec. 1, 2024) (discussing basics of firearms testing).

Page 169 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

(2) the direction of rifling (left or right twist), (3) the number of lands and grooves, (4) the width of the lands and grooves, and (5) the degree of the rifling twist. Generally, a .38-caliber bullet with six land and groove impressions and with a right twist could have been fired only from a firearm with these characteristics. Such a bullet could not have been fired from a .32-caliber firearm, or from a .38-caliber firearm with a different number of lands and grooves or a left twist. According to the Association of Firearm and Tool Mark Examiners (AFTE),²⁹³ subclass characteristics are discernible surface features that are more restrictive than class characteristics in that they are (1) “produced incidental to manufacture,” (2) “relate to a smaller group source (a subset to which they belong),” and (3) can arise from a source that changes over time.²⁹⁴ The AFTE states that “[c]aution should be exercised in distinguishing subclass characteristics from class characteristics.”²⁹⁵ Finally, according to the AFTE, imperfections in the machining process and subsequent use of a firearm can cause individual characteristics that are thought to be unique to a particular firearm. Experts have warned that “caution should be exercised in distinguishing subclass characteristics from individual characteristics.”²⁹⁶

In terms of examiners’ ultimate opinions, the FBI’s ULTR on firearms and toolmark examinations directs examiners to report one of three conclusions: (1) source identification; (2) source exclusion; or (3) inconclusive.²⁹⁷ AFTE, in contrast, advises examiners that a source conclusion can be made when “sufficient agreement” exists in the patterns of two sets of marks. Agreement is sufficient when it “exceeds the best agreement demonstrated between toolmarks known to have been produced by different tools and is consistent with agreement demonstrated by toolmarks known to have been produced by the same tool,” such that “the likelihood another tool could have made the mark is so remote as to be considered a practical impossibility.”²⁹⁸

Although a conclusion about what marks exist is in one sense based on relatively objective data (the striations on the bullet surface), the AFTE explains that

293. AFTE is the leading professional organization in the field. There is also the Organization of Scientific Areas (OSAC) Firearms and Toolmarks Subcommittee, which promulgates standards and guidelines for examiners.

294. Theory of Identification, Association of Firearm and Toolmark Examiners, 30 AFTE J. 86, 88 (1998).

295. Id.

296. Robert M. Thompson, Program Manager, Forensic Data Systems, Office of Law Enforcement Standards, NIST, Firearm Identification in the Forensic Laboratory, Nat’l Dist. Att’ys Ass’n (2010), https://perma.cc/P37Y-ASNM.

297. See U.S. Dep’t of Justice, Uniform Language for Testimony and Reports for the Forensic Firearms/Toolmarks Discipline Pattern Examination (2023), https://perma.cc/4N57-TDAG.

298. Association of Firearm & Tool Mark Examiners, AFTE Theory of Identification, https://perma.cc/VTS2-HER3 (last visited Dec. 5, 2024).

Page 170 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

the examiner’s ultimate conclusion as to whether a projectile was fired from a particular weapon is a largely subjective judgment. The AFTE describes the traditional pattern recognition methodology as “subjective in nature, founded on scientific principles and based on the examiner’s training and experience.”²⁹⁹ There are no objective criteria governing this determination: “Ultimately, unless other issues are involved, it remains for the examiner to determine for himself the modicum of proof necessary to arrive at a definitive opinion.”³⁰⁰

Another firearms comparison technique is cartridge case identification. Cartridge case identification is based on the same theory of random markings as bullet identification.³⁰¹ As with barrels, the belief is that defects produced in manufacturing leave distinctive characteristics on the breech face, firing pin, chamber, extractor, and ejector. Later use of the firearm is believed to produce additional defects. When the trigger is pulled, the firing pin strikes the primer of the cartridge, causing the primer to detonate. This detonation ignites the propellant (powder). In the process of combustion, the powder is converted rapidly into gases. The pressure produced by this process propels the bullet from the weapon and also forces the “base” (or “head”) of the cartridge case backward against the breech face, imprinting breech face marks on the base of the cartridge case.³⁰² In turn, cartridge case identification involves a comparison of the case recovered at the crime scene and a test case obtained from the firearm after it has been fired. Shotgun shell cases may be analyzed in this way, as well. As in bullet identification, the comparison microscope is used in the examination, and the examiner’s findings are, according to AFTE, “subjective in nature, based on one’s training and experience.”³⁰³

These ammunition identification techniques, like in other disciplines, are being increasingly automated. The current imaging system for identifying

299. Theory of Identification, AFTE, supra note 294, at 86.

300. Joseph L. Peterson et al., Crime Laboratory Proficiency Testing Research Program, Nat’l Inst. L. Enforcement & Crim. Just. 207 (1978), https://perma.cc/BDJ8-JRNP.

301. Gerald Burrard, The Identification of Firearms and Forensic Ballistics 107 (1962). However, bullet and cartridge case identifications differ in several respects. Because the bullet is traveling through the barrel at the time it is imprinted with the bore imperfections, these marks are “sliding” imprints, called striated marks. In contrast, the cartridge case receives “static” imprints, called impressed marks. Id. at 145.

302. Ammunition itself also has “class” characteristics, including markings on the head of the cartridge case, known as “head stamps,” as well as bullet characteristics, such as caliber, type (such as hollow-point), weight, and jacketing, which may link cartridge cases or expelled projectiles at a scene to others collected elsewhere (such as unexpended cartridges within a seized firearm or at a search location), by caliber, brand, or type. See, e.g., Defense Intelligence Agency, Small-Caliber Ammunition Identification Guide, Vol. 2 (1985), https://perma.cc/U4Q2-NXE9.

303. Eliot Springer, Toolmark Examinations—A Review of Its Development in the Literature, 40 J. Forensic Scis. 964, 966–67 (1995), https://doi.org/10.1520/JFS13864J.

Page 171 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

bullets is the Integrated Ballistics Information System (IBIS).³⁰⁴ Automated systems screen bullets and cartridge cases in the database for possible “matches.”³⁰⁵ IBIS identifies a number of candidate bullets, and the examiner makes a final comparison with a microscope.³⁰⁶ The examiner has discretion to reject all the candidates.

The Toolmark Comparison Method and Its Claims

Toolmark comparisons rest on essentially the same theory as firearms identifications.³⁰⁷ Toolmark comparisons rest on the assumption that tools, and the marks they leave on surfaces like wood or metal, have both class and individual characteristics.³⁰⁸ For toolmarks, “class” characteristics might include the size of the mark/tool or the type of tool. For example, it may be possible to identify a mark (impression) left on wood as having been produced by a hammer, a knife, or a screwdriver. “Individual” characteristics are thought to include, for example, accidental imperfections produced by the machining process and subsequent use. When the tool is used, these characteristics are sometimes imparted onto the surface of another object struck by the tool.

As with firearms identification testimony, toolmark identification testimony is based on the largely subjective judgment of the examiner, who determines whether a sufficient number of marks of sufficient similarity are present to permit an identification.³⁰⁹ There are no easily repeatable or reproducible steps involving objective criteria governing the determination of whether there is a “match.”

304. See Jan De Kinder et al., Reference Ballistic Imaging Database Performance, 140 Forensic Sci. Int’l 207 (2004), https://doi.org/10.1016/j.forsciint.2003.12.002; Ruprecht Nennstiel & Joachim Rahm, An Experience Report Regarding the Performance of the IBIS™ Correlator, 51 J. Forensic Scis. 24 (2006), https://doi.org/10.1111/j.1556-4029.2005.00003.x.

305. Richard E. Tontarski & Robert M. Thompson, Automated Firearms Evidence Comparison: A Forensic Tool for Firearms Identification—An Update, 43 J. Forensic Scis. 641, 641 (1998), https://doi.org/10.1520/JFS16196J.

306. Id.

307. See Springer, supra note 303, at 964:
The identification is based . . . on a series of scratches, depressions, and other marks which the tool leaves on the object it comes into contact with. The combination of these various marks ha[s] been termed toolmarks and the claim is that every instrument can impart a mark individual to itself.

308. While there are no studies assessing the phenomena of subclass characteristics for tools, in principle there is no reason to believe nongun tools would not also have subclass characteristics to the same extent as firearms.

309. See Springer, supra note 303, at 966–67 (“According to the Association of Firearms and Toolmarks Examiners’ Criteria for Identification Committee, interpretation of toolmark individualization and identification is still considered to be subjective in nature, based on one’s training and experience.”).

Page 172 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

Scientific Assessments, Critiques, and Debates

The following discussion is focused on critiques of firearms analysis,³¹⁰ given that these will frame the litigation judges will face over admission of firearm comparison testimony.

Critics of firearm identification raise three primary challenges. First, they argue that it is premised on an empirically untested assumption: that every firearm leaves unique, individualized markings on spent ammunition, a direct trail back to the firearm in question. Without variability data to back up this assumption, critics argue that claims of source attribution are not yet sustainable. Second, they argue that the method of comparison itself is overly subjective, based largely on examiner judgment and experience and thus prone to cognitive bias. Third, critics argue that the method is insufficiently validated through well-designed studies as an accurate means of establishing whether ammunition was fired from a particular firearm. In particular, they argue that existing studies suffer serious design flaws that underestimate error rate. Linking ammunition to a larger class of firearms is less controversial, though also sometimes challenged, as discussed below.

The Crime Laboratory Proficiency Testing Program, begun in 1978, was the first published test of the accuracy of firearms analysis.³¹¹ In one test, 5.3% of the participating laboratories misidentified firearms evidence, and in another test 13.6% erred. These tests involved bullet and cartridge case comparisons. The Project Advisory Committee considered these errors “particularly grave in nature” and concluded that they probably resulted from carelessness, inexperience, or inadequate supervision.³¹² A third test required the examination of two bullets and two cartridge cases to identify the “most probable weapon” from which each was fired. The error rate was 28.2%. In later tests,

[e]xaminers generally did very well in making the comparisons. For all fifteen tests combined, examiners made a total of 2106 [bullet and cartridge case] comparisons and provided responses which agreed with the manufacturer responses 88% of the time, disagreed in only 1.4% of responses, and reported inconclusive results in 10% of cases.³¹³

310. This section focuses on firearms identification, although it notes where studies relate to toolmarks as well. The broader issues (i.e., subjectivity, lack of variability data, lack of well-designed validation studies) are likely to arise in toolmark litigation as well.

311. Peterson et al., supra note 300.

312. Id. at 207–08.

313. Joseph L. Peterson & Penelope N. Markham, Crime Laboratory Proficiency Testing Results, 1978–1991, Part II: Resolving Questions of Common Origin, 40 J. Forensic Scis. 1009, 1018 (1995), https://doi.org/10.1520/JFS13871J. The authors also stated:
The performance of laboratories in the firearms tests was comparable to that of the earlier LEAA study, although the rate of successful identifications actually was slightly over—88% vs. 91%.

Page 173 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

In this same report, toolmark identification had higher error rates than firearm identification.³¹⁴

Citing this and other studies, the National Academy of Sciences in a 2008 Report on Ballistic Imaging concluded that “[t]he validity of the fundamental assumptions of uniqueness and reproducibility of firearms-related toolmarks has not yet been fully demonstrated.”³¹⁵ Specifically, it found that “[m]ost of th[e existing firearms] studies are limited in scale and have been conducted by firearms examiners (and examiners in training) in state and local law enforcement laboratories as adjuncts to their regular casework.”³¹⁶ As psychologists have later noted, the insular nature of tests makes the error rate difficult to ascertain from such tests alone.³¹⁷ In addition, noting that firearms examiners often testify to a definite identification, the 2008 report cautioned, “examiners tend to cast their assessments in bold absolutes, commonly asserting that a ‘match’ can be made ‘to the exclusion of all other firearms in the world.’ Such comments cloak an inherently subjective assessment of a ‘match’ with an extreme probability statement that has no firm grounding and unrealistically implies an error rate of zero.”³¹⁸

The 2009 NRC Report likewise reviewed the existing validation studies, concluding that the technique can narrow down to a class of guns or tools, but was not yet sufficiently validated for determining that a particular gun or tool is the source of a mark or pattern:

Because not enough is known about the variabilities among individual tools and guns, we are not able to specify how many points of similarity are necessary for a given level of confidence in the result. Sufficient studies have not been done to understand the reliability and repeatability of the methods. The committee agrees that class characteristics are helpful in narrowing the pool of tools that may have left a distinctive mark. Individual patterns from manufacture or from wear might, in some cases, be distinctive enough to suggest one

Laboratories cut the rate of errant identifications by half (3% to 1.4%) but the rate of inconclusive responses doubled, from 5% to 10%. Id. at 1019.

314. Id. at 1025 (“Overall, laboratories performed not as well on the toolmark tests as they did on the firearms tests.”).

315. National Research Council, Ballistic Imaging 81 (2008), https://doi.org/10.17226/12162.

316. Id. at 70. See also William A. Tobin, H. David Sheets & Clifford Spiegelman, Absence of Statistical and Scientific Ethos: The Common Denominator in Deficient Forensic Practices, 4 Statistics & Pub. Pol’y 1, 1 (2017), https://doi.org/10.1080/2330443X.2016.1270175 (stating that the majority of firearms testing that has been done has been developed by firearms/toolmark examiners, an “insular communit[y] of nonscientist practitioners . . . who did not incorporate effective statistical methods”).

317. See Itiel E. Dror & Nicholas Scurich, (Mis)use of Scientific Measurements in Forensic Science, 2 Forensic Sci. Int’l: Synergy 333, 333 (2020), https://doi.org/10.1016/j.fsisyn.2020.08.006.

318. National Research Council, supra note 315, at 82.

Page 174 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

particular source, but additional studies should be performed to make the process of individualization more precise and repeatable.³¹⁹

The report also identified a “fundamental problem with toolmark and firearms analysis” as being “the lack of a precisely defined process” or “specific protocol,” which the report said led to a lack of “well-characterized confidence limits” in the examiners’ conclusions.³²⁰

The 2016 PCAST Report has offered the most extensive analysis to date about why existing studies might underestimate error rate. The report explained that all but two existing firearms studies are either within-set studies or closed-set studies. In within-set studies, examiners analyze bullets to determine which came from the same firearm. The comparisons are not independent of each other (e.g., once bullet A matches bullet B, then if bullet C matches bullet A it must also match bullet B), which the report explained might artificially reduce the false-positive rate. In closed-set studies, examiners are asked to link each projectile to a gun in the set; all the right answers are included (all source guns are present). The problem with a closed-set study, according to the PCAST report, is that it is easier than real casework, for the same reason a multiple-choice question is easier (the right answer is definitely one of the choices given).³²¹

PCAST identified two existing firearms studies—the “Miami Dade study” and the “Ames study” (by the Ames Laboratory, a Department of Energy laboratory affiliated with Iowa State University)—that were not within-set or closed-set. The Miami Dade study was conducted by the Miami Dade police crime laboratory and involved a “partial open-set” where at least some source guns were not among those present.³²² Among the conclusive determinations made, four were false positives, resulting in a false positive rate of 1 in 48, with an upper bound (of the likely range of the true false positive rate) of 1 in 19. If one includes inconclusive determinations in the denominator (which PCAST argued was inappropriate), the false positive rate is lower.³²³ The Ames study was modeled after the FBI’s latent print study, and involved true open sets, where the examinations were fully independent of each other and where there was no indication whether the source gun was present. In the Ames study, there were 22 false positives and 735 “inconclusives” out of 2,158 total “different source” examinations where the ground truth (the correct call, that is) was “exclusion.”

319. 2009 NRC Report, supra note 37, at 154.

320. Id. at 155.

321. 2016 PCAST Report, supra note 20, at 107–09. The report notes that even where examiners are not explicitly told the study is a closed set, examiners can easily intuit this from the study design and answer sheet. Id. at 108.

322. Id. at 109. In the Miami Dade study, 165 examiners “were asked to assign a collection of 15 questioned samples, fired from 10 pistols to a collection of known standards; two of the 15 questioned samples came from a gun for which known standards were not provided. For these two samples, there were 188 eliminations, 138 inconclusives and 4 false positives.” Id.

323. Id.

Page 175 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

PCAST calculated the false positive rate from this information as being 22 out of 1,443 (the number of conclusive examinations), leading to an estimated error rate as high as 2.2% or 1 in 46.³²⁴

Ultimately, PCAST concluded that only the Ames study was sufficiently well designed to support a finding of foundational validity of the method, in that they involved designers who were independent of law enforcement (even though the participants themselves were self-selected)³²⁵ and involved at least a partial open set. PCAST concluded that because of design flaws, other studies “seriously underestimate the false positive rate.” With only one well-designed study to back it, firearms analysis in PCAST’s view “still falls short of the scientific criteria for foundational validity.”³²⁶ PCAST urged that, if judges admitted firearms examination results in court, the testimony be accompanied by estimates of the false positive rate, based on the number of false positives in the examiners’ total number of conclusive examinations.³²⁷ The National District Attorneys Association disagreed with this conclusion in its response to PCAST, arguing that “the PCAST members are neither forensic firearm scientists performing casework nor did they participate as examiners in these validation studies” and that PCAST’s criteria for a well-designed study was “arbitrarily defined.”³²⁸

As judges might glean from the discussion above, there is a continuing debate about how to deal with “inconclusive” determinations in calculating a false positive rate. There are two separate issues. The first issue is whether to calculate a false positive rate based on the total number of examinations (including those where the examiner’s opinion was “inconclusive”) or solely based on conclusive examinations. The 2016 PCAST Report urged the latter approach, offering a hypothetical: “[C]onsider an extreme case in which a method had been tested 1000 times and found to yield 990 inconclusive results, 10 false positives, and no correct results. It would be misleading to report that the false positive rate was 1 percent (10/1,000 examinations). Rather, one should report that 100 percent of the conclusive results were false positives (10/10 examinations).”³²⁹

The second issue is whether an “inconclusive” determination should be treated as an error when the ground truth is an “exclusion”—for example, where a bullet was definitively not fired from Gun A but the examiner fails to call this an exclusion. In one sense, this call is not the same as a false identification because the examiner has not stated that the bullet was fired from the gun. Moreover, in real casework, “inconclusive” might be the most accurate inference from the

324. Id. at 110.

325. Alan H. Dorfman & Richard Valliant, Inconclusives, Errors, and Error Rates in Forensic Firearms Analysis: Three Statistical Perspectives, 5 Forensic Sci. Int’l: Synergy 1, 5 (2022), https://doi.org/10.1016/j.fsisyn.2022.100273 (noting self-selecting volunteers for Ames study).

326. 2016 PCAST Report, supra note 20, at 111–12.

327. Id.

328. Nat’l Dist. Att’ys Ass’n, supra note 26, at 6.

329. 2016 PCAST Report, supra note 20, at 51–52.

Page 176 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

evidence, given the conditions. In another sense, this call is erroneous in that it fails to exclude what is in fact an exclusion. This failure is especially concerning in a closed-set study where inconclusives are inherently a wrong response (because the source gun is always present), but is potentially concerning in nearly all validation studies, given that existing studies (unlike real casework) are designed to offer examiners sufficient information to correctly call exclusions when they are indeed exclusions.³³⁰ As of this writing, “[r]esearchers have yet to agree on whether and how inconclusives should be used when assessing examiner performance.”³³¹ The Food and Drug Administration, for its part, directs researchers facing this issue to calculate two error rates: one treating all inconclusives as a “positive” (here, an identification) and one treating inconclusives as a “negative” (here, an exclusion).³³² Judges should be aware of the debate and why “[i]t is clear that a well-founded characterization of inconclusives is critical for assessing the size of error rates estimated from the forensic firearms studies.”³³³

Since the 2016 PCAST Report, several other “open-set” validation studies of firearms comparison have been published, some by the AFTE in its own journal³³⁴ and others in peer-reviewed journals like the Journal of Forensic Sciences.³³⁵ Some of these new studies have also been critiqued on grounds similar to

330. See generally Itiel E. Dror & Glenn Langenburg, “Cannot Decide”: The Fine Line Between Appropriate Inconclusive Determinations Versus Unjustifiably Deciding Not to Decide, 64 J. Forensic Scis. 10, 11 (2018), https://doi.org/10.1111/1556-4029.13854 (arguing that inconclusives should be treated as error where ground truth is exclusion); Maneka Sinha & Richard Gutierrez, Signal Detection Theory Fails to Account for Real-World Consequences of Inconclusive Decisions, 21 L., Probability & Risk 131 (2023), https://doi.org/10.1093/lpr/mgad001 (same); Dror & Scurich, supra note 317, at 334 (same); Hal Arkes & Jonathan Jay Koehler, Inconclusives and Error Rates in Forensic Science: A Signal Detection Theory Approach, 20 L., Probability & Risk 153 (2021), https://doi.org/10.1093/lpr/mgac005 (arguing that whether inconclusives are error depends on the context, given that “inconclusive” is not a state of nature); Dorfman & Valliant, supra note 325, at 6–7 (noting that inconclusives should at least sometimes be treated as errors); cf. Andrew Smith & Gary Wells, Telling Us Less than What They Know: Expert Inconclusive Reports Conceal Exculpatory Evidence in Forensic Cartridge-Case Comparisons, 13 J. Applied Res. Memory & Cognition 147 (2024), https://doi.org/10.1037/mac0000138.

331. Arkes & Koehler, supra note 330.

332. See U.S. Food & Drug Admin., Guidance for Industry and FDA Staff: Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests 18 (2007), https://www.fda.gov/media/71147/download.

333. Dorfman & Valliant, supra note 325, at 2.

334. See, e.g., Brandon A. Best & Elizabeth A. Gardner, An Assessment of the Foundational Validity of Firearms Identification Using Ten Consecutively Button-Rifled Barrels, 54 AFTE J. 28 (2022); Mark A. Keisler et al., Isolated Pairs Research Study, 50 AFTE J. 56 (2018).

335. See, e.g., Eric F. Law & Keith B. Morris, Evaluating Firearm Examiner Conclusion Variability Using Cartridge Case Reproductions, 66 J. Forensic Scis. 1704 (2021), https://doi.org/10.1111/1556-4029.14758; Keith L. Monson, Eric D. Smith & Eugene M. Peters, Accuracy of Comparison Decisions by Forensic Firearms Examiners, 68 J. Forensic Scis. 86 (2023), https://doi.org/10.1111/1556-4029.15152; Jaimie A. Smith, Beretta Barrel Fired Bullet Validation Study, 66 J. Forensic Scis. 547 (2021), https://doi.org/10.1111/1556-4029.14604. See also Max Guyll et al., Validity of Forensic

Page 177 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

PCAST’s critiques of the Ames study,³³⁶ or because of their treatment of inconclusives³³⁷ or other issues related to “missingness” of data.³³⁸ Judges should be aware both that new studies are being published and of the recurring critiques of study design and error rate calculation that might be at the heart of the debate over such studies’ probative value in determining reliability.

Reliability as applied.

As with fingerprints and handwriting analysis, evidence of the reliability as applied of firearms and toolmark comparison will typically be a mix of proficiency testing (which is also related to expert qualification), internal validation of any software or equipment used, documentation of whether the examiner followed standards and protocols, documentation of steps taken to avoid contextual and other bias, and evidence about the condition of the particular sample being examined. As with other methods, a judge might find based on the evidence presented that firearms and toolmark analysis is reliable as applied to offering one type of conclusion or answering one type of question, but not another.

In terms of proficiency testing, one study cited by proponents of firearms comparison testimony suggests a low 1.4% error rate based on a 1978–1991 study of results of proficiency tests given by the main private forensic proficiency testing firm, Collaborative Testing Services (CTS).³³⁹ In a 2005 CTS cartridge case examination, none of the 255 test-takers nationwide answered incorrectly.³⁴⁰ As one district court judge has noted, “[o]ne could read these results to mean that

Cartridge-Case Comparisons, 120 Proc. Nat’l Acad. Scis. e2210428120 (2023), https://doi.org/10.1073/pnas.2210428120.

336. After the 2016 PCAST Report, the Ames Laboratory and FBI collaborated on a second study, known as Ames II, originally available online as Stanley J. Bajic et al., Ames Laboratory, U.S. Dep’t of Energy, Technical Report No. ISTR-5220, Report: Validation Study of the Accuracy, Repeatability, and Reproducibility of Firearms Comparison (Oct. 7, 2020). The published version of the study in the Journal of Forensic Sciences, however, includes only FBI authors. See Keith L. Monson et al., Repeatability and Reproducibility of Comparison Decisions by Firearms Examiners, 68 J. Forensic Scis. 1721 (2023), https://doi.org/10.1111/1556-4029.15318. The FBI authors now state, without further explanation: “We acknowledge the contracted research contributions of Ames Laboratory personnel, who have declined authorship and individual acknowledgment.” Id. at 1738.

337. See, e.g., Dror & Scurich, supra note 317, at 334 (discussing Keisler et al., supra note 334); Michael Rosenblum et al., Commentary on Guyll et al. (2023): Misuse of Statistical Method Results in Highly Biased Interpretation of Forensic Evidence 1 (Sept. 11, 2023), https://perma.cc/VB9J-DV2C (arguing that Guyll et al., supra note 335, “make a serious statistical error that leads to highly inflated claims about the probability that a cartridge case from a crime scene was fired from a reference gun”).

338. See, e.g., Khan & Carriquiry, supra note 102, at 4–5 (noting “missingness problems” affecting forensic error rate calculation, including self-selected participants, high rates of attrition, nonresponse, how inconclusive responses are used, low reproducibility (inter-examiner consistency) and repeatability (intra-examiner consistency)).

339. See, e.g., United States v. Monteiro, 407 F. Supp. 2d 351, 367 (D. Mass. 2006) (noting that government was citing and relying on Richard Grzybowski et al., Firearm/Toolmark Identification: Passing the Reliability Test Under Federal and State Evidentiary Standards, 35 AFTE J. 209, 213 (2003))

340. Monteiro, 407 F. Supp. 2d at 367.

Page 178 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

the technique is foolproof, but the results might instead indicate that the test was somewhat elementary.”³⁴¹ Indeed, the CTS president has acknowledged that their customers prefer that tests are designed to be “easy.”³⁴² Some laboratories, like the Houston Forensic Science Center, have moved to “blind” proficiency testing where examiners do not know they are being tested.³⁴³ This solution has also been suggested as a means of accounting for artificially inflated accuracy rates where inconclusives are treated as non-errors.³⁴⁴

Case Law Development

Before 2005, courts had routinely allowed testimony of experts not only about class characteristics—or whether a bullet could have been fired from a particular firearm, or a mark could have been made by a particular tool³⁴⁵—but also more categorical statements,³⁴⁶ other than where experts were attempting particularly novel comparisons.³⁴⁷ In 2005, a federal district court judge ruled for the first time that an expert could not testify that recovered cases came from a

341. Id.

342. 2016 PCAST Report, supra note 20, at 57 n.133 (quoting CTS president).

343. Garrett & Mitchell, supra note 19, at 923.

344. See, e.g., Arkes & Koehler, supra note 330 (suggesting blind proficiency testing as a possible solution to the inconclusive problem); Dorfman & Valliant, supra note 325, at 6–7 (same).

345. E.g., People v. Horning, 102 P.3d 228, 236 (Cal. 2004) (expert “opined that both bullets and the casing could have been fired from the same gun . . . ; because of their condition he could not say for sure”); Luttrell v. Commonwealth, 952 S.W.2d 216, 218 (Ky. 1997) (expert “testified only that the bullets which killed the victim could have been fired from Luttrell’s gun”); United States v. Murphy, 996 F.2d 94, 99 (5th Cir. 1993) (allowing testimony of FBI expert “that the tools such as the screwdriver associated with Murphy ‘could’ have made the marks on the ignitions but that he could not positively attribute the marks to the tools identified with Murphy”).

346. See, e.g., United States v. Bowers, 534 F.2d 186, 193 (9th Cir. 1976) (concluding that toolmark identification “rests upon a scientific basis and is a reliable and generally accepted procedure”); United States v. Hicks, 389 F.3d 514, 526 (5th Cir. 2004) (ruling that “the matching of spent shell casings to the weapon that fired them has been a recognized method of ballistics testing in this circuit for decades”); United States v. Foster, 300 F. Supp. 2d 375, 377 n.1 (D. Md. 2004) (“Ballistics evidence has been accepted in criminal cases for many years. . . . In the years since Daubert, numerous cases have confirmed the reliability of ballistics identification.”); United States v. Santiago, 199 F. Supp. 2d 101, 111 (S.D.N.Y. 2002) (“The Court has not found a single case in this Circuit that would suggest that the entire field of ballistics identification is unreliable.”); Commonwealth v. Whitacre, 878 A.2d 96, 101 (Pa. Super. Ct. 2005) (“no abuse of discretion in the trial court’s decision to permit admission of the evidence regarding comparison of the two shell casings with the shotgun owned by Appellant”).

347. See, e.g., Ramirez v. State, 810 So. 2d 836, 849–51 (Fla. 2001) (rejecting testimony matching knife to a cartilage wound); Sexton v. State, 93 S.W.3d 96, 101 (Tex. Crim. App. 2002) (rejecting testimony matching collected fired cartridge cases to cases from unfired bullets found in appellant’s apartment based solely on magazine marks).

Page 179 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

specific weapon “to the exclusion of every other firearm in the world.”³⁴⁸ Other courts similarly began to disallow certain statements suggesting certainty or definitive source attribution.³⁴⁹ Other courts continued to allow statements such as a firearm identification with “practical certainty.”³⁵⁰

Several courts in the past decade have, for the first time, more significantly limited firearm comparison testimony to statements about class characteristics or statements that a gun “cannot be excluded” as being a source of a mark or

348. United States v. Green, 405 F. Supp. 2d 104 (D. Mass. 2005).

349. See, e.g., Williams v. United States, 210 A.3d 734, 742 (D.C. 2019) (concluding that “the empirical foundation does not currently exist to permit these examiners to opine with certainty that a specific bullet can be matched to a specific gun,” and that the trial court plainly erred in allowing the testimony); United States v. Diaz, No. CR 05-00167 WHA, 2007 WL 485967, at *1 (N.D. Cal. Feb. 12, 2007) (allowing examiner to testify that a match has been made to a “reasonable degree of certainty in the ballistics field” but not “to the exclusion of all other firearms in the world”); United States v. Glynn, 578 F. Supp. 2d 567, 568 (S.D.N.Y. 2008) (disallowing “reasonable degree of ballistic certainty” testimony but allowing examiner to say that gun was “more likely than not” the source); Gardner v. United States, 140 A.3d 1172, 1177 (D.C. 2016) (holding that “firearms and toolmark expert may not give an unqualified opinion, or testify with absolute or 100% certainty, that based on ballistics feature comparison matching a fatal shot was fired from one firearm, to the exclusion of all other firearms”); United States v. Willock, 696 F. Supp. 2d 536, 546, 549 (D. Md. 2010) (Holding, based on a comprehensive magistrate judge’s report, that examiner “Sgt. Ensor shall not opine that it is a ‘practical impossibility’ for a firearm to have fired the cartridges other than the common ‘unknown firearm’ to which Sgt. Ensor attributes the cartridges.” Thus, “Sgt. Ensor shall state his opinions and conclusions without any characterization as to the degree of certainty with which he holds them.”); United States v. Taylor, 663 F. Supp. 2d 1170, 1180 (D.N.M. 2009):
[B]ecause of the limitations on the reliability of firearms identification evidence discussed above, Mr. Nichols will not be permitted to testify that his methodology allows him to reach this conclusion as a matter of scientific certainty. Mr. Nichols also will not be allowed to testify that he can conclude that there is a match to the exclusion, either practical or absolute, of all other guns. He may only testify that, in his opinion, the bullet came from the suspect rifle to within a reasonable degree of certainty in the firearms examination field. See also United States v. Monteiro, 407 F. Supp. 2d 351, 374 (D. Mass. 2006) (allowing ballistics expert to testify to a “reasonable degree of certainty” but not a “statistical certainty”).

350. See, e.g., United States v. Williams, 506 F.3d 151, 161–62 (2d Cir. 2007) (allowing testimony of a ballistics “match” but cautioning that the opinion should not “be taken as saying that any proffered ballistic expert should be routinely admitted”); United States v. McCluskey, CR. No. 10-2734 JCH, 2013 WL 12335325, at *9–10 (D.N.M. Feb. 7, 2013) (allowing “practical certainty” but disallowing “reasonable degree of ballistic certainty” as a nonsensical nonscientific term); United States v. Natson, 469 F. Supp. 2d 1253, 1261 (M.D. Ga. 2007) (allowing statement of “100% degree of certainty”); Commonwealth v. Meeks, Nos. 2002-10961, 2003-10575, 2006 WL 2819423, at *50 (Mass. Super. Ct. Sept. 28, 2006) (allowing opinion of a “match”); United States v. Casey, 928 F. Supp. 2d 397, 399–400 (D.P.R. 2013) (finding that although “defendant challenges [the] conclusion that [the examiner] is 100% certain,” the court “remains faithful to the longstanding tradition of allowing the unfettered testimony of qualified ballistics experts”); State v. Davidson, 509 S.W.3d 156, 205 (Tenn. 2016) (“it’s like a fingerprint”); State v. Anderson, 624 S.E.2d 393, 397–98 (N.C. Ct. App. 2006) (no abuse of discretion in admitting bullet identification evidence).

Page 180 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

pattern. In the often-cited case of United States v. Tibbs, a D.C. trial judge conducted a multiday evidentiary hearing, ultimately restricting the firearms testimony to a statement that the gun “cannot be excluded as the source of the cartridge case found on the scene,” rather than identification, because “[a]ny statement by the expert involving more certainty regarding the relationship between a casing and a firearm would stray into territory not presently supported by reliable principles and methods.”³⁵¹ The court recognized the “threshold design issues” of existing firearms/toolmark studies, which “limit their utility.”³⁵² A leading scientific evidence treatise calls Tibbs the “a high water mark for genuinely engaging with the imperfect and limited state of the scientific research underlying firearms identification research,”³⁵³ and several other courts since 2019 have similarly limited firearms examiner testimony.³⁵⁴ Indeed,

351. See United States v. Tibbs, 2016-CF1-19431, 2019 WL 4359486, at *23 (D.C. Super. Ct. Sept. 5, 2019).

352. Id. at *14.

353. Modern Scientific Evidence, supra note 138, at § 34:5.

354. See, e.g., United States v. Briscoe, No. 20-CR-1777 MV, 2023 WL 8096886 (D.N.M. Nov. 21, 2023) (citing Tibbs and allowing testimony but declining to allow evidence of “consistent with,” “sufficient agreement,” “match,” but allowing class characteristics); Abruquah v. State, 296 A.3d 961, 996–98 (Md. 2023) (allowing only testimony that gun was “consistent or inconsistent” with markings); United States v. Cloud, 576 F. Supp. 3d 827, 845 (E.D. Wash. 2021):
[I]f [examiner] intends to go beyond testimony that merely notes the recovered cartridge casings could not be excluded as having been fired from the recovered firearm, the Court will inform the jury that: (1) only three studies that meet the minimum design standard have attempted to measure the accuracy of firearm/toolmark comparison and (2) these studies found false positive rates that could be as high as 1 in 46 in one study, 1 in 200 in the second study, and 1 in 67 in the third study, though this study has yet to be published and subjected to peer review. See also United States v. Shipp, 422 F. Supp. 3d 762, 783 (E.D.N.Y. 2019) (ruling that ballistics expert would be permitted to testify only that toolmarks on recovered bullet fragment and shell casing were consistent with having been fired from the recovered firearm); United States v. Adams, 444 F. Supp. 3d 1248, 1267 (D. Or. 2020) (limiting testimony to shared class characteristics only, such as caliber, type of firing pin, “barrel with six lands/grooves and right twist”); People v. Ross, 129 N.Y.S.3d 629, 642 (Sup. Ct. 2020) (limiting testimony to class characteristics only, and that the gun “cannot be ruled out” as the source, but disallowing phrases like “consistent with” or “sufficient agreement”); Illinois v. Rickey Winfield, 15 CR 14066-01 (Cook Cty. Cir. Ct. Feb. 8, 2023) (excluding ballistics testimony, concluding “[t]here are no objective forensic based reasons that firearms identification evidence belongs in any category of forensic science” and that “the junk evidence” “fails” the Frye test for admissibility (Frye v. United States, 293 F. 1013 (D.C. Cir. 1923))); State v. Ghigliotty, No. 17-0200154-1, slip op. at 39 (N.J. Super. Ct. Law Div. Aug. 23, 2019):
[T]his Court finds that the current state of firearms identification through the use of tool marks in the view of the scientific community does not support permitting [the expert] to frame his opinion that the identification of the evidence bullet in this case was “positive” or a “match” in terms of absolutely certainty. Cf. United States v. Medley, No. PWG 17-242 (D. Md. Apr. 24, 2018) (pre-Tibbs opinion ruling that expert may not testify to a “match” opinion).

Page 181 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

at least one court has cited Tibbs in rejecting a defense ballistics expert.³⁵⁵ At the same time, however, other courts continued to uphold admission.³⁵⁶

As explained at the end of the sections titled “Fingerprint Evidence” and “Handwriting Evidence,” the 2023 amendment to Rule 702 may require courts to engage in more robust gatekeeping to evaluate whether the opponent has proven by a preponderance of the evidence both the foundational reliability and the reliability as applied of firearm and toolmark comparison expert testimony.

Bitemark Evidence

Introduction

Forensic odontology is the field engaged in “examining, interpreting, and presenting dental and oral evidence for investigative and legal purposes.”³⁵⁷ Forensic bitemark examiners, a subset of forensic odontologists, purport to be able to identify a mark as a human bitemark and then attribute it to a particular person’s teeth.³⁵⁸ Other forensic odontologists focus only on human identification—that is, identifying people by their teeth.

Courts previously admitted bitemark analysis without much scrutiny into its reliability. In more recent years, scientists have overwhelmingly concluded that bitemark evidence is not reliable.³⁵⁹ As discussed below, research has

355. See United States v. Harris, 491 F. Supp. 3d 414, 424–25 (E.D. Wis. 2020) (rejecting on Daubert grounds proffered defense expert testimony that mark on car was not from a bullet, citing Tibbs).

356. See, e.g., United States v. Hunt, 63 F.4th 1229, 1249, n.18 (10th Cir. 2023) (allowing testimony based on CMS (consecutive matching striae) and noting that Tibbs did not deal with that method); United States v. Perry, 35 F.4th 293, 329–31 & n.23 (5th Cir. 2022) (holding that trial court’s admission of ballistics testimony under Daubert was not an abuse of discretion, notwithstanding the defendant’s arguments as to lack of peer review and standardization); United States v. Graham, No. 4:23-CR-00006, 2024 WL 688256, at *16 (W.D. Va. Feb. 20, 2024) (denying Daubert challenge to ballistics expert, but requiring expert to testify consistent with FBI’s new ULTR); United States v. Blackman, No. 18-CR-00728, 2023 WL 3440384, at *5 n.5, *9 (N.D. Ill. May 12, 2023) (allowing source attribution testimony so long as expert does not imply absolute certainty or “to the exclusion of all other” guns) (citing United States v. Green, 405 F. Supp. 2d 104, 124 (D. Mass. 2005)); United States v. Harris, 502 F. Supp. 3d 28, 33 (D.D.C. 2020) (denying Daubert challenge to ballistics expert, but requiring expert to testify consistent with FBI’s new ULTR).

357. NIST, Forensic Odontology Subcommittee, https://perma.cc/5L4U-HTG6 (last updated Nov. 3, 2023).

358. See E.H. Dinkel, Jr., The Use of Bite Mark Evidence as an Investigative Aid, 19 J. Forensic Scis. 535 (1974), https://doi.org/10.1520/JFS10208J.

359. NIST Bitemark Report, supra note 110. As the Texas Forensic Science Commission wrote, “if anyone should take responsibility for the current state of bite mark comparison, it is the very organization of practitioners that, due to its glacial pace, reticence to publish critical data, and willingness to allow overstatements of science to go unchecked for decades, is facing a barrage of well-founded criticism.” Tex. Forensic Sci. Comm’n, Forensic Bitemark Comparison Complaint

Page 182 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

documented that (1) skin is an unstable medium upon which to expect consistent imprint results; (2) dental molds are frequently similar in nature and challenging to distinguish; and (3) bitemark experts themselves have increasing difficulty determining first whether a blemish is a bitemark, then whether it is a human or animal bitemark, and finally, whether the bitemark “matches” to a dental mold. The uniqueness of dentition has also never been proven. The OSAC Forensic Odontology Subcommittee itself makes clear on its website that it “does not develop standards on bitemark recognition, comparison, and identification.”³⁶⁰

Still, as of this writing, some prosecutors and defense attorneys continue to offer bitemark evidence, and some courts continue to admit it.

The Method and Its Claims

Human identification.

Forensic odontologists identify who a deceased person might be from human remains by comparing the decedent’s teeth to dental records of a known person. The human adult set of teeth (dentition) offers many points of comparison for these purposes. A set of teeth consists of thirty-two teeth, each with five anatomic surfaces. Restorations, with varying shapes, sizes, and restorative materials, may offer numerous additional points of comparison. Moreover, the number of teeth, prostheses, decay, malposition, malrotation, peculiar shapes, root canal therapy, bone patterns, bite relationship, and oral pathology may also provide points of comparison.³⁶¹

Although the assumption that every person’s full dentition is unique remains empirically unproven,³⁶² “the identification of human remains by their dental characteristics is” nonetheless “well established.”³⁶³ The reliability of human identification through dental records stems from the finite number of candidates to identify and the availability of full dentition on both remains and records.³⁶⁴

Filed by National Innocence Project on behalf of Steven Mark Chaney—Final Report, 1, 17 (Apr. 12, 2016), https://www.txcourts.gov/media/1454500/finalbitemarkreport.pdf.

360. NIST, supra note 357.

361. The identification is made by comparing the decedent’s teeth with antemortem dental records. See generally Javier Ata-Ali & Fadi Ata-Ali, Forensic Dentistry in Human Identification: A Review of the Literature, 6 J. Clin. & Exp. Dent. e162 (2014), https://doi.org/10.4317/jced.51387 (explaining points of comparison, process, and types of records relied on).

362. See 2009 NRC Report, supra note 37, at 175 (“The uniqueness of the human dentition has not been scientifically established.”); NIST Bitemark Report, supra note 110, at i (finding a “lack of support” for this “key premise”).

363. 2009 NRC Report, supra note 37, at 173. See also Satomi Mizuno et al., Validity of Dental Findings for Identification by Postmortem Computed Tomography, 341 Forensic Sci Int’l 111507 (2022), https://doi.org/10.1016/j.forsciint.2022.111507.

364. See Iain A. Pretty & David J. Sweet, The Scientific Basis for Human Bitemark Analyses—A Critical Review, 41 Sci. & Just. 85, 88 (2001), https://doi.org/10.1016/S1355-0306(01)71859-X

Page 183 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

Bitemark identification.

Bitemark identification, by contrast to human remains identification, relies on impressions and bruising of the skin and comparing those to a suspect’s teeth. In contrast to human remains identification, the number of potential suspects may be large, the examiner may have only a small number of impressions to work with, the marks may change over time, and the medium on which the marks are observed (like skin) may be unstable and allow shrinkage and distortion.³⁶⁵ Moreover, all five anatomic surfaces are not engaged in biting; only the edges of the front teeth come into play. In sum, bitemark identification depends not only on the uniqueness of each person’s dentition but also on “whether there is a [sufficient] representation of that uniqueness in the mark found on the skin or other inanimate object.”³⁶⁶ This uniqueness has yet to be proven.³⁶⁷

Of the several methods of bitemark analysis that have been reported, all involve three steps: (1) registration of both the bitemark and the suspect’s dentition, (2) comparison of the dentition and bitemark, and (3) evaluation of the points of similarity or dissimilarity. The comparison may be either direct or indirect. A model of the suspect’s teeth is used in direct comparisons; the model is compared to life-size photographs of the bitemark. Transparent overlays made from the model are used in indirect comparisons.³⁶⁸ The ultimate opinion of a bitemark examiner regarding individuation is a relatively subjective one.³⁶⁹ For example, there is no accepted minimum number of points of identity required

(“A distinction must be drawn from the ability of a forensic dentist to identify an individual from their dentition by using radiographs and dental records and the science of bitemark analysis.”).

365. See 2009 NRC Report, supra note 37, at 174–76 (noting these limitations).

366. Raymond D. Rawson et al., Statistical Evidence for the Individuality of the Human Dentition, 29 J. Forensic Scis. 245, 252 (1984), https://doi.org/10.1520/JFS11656J.

367. Mary A. Bush et al., Statistical Evidence for the Similarity of the Human Dentition, 56 J. Forensic Scis. 118, 118 (2010), https://doi.org/10.1111/j.1556-4029.2010.01531.x (observing significant correlations and nonuniform distributions of tooth positions as well as “matches” between dentitions and concluding that “statements of dental uniqueness with respect to bitemark analysis in an open population are unsupportable and that use of the product rule is inappropriate”); Mary A. Bush et al., Similarity and Match Rates of the Human Dentition in Three Dimensions: Relevance to Bitemark Analysis, 125 Int’l J. Legal Med. 779, 779 (2011), https://doi.org/10.1007/s00414-010-0507-8 (explaining that three dimensional models reduce but do not limit random matches and “a zero match rate cannot be claimed for the population studied”); H. David Sheets et al., Dental Shape Match Rates in Selected and Orthodontically Treated Populations in New York State: A Two-Dimensional Study, 56 J. Forensic Sci. 621, 621 (2011), https://doi.org/10.1111/j.1556-4029.2011.01731.x (finding random dental shape “matches” and concluding that “statements of certainty concerning individualization in such populations should be approached with caution”).

368. See David J. Sweet, Human Bitemarks: Examination, Recovery, and Analysis, in Manual of Forensic Odontology 162 (American Society of Forensic Odontology, 3d ed. 1997) (“The analytical protocol for bitemark comparison is made up of two broad categories. Firstly, the measurement of specific traits and features called a metric analysis, and secondly, the physical matching or comparison of the configuration and pattern of the injury called a pattern association.”).

369. See Roland F. Kouble & Geoffrey T. Craig, A Comparison Between Direct and Indirect Methods Available for Human Bite Mark Analysis, 49 J. Forensic Scis. 111, 111 (2004), https://doi.org/10.1520/JFS2001252 (“It is important to remember that computer-generated overlays still retain an

Page 184 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

for a positive identification,³⁷⁰ and experts have testified to a wide range of points of similarity, from a low of eight points to a high of fifty-two.³⁷¹

In an attempt to develop a more objective method, in 1984 the American Board of Forensic Odontology (ABFO) promulgated guidelines for bitemark analysis, including a uniform scoring system.³⁷² According to the drafting committee, “[t]he scoring system . . . has demonstrated a method of evaluation that produced a high degree of reliability among observers.”³⁷³ Moreover, the committee characterized “[t]he scoring guide . . . [as] the beginning of a truly scientific approach to bite mark analysis.”³⁷⁴ In a subsequent letter, however, the drafting committee wrote:

While the Board’s published guidelines suggest use of the scoring system, the authors’ present recommendation is that all odontologists await the results of further research before relying on precise point counts in evidentiary proceedings. . . . [T]he authors believe that further research is needed regarding the quantification of bite mark evidence before precise point counts can be relied upon in court proceedings.³⁷⁵

Ultimately, the ABFO’s effort to develop a standardized bitemark methodology “failed . . . due to inter examiner discord and unreliable quantitative interpretation.”³⁷⁶ The ABFO has conceded that bitemark evidence cannot be used for individualization in “open population” cases, where there is an unknown number of potential suspects.³⁷⁷

element of subjectivity, as the selection of the biting edge profiles is reliant on the operator placing the ‘magic wand’ onto the areas to be highlighted within the digitized image.”).

370. See Stubbs v. State, 845 So. 2d 656, 669 (Miss. 2003) (“There is little consensus in the scientific community on the number of points which must match before any positive identification can be announced.”). Leigh Stubbs was exonerated in 2012. See Valena Beety, Manifesting Justice: Wrongly Convicted Women Reclaim Their Rights (2022).

371. E.g., State v. Garrison, 585 P.2d 563, 566 (Ariz. 1978) (10 points); People v. Slone, 143 Cal. Rptr. 61, 67 (Ct. App. 1978) (10 points); People v. Milone, 356 N.E.2d 1350, 1356 (Ill. App. Ct. 1976) (29 points); State v. Sager, 600 S.W.2d 541, 564 (Mo. Ct. App. 1980) (52 points); State v. Green, 290 S.E.2d 625, 630 (N.C. 1982) (14 points); State v. Temple, 273 S.E.2d 273, 279 (N.C. 1981) (8 points); Kennedy v. State, 640 P.2d 971, 976 (Okla. Crim. App. 1982) (40 points); State v. Jones, 259 S.E.2d 120, 125 (S.C. 1979) (37 points).

372. Am. Board of Forensic Odontology (ABFO), Guidelines for Bite Mark Analysis, 112 J. Am. Dental Ass’n 383 (1986).

373. Raymond D. Rawson et al., Reliability of the Scoring System of the American Board of Forensic Odontology for Human Bite Marks, 31 J. Forensic Scis. 1235, 1256 (1986), https://doi.org/10.1520/JFS11903J.

374. Id. at 1259.

375. Gerald L. Vale et al., Letters to the Editor: Discussion of “Reliability of the Scoring System of the American Board of Forensic Odontology for Human Bite Marks,” 33 J. Forensic Scis. 20 (1988).

376. C. Michael Bowers, Problem-Based Analysis of Bitemark Misidentifications: The Role of DNA, 159S Forensic Sci. Int’l S104, S106 (2006), https://doi.org/10.1016/j.forsciint.2006.02.032.

377. Am. Bd. of Forensic Odontology, Inc., Diplomates Reference Manual 102 (2015) (explaining that “[t]he ABFO does not support a conclusion of ‘The Biter’ in an open population case(s)”).

Page 185 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

Scientific Assessments, Critiques, and Debates

Bitemark analysis is rarely used in the United States because of serious concerns about its reliability, in part inspired by high-profile DNA exonerations involving people convicted based on bitemark testimony. For example, in State v. Krone,³⁷⁸ two experienced experts concluded that the defendant had made the bitemark found on a murder victim. The defendant, however, was later exonerated through DNA testing.³⁷⁹ In Otero v. Warnick,³⁸⁰ a forensic dentist testified that the

plaintiff was the only person in the world who could have inflicted the bite marks on [the murder victim’s] body. On January 30, 1995, the Detroit Police Crime Laboratory released a supplemental report that concluded that plaintiff was excluded as a possible source of DNA obtained from vaginal and rectal swabs taken from [the victim’s] body.³⁸¹

In Burke v. Town of Walpole,³⁸² the expert concluded that “Burke’s teeth matched the bite mark on the victim’s left breast to a ‘reasonable degree of scientific certainty.’ That same morning . . . DNA analysis showed that Burke was excluded as the source of male DNA found in the bite mark on the victim’s left breast.”³⁸³ These are but a few of the exonerations involving false bitemark evidence.

The 2009 NRC Report concluded that neither “[t]he uniqueness of human dentition,” “[t]he ability of the dentition, if unique, to transfer a unique pattern to human skin and the ability of the skin to maintain that uniqueness” nor any “standard for the type, quality, and number of individual characteristics required to indicate that a bite mark has reached a threshold of evidentiary value” has been scientifically established.³⁸⁴ Moreover, as the report noted, “[t]here is no science on the reproducibility of the different methods of analysis that lead to conclusions about the probability of a match. This includes reproducibility between experts and with the same expert over time. Even when using the

378. 897 P.2d 621, 622, 623 (Ariz. 1995) (“The bite marks were crucial to the State’s case because there was very little other evidence to suggest Krone’s guilt.”; “Another State dental expert, Dr. John Piakis, also said that Krone made the bite marks . . . Dr. Rawson himself said that Krone made the bite marks. . . .”).

379. See Mark Hansen, The Uncertain Science of Evidence, A.B.A. J., July 28, 2005, at 49, https://perma.cc/G674-XKPB (discussing Krone).

380. 614 N.W.2d 177 (Mich. Ct. App. 2000).

381. Id. at 178.

382. 405 F.3d 66 (1st Cir. 2005).

383. Id. at 73. See also Bowers, supra note 376, at S106–07 (citing several cases involving bitemarks and DNA exonerations); Mark Hansen, Out of the Blue, A.B.A. J., Feb. 1, 1996, at 50, 51, https://perma.cc/GV32-SAC4 (DNA analysis of skin taken from fingernail scrapings of the victim conclusively excluded suspect).

384. 2009 NRC Report, supra note 37, at 175–76. See also id. at 176 (“Although the majority of forensic odontologists are satisfied that bite marks can demonstrate sufficient detail for positive identification, no scientific studies support this assessment, and no large population studies have been conducted.”).

Page 186 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

[American Board of Forensic Odontology] guidelines, different experts provide widely differing results and a high percentage of false positive matches of bite marks using controlled comparison studies.”³⁸⁵ As as result, as early as 2009, there was already “continuing dispute over the value and scientific validity of comparing and identifying bite marks.”³⁸⁶ Indeed, because of the dispute, some odontologists well before the 2009 NRC Report had argued that “bitemark evidence should only be used to exclude a suspect.”³⁸⁷

In 2015, the ABFO conducted an internal study titled Construct Validity of Bitemark Assessments Using the ABFO Bitemark Decision Tree.³⁸⁸ This experiment presented 100 injuries to 39 ABFO board-certified forensic odontologists, to determine whether the injury at issue was a human bitemark and, if so, whether it had distinct identifiable arches and individual toothmarks.³⁸⁹ Thirty-nine forensic odontologists completed the survey and “came to unanimous agreement on just 4 of the 100 case studies. Of the initial 100, there remained just 8 case studies in which at least 90 percent of the analysts were still in agreement.”³⁹⁰

The Texas Forensic Science Commission, in a 2016 report, concluded that “there is no scientific basis for stating that a particular patterned injury can be associated to an individual’s dentition” and “[a]ny testimony describing human dentition as ‘like a fingerprint’ or incorporating similar analogies lacks scientific support.”³⁹¹ It ultimately urged that analyst testimony regarding the probability or weight of any association between a bitemark and an individual’s dentition has “no place in our criminal justice system because they lack any credible supporting data.”³⁹²

The 2016 PCAST Report likewise concluded that “available scientific evidence strongly suggests that examiners not only cannot identify the source of bitemark with reasonable accuracy, they cannot even consistently agree on whether an injury is a human bitemark.”³⁹³

Most recently, in 2023, NIST released a report, Bitemark Analysis: A NIST Scientific Foundation Review, reviewing bitemark analysis. The report found

forensic bitemark analysis lacks a sufficient scientific foundation because the three key premises of the field are not supported by the data. First, human

385. Id. at 174.

386. Id. at 173.

387. Iain A. Pretty, A Web-Based Survey of Odontologist’s [sic] Opinions Concerning Bitemark Analyses, 48 J. Forensic Scis. 1117, 1120 (2003), https://doi.org/10.1520/JFS2003017.

388. Adam J. Freeman & Iain A. Pretty, Construct Validity of Bitemark Assessments Using the ABFO Decision Tree (2015), https://perma.cc/D7FZ-JAJX.

389. Radley Balko, A Bite Mark Matching Advocacy Group Just Conducted a Study that Discredits Bite Mark Evidence, Wash. Post, Apr. 8, 2015, https://perma.cc/S2DL-ZREX.

390. Id.

391. Tex. Forensic Sci. Comm’n, supra note 359, at 11–12.

392. Id. at 12.

393. 2016 PCAST Report, supra note 20, at 9 (emphasis in original).

Page 187 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

anterior dental patterns have not been shown to be unique at the individual level. Second, those patterns are not accurately transferred to human skin consistently. Third, it has not been shown that defining characteristics of those patterns can be accurately analyzed to exclude or not exclude individuals as the source of a bitemark.³⁹⁴

Case Law Development

With respect to the use of dentition to identify human remains, courts have readily accepted the method as a means of establishing the identity of a homicide victim,³⁹⁵ with some cases dating back to the nineteenth century.³⁹⁶

With respect to bitemarks, People v. Marx³⁹⁷ emerged as the seminal case. The case “came to be read as a global warrant to admit bite mark identification evidence whenever a person displaying apparent credentials chose to testify to an identification.”³⁹⁸ The cases that closely followed and relied on Marx also went to great lengths to extoll the “superior trustworthiness of the scientific bite mark approach.”³⁹⁹ After Marx, bitemark evidence became widely accepted.⁴⁰⁰ By 1992, it had been introduced or noted in 193 reported cases and accepted as

394. NIST Bitemark Report, supra note 110, at 24.

395. E.g., Wooley v. People, 367 P.2d 903, 905 (Colo. 1961) (dentist compared his patient’s record with dentition of a corpse); Martin v. State, 636 N.E.2d 1268, 1272 (Ind. Ct. App. 1994) (dentist qualified to compare X-rays of one of his patients with skeletal remains of murder victim and make a positive identification); Fields v. State, 322 P.2d 431, 446 (Okla. Crim. App. 1958) (murder case in which victim was burned beyond recognition).

396. See Commonwealth v. Webster, 59 Mass. 295, 299–300 (1850) (remains of the incinerated victim, including charred teeth and parts of a denture, were identified by the victim’s dentist); Lindsay v. People, 63 N.Y. 143, 145–46 (1875).

397. 126 Cal. Rptr. 350 (Cal. Ct. App. 1975). The court in Marx avoided applying the Frye test, which requires acceptance of a novel technique by the scientific community as a prerequisite to admissibility. See Frye v. United States, 293 F. 1013 (D.C. Cir. 1923). According to the Marx court, the Frye test “finds its rational basis in the degree to which the trier of fact must accept, on faith, scientific hypotheses not capable of proof or disproof in court and not even generally accepted outside the courtroom.” 126 Cal. Rptr. at 355–56.

398. D. Michael Risinger, Navigating Expert Reliability: Are Criminal Standards of Certainty Being Left on the Dock?, 64 Alb. L. Rev. 99, 138 (2000).

399. People v. Slone, 143 Cal. Rptr. 61, 69 (Ct. App. 1978); see also State v. Sager, 600 S.W.2d 541, 569 (Mo. Ct. App. 1980) (characterizing bitemark evidence as “an exact science”).

400. Two Australian cases, however, excluded bitemark evidence. See Lewis v The Queen (1987), 29 A Crim R 267 (odontological evidence was improperly relied on, in that this method has not been scientifically accepted); R v Carroll (1985), 19 A Crim R 410 (“[T]he evidence given by the three odontologist is such that it would be unsafe or dangerous to allow a verdict based upon it to stand.”).

Page 188 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

admissible in 35 states.⁴⁰¹ Some courts described bitemark comparison as an “exact science,”⁴⁰² and several cases took judicial notice of its validity.⁴⁰³ The first state supreme court to take judicial notice of the “general acceptance” of bitemark evidence was West Virginia in State v. Armstrong, prompting other state supreme courts to rule similarly.⁴⁰⁴ State v. Armstrong, however, relied on the findings in another bitemark identification case, that of Robert Lee Stinson in Wisconsin.⁴⁰⁵ Stinson was ultimately exonerated by DNA evidence in 2009.⁴⁰⁶ Hence, even the bitemark identification in the case undergirding what led to the national “general acceptance” of bitemark evidence was wrong.

Because law enforcement around the country has so significantly curtailed its reliance on bitemark evidence, there is little recent case law on the subject. Indeed, serious court evaluation of bitemarks has still been largely confined to civil postconviction habeas corpus cases and 42 U.S.C. § 1983 lawsuits for wrongful conviction and presentation of false evidence at trial.⁴⁰⁷ Still, some courts have continued to admit the evidence.⁴⁰⁸

401. Steven Weigler, Bite Mark Evidence: Forensic Odontology and the Law, 2 Health Matrix: J.L.-Med. 303, 303 (1992).

402. See People v. Marsh, 441 N.W.2d 33, 35 (Mich. Ct. App. 1989) (“the science of bite mark analysis has been extensively reviewed in other jurisdictions”); Sager, 600 S.W.2d at 569 (“an exact science”).

403. See State v. Richards, 804 P.2d 109, 112 (Ariz. Ct. App. 1990) (“[B]ite mark evidence is admissible without a preliminary determination of reliability. . . .”); People v. Middleton, 429 N.E.2d 100, 101 (N.Y. 1981) (“The reliability of bite mark evidence as a means of identification is sufficiently established in the scientific community to make such evidence admissible in a criminal case, without separately establishing scientific reliability in each case. . . .”); State v. Armstrong, 369 S.E.2d 870, 877 (W. Va. 1988) (judicially noticing the reliability of bitemark evidence).

404. Armstrong, 369 S.E.2d at 877.

405. State v. Stinson, 397 N.W.2d 136 (Wis. Ct. App. 1986).

406. See Jennifer D. Oliva & Valena E. Beety, Discovering Forensic Fraud, 112 Nw. U. L. Rev. 121 (2017).

407. See, e.g., Keko v. Hingle, 318 F.3d 639, 644 (5th Cir. 2003) (denying absolute immunity to forensic odontologist in § 1983 civil lawsuit following wrongful conviction); Ege v. Yukins, 380 F. Supp. 2d 852, 871 (E.D. Mich. 2005), aff’d in part, rev’d in part on other grounds, 485 F.3d 364 (6th Cir. 2007) (ruling “there is no question that the [bitemark] evidence in this case was unreliable and not worthy of consideration by a jury”); In re Richards, 371 P.3d 195, 211 (Cal. 2016) (court granting civil writ of habeas corpus ruling bitemark expert’s criminal trial testimony constituted material false evidence); Stinson v. Milwaukee, No. 09-C-1033, 2013 WL 5447916, at *12–13, *15–17 (E.D. Wis. Sept. 30, 2013), aff’d in part, rev’d in part sub nom. Stinson v. Gauger, 799 F.3d 833 (7th Cir. 2015) (denying absolute immunity to forensic odontologists in § 1983 civil lawsuit alleging fabrication and suppression of evidence).

408. See Aliza B. Kaplan & Janis C. Puracal, It’s Not a Match: Why the Law Can’t Let Go of Junk Science, 81 Alb. L. Rev. 895, 898 (2018) (notes courts that continue to admit bitemarks, relying on precedent); Michael J. Saks et al., Forensic Bitemark Identification: Weak Foundations, Exaggerated Claims, 3 J.L. & Biosciences 538, 546–47 (2016), https://doi.org/10.1093/jlb/lsw045 (same).

Page 189 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

Facial Recognition Software and Other Methods

Several other forensic feature comparison techniques are on the horizon, beyond the capacity of this reference guide to discuss. For example, “microbial forensics” is a biometric comparison technique that is increasingly used where DNA typing is not possible.⁴⁰⁹ Additionally, digital forensics—from cell phone extraction to cell site location information to hash-value authentication of documents—is a significant area of Daubert litigation, although digital forensics is not typically referred to as a feature comparison technique. Digital evidence is discussed in the Reference Guide on Computer Science, in this manual.

One technique that has been largely supplanted by mitochondrial DNA analysis, and yet which still arises in federal habeas litigation, is microscopic hair comparison. In this technique, the expert compares hair samples under a microscope and testifies whether the hair samples share a common source. Microscopic hair analysis was described by the 2009 NRC Report as a field in which there is no uniform standard on the number of features that must be congruent between hair samples for an examiner to claim a “match.” The report found that microscopic hair analysis without mtDNA testing was unreliable.⁴¹⁰ Likewise, the 2016 PCAST Report found insufficient evidence of foundational validity to support microscopic hair analysis.⁴¹¹

After the release of the 2009 NRC Report, the FBI launched a national reopening of closed cases involving hair analysis, and acknowledged misleading testimony by agents.⁴¹² Of the first 200 cases internally reviewed by the FBI, over 90% contained false or faulty testimony.⁴¹³ Most notably, Santae Trible was imprisoned for twenty-eight years when FBI forensic analysts confused Trible’s hair with a dog’s hair.⁴¹⁴ In partnership with the DOJ, the National Association of Criminal Defense Attorneys, and the Innocence Project, the FBI agreed to

409. See, e.g., Audrey Gouello et al., Analysis of Microbial Communities: An Emerging Tool in Forensic Sciences, 12 Diagnostics 1 (2021), https://doi.org/10.3390/diagnostics12010001 (discussing the techniques for identifying the anatomical origin of a microbial trace, and the forensic possibilities for different types of microbial evidence).

410. See 2009 NRC Report at 161, supra note 37.

411. See 2016 PCAST Report, supra note 20.

412. See, e.g., Press Release, Innocence Project, Innocence Project and NADCL Announce Historic Partnership with the FBI and Department of Justice on Microscopic Hair Analysis Cases (July 18, 2013), https://perma.cc/35VU-ALVD.

413. Norman L. Reimer, The Hair Microscopy Review Project: An Historic Breakthrough for Law Enforcement and a Daunting Challenge for the Defense Bar, Champion, July 2013, at 16, https://perma.cc/679W-VS4T. See also Press Release, FBI, FBI Testimony on Microscopic Hair Analysis Contained Errors in at Least 90% of Cases in Ongoing Review (Apr. 20, 2015), https://perma.cc/Y3R4-3RV8.

414. See Simon A. Cole et al., Microscopic Hair Comparison Analysis and Convicting the Innocent, Nat’l Registry of Exonerations (Dec. 2023), https://perma.cc/WD7V-EDRD.

Page 190 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

provide free DNA testing of any remaining evidence in the acknowledged cases. The DOJ agreed to waive any statute of limitation barriers. State initiatives followed in its wake.⁴¹⁵

Finally, a few notes about facial recognition methods are in order, given their already wide investigative use and the fact that they raise issues similar to those raised by older, relatively subjective non-DNA feature comparison methods. Facial recognition software (or facial recognition technology, FRT) compares two images to determine whether the same person is present in each image.⁴¹⁶ A probe photo, such as a still from a surveillance video, can be uploaded to a police database of photos.⁴¹⁷ The facial recognition software then provides several possible “matches,” and a human police officer uses those possible “matches” as investigative leads.⁴¹⁸

Police use of FRT and databases in investigations is now commonplace.⁴¹⁹ The FBI, for example, routinely runs face recognition searches through its agency system.⁴²⁰ Federal and state databases are also not limited to mug shots; now, nearly half of American adults are in a law enforcement agency’s facial recognition network, through the use of state driver’s license databases or other identification photos.⁴²¹

Crucially, FRT cannot yet “match” two photographs for use as identification evidence at a trial, because of its current accuracy limitations,⁴²² particularly when used to identify people of color.⁴²³ According to a 2019 NIST study evaluating the effects of race, age, and sex on facial recognition software, the

415. See id. at 13 n.30 (“We believe some form of state review exists in at least the following states: Arizona, Arkansas, California, Colorado, Connecticut, Florida, Illinois, Iowa, Kansas, Massachusetts, Missouri, Nebraska, New York, North Carolina, Pennsylvania, Texas, and Virginia.”). See also FBI, Director Comey Letter to Additional Governors on State Reviews (June 10, 2016), https://perma.cc/ZJ3Y-3AL6.

416. Kaitlin Jackson, Challenging Facial Recognition Software in Criminal Court, Champion, July 2019, at 14, https://perma.cc/6HQ9-GBC6.

417. Id. The database can include civilian photos from the Department of Motor Vehicles.

418. Id.

419. See, e.g., Hannah Bloch-Wehba, Visible Policing: Technology, Transparency, and Democratic Control, 109 Calif. L. Rev. 917, 921, 933 (2021) (noting that the extent of police use of facial recognition technology is both common and likely greater than the public is aware because of lack of transparency over its use); Claire Garvie et al., The Perpetual Line-Up: Unregulated Police Face Recognition in America, Geo. L. Ctr. on Priv. & Tech. (Oct. 18, 2016), https://perma.cc/SC56-9UJQ.

420. See generally Garvie, supra note 419.

421. Id.

422. See, e.g., People v. Collins, 15 N.Y.S.3d 564, 575 (Sup. Ct. 2015) (noting that facial recognition software, like certain cutting-edge DNA techniques, “can aid an investigation, but are not considered sufficiently reliable to be admissible at a trial”).

423. Use of facial recognition software has been shown to disproportionately affect Black Americans, Asian Americans, and Native Americans. See Safiya Umoja Noble, Algorithms of Oppression: How Search Engines Reinforce Racism (2018); Ruha Benjamin, Race After Technology: Abolitionist Tools for the New Jim Code (2019).

Page 191 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

software has been least accurate in identifying Black women, even misidentifying their gender.⁴²⁴ Two recent high-profile cases involved false accusations of crime against people of color.⁴²⁵ In response, software datasets are growing to reflect diversity, in order to diminish error rates.⁴²⁶ Still, while making datasets more representative addresses one source of error, another source of error can be the comparison of features by human examiners. Most recently, the National Academy of Sciences issued a comprehensive report on FRT, including a discussion of its uses, limitations, and recommendations for best practices.⁴²⁷

FRT output as of this writing has not yet been admitted as identification evidence at trial, and at least one trial court has observed that there is “no agreement in a relevant community of technological experts that [FRT] matches are sufficiently reliable to be used in court as identification evidence.”⁴²⁸ Nonetheless, courts should be aware of its investigative use and keep abreast of future developments in the field. As the 2024 NAS Report notes, judges should be ready not only to determine whether FRT as an identification technique is foundationally reliable, but also whether it is “applied reliably by an appropriately trained, competent analyst.”⁴²⁹ Likewise, speaker recognition technology, as of this writing, is being used to investigate crimes such as false marine distress calls, but is rarely as yet used in criminal cases as evidence.⁴³⁰

424. NIST, NIST Study Evaluates Effects of Race, Age, Sex on Face Recognition Software (Dec. 19, 2019), https://perma.cc/C5KV-HYLS.

425. See Kashmir Hill, Eight Months Pregnant and Arrested After False Facial Recognition Match, N.Y. Times, Aug. 6, 2023, https://www.nytimes.com/2023/08/06/business/facial-recognition-false-arrest.html (Porcha Woodruff falsely accused of robbery and carjacking); Tate Ryan-Mosley, The New Lawsuit That Shows Facial Recognition Is Officially a Civil Rights Issue, MIT Tech. Rev., Apr. 14, 2021, https://perma.cc/89WV-E2LL (noting false theft and burglary accusation against Robert Williams in Detroit).

426. Additionally, there is a NIST OSAC Facial Identification Subcommittee that works “to develop consensus standards and guidelines for the image-based comparisons of human facial features and to provide recommendations for the research and development necessary to advance the state of the science.” See NIST, Facial Identification Subcommittee, https://perma.cc/XJ4K-SPLQ.

427. Nat’l Acads. of Scis., Eng’g & Med. (NAS), Facial Recognition Technology: Current Capabilities, Future Prospects, and Governance (2024), https://doi.org/10.17226/27397 [hereinafter 2024 NAS Report].

428. People v. Reyes, 133 N.Y.S.3d 433, 436–37 (Sup. Ct. 2020).

429. 2024 NAS Report, supra note 427, at 103.

430. See generally Peter Andrey Smith, Can We Identify a Person from Their Voice?, IEEE Spectrum, Apr. 15, 2023, https://perma.cc/4KUL-Q8VM (noting current investigative uses of voice recognition technology but also that it is not yet used in criminal trials in the United States).

Page 192 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

Special Issues with Machine-Generated Feature Comparison Evidence

Feature comparison methods are being increasingly automated. Previous sections have already described some of these turns toward, or research to eventually enable, expert systems in place of human analysis in feature comparison. This section briefly addresses unique conceptual and legal issues raised by this automation.

First, a brief note about definitions. By machine-generated feature comparison evidence, we mean the conveyance by a machine of a conclusion about whether or how two features or patterns “match,” such as a software program’s reported “likelihood ratio” comparing the chance of seeing the evidence sample under competing hypotheses about whether the defendant is the source of the evidence. In that case, the expert system itself is generating and reporting the information output, just like a human expert might make an assertion in a report or on the witness stand. Machine-generated feature comparison is therefore distinct from mere electronic storage of, or conduits for, information. Courts have admitted machine-stored information since at least the 1970s.⁴³¹ Indeed, electronically stored information (ESI) has now been officially incorporated into the rules of civil discovery.⁴³² Likewise, courts routinely admit emails and social media posts made by people and communicated via computers. To be sure, such evidence raises potential authentication questions (such as whether a purported email is actually written by a particular declarant).⁴³³ But ESI generally does not raise novel issues related to the reliability of the information itself, such as whether an assertion in an email is true.

Types of Machine-Generated Feature Comparison Evidence

The past decade has seen a precipitous rise in feature comparison techniques automated or at least facilitated by computer software. Techniques that were previously the province only of human examiners, such as firearm and toolmark

431. See, e.g., United States v. Liebert, 519 F.2d 542, 543 (3d Cir. 1975) (admitting list of persons not filing tax returns, stored in IRS computers).

432. See Fed. R. Civ. P. 34.

433. See Fed. R. Evid. 902(13), (14) & advisory committee’s notes to 2017 amendments. For an example of a successful authentication challenge to ESI, see United States v. Vayner, 769 F.3d 125 (2d Cir. 2014) (holding that admission of social media page and contents against defendant was reversible error because not properly authenticated as having been posted by him).

Page 193 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

comparison,⁴³⁴ handwriting comparison,⁴³⁵ shoemark comparison,⁴³⁶ latent print comparison,⁴³⁷ facial recognition,⁴³⁸ and arson analysis based on burn or debris patterns,⁴³⁹ can now potentially be done by algorithm. The Government Accounting Office issued a report in 2021 on issues raised by automated forensic methods—in particular DNA, latent print analysis, and facial recognition.⁴⁴⁰ This reference guide focuses on non-DNA comparison disciplines. Nonetheless, because existing cases addressing forensics algorithms are all in the DNA context, judges should have a general understanding of such cases. Competing software programs now purport to interpret complex DNA mixtures, determine whether a defendant’s profile is consistent with the mixture, and report associated match statistics. Courts have generally admitted the results of such programs—typically, but not exclusively, offered by the prosecution⁴⁴¹—over Frye/Daubert objections.⁴⁴² The few exceptions have been cases in which a local laboratory did not perform internal validation studies before using the software,⁴⁴³ where a mixture had a particularly

434. See generally, Eric Hare, Heike Hofmann & Alicia Carriquiry, Automatic Matching of Bullet Land Impressions, 11 Annals Applied Stat. 2332 (2017), https://doi.org/10.1214/17-AOAS1080 (creating an open-source algorithm for automatically comparing certain striations, after removing class characteristics, on fired ammunition).

435. See, e.g., Amy M. Crawford, Danica M. Ommen & Alicia L. Carriquiry, A Statistical Approach to Aid Examiners in the Forensic Analysis of Handwriting, 68 Annals Applied Stat. 1768 (2023), https://doi.org/10.1111/1556-4029.15337.

436. Soyoung Park & Alicia Carriquiry, An Algorithm To Compare Two-Dimensional Footwear Outsole Images Using Maximum Cliques and Speeded-Up Robust Feature, 13 Stat. Analysis & Data Mining: ASA Data Sci. J. 188 (2020), https://doi.org/10.1002/sam.11449.

437. See Simon A. Cole et al., Beyond the Individuality of Fingerprints: A Measure of Simulated Computer Latent Print Source Attribution Accuracy, 7 L. Prob. & Risk 165, 166 (2008), https://doi.org/10.1093/lpr/mgn004 (explaining how the AFIS database returns several “candidate” prints to the analyst).

438. See, e.g., Lynch v. State, 260 So.3d 1166 (Fla. Dist. Ct. App. 2018).

439. See, e.g., Sander Korver et al., Artificial Intelligence and Thermodynamics Help Solving Arson Cases, 10 Sci. Reps. 20502 (2020), https://www.nature.com/articles/s41598-020-77516-x.

440. See Karen L. Howard, Forensic Technology: Algorithms Offer Benefits for Criminal Investigations, but a Range of Factors Can Affect Outcomes, Gov’t Accountability Off. (Jan. 24, 2024), https://perma.cc/SW4Q-TX3Z.

441. One such program, TrueAllele, has been used by a handful of defendants to demonstrate their innocence. See generally, TrueAllele Helps Free Innocent Indiana Man After 24 Years in Prison, Cybergenetics, Apr. 26, 2016, https://perma.cc/RA7Q-XV92.

442. See John S. Hausman, Lost Shoe Led to Landmark DNA Ruling—And Now, Nation’s 1st Guilty Verdict, MLive, https://perma.cc/UA3W-F8XF (reporting a conviction in Michigan as the first in the United States to be based in part on the STRmix software after the defense contested its admissibility); Trials, Cybergenetics, https://perma.cc/V9ZK-MFP6 (listing over fifty cases in which TrueAllele has been admitted).

443. See Decision & Order on DNA Analysis Admissibility, People v. Hillary, Indictment No. 2015-15 (N.Y. Cnty. Ct. Aug. 26, 2016), http://perma.cc/3TFA-8VA5 (excluding STRmix results under Frye).

Page 194 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

high number of contributors,⁴⁴⁴ where the evidence involved a minor contributor to a mixture,⁴⁴⁵ or where the software was discontinued.⁴⁴⁶ In October 2019, in United States v. Gissantaner, a federal district judge ruled the results of one program inadmissible in a case involving “low copy number” DNA,⁴⁴⁷ but the Sixth Circuit reversed the following spring.⁴⁴⁸

One particular 2016 case, although involving DNA, is important to be aware of, as it will likely be invoked in future challenges to results of other automated forensic analyses. In a high-profile New York homicide case, the two main DNA programs—STRmix™ and TrueAllele^®—reached very different conclusions. The case involved the strangulation of a young boy. Police suspicion fell upon Oral Hillary, a former college soccer coach who had dated the boy’s mother.⁴⁴⁹ Another former boyfriend, a deputy sheriff who had been physically violent with the mother, was cleared of suspicion. No physical evidence linked either man to the scene. But analysts eventually collected a DNA mixture, too complex to be analyzed by humans, from under the boy’s fingernail. Police in 2013 gave the DNA fingernail data to the proprietor of TrueAllele. In 2014, TrueAllele reported “no statistical support for a match” with Hillary.⁴⁵⁰ Indeed the TrueAllele CEO would later insist the evidence suggested Hillary was excluded as a contributor, based on the reported likelihood ratio.⁴⁵¹ A year later, a new district attorney had the DNA data analyzed through STRmix, which reported that Hillary was 300,000 times more likely than a random person to have contributed to the mixture.⁴⁵² A trial judge excluded the STRmix results because of a lack of internal validation,⁴⁵³ and

444. See Order Excluding DNA Evidence, United States v. Alfonzo Williams, Case No. 3:13-cr-00764-WHO-1 (N.D. Cal. Apr. 29, 2019) (Orrick, J.) (excluding results of “Bulletproof” genotyping software in case with potentially more than four contributors).

445. See United States v. Lewis, 442 F. Supp. 3d 1122 (D. Minn. 2020).

446. See Shayna Jacobs, Judge Tosses Out Two Types of DNA Evidence Used Regularly in Criminal Cases, N.Y. Daily News, Jan. 5, 2015, https://perma.cc/A9ZL-BMJ6 (reporting a Brooklyn judge excluded results from low copy number DNA testing and Forensic Statistical Tool testing).

447. United States v. Gissantaner, 417 F. Supp. 3d 857 (W.D. Mich. 2019) (Neff, J.), rev’d, 990 F.3d 457 (6th Cir. 2021).

448. Gissantaner, 990 F.3d 457. Rehearing en banc was denied, id., and the case appears to have since ended in a change of plea. See https://perma.cc/E765-9FW6.

449. See, e.g., Jesse McKinley, Tensions Simmer over Race as Town Reels from Boy’s Killing, N.Y. Times, Mar. 5, 2016, at A1, https://perma.cc/AE3L-NPVW.

450. W.T. Eckert, Hillary Trial Slated for Aug. 1, Watertown Daily Times (Mar. 2, 2016).

451. See John Buckleton, Memorandum, People v. Hillary (2017), https://perma.cc/H3UE-GT7U (proprietor of STRmix explaining the discrepancies between STRmix and TrueAllele likelihood ratios in the Hillary case).

452. Notice of Motion to Preclude at 10, People v. Hillary, Indictment No. 2015-15 (N.Y. Cnty. Ct. May 31, 2016), https://perma.cc/2H69-DJJT.

453. See Decision & Order, People v. Hillary, Indictment No. 2015-15 (N.Y. Cnty. Ct. Aug. 26, 2016) (Catena, J.).

Page 195 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

Hillary was acquitted.⁴⁵⁴ The creator of STRmix has since published a memorandum online, explaining how STRmix’s approach to the evidence in Hillary is different from TrueAllele’s approach.⁴⁵⁵ The case is an example of two programs, both deemed reliable by courts under Daubert or Frye, coming to significantly different conclusions based on the same forensic data.⁴⁵⁶ Judges should also be aware of a 2021 NIST study exploring the reliability of probabilistic genotyping software, given that its findings and recommendations might be relevant to evaluation of other automated methods.⁴⁵⁷

Recurring Legal Issues

This section offers an overview of legal issues potentially raised by machine-generated feature comparison evidence,⁴⁵⁸ some of which are already the subject of federal litigation.

Rule 702/Reliability Issues

If a computer program or machine output is the method upon which a human feature comparison expert’s opinion is based, then the algorithm might be subject to Daubert scrutiny.⁴⁵⁹

454. Jesse McKinley, Race, Jilted Love and Acquittal in Boy’s Killing, N.Y. Times, N.Y. Ed., at A1 (Sept. 29, 2016).

455. See Buckleton, supra note 451.

456. The study emphasized the importance, in determining software accuracy in a given case, of validation studies covering the particular “factor space” that a case falls into (such as, in the DNA mixture context, the quantity of DNA, number of contributors, etc.). See John M. Butler et al., NIST, DNA Mixture Interpretation: A NIST Scientific Foundation Review (2021), https://doi.org/10.6028/NIST.IR.8351-draft (noting that report is in draft form but that comment period has closed).

457. Id.

458. For a general discussion of legal issues related to algorithmic feature comparison evidence, see generally Andrea Roth, The Use of Algorithms in Criminal Adjudication, in Cambridge Handbook on the Law of Algorithms 407 (Woodrow Barfield ed., 2020), https://doi.org/10.1017/9781108680844 (describing the rise of algorithmic proof in criminal cases, and the legal and ethical issues raised); Edward K. Cheng & G. Alexander Nunn, Beyond the Witness: Bringing a Process Perspective to Modern Evidence Law, 97 Tex. L. Rev. 1077 (2019) (discussing how evidence law and Confrontation Clause jurisprudence should accommodate computer-generated evidence); Andrea Roth, Machine Testimony, 126 Yale L.J. 1972 (2017) (describing the rise of machine-generated conveyances of information as proof, and testimonial safeguards that might be deployed to better regulate such proof in a manner analogous to human assertions).

459. See Fed. R. Evid. 702 (applying only to expert “witnesses” who testify). Cf. People v. Lopez, 286 P.3d 469, 478 (Cal. 2012) (portion of spectrometer readings offered to prove blood-alcohol concentration, without additional sworn testimony or certification by human expert, not

Page 196 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

Reliability concerns have arisen with respect to several types of machine-generated conclusions both within and outside the feature comparison context, including driving estimates,⁴⁶⁰ blood alcohol testing,⁴⁶¹ translation of the federal sentencing guidelines into computer code,⁴⁶² Apple’s “Find My iPhone” tracking when used in theft and robbery cases,⁴⁶³ “minor miscode[s]” in DNA software,⁴⁶⁴ and inaccurate allelic frequencies used by software to generate DNA match statistics.⁴⁶⁵ Additionally, as discussed in the section titled “Facial Recognition Software and Other Emerging Methods” above, machine learning algorithms can misidentify or misclassify people or events because of limitations in the training dataset or other analytical problems.⁴⁶⁶ Machines learn how to characterize new data by training on data already categorized by a person or the machine itself. If the dataset is too small or too unrepresentative of future items to be classified, the algorithm might infer a pattern or linkage in the training data that does not actually mirror real life (overfitting) or might try to account for too many variables, rendering the training data inadequate for learning (the so-called curse of dimensionality).⁴⁶⁷

With respect to issues that might arise in a Daubert hearing related to software results, some commentators have argued that validation studies alone might be insufficient to scrutinize certain algorithmic proof. For example, while validation studies might show that a certain interpretive software program boasts a

subject to the Confrontation Clause because not the statement of a human expert, over dissent by Justice Liu).

460. See, e.g., Jianniney v. State, 962 A.2d 229, 232 (Del. 2008) (excluding Mapquest driving estimates as inadmissible hearsay and noting reliability concerns).

461. See Andrea Roth, Trial by Machine, 104 Geo. L.J. 1245, 1271–72 (2016) (citing sources with respect to problems with Intoxilyzer 8000 and 5000 readings, due to programming errors); Roth, Machine Testimony, supra note 458, at 1995 (discussing litigation over code errors with the Alcotest 7110).

462. See Steven R. Lindemann, Commentary, Published Resources on Federal Sentencing, 3 Fed. Sent’g Rep. 45, 45–46 (1990), https://doi.org/10.2307/20639272.

463. See Lawrence Mower, If You Lose Your Cellphone, Don’t Blame Wayne Dobson, Las Vegas Rev.-J., Jan. 13, 2013, https://perma.cc/2G7G-89QX.

464. Roth, Trial by Machine, supra note 461, at 1276.

465. See Notice of Amendment of the FBI’s STR Population Data Published in 1999 and 2001, FBI (2015), https://perma.cc/TDF7-WMLA.

466. See also Face Recognition, Elec. Frontier Found., https://perma.cc/W648-PQWF (citing Brendan F. Klare et al., Face Recognition Performance: Role of Demographic Information, 7 IEEE Transactions Info. Forensics & Sec. 1789 (2012), https://doi.org/10.1109/TIFS.2012.2214212 (showing higher false positive rates for African Americans)).

467. See generally The Discipline of Organizing: Informatics Edition (Robert J. Glushko ed., 4th ed. 2013); Harry Surden, Machine Learning and Law, 89 Wash. L. Rev. 87 (2014); Pedro Domin-gos, The Master Algorithm: How the Quest for the Ultimate Learning Machines Will Remake Our World (2015); Alice Zheng, Evaluating Machine Learning Models: A Beginner’s Guide to Key Concepts and Pitfalls (2015).

Page 197 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

low false positive rate, such studies cannot so easily determine whether a reported likelihood ratio is inaccurate:

Laboratory procedures to measure a physical quantity such as a concentration can be validated by showing that the measured concentration consistently lies with an acceptable range of error relative to the true concentration. Such validation is infeasible for software aimed at computing a[] [likelihood ratio] because it has no underlying true value (no equivalent to a true concentration exists). The [likelihood ratio] expresses our uncertainty about an unknown event and depends on modeling assumptions that cannot be precisely verified in the context of noisy [crime scene profile] data.⁴⁶⁸

In sum, judges faced with determining foundational validity and validity as applied of machine-generated feature comparison conclusions will need to consider separately whether the software’s classifications are reliable (e.g., identification versus nonidentification; consistent versus inconsistent) and whether the software’s reported “scores” are reliable (such as a match statistic or score where the ground truth is not known). Judges will face the same questions as with other feature comparison methods in terms of what counts as a well-designed validation study, and how many well-designed studies are necessary, to determine a program’s error rate. They will also need to decide whether some issues (such as the reliability of reported scores that cannot be tested through traditional black box validation studies) require greater access to the source code underlying the software, a discovery issue discussed below.

Hearsay and Confrontation

Some litigants have tried to argue, most unsuccessfully, that a machine-generated conclusion offered as evidence is “hearsay” and therefore inadmissible unless it comes within an exception to the rule against hearsay. Hearsay, under Federal Rules of Evidence 801 and 802, is an out-of-court “assertion” by a “person,” “intended” as an assertion, and offered to prove the truth of the matter asserted.⁴⁶⁹ By the explicit language of these rules, as numerous courts have now held,⁴⁷⁰ a factual claim rendered by a machine rather than a person cannot be

468. Christopher D. Steele & David J. Balding, Statistical Evaluation of Forensic DNA Profile Evidence, 1 Ann. Rev. Stat. & Its Application 361, 380 (2014), https://doi.org/10.1146/annurev-statistics-022513-115602.

469. See, e.g., Fed. R. Evid. 801, 802 (defining hearsay as an out-of-court “statement,” “intended” as an assertion, offered to prove the truth of the matter asserted, and stating that hearsay is presumptively inadmissible).

470. See, e.g., People v. Lopez, 286 P.3d 469, 478 (Cal. 2012) (holding that gas chromatograph “raw” data is not hearsay); United States v. Lizarraga-Tirado, 789 F.3d 1107, 1109 (9th Cir. 2015) (holding that Google Earth output is not hearsay).

Page 198 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

hearsay. To be sure, some courts used to treat computer-generated results as hearsay and admit them under the “business records” hearsay exception for records created in the regular course of business.⁴⁷¹ But commentators were critical of this practice, pointing out that these courts were conflating electronically stored records, inputted by humans but stored by computers, with electronically generated records, where the computer itself creates and reports information.⁴⁷²

Some courts and commentators have alternatively suggested that an algorithm’s conclusions might be the hearsay statements of the algorithm’s creator, such as a computer programmer, and that machine-generated proof is admissible so long as the programmer is subject to cross-examination.⁴⁷³ Others have argued against this view of algorithmic evidence, noting that the programmer has not uttered the resulting information, and might not even fully understand the thousands of steps taken by the program to generate the information, particularly in more advanced forms of artificial intelligence (AI).⁴⁷⁴

Some litigants and commentators have further argued that admission of machine-generated conclusions without sufficient opportunity for adversarial scrutiny may violate the Sixth Amendment’s Confrontation Clause,⁴⁷⁵ which guarantees an accused the right “to be confronted with the witnesses against him.” To be sure, under existing U.S. Supreme Court precedent, the right of confrontation applies only to “testimonial hearsay,”⁴⁷⁶ which presumably only covers certain human assertions. While the Court has not squarely addressed the question, Justice Sotomayor has suggested in a concurring opinion that “raw data” from a machine might not be “testimonial.”⁴⁷⁷ Not surprisingly, then,

471. Apparently, this approach was buttressed by an oft-cited 1974 article arguing that “[c]omputer generated evidence will inevitably be hearsay.” Jerome J. Roberts, A Practitioner’s Primer on Computer-Generated Evidence, 41 U. Chi. L. Rev. 254, 272 (1974).

472. See generally Adam Wolfson, Note, “Electronic Fingerprints”: Doing Away with the Conception of Computer-Generated Records as Hearsay, 104 Mich. L. Rev. 151 (2005) (noting, and criticizing, courts’ treatment of computer-generated business records as hearsay).

473. See, e.g., People v. Wakefield, 38 N.Y.3d 367, 385–86 (2022) (holding that admission of TrueAllele results did not violate the Confrontation Clause so long as the creator, Mark Perlin, was on the witness stand); United States v. Washington, 498 F.3d 225, 229 (4th Cir. 2007) (explaining defendant’s argument that the “raw data” of a chromatograph was the “hearsay” of the “technicians” who tested the defendant’s blood sample for PCP and alcohol using the machine); Karen Neville, Programmers and Forensic Analyses: Accusers Under the Confrontation Clause, 10 Duke L. & Tech. Rev. 1, 8–9 (2011) (arguing that “the programmer” is “the ‘true accuser’—not the machine merely following the protocols he created”).

474. See, e.g., Roth, Machine Testimony, supra note 458, at 1986. See generally Cheng & Nunn, supra note 458 (describing potential inaccuracies of “process-based” proof and suggesting enhanced pretrial discovery, not cross-examination of an individual expert, as the most appropriate safeguard).

475. See, e.g., Wakefield, 38 N.Y.3d at 385 (arguing that admission of TrueAllele results without disclosure of the source code violated the Confrontation Clause).

476. Crawford v. Washington, 541 U.S. 36 (2004).

477. Bullcoming v. New Mexico, 564 U.S. 647, 674 (2011) (Sotomayor, J., concurring).

Page 199 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

most courts have thus far rejected arguments that the admission of a machine’s claim without disclosing certain information about a machine’s processes (such as source code) violates the Confrontation Clause.⁴⁷⁸ Still, several commentators have argued that any approach categorically exempting machine assertions from the right of confrontation may soon become untenable, especially as machine-generated conclusions become ever more complex, opaque, and interpretive, akin to human expert testimony.⁴⁷⁹

Discovery Issues

A final recurring legal issue related to machine-generated feature comparison evidence is disputes over pretrial requests for disclosure of the source code or for access to a research license to conduct software audits. The dispute stems from the fact that many (though not all) of the programs underlying machine-generated proof are proprietary,⁴⁸⁰ and program developers often argue, in

478. See, e.g., Wakefield, 38 N.Y.3d at 385 (2022) (holding that admission of TrueAllele results without disclosure of the source code did not violate the Confrontation Clause); People v. Lopez, 286 P.3d 469, 478 (Cal. 2012) (holding that gas chromatograph “raw” data is not hearsay and thus its admission did not implicate the Confrontation Clause). But see People v. Chubbs, No. B258569, 2015 WL 139069, at *4 (Cal. Ct. App. Jan. 9, 2015) (noting that the trial court had invoked the Confrontation Clause in ordering disclosure of source code to facilitate cross-examination of programmer); Order on Procedural History and Case Status in Advance of May 25, 2016 Hearing, United States v. Michaud, No. 3:15-CR-05351RJB, 2016 WL 337263 (W.D. Wash. Jan. 28, 2016) (noting due process right to examine source code of government’s Network Investigative Technique (NIT) used to hack defendant’s computer).

479. See Andrea Roth, What Machines Can Teach Us About “Confrontation,” 60 Duquesne L. Rev. 210 (2022) (explaining that a view of confrontation as merely guaranteeing cross-examination at trial is ahistorical); Cheng & Nunn, supra note 458 (arguing that machine-generated proof cannot be exempt from confrontation); Roth, Machine Testimony, supra note 458 (same); Erin Murphy, The Mismatch Between Twenty-First-Century Forensic Evidence and Our Antiquated Criminal Justice System, 87 S. Calif. L. Rev. 633, 657–58 (2014); David Alan Sklansky, Hearsay’s Last Hurrah, 2009 Sup. Ct. Rev. 1, 67 (2009) (urging a view of confrontation as “a meaningful opportunity to test and to challenge the prosecution’s evidence”); see also id. at 7 (quoting Daniel H. Pollitt, The Right of Confrontation: Its History and Modern Dress, 8 J. Pub. L. 381, 402 (1959)) (“The Confrontation Clause could be read broadly to guarantee criminal defendants a meaningful opportunity to challenge—‘to know, to examine, to explain, and to rebut’—the proof offered against them.”).

480. See, e.g., State v. Loomis, 881 N.W.2d 749 (Wis. 2016) (denying litigants access to source code of actuarial tool used in parole hearing); Chubbs, 2015 WL 139069, at *8–9 (noting that DNA mixture interpretation software TrueAllele is proprietary); Order, United States v. Michaud, No. 3:15-CR-05351RJB (W. Dist. Wash. May 18, 2016) (noting that government’s malware used in child pornography investigation was proprietary); Luke Broadwater & Scott Calvert, City in $2 Million Dispute With Xerox Over Camera Tickets, Balt. Sun, Apr. 24, 2013, https://perma.cc/T2X7-F2PS (noting that Xerox refused to disclose source code for red light camera system). But see Hare, Hofmann & Carriquiry, supra note 434 (creating an open-source algorithm for automatically comparing certain striations, after removing class characteristics, on fired ammunition, and noting the benefits of open-source software for transparency and improvement).

Page 200 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

response to discovery requests for information like source code, that such information is protected by a “trade secrets privilege.”⁴⁸¹ Some have suggested that a trade secret privilege might be necessary to incentivize development of high-quality algorithms for use in criminal justice,⁴⁸² while others have argued that such a privilege is inappropriate and unnecessary.⁴⁸³ In particular, Professor Rebecca Wexler has argued that the emergence of a trade secrets privilege in criminal cases was an historical accident, and that substantive trade secret doctrine offers software developers sufficient protection to incentivize innovation. Until recently, courts appeared to have largely accepted proprietors’ claims of privilege.⁴⁸⁴ At least two courts have granted defense requests for source code in cases involving DNA software, however.⁴⁸⁵ Meanwhile, U.S. Representative Mark Takano (D-CA) introduced a bill (originally introduced in 2019) titled “Justice in Forensic Algorithms Act of 2024” that would require disclosure of source code and remove any trade secret privilege over such code.⁴⁸⁶

Admission of machine-generated feature comparison conclusions might raise other discovery issues as well. For example, a party might ask for pretrial access to a program to manipulate its inputs, in the same way that a party might seek to depose or interview an opposing party’s expert witness before trial.⁴⁸⁷ Next, parties might ask for access to the prior statements or runs of software programs in a given case. While there is no Jencks Act for machine conveyances, judges will have to decide whether to grant access to such statements as fairness demands.⁴⁸⁸ In addition, judges might face requests by a party for access to the

481. See, e.g., Roth, Machine Testimony, supra note 458, at 2028 (discussing trade secrets); Christian Chessman, Note, A “Source” of Error: Computer Code, Criminal Defendants, and the Constitution, 105 Calif. L. Rev. 179, 212 (2017) (discussing the trade secret privilege in criminal cases with respect to source code).

482. See, e.g., Edward J. Imwinkelried, Computer Source Code: A Source of the Growing Controversy over the Reliability of Automated Forensic Techniques, 66 DePaul L. Rev. 97 (2016).

483. See Chessman, supra note 481, at 212; Rebecca Wexler, Life, Liberty, and Trade Secrets: Intellectual Property in the Criminal Justice System, 70 Stan. L. Rev. 1343 (2018); Jason Tashea, Trade Secret Privilege Is Bad for Criminal Justice, A.B.A. J., July 30, 2019, https://perma.cc/CKY9-HLVN.

484. Wexler, supra note 483, at 1352–53.

485. The New Jersey case is State v. Pickett, 246 A.3d 279 (N.J. Super. Ct. App. Div. 2021). The state has since withdrawn its intent to introduce the TrueAllele results, and thus this litigation appears to be moot. The other case is United States v. Ellis, No. 19-369, 2021 WL 1600711 (W.D. Pa. Apr. 23, 2021). As of this writing, the parties had agreed upon terms of a protective order and review by experts had just begun.

486. See H.R. 7394, 118th Cong. (Feb. 15, 2024), https://www.congress.gov/bill/118th-congress/house-bill/7394/committees.

487. Cf. Jennifer L. Mnookin, Repeat Play Evidence: Jack Weinstein, “Pedagogical Devices,” Technology, and Evidence, 64 DePaul L. Rev. 571, 588 (2015) (arguing that demonstrative evidence in the form of complex algorithms should have this as a condition of admissibility).

488. See, e.g., Roth, Machine Testimony, supra note 458, at nn.411–19 & accompanying text (explaining why the Jencks case (Jencks v. United States, 353 U.S. 657 (1957)) can and should apply to machines and why it would be consistent with the Jencks Act).

Page 201 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

data on which a machine-learning algorithm was “trained” to classify future objects or events. A party might seek this dataset to argue that the algorithm was trained on data that was not representative of the person, object, or event being classified in the case (such as facial recognition software claiming to have identified an African American as having been at a crime scene, where the software was trained largely on white faces). The proponent of the evidence might argue that the data are covered by privacy laws and cannot be disclosed. Just as judges have always had to grapple with access to privacy-protected data in contexts like a victim’s medical records, judges will need to consult applicable data-protection laws, with the accused’s constitutional right to present a defense in mind, in resolving these issues.⁴⁸⁹

489. For a brief discussion of intellectual property and data privacy issues related to machine learning training data, see Brittany Bacon et al., Training a Machine Learning Model Using Customer Proprietary Data: Navigating Key IP and Data Protection Considerations, in Pratt’s Privacy and Cyber-security Law Report, LexisNexis 233 (Steven A. Meyerowitz et al. eds., Oct. 2020).

Page 202 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

Glossary of Terms

This glossary has been adapted from a general literature review related to the topics included in this reference guide. An ongoing project by NIST’s Organization of Scientific Area Committees (OSAC) is also creating a list of preferred terms for recurring concepts in forensic science.⁴⁹⁰

ACE-V. Acronym for the method many feature comparison analysts use and refer to as Analysis, Comparison, Evaluation, and Verification.

AFIS. Acronym for Automated Fingerprint Identification System used by law enforcement to rapidly screen large numbers of prints in a national database, which are then evaluated by analysts. While AFIS is useful for screening, the system is not currently able to determine that a particular pair of prints has a common source.

algorithm. A list of steps capable of being performed by a machine.

“black box” study. A validation study designed to determine the extent to which a forensic examiner applying a feature comparison method reaches the correct conclusion (where the ground truth is known), without having to know anything about how the method itself—which may well remain a “black box”—works.

class characteristics. Characteristics believed to be shared by a group of persons or objects (e.g., ABO blood types).

contextual bias. Bias that occurs when a forensic examiner’s judgment while engaged in feature comparison is influenced by irrelevant information about the facts of a case.

double-“blind” or double-anonymized research or testing. Research or testing in which the respondent and the interviewer are not given information that will alert them to the anticipated or preferred pattern of response. While the phrases “blind” and “double-blind” have been eliminated by many users over concerns that such language is ableist, judges will still frequently encounter these terms in literature discussing feature comparison disciplines.

false negative rate. The rate at which an examiner, using a feature comparison method, erroneously concludes that two features or patterns are inconsistent when, in fact, they are consistent. Note that there is a debate in the literature regarding how to incorporate an examiner’s determination of “inconclusive” (in a testing context, not in casework), when the ground truth is that two features or patterns are consistent, into a false negative rate.

490. See NIST, OSAC Lexicon, https://perma.cc/VKY4-KEND.

Page 203 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

false positive rate. The rate at which an examiner, using a feature comparison method, erroneously concludes that two features or patterns are consistent when, in fact, they are not consistent. Note that there is a debate in the literature regarding how to incorporate an examiner’s determination of “inconclusive” (in a testing context, not in casework), when the ground truth is that two features or patterns are inconsistent, into a false positive rate.

FDE. Acronym for Forensic Document Examiners, who compare known and unknown writing samples and perform various tests on documents, such as whether a particular ink formulation existed on the purported date of a writing. Also referred to as Questioned Document Examiners (QDE).

fingerprint conclusions. A fingerprint examiner’s conclusion as to the strength of the evidence in supporting an inference about whether two impressions came from the same source. Some laboratories, such as the Department of Justice, guide examiners to use one of three conclusions: source identification, a source exclusion, or inconclusive determination.

fingerprint inconclusive determination. The label some examiners use to describe their determination that there is “insufficient quantity and/or clarity” between the two impressions for the examiner to arrive at a source identification or exclusion of fingerprints.

fingerprint source exclusion. The label some examiners use to describe their determination that the two sets of fingerprints did not come from the same source.

fingerprint source identification. A label some examiners use to describe their determination that two sets of fingerprint prints—known and unknown—came from the same person. Other examiners might eschew this “identification” language and instead make statements about the strength of the evidence, such as that there is “extremely strong support” that the impressions came from the same source and “extremely weak support” that they came from different sources.

individual characteristic. A characteristic believed to be unique to an object or person.

individualization. The claim that a feature comparison method can scientifically determine that a feature or pattern (such as a fingerprint, or toolmark, or bitemark) was produced by one source, to the exclusion of all other sources. This claim is notably different from the concept of “uniqueness,” a claim that a particular biological trait (such as the ridges on one’s fingers) is unique to each person.

latent prints. Fingerprints found at a crime scene or in another location, which may vary in quality or number. Latent prints are compared to record prints to determine whether there is consistency between the two.

Page 204 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

likelihood ratio (LR). A ratio in which the numerator is the chance of seeing a particular type of evidence (such as a DNA “match”) given one hypothesis (e.g., that the defendant is the source of the DNA) and the denominator is the chance of seeing the evidence (e.g., the DNA “match”) given the alternative hypothesis (such as that the defendant is not the source of the DNA). An LR is part of Bayes’ Theorem, a means of updating a prior guess about the probability of an event with new evidence (incorporated into the LR) to arrive at a new guess (“posterior probability”). Computer programs purporting to interpret complex DNA mixtures typically report their conclusions in the form of LRs.

machine learning. An algorithm in which the computer is “trained” on a dataset (such as emails already classified as “spam” or “not spam”) such that the computer “learns” to classify objects in the future (such as determining whether future emails are “spam” versus “not spam”). More broadly, machine learning refers to any system in which a machine learns from experience and improves its performance on an assigned task over time.

OSAC. The Organization of Scientific Area Committees, an organization under the direction of the National Institute of Standards and Technology (NIST), under the United States Department of Commerce. OSAC focuses on developing and approving standards to govern forensic science service providers.

PCAST. President’s Council of Advisors on Science and Technology.

QDE. See FDE.

record prints. Fingerprints that are rolled onto a fingerprint card or digitized and scanned into a file. These prints are the ones stored in databases and against which unknown prints are compared.

reliability. The extent to which the same results are obtained in each instance in which a test or method is performed—its consistency. Reliability can be further divided into repeatability, the ability of one examiner to repeat the same test results using the same method under the same conditions; and reproducibility, the ability of another examiner to repeat the same test results using the same method under the same conditions.

source code. The code written by a computer programmer, in human-readable form, to a computer containing the instructions for what to do.

SWGDOC. The Scientific Working Group for Forensic Document Examination.

SWGFAST. The Scientific Working Group on Friction Ridge Analysis, Study, and Technology.

toolmark conclusions. A toolmark examiner’s conclusion as to the strength of the evidence in supporting an inference about whether two toolmarks came from the same source. Some laboratories, such as the Department of Justice,

Page 205 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.

guide examiners to use one of three conclusions: source identification, source exclusion, or inconclusive determination. For a description of each, see Fingerprint Conclusions.

ULTR. The Department of Justice’s guidelines on Uniform Language for Testimony and Reports.

validity. The ability of a test to measure what it is supposed to measure—its accuracy. Validity includes reliability, but the converse is not necessarily true.

Page 206 Cite Bookmark

Suggested Citation: "Reference Guide on Forensic Feature Comparison Evidence." National Academies of Sciences, Engineering, and Medicine and Federal Judicial Center. 2025. Reference Manual on Scientific Evidence: Fourth Edition. Washington, DC: The National Academies Press. doi: 10.17226/26919.