Previous Chapter: Front Matter
Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.

Summary

The Department of the Air Force’s (DAF’s) Air and Space Forces stand on the shoulders of 75 years of comprehensive, rigorous service-wide test and evaluation (T&E) policies, processes, and practices. The combination of a large cadre of designated test personnel, sustained funding, dedicated test organizations, test infrastructure, career-long T&E education and training, and a unique test culture have been instrumental in shaping the current force. Absent a highly disciplined systems engineering approach to testing and the continuous focus on T&E in every aspect of operations, today’s DAF would be far less capable and safe.

In requesting this study on testing, evaluating, and assessing the performance of artificial intelligence-enabled systems under operational conditions, DAF leaders recognize both the opportunities and challenges inherent in integrating artificial intelligence (AI) at speed and at scale across the DAF. Integration of AI-enabled capabilities into the DAF has been limited, with a slow pace of adoption so far. The demand for and integration of such capabilities is expected to accelerate substantially based on current trends and expected technological developments in AI and related fields.

In its final report published in March 2021, the National Security Commission on AI (NSCAI) noted that “having justified confidence in AI systems requires assurances that they will perform as intended when interacting with humans and other systems. The T&E of traditional legacy systems is inefficient at providing these assurances. To minimize performance problems and unanticipated outcomes, an entirely new type of T&E will be needed.”1 The NSCAI recommended that all

___________________

1 National Security Commission on Artificial Intelligence, 2021, The National Security Commission on Artificial Intelligence Final Report, Arlington, VA, https://www.nscai.gov/wp-content/uploads/2021/03/Full-Report-Digital-1.pdf, p. 137.

Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.

the military Services should “establish a test, evaluation, verification, and validation (TEVV) framework and culture that integrates testing as a continuous part of requirements specification, development, deployment, training, and maintenance and includes runtime monitoring of operational behavior.”2 This committee echoes this NSCAI recommendation.

DAF leaders must now address the pervasive implications of AI T&E across the entire DAF. The DAF has not yet prioritized AI T&E in a way that matches its historical investments in its other T&E capabilities. For example, DAF has not developed a cadre of DAF-wide AI experts or implemented the requisite AI T&E frameworks. Similarly, the DAF has not established enterprise-level T&E policies and infrastructures to support testing autonomous or AI-enabled autonomous systems, either in isolation or in integrated within system-of-systems architectures. Instead, T&E of current AI capabilities has largely relied on ad hoc and bespoke processes and procedures. The ad hoc nature of current DAF AI T&E and the lack of formal guidance complicated this committee’s efforts to evaluate current assessment methods employed by the DAF. Much greater investments are needed in AI T&E than previous T&E resources; partially because previous T&E was notoriously under-resourced and because AI systems are so complex. However, these have been boosted over the past 2 years by the work of the AI T&E Community of Interest (CoI) established by the Office of the Secretary of Defense (OSD) Joint Artificial Intelligence Center (JAIC). As discussed in this report, the DAF cannot presently successfully incorporate AI-based solutions. Without significant improvements to the DAF’s ability to test and evaluate AI, the DAF will be unable to successfully incorporate AI into DAF systems. To acknowledge the NSCAI’s findings and associated recommendations for AI T&E and to enable the DAF to field AI-enabled capabilities that are highly effective, safe, and used responsibly, DAF leaders must prioritize AI T&E. They should do so in a way that, as the committee describes in more detail throughout the report, recognizes the importance of AI T&E throughout the entire AI life cycle, rather than segregated into distinct developmental test and evaluation and operational test and evaluation (OT&E) phases as with traditional weapon systems (see Section 3.2). The committee found that this prioritization includes but is not limited to:

  • Fostering a unique AI T&E culture
  • Establishing DAF-wide AI T&E governance with sufficient authority
  • Dedicated and sustaining the resources necessary for AI T&E
  • Integrating data collection and curation into the AI T&E pipeline
  • Creating the virtual environments and simulations necessary to create simulated data or to use for reinforcement machine learning

___________________

2 Ibid, p. 384.

Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
  • Emphasizing human-systems integration (HSI) such as for human-AI teaming
  • Developing the AI T&E workforce

These shortcomings underscore the challenges the entire federal government faces in establishing organization- and agency-wide AI T&E processes and procedures. Unlike digital-age technology companies that rapidly embrace AI capabilities, the DAF is analogous to traditional companies that are only now beginning to adopt AI technologies across their respective industries. Therefore, it is an opportune time for the DAF to craft an AI T&E vision and commit to a long-range AI T&E strategy and implementation plan that includes specific and measurable objectives and goals. There is no time to waste: the risks to the DAF from remaining “frozen in place” regarding AI T&E are significant and will increase exponentially over time. The DAF will only gain ground through the prioritization of AI T&E and commensurate near-term commitment of resources. Rigorous and comprehensive end-to-end T&E of AI-enabled capabilities will significantly increase the DAF’s ability to field systems while also allowing end-users to gain justified confidence in AI-enabled systems and tools.

As demonstrated by previous examples of AI projects carried out at scale and both DoD- and industry-wide digital modernization programs, leaders commonly underestimate the investments of time, expert human resources, and money required to implement digital modernization and establish modern AI data management best practices. Without accelerating digital modernization3 of the DAF’s underlying T&E infrastructure, to include information architecture and commitment to a DAF-wide T&E data strategy and implementation plan,4 the DAF will struggle to assess AI-enabled solutions at the required scale. Therefore, the committee recommends that the DAF immediately update its comprehensive analysis of resource requirements immediately to ensure AI T&E digital modernization efforts

___________________

3 This digital modernization across the AFTC, AFOTEC, and USAFWC includes but is not limited to prioritizing (and sustaining) funding for and rapidly installing AI stacks (AI tools, modern software platforms, data libraries, and providing access to the same computing environments and information technology architectures available to the nation’s leading commercial technology companies). The 2022 establishment of the Autonomy, Data, and AI Experimentation (ADAX) proving ground at Eglin AFB as a joint venture between CDAO and AFWERX, supported by the Eglin AFB test ecosystem, is an encouraging first step. One of ADAX’s missions is to assess the viability of commercial technologies for Air Force adoption. The ADAX team is coordinating with the DAF ABMS program office to develop initial use cases. Once a technology is determined to be suitable for integration, ADAX personnel will design an Air Force AI test plan. In July 2022, the Air Force Test Center published its own “Digital Modernization Strategy” and initiated three digital engineering efforts. The committee recommends including AI T&E as part of these efforts.

4 Department of Defense, 2020, “DoD Data Strategy,” https://media.defense.gov/2020/Oct/08/2002514180/-1/-1/0/DoD-Data-Strategy.pdf.

Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.

are included in the DAF’s overarching digital transformation plans and take steps to sustain AI T&E resources in future DAF budgets.

The magnitude of changes this report suggests will require dedicated leadership, continuous oversight, and individual responsibility and accountability. These outcomes are best attained by formally designating a senior AI T&E official who reports to the secretary of the air force, is responsive to the chiefs of the air and space forces, and who has the necessary resources and authorities to implement DAF-wide changes. For this reason, the committee recommends that the secretary of the air force formally designate an overall DAF AI T&E champion at the general officer or senior executive service level in the DAF and grant them the necessary authorities to execute DAF-wide changes on the behalf of the secretarys and chiefs of the two Services.5 The 2022 dual-hat designation of the 96th Operations Commander as the Chief of AI Test and Operations for the DAF Chief Data and AI Office (CDAO) is a positive and important step. The committee views the 96 OG and CC as one of the primary beneficiaries of this report. However, as currently constituted, the chief of AI test and operations for the DAF CDAO does not have the authority to make the scope and scale of changes across the DAF this committee believes necessary to enable and accelerate AI T&E. Therefore, the DAF needs a formally-designated advocate with an appropriate breadth and depth of AI and T&E experience, along with the commensurate background and expanded authorities, responsibility, and resources. This champion should establish an AI governance structure that includes delineating formally AI T&E reporting relationships and roles and responsibilities across the Tri-Center,6 the future U.S. Space Force Operational Test Agency (OTA),7 the DAF CDAO, and operational air, intelligence, command and control (C2), space, and cyber units. This process should include assessing what broader DAF-wide organizational and

___________________

5 The committee uses the term “champion” as illustrative; it does not take a position on the actual title, or whether the designated official should be a general officer or civilian senior executive, or whether the position should reside within the AFTC, AFOTEC, the U.S. Air Force Warfare Center, or elsewhere. The committee notes, however, that this individual will be required to coordinate AI T&E roles and responsibilities across the three primary DAF test and evaluation commands (AFMC, ACC, and AFOTEC) and the DAF Chief Data and AI Office (CDAO). Additionally, while the committee calls for a single DAF AI champion, the committee acknowledges the potential benefits of designating separate AI T&E champions for both the air force and the space force. The committee recommends that the DAF analyze the potential benefits and drawbacks of these various options, with the goal to designate the individual(s) as soon as possible.

6 Comprising the Air Force Test Center (AFTC) (Air Force Materiel Command, to include the AFMC Digital Transformation Office or DTO); the United States Air Force Warfare Center (USAFWC) (Air Combat Command); and the Air Force Operational Test and Evaluation Center (AFOTEC) (CSAF).

7 If established.

Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.

governance changes are needed to reflect the differences between AI T&E and T&E for all other air force systems and capabilities.8

There are many similarities between the T&E of aircraft, weapons, sensors, command and control, and cyber systems and the T&E of AI-enabled systems. Most importantly, the same basic systems engineering principles that have proven instrumental in fielding all previous DAF capabilities are equally applicable to AI. Therefore, the foundational systems theory concepts that have served the DAF so well over the past 75 years provide the appropriate starting point for crafting DAF AI T&E strategies and implementation plans.9

In view of AI as a software-centric capability, however, major differences drive the need for a new approach to several critical aspects of AI T&E.

The major differences include:

  • The lack of a clear demarcation between the developmental test (DT) and operational test (OT) or between initial operational T&E (IOT&E) and follow-on operational T&E (FOT&E) for AI capabilities.
  • The importance of and reliance on iterative and incremental (agile development approaches) software development and adaptive T&E principles (AIOps or DevSecOps, see Section 3.2) instead of linear and sequential (waterfall) software development for AI systems.
  • The centrality of data (including the potential for skewed, corrupted, or incomplete datasets) also necessitates the emphasis on its collection, curation, and high-end computing.
  • A continuous data-based learning capability that continually changes fielded AI systems necessitates continued testing.
  • The importance and challenges of domain adaptation for AI-enabled systems.
  • Probabilistic or statistically predictable (i.e., non-deterministic) behavior.
  • The effects and risks of adversarial attacks against AI models.
  • The challenges of AI explainability and auditability.

___________________

8 This should include evaluating the success of past integrated T&E efforts, such as combined test forces with matrixed OT and DT personnel from across the Tri-Center and assessing their utility for DAF-wide AI T&E projects. The Operational Flight Program-Combined Test Force (OFP-CTF) at Eglin AFB, with a rotating DT and OT commander with authority to direct efforts using resources from both the USAFWC and AFTC, serves as a useful reference point.

9 See Department of Defense, 2023, “Autonomy in Weapon Systems,” DoD Directive 3000.09. Office of the Under Secretary of Defense for Policy, https://www.esd.whs.mil/portals/54/documents/dd/issuances/dodd/300009p.pdf. This is one example of a current DoD policy that bridges the gap between hardware- and software-centric weapon systems. This directive establishes requirements for TEVV of autonomous and semi-autonomous weapon systems, to include AI-enabled capabilities.

Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
  • Continuous integration and continuous delivery (CI/CD) for fielded AI-enabled systems requires commensurate T&E.
  • New T&E AI methods, tools, and processes geared toward identifying and addressing AI-related cyberattacks and their effects throughout the testing and operational life cycle of an AI system.
  • The importance of adding instrumentation to fielded AI-enabled systems to monitor their performance over time, including metrics to log and analyze changes, since the performance and metrics change with continuous learning.

As the committee explains later in the report, these differences will also drive changes to existing requirements formulation processes for new AI capabilities and AI-enabled systems and how performance metrics are used and evaluated during testing (see Section 5.5).

The difficulty of defining comprehensive T&E requirements for software-centric capabilities is that the “black box” performance under operational conditions could change continually based on the ingestion of more data that generate probabilistic or statistically predictable behavior rather than deterministic results. The intersection of these two equally important considerations leads to a fundamental and persistent challenge for AI T&E today: understanding what requirements to test against when performing T&E for new or fielded AI systems.

The importance of human-system interfaces was one of the other resounding themes throughout this period of study. AI’s enormous potential will never be unleashed without changing how humans and machines interact in a more digitized future. While human-system integration (HSI) has been studied extensively over the past 50 years, it is evident that the kinds of AI anticipated soon will demand a different approach to how humans learn to work with “smart” machines. User interface and user experience (UI/UX) are more important than ever, yet much more analysis is needed to understand how to optimize HSI and assess the performance of human-machine interfaces. Optimizing the integration of humans and AI-enabled machines, which in turn depends on redesigning human-machine interfaces and recalibrating human and machine roles and responsibilities, will be one of the most important and defining features of an AI-enabled future. HSI and human-AI team effectiveness must be considered during the T&E of AI-enabled systems.

As the committee’s work proceeded, the committee determined that the AI T&E questions the DAF asked the committee to consider are intimately and inextricably related to larger issues of AI-based system acquisitions within the DAF. Thus, the committee realized that only by placing these questions within this larger context can they be properly understood and addressed, with resulting actionable recommendations. This report, therefore, follows and builds on this theme from chapter to chapter.

Chapter 1 reviews the current state of AI in the DAF. It finds that the DAF is in the early stages of incorporating modern AI implementations (see Section 1.3)

Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.

into its systems and operations. It has not yet acquired modern AI capability within the standard acquisition processes of a major defense acquisition program (MDAP) or major automated information system (MAIS). Chapter 1 also discusses what the committee means by AI, and several different categories of AI implementation. Chapter 1 also reports that DAF AI-related projects have been research and development initiatives, proof-of-concept demonstrations, or integrated into existing systems as upgrades or prototypes. The chapter notes that in the absence of AI-specific DoD and DAF standards, current DAF prototyping projects have adopted ad hoc acquisition and T&E processes. These ad hoc methods, by nature, do not scale, nor are they consistent. However, the projects reviewed mostly followed sound commercial practices. The chapter ends with a detailed case study of Project Maven. The lessons learned in Project Maven serve as signposts whose themes inform much of the report’s findings and recommendations. In particular, the chapter highlights how Project Maven, as a pathfinder AI program within DoD, underscored the importance of rigorous T&E, adopting and adapting industry best-practices, and staying abreast of new ideas from the top AI researchers in academia. Project Maven and other examples from this chapter emphasize the need to retrain AI models to meet unanticipated and changing operational conditions.

Chapter 2 reviews AI and AI-based systems to establish definitions and introduce salient aspects of AI and AI-related technologies. The chapter points out the fundamental importance of data within the machine learning training and testing processes. The chapter presents a historical overview of AI and AI test and evaluation before discussing human-machine teaming. It then proceeds to a detailed discussion of adjusting the evolution of DAF T&E protocols in response to the rapid pace of AI technology advances. The chapter notes a higher level of trust in existing non-AI-enabled systems garnered through years of user familiarity with such systems and continual refinement in specialized T&E approaches. It observes that the DAF T&E community is especially adept at assessing and optimizing human-machine interactions for its piloted weapon systems. However, the chapter concludes that DAF T&E practices neglect important aspects of AI-based HMI (human-machine interface). In particular, it concludes that the DAF needed to refocus all of its acquisition, T&E, operation, and sustainment processes on gaining user trust for deployed and emerging AI-enabled systems. It discusses how human-AI interfaces present new challenges as responsibilities shift between humans and intelligent machines and new concepts of operations (CONOPS) emerge. The chapter notes that inexperience in a future environment characterized by the widespread fielding of AI-enabled systems, the DAF would only be able to achieve maximum performance by focusing specifically on superior human-system integration. The chapter concludes by emphasizing the importance of giving more prominence to Human Readiness Levels (HRL) and UI/UX for AI-enabled military systems and revamping how future military systems are designed for a more digital future.

Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.

Chapter 2 offers a key finding and recommendation in the area of human-system integration, or HSI.

Finding 2-1: The DAF has not yet developed a standard and repeatable process for formulating and assessing HSI-specific measures of performance and measures of effectiveness.

Conclusion 2-1: The future success of human-AI systems depends on optimizing human-system interfaces. Measures of performance and effectiveness, to include assessments of user trust and justified confidence, must be formulated during system design and development, and assessed throughout test and evaluation and after system fielding.

Recommendation 2-1: Department of the Air Force (DAF) leadership should prioritize human-system integration (HSI) or HSI across the DAF, with an emphasis on formulating and assessing HSI-specific measures of performance and measures of effectiveness across the design, development, testing, deployment, and sustainment life cycle.

Chapter 3 reviews the historical, traditional approach to T&E in the Air Force and then discusses why current practices are insufficient for effective T&E of AI-based systems—particularly the lack of clean lines between developmental test and evaluation (DT&E) and operational test and evaluation (OT&E) for AI capabilities. The chapter observes the lack of formal DoD and DAF AI T&E standards and policies. It notes that seminal specifications from CDAO are emerging however, and that the committee expects OSD Director of Operational Test and Evaluation (DOT&E) will adapt CDAO’s frameworks and playbooks and will promulgate the new products DoD-wide. The chapter highlights that OSD DOT&E has provided an initial roadmap for redesigning T&E for DoD AI-enabled systems to reflect the substantial differences between the T&E of traditional DoD systems and the T&E of AI capabilities. The chapter also reviews the role of AI and Development, Security, and Operations (DevSecOps)/AIOps and notes the importance of accelerating the use of agile methodologies across the DAF and designing Artificial Intelligence for IT Operations (AIOps) architectures as a critical part of the AI life cycle. The chapter notes that over the last decade, the commercial sector, particularly the autonomous vehicle industry (see Section 3.2), has employed and refined agile methods to significantly advance the design and deployment of T&E methodologies for safety-critical AI-enabled systems. It also notes that the AIOps solutions designed for commercial applications will not meet the operational requirements of the DAF. This chapter introduces the concept of justified confidence as a progressive measure of trustworthiness and notes that developers, testers, and users should gain justified confidence in AI-enabled systems over time as they become increasingly familiar with system

Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.

performance limits and behaviors. Next, the chapter discusses AI assurance, another term that, along with justified confidence and trustworthiness, replaces the binary concept of trust when referring to AI-enabled systems. The chapter ends with analyzing operationally oriented risks pertaining to integrating AI capabilities into DAF systems. It emphasizes that when fielding AI-enabled capabilities under operational conditions, DAF end-users, program offices, DevSecOps or AIOps teams, testers, and leaders must use a tailored AI Risk Management Framework (RMF), such as the National Institute of Standards and Technology (NIST) AI RMF, to address a series of risk-related questions at each stage of the AI life cycle. Chapter 3 develops a series of findings, conclusions, and recommendations:

Finding 3-1: The DAF will have similar training infrastructure requirements to support the development and maintenance of AI-enabled systems. The decentralized nature of DAF operations means training cannot be supported by standard commercial offerings. The committee knows of no commercial off-the-shelf solution presently supports these requirements.

Recommendation 3-1: The Department of the Air Force artificial intelligence testing and evaluation champion should outline and prioritize these training infrastructure requirements and coordinate with commercial providers to adapt available solutions accordingly.

Finding 3-2: The DAF has not yet developed a standard and repeatable process for integrating, testing, acquiring, developing, and sustaining AI capabilities.

Finding 3-3: OSD DOT&E has not yet published DoD-wide formal AI T&E guidance.

Finding 3-4: There is a lack of clear distinction between DT and OT phases for AI capabilities.

Conclusion 3-1: A lack of formal AI development and T&E guidance represents a considerable challenge for the DAF as AI-based systems emerge.

Recommendation 3-2: Department of the Air Force (DAF) leadership should prioritize artificial intelligence testing and evaluation (AI T&E) across the DAF with an emphasis on a radical shift to the continuous, rigorous technical integration required for holistic T&E of AI-enabled systems across the design, development, deployment, and sustainment life cycle.

Recommendation 3-3: The Department of the Air Force should track the progress of the International Organization for Standardization/International

Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.

Electrotechnical Commission TR 5469 working group report through the publication process and leverage it as a starting point for adapting their testing and evaluation processes for artificial intelligence–enabled systems.

Finding 3-5: DAF AI contributions to date have been focused on computer vision perception and natural language processing algorithms and have yet to extend to fully address system-level T&E.

Recommendation 3-4: The Department of the Air Force should adopt a definition of artificial intelligence (AI) assurance in collaboration with the Office of the Secretary of Defense Chief Digital and AI Office. This definition should consider whether the system is trustworthy and appropriately explainable; ethical in the context of its deployment, with characterizable biases in context, algorithms, and datasets; and fair to its users.

Recommendation 3-5: The Department of the Air Force should develop standardized artificial intelligence (AI) testing and evaluation protocols to assess the impact of major AI-related risk factors.

Chapter 4 proposes appointing a DAF AI T&E champion and explores the challenges in defining comprehensive T&E requirements for AI capabilities compared to traditional DAF weapon systems. The chapter discusses Project Maven as a requirements use case and recommends options for establishing AI T&E requirements and increasing interactions between system designers, developers, testers, program offices, and end-users throughout the AI life cycle. This chapter discusses the value of independent red teams as a critical component of the overarching requirements process and AI test design. Finally, in examining the role of culture and workforce development, the chapter observes the challenge of adapting a highly successful DAF test culture to the era of AI T&E. It emphasizes the immediate education, training, and certification steps that DAF leaders need to take to build and sustain an AI-ready test enterprise workforce.

Chapter 4 contains most of the committee’s recommendations, as follows:

Finding 4-1: Currently, no single person below the level of the secretary or the chiefs of the Air and Space Forces has the requisite authority to implement DAF-wide changes to successfully test and evaluate AI-enabled systems.

Recommendation 4-1: The secretary of the Air Force and chiefs of the Air and Space Forces should formally designate a general officer or senior civilian executive as the Department of the Air Force (DAF) artificial intelligence (AI) testing and evaluation (T&E) champion to address the unique challenges of T&E of AI systems identified above. This AI T&E advocate should have the necessary AI and T&E credentials, and should be granted the requisite authorities, and

Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.

responsibilities, and resources to ensure that AI T&E is integrated from program inception and appropriately funded, realizing the DAF AI T&E vision.

Conclusion 4-1: Compared to traditional T&E, AI T&E requires radically deeper continuous technical integration among designers, testers, and operators or end-users.

Recommendation 4-2: The Department of the Air Force should adopt a more flexible approach for acquiring artificial intelligence (AI)-enabled capabilities that whenever possible links proposed solutions to existing joint capabilities integration and development system requirements, and that follows a development, security, and operations or AI for information technology operations/machine learning operations development methodology.

Recommendation 4-3: To the maximum extent possible and where it makes sense operationally, the Department of the Air Force (DAF) should integrate artificial intelligence (AI) requirements into programs of record, via the DAF’s system program offices and program executive officers, and integrate AI testing and evaluation (T&E) into the host weapon system T&E master plan.

Recommendation 4-4: The Department of the Air Force should establish an activity focused on robust artificial intelligence–based systems red-teaming, implement testing against threats the red-teaming uncovers, and coordinate its investments to explicitly address the findings of red-team activities and to augment research in the private sector.

Recommendation 4-5: Building off the 2020 DoD Data Strategy, the Department of the Air Force should update and promulgate its data vision, strategy, and metrics-based implementation plan to explicitly recognize data as a “first-class citizen.” These documents should include plans for the following:

  • Prioritizing investments in computation and storage resources and infrastructure to support artificial intelligence (AI) development
  • Widely expanding data collection and curation for the entire range of AI planning and scoping, designing, training, evaluation, and feedback activities
  • Investing in data simulators for AI training and testing
  • Adapting approaches and architectures developed in private industry for AI-based systems

Recommendation 4-6: The Department of the Air Force (DAF) should inculcate an artificial intelligence (AI) testing and evaluation (T&E) culture

Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.

espoused by DAF leaders and led by the AI T&E champion. In particular, the DAF and the DAF AI champion should:

  • Provide AI education, training, and, where applicable, certifications to all personnel, from general officers and senior civilian executives to entry-level personnel
  • Institute career-long tracking and management of personnel with specific AI and AI T&E skills
  • Place core AI T&E training under the Air Force Test Center
  • Take advantage of existing AI-related education and training initiatives

Recommendation 4-7: The Department of the Air Force (DAF) should determine the current baseline of artificial intelligence (AI) and AI test and evaluation (T&E) skills across the DAF, develop and maintain a tiered approach to AI and AI T&E-specific education and training, rebalance the test force by shifting people with needed expertise into the test enterprise, and consider placing personnel with AI T&E expertise into operational units.

Chapter 5 evaluates AI technical risks in DAF operational systems. It discusses how the employment of AI-enabled systems can have significant benefits in augmenting the capabilities of the warfighter. Still, it also notes that there are risks inherent in the use of AI-enabled systems that the DAF must address. This chapter observes that AI-enabled systems are vulnerable to several realistic performance issues, some based on adversarial AI attacks and others based on the risk of deploying the AI-enabled system in operational environments that have features or contexts that differ significantly from the representative datasets and intended contexts that were used to develop the AI capability. The chapter reviews the numerous attacks adversaries could potentially direct toward AI models within operational systems. The chapter observes that while AI models were subject to the same attacks as other software products, they were also vulnerable to AI-unique attack vectors that manipulated the training data, operational data, or the models themselves. It concludes that the DAF needed a staunch cyber defense as the first defense against such attacks. DAF T&E processes should likewise focus on detecting performance degradations and AI model susceptibility to classes of adversarial examples designed to avoid detection. Finally, it describes certain attacks, such as backdoor attacks involving adversarial triggers, that may be undetectable before they trigger with state-of-the-art test technologies.

The chapter discusses how academic research and development progress in this area has become an escalatory battle between attackers and defenders. Consequently, the chapter concludes that it would be important for the DAF to employ red-teaming of AI-based system vulnerabilities and to develop mitigations such as operational performance monitors. Furthermore, this chapter notes that DAF T&E would have an important role to play in emulating attacks identified by the red

Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.

teams and testing operational systems against these attacks. Finally, the chapter also discusses how AI models can fail in ways that are unexpected and non-intuitive. Therefore, it concludes that the DAF should focus on extensive testing to establish justified confidence in the deployed models.

Chapter 5 also makes a series of findings, conclusions, and recommendations:

Finding 5-1: Existing research on attacks on AI-enabled systems and strategies for mitigating them consider attacks that require unimpeded access to an underlying AI model. These attacks are unlikely to be practical with traditional protections and mitigations inherent in deployed DAF systems.

Finding 5-2: Ongoing research on adversarial attacks on AI-enabled systems focus on performance on benchmark datasets which are inadequate for simulating operational attacks. It appears that as robustness to adversarial attacks is improved, the performance often goes down. Even on benchmark datasets, the trade-off between potential performance reduction and improved robustness is not understood. More importantly, the defenses are designed to thwart known attacks. Such pre-trained defenses are not effective for novel attacks.

Finding 5-3: The impact of adversarial attacks on human-AI enabled systems has not been well understood.

Recommendation 5-1: The Department of the Air Force (DAF) should fund research activities that investigate the trade-offs between model resilience to adversarial attack and model performance under operational conditions. This research should account for a range of known and novel attacks whose specific effects may be unknown, but can be postulated based on state-of-the-art research. The research should explore mitigation options, up to and including direct human intervention that ensures fielded systems can continue to function even while under attack. The DAF should also simulate, evaluate, and generate defenses to known and novel adversarial attacks as well as quantitatively determine the trade-off between potential loss of performance and increased robustness of artificial intelligence–enabled systems.

Recommendation 5-2: The Department of the Air Force (DAF) should apply the DoD Zero Trust Strategy to all DAF artificial intelligence–enabled systems.

Conclusion 5-1: Promising areas of research that will improve the mitigation of adversarial AI include techniques for data sanitization, quantifiable uncertainty, and certifiable robustness.

Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.

Chapter 6 turns its attention to new and promising AI techniques and capabilities. It contends that even as the DAF addresses its current needs and opportunities, it must evaluate these emerging AI trends and their likely implications for T&E. Finally, the chapter observes that it is difficult to make precise predictions about which future AI capabilities will be most impactful for air force applications, especially given the accelerating rate at which AI technology advances. Nevertheless, it hypothesizes that five areas are particularly likely to impact DAF T&E: foundation models, informed machine learning, generative AI, trustworthy AI, and gaming AI for complex decision-making. It makes findings and recommendations accordingly.

Recommendation 6-1: The Department of the Air Force should focus on the following promising areas of science and technology that may lead to improved detection and mitigation of artificial intelligence (AI) corruption: trustworthy AI, foundation models, informed machine learning, AI-based data generators, AI gaming for complex decision-making, and a foundational understanding of AI.

Finding 6-1: Existing approaches for designing trustworthy AI-enabled systems do not consider the role of humans who interact with the AI-enabled systems.

Recommendation 6-2: The Department of the Air Force should invest in developing and testing trustworthy artificial intelligence (AI)-enabled systems. Warfighters are trained to work with reliable hardware and software-based advanced weapon systems. Such trust and justified confidence must be developed with AI-enabled systems.

Finding 6-2: Large language FMs exhibit a behavior termed “hallucination,” where the model output is either non-sensical or is not consistent with the provided input or context. The effects of hallucination are task-dependent. There are no metrics to assess the impact of large FMs on the various downstream applications, they have been applied to.

Finding 6-3: Several large FMs are available for single modalies, with language being the most dominant one. DAF tasks may involve multi-modal sensing and inference. SSL-based large language models are just recently becoming available for multi-modal paired or unpaired data.

Finding 6-4: Physics-based and other knowledge-informed models have the potential to increase the robustness and computational efficiency of data-driven methods. These models can also help enforce physics or knowledge-based performance boundaries, which can increase the efficiency of the T&E process. However, to successfully deploy such models the DAF must ensure

Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.

that the parameters and assumptions upon which they are based are present during operations, which requires additional attention to operational T&E.

Recommendation 6-3: The Department of the Air Force should assess the capabilities of data generators to enhance testing and evaluation in the context of relevant applications.

Finding 6-5: Recent and anticipated advances in AI gaming technologies will enable the Air Force to build systems that are more capable than ever before and that involve AI in more sophisticated ways, but this increased system complexity will make the teaming relationship between the human and AI elements much more interrelated and complex, thereby placing additional challenges on effective T&E.

Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.

This page intentionally left blank.

Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
Page 1
Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
Page 2
Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
Page 3
Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
Page 4
Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
Page 5
Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
Page 6
Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
Page 7
Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
Page 8
Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
Page 9
Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
Page 10
Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
Page 11
Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
Page 12
Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
Page 13
Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
Page 14
Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
Page 15
Suggested Citation: "Summary." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
Page 16
Next Chapter: 1 Introduction
Subscribe to Email from the National Academies
Keep up with all of the activities, publications, and events by subscribing to free updates by email.