Dropdown items
My Academies

Personal Library

Account settings

Defense Software for a Contested Future: Agility, Assurance, and Incentives (2025)

Chapter: 5 Machine Learning, Artificial Intelligence, and Software Systems

Visit NAP.edu/10766 to get more information about this book, to buy it in print, or to download it as a free PDF.

Previous chapter Next chapter
Page of 122
Search this publication

Page 97 Cite Bookmark

Suggested Citation: "5 Machine Learning, Artificial Intelligence, and Software Systems." National Academies of Sciences, Engineering, and Medicine. 2025. Defense Software for a Contested Future: Agility, Assurance, and Incentives. Washington, DC: The National Academies Press. doi: 10.17226/29129.

5

Machine Learning, Artificial Intelligence, and Software Systems

Machine learning (ML) and artificial intelligence (AI)-based technologies have recently made dramatic advances in the breadth and depth of their capabilities. A major breakthrough was the 2017 introduction of the transformer architecture¹ for constructing deep neural networks (DNNs). This was followed by OpenAI’s release of GPT-1, a large language model (LLM) based on the transformer architecture, in 2018.² LLMs started gaining more public fanfare with the release of the ChatGPT chatbot, which is based on the GPT-3.5 model, in late 2022.

Today, ML/AI is increasingly being used both as part of software systems and as a tool to help engineer those systems. Despite the obvious importance of ML/AI, the committee’s recommendations for achieving nimble, high-assurance systems do not recommend how to use it. This is for two reasons. First, it is difficult to provide solid and specific advice. ML/AI technology is advancing so rapidly that it is hard to predict when and how and where it might plateau and, therefore, the most effective ways it could be used. Second, the committee believes that its non-ML/AI-specific recommendations will remain applicable even as ML/AI advances. Its general advice (e.g., “use automatically verified evidence, when possible, not manually vetted documents”) applies in basically any future. Some specific advice may also apply at least in the short term (e.g., “use memory-safe languages”), while other specific advice may need to be questioned. For

___________________

¹ A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, 2017, “Attention Is All You Need,” Advances in Neural Information Processing Systems 30.

² A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, 2018, “Improving Language Understanding by Generative Pre-Training,” OpenAI.

Page 98 Cite Bookmark

example, extensible software architectures may be less important if ML/AI becomes proficient and reliable at code refactoring (it has a ways to go³).

This chapter discusses the brief history and state of the art of modern ML/AI, applied to software engineering tasks and to implementing functionality in software systems. For each, it summarizes the committee’s view of the promise and challenges to using ML/AI, now and in the near future.

MACHINE LEARNING/ARTIFICIAL INTELLIGENCE–ENHANCED SOFTWARE ENGINEERING

In the commercial world, ML/AI has quickly come into regular use for software engineering tasks. Indeed, it seems likely that most software engineering tasks carried out in the future will involve extensive use of ML/AI tools, which promise to enable dramatic improvements in efficiency. Nevertheless, these tools do carry some risk, at least currently, and it remains to be seen how these risks are best mitigated.

In October 2021, Microsoft released GitHub Copilot, an ML-based coding assistant whose persona is that of a “pair programmer”⁴ that is able to generate code from a natural language specification and explain code in natural language. Initially, Copilot was powered by OpenAI Codex,⁵ a modified, production version of GPT-3 (released in June 2020), which has 175 billion parameters and was pre-trained on more than 1 trillion words taken from Internet sources. GitHub has since replaced GPT-3’s use in Copilot with more advanced LLMs, including GPT-4 in November 2023⁶ and other models starting in late 2024.⁷

The number and capabilities of ML/AI coding assistants has expanded rapidly. By one estimate in March 2025, there were 11 notable ML/AI-powered coding/software

___________________

³ B. Liu, Y. Jiang, Y. Zhang, N. Niu, G. Li, and H. Liu, 2025. “Exploring the Potential of General Purpose LLMs in Automated Software Refactoring: An Empirical Study.” Automated Software Engineering 32(1):26.

⁴ N. Friedman, 2021, “Introducing GitHub Copilot: Your AI Pair Programmer,” Github Blog, https://github.blog/news-insights/product-news/introducing-github-copilot-ai-pair-programmer.

⁵ P. Krill, 2021, “OpenAI Offers API for GitHub Copilot AI model,” InfoWorld.

⁶ Github, 2023, “GitHub Copilot – November 30th Update,” Github Blog, https://github.blog/changelog/2023-11-30-github-copilot-november-30th-update.

⁷ T. Warren, 2024, “GitHub Copilot Will Support Models from Anthropic, Google, and OpenAI,” The Verge, https://www.theverge.com/2024/10/29/24282544/github-copilot-multi-model-anthropic-google-open-ai-github-spark-announcement.

Page 99 Cite Bookmark

engineering tools.⁸ Start-ups, including Bolt.new,⁹ Codeium,¹⁰ Cursor,¹¹ Lovable,¹² and Magic,¹³ raised hundreds of millions of dollars in funding in 12 months.¹⁴ ML/AI tools initially focused on generating small, simple code snippets or functions from text prompts, but now are able to generate larger code units, including nearly complete applications.¹⁵ They can also generate test suites and documentation from code, review code for defects and vulnerabilities, refactor and optimize existing code,¹⁶ and (semi)automatically determine the root cause of run-time failures.¹⁷ A recent meta-study found that developers using Copilot completed 26 percent more tasks on average,¹⁸ with individual studies reporting higher numbers.

Software engineering organizations are rapidly adopting ML/AI tools. According to the annual Stack Overflow developers survey, filled out by tens of thousands of (mostly professional) developers, 43.78 percent of respondents were using some form of ML/AI coding assistant in 2023,¹⁹ while 61.8 percent were using one in 2024, with an additional 14 percent planning to use one.²⁰ Copilot was reported in early 2024 to have a user base of 1.3 million subscribers and more than 50,000 businesses.²¹ Start-up companies are in the vanguard of exploring the most effective use of ML/AI-based software engineering. In a chat in March 2025 with leaders at the Y Combinator (YC) technology venture fund, YC managing partner Jared Friedman said their W25 start-up batch (one-fourth of the start-ups in their current cohort) have 95 percent of their codebases generated

___________________

⁸ AI Research and Development Team, 2025, “Best AI for Coding in 2025: 25 Developer Tools to Use (or Avoid),” Pragmatic Coders, https://www.pragmaticcoders.com/resources/ai-developer-tools.

⁹ Bolt.new, 2025, Post on X, January 22, https://x.com/boltdotnew/status/1882106655258894390.

¹⁰ K. Wiggers, 2024, “GitHub Copilot Competitor Codeium Raises $150M at a $1.25B Valuation,” TechCrunch, https://techcrunch.com/2024/08/29/github-copilot-competitor-codeium-raises-150m-at-a-1-25b-valuation.

¹¹ M. Temkin, 2024, “In Just 4 Months, AI Coding Assistant Cursor Raised Another $100M at a $2.6B Valuation Led by Thrive, Sources Say,” TechCrunch, https://techcrunch.com/2024/12/19/in-just-4-months-ai-coding-assistant-cursor-raised-another-100m-at-a-2-5b-valuation-led-by-thrive-sources-say.

¹² M. Butcher, 2025, “Sweden’s Lovable, an App-Building AI Platform, Rakes in $15M After Spectacular Growth,” TechCrunch, https://techcrunch.com/2025/02/25/swedens-lovable-an-app-building-ai-platform-rakes-in-16m-after-spectacular-growth.

¹³ K. Wiggers, 2024, “Generative AI Coding Startup Magic Lands $320M Investment from Eric Schmidt, Atlassian and Others,” TechCrunch, https://techcrunch.com/2024/08/29/generative-ai-coding-startup-magic-lands-320m-investment-from-eric-schmidt-atlassian-and-others.

¹⁴ I. Mehta, 2025, “A Quarter of Startups in YC’s Current Cohort Have Codebases That Are Almost Entirely AI-Generated,” TechCrunch, https://techcrunch.com/2025/03/06/a-quarter-of-startups-in-ycs-current-cohort-have-codebases-that-are-almost-entirely-ai-generated.

¹⁵ Examples include GitHub Spark (https://githubnext.com/projects/github-spark), Nectry (https://nectry.com), and AWS Partyrock (https://partyrock.aws).

¹⁶ A. Shypula, A. Madaan, Y. Zeng, U. Alon, J. Gardner, M. Hashemi, G. Neubig, P. Ranganathan, O. Bastani, and A. Yazdanbakhsh, 2023, “Learning Performance-Improving Code Edits,” arXiv preprint arXiv:2302.07867.

¹⁷ K. Levin, N. van Kempen, E.D. Berger, and S.N. Freund, 2024, “Chatdbg: An AI-Powered Debugging Assistant,” arXiv preprint arXiv:2403.16354.

¹⁸ Z.K. Cui, M. Demirer, S. Jaffe, L. Musolff, S. Peng, and T. Salz, 2024, “The Effects of Generative AI on High Skilled Work: Evidence from Three Field Experiments with Software Developers,” Available at SSRN 4945566.

¹⁹ Stack Overflow, 2023, “Sentiment and Usage,” https://survey.stackoverflow.co/2023/#sentiment-and-usage-ai-select.

²⁰ Stack Overflow, 2024, “Sentiment and Usage,” https://survey.stackoverflow.co/2024/ai#sentiment-and-usage.

²¹ D. Ramel, 2024, “Copilot by the Numbers: Microsoft’s Big AI Bet Paying Off,” Visual Studio Magazine, https://visualstudiomagazine.com/Articles/2024/02/05/copilot-numbers.aspx.

Page 100 Cite Bookmark

by ML/AI.²² Although there may be uncertainty about whether that percentage refers to the fraction of code generated by ML/AI or the fraction of codebases with some ML/AI-generated code, it is also indicative that there have been reports that start-ups are taking less venture investment, since they do not need to hire as many people.²³

General adoption rates are likely to rise further, thanks to increasing capability. In a February 2025 paper,²⁴ OpenAI found that its general-purpose gpt4o3 model achieved a CodeForces rating²⁵ in the 99.8th percentile, compared to the 11th percentile using the May 2024 gpt4o model. The key advance is the use of end-to-end reinforcement learning in which decisions are not made by test-time inference but rather emerge entirely through self-play. At the same time, there is still significant room for improvement, As of March 2025, the best performing ML/AI on SWE-Bench, a benchmark comprising broader software engineering tasks, achieved a 64.6 percent success rate.²⁶

How might ML/AI make it easier or harder to build high-assurance and agile software systems? Their powerful capabilities will make such systems easier to build, but there is still an important role for human developers. First, despite their overall impressive capabilities, AI-generated code often contains bugs.²⁷ Even if generated code meets functional requirements, it may contain security vulnerabilities. This should not be surprising, given that human code authors often make vulnerability-introducing coding mistakes, and code-generation models are often trained on unvetted human-authored code. A 2023 study by Perry et al.²⁸ found that participants who had access to an AI assistant based on OpenAI’s codex-davinci-002 model wrote significantly less secure code than those without access. Pearce et al.²⁹ found in a study of GitHub CoPilot that upwards of 40 percent of the generated code was vulnerable to security issues identified within the top 25 of MITRE’s Common Weakness Enumeration (CWE), and Fu et al.³⁰ found that 24–29 percent of ML/AI-generated code committed to GitHub open-source projects was similarly vulnerable. This means that developers need to review code and

___________________

²² I. Mehta, 2025, “A Quarter of Startups in YC’s Current Cohort Have Codebases That Are Almost Entirely AI-Generated,” TechCrunch.

²³ E. Griffith, 2025, “A.I. Is Changing How Silicon Valley Builds Start-Ups,” New York Times.

²⁴ A. El-Kishky, A. Wei, A. Saraiva, B. Minaiev, D. Selsam, D. Dohan, F. Song, et al., 2025, “Competitive Programming with Large Reasoning Models,” arXiv preprint arXiv:2502.06807.

²⁵ CodeForces is a coding competition: https://codeforces.com. The OpenAI study used prior problems from the contest’s Division 1, ensuring all test contests occurred after the data cut-off for model training and tuning.

²⁶ See the SWE-bench website at https://www.swebench.com, accessed March 26, 2025.

²⁷ B.K. Deniz, C. Gnanasambandam, M. Harrysson, A. Hussin, and S. Srivastava, 2023, “Unleashing Developer Productivity with Generative AI,” McKinsey & Company, https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/unleashing-developer-productivity-with-generative-ai#.

²⁸ N. Perry, M. Srivastava, D. Kumar, and D. Boneh, 2023, “Do Users Write More Insecure Code with AI Assistants?” Pp. 2785–2799 in Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS ‘23), Association for Computing Machinery, https://doi.org/10.1145/3576915.3623157.

²⁹ H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, 2025, “Asleep at the Keyboard? Assessing the Security of Github Copilot’s Code Contributions,” Communications of the ACM 68(2):96–105.

³⁰ Y. Fu, P. Liang, L. Zengyang, S. Mojtaba, Y. Jiaxin, and C. Jinfu, 2025, “Security Weaknesses of Copilot-Generated Code in GitHub Projects: An Empirical Study,” ACM Transactions on Software Engineering and Methodology.

Page 101 Cite Bookmark

make sure it is correct. YC general partner Diana Hu said that even if product builders rely heavily on AI, one skill they would have to be good at is reading the code and finding bugs,³¹ perhaps with the assistance of static analysis tools. In the above-referenced two studies, the authors used the CodeQL static analyzer to identify vulnerabilities.

More broadly, even as ML/AI tools’ capabilities improve, there remains the need for a human developer to decide how to use those capabilities—that is, to specify what software to build. In traditional software engineering processes, specifications boil down to requirements, which are often written in natural language. One could imagine ML/AI tools generating software starting from such requirements.³² However, the formal methods community has regularly pointed out that natural language requirements often lack precision, meaning that software that meets the requirements nevertheless does not perform as anticipated. While this situation was observed for human-authored software, there is no reason to think that ML/AI-authored software will be better equipped to handle inherent ambiguity.

A solution to the problem of imprecise requirements is to leverage more formal specifications, but these have not seen widespread adoption at least in part because many practitioners find them difficult to write.³³ Ultimately, an iterative, agile approach is likely to win out, in which the human developer works with the ML/AI tool to land on the right requirements, and from them generates larger applications from smaller components (which at least today are better suited to reliable ML/AI generators³⁴). Along the way, code-level analysis and human review, for understanding the full capabilities (including bugs and vulnerabilities) of the software, are likely to be essential. This full understanding obtained from expert human review may be especially important for establishing a good software architecture on which future capabilities can be nimbly built.

It is also possible that ML/AI can help with the agile development of requirements. In February 2025, Andrej Karpathy (one of the founders of OpenAI) introduced the term “vibe coding” to refer to programming solely by natural-language instructions given to an LLM (both to build and debug/improve the system) where one never looks at the resulting code. This is possible because “LLMs (e.g., Cursor Composer with Sonnet) are getting too good.”³⁵ While it would be irresponsible and ineffective to rely on a vibe-coded system, it may be an effective way to develop good requirements—vibe-coding a

___________________

³¹ I. Mehta, 2025, “A Quarter of Startups in YC’s Current Cohort Have Codebases That Are Almost Entirely AI-Generated,” TechCrunch.

³² B. Wei, 2024, “Requirements Are All You Need: From Requirements to Code with LLMs,” In 2024 IEEE 32nd International Requirements Engineering Conference (RE), IEEE.

³³ J. Bruel, S. Ebersold, F. Galinier, M. Mazzara, A. Naumchev, and B. Meyer, 2021, “The Role of Formalism in System Requirements,” ACM Computing Surveys (CSUR).

³⁴ B.K. Deniz, C. Gnanasambandam, M. Harrysson, A. Hussin, and S. Srivastava, 2023, “Unleashing Developer Productivity with Generative AI,” McKinsey & Company, https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/unleashing-developer-productivity-with-generative-ai#.

³⁵ A. Karpathy, 2025, Post on X, February 2, https://x.com/karpathy/status/1886192184808149383.

Page 102 Cite Bookmark

system that seems to do what you want and then deriving from it durable requirements from which a higher-quality system (with human help) could be built.

Machine Learning and Artificial Intelligence–Based Components of Software Systems

In the decade prior to the development of LLMs, ML/AI began to be used in greater frequency in some key components of software systems. These uses followed the rapid development of DNNs starting in the late 2000s, especially convolutional neural networks, which were boosted in capability by the vast amount of training data available via public sources (e.g., text, images, and videos on the Internet) and by the emergence of fast graphics processing units (GPUs) to accelerate training.³⁶ Deep learning-based approaches started to prove far more effective than classical ML/AI methods at tasks in computer vision (e.g., object classification and recognition), natural-language processing (e.g., language translation and speech-to-text), and planning. As a result, these capabilities started to be used in a variety of contexts and devices, including smartphones, self-driving cars,³⁷ malware detection,³⁸ and aircraft collision avoidance systems.³⁹ The emergence of LLMs, inspired by their effective demonstration via ChatGPT in the role of chatbots, increased interest in interactive ML/AI-based assistants for performing automated question and answer–style customer service tasks, or acting as natural-language interfaces to complex systems. Today, LLMs are being deployed as agents, which are given some degree of autonomy to carry out complex, user-specified tasks.⁴⁰ Agents can interact with “tools” or other agents to get information, check their results, and initiate actions.

As with using ML/AI to build high-assurance and agile systems, using ML/AI components as part of high-assurance and agile systems carries both promise and risk. The promise is incredible (and increasing) capability for critical functions at lower costs. The risk is that machine-learned capabilities may not provide safe or reliable results, and traditional software engineering practices involving design and code reviews may not provide much evidence to the contrary. This is because the behavior of DNNs is determined by networks comprising billions of parameters, and those parameters’ values are determined by training. This means that a human auditor cannot simply look at the code of the ML model in order to make sense of it and assess whether it is safe or reliable.

___________________

³⁶ Wikipedia contributors, 2025, “Deep Learning,” Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/Deep_learning.

³⁷ M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, et al., 2016, “End to End Learning for Self-Driving Cars,” arXiv preprint arXiv:1604.07316.

³⁸ Z. Yuan, L. Yongqiang, W. Zhaoguo, and Y. Xue, 2014, “Droid-sec: Deep Learning in Android Malware Detection,” In Proceedings of the 2014 ACM conference on SIGCOMM.

³⁹ K.D. Julian, J. Lopez, J.S. Brush, M.P. Owen, and M.J. Kochenderfer, 2016. “Policy Compression for Aircraft Collision Avoidance Systems,” In 2016 IEEE/AIAA 35th Digital Avionics Systems Conference (DASC).

⁴⁰ Anthropic, 2024, “Building Effective Agents,” Anthropic, https://www.anthropic.com/engineering/building-effective-agents.

Page 103 Cite Bookmark

Indeed, the way that a trained LLM or other DNN actually works is often hard to comprehend—that is, its behavior is not interpretable.⁴¹ One surprising feature of DNNs is that they exhibit discontinuity, meaning that a small perturbation to the input to a DNN can have a potentially large impact on its output.⁴² As a result, a DNN-based object recognition or planning system in a self-driving car may behave incorrectly if it receives input it has not seen before, potentially resulting in a collision or other catastrophic failure. Worse, inputs that induce wrong behavior can be crafted adversarially⁴³ by perturbing an input in a manner that is imperceptible to a human but massively affects the DNN. Such adversarial inputs often work equally well on multiple DNNs carrying out similar tasks but trained differently on different data.

LLMs, as more general and powerful tools, are subject to additional failure modes, most notably hallucinations. An LLM-based response, perhaps given as an answer to a question posed to an LLM-based chatbot, is called a hallucination when it has the appearance of a correct statement (in terms of the answer’s tone, the presence of numbers, and the inclusion of supporting explanations) but is in actuality incorrect.⁴⁴ Incorrectness could manifest as a factual error or as a failure to properly consider provided context,⁴⁵ depending on how the LLM was trained, prompted, etc. In general, chatbots are more than just the underlying LLM, and their failure modes are diverse.⁴⁶

Technologists are actively exploring mitigations to these reliability and security risks. Most directly, one can train DNNs with adversarially generated examples included so as to reduce discontinuity and enhance robustness,⁴⁷ but input preprocessing and output post-processing are also sometimes effective. For LLMs, similar techniques apply.⁴⁸ Formal methods–style automated reasoning and analysis can prove

___________________

⁴¹ Wikipedia, 2025, “Explainable Artificial Intelligence,” Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/Explainable_artificial_intelligence.

⁴² C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, 2013, “Intriguing Properties of Neural Networks,” arXiv preprint arXiv:1312.6199.

⁴³ I.J. Goodfellow, J. Shlens, and C. Szegedy, 2014, “Explaining and Harnessing Adversarial Examples,” arXiv preprint arXiv:1412.6572.

⁴⁴ In the context of LLM-based code authoring for software engineering, hallucinations manifest as code suggestions that reference non-existent variables or libraries, use non-existent language features, or use features and functionality incorrectly.

⁴⁵ L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, et al., 2025, “A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions,” ACM Transactions on Information Systems 43(2):1–55.

⁴⁶ A. Vakulov, 2023, “Understanding and Mitigating the Security Risks of Chatbots,” Cyber Security Hub, https://www.cshub.com/attacks/articles/understanding-and-mitigating-the-security-risks-of-chatbots; E. Conover, 2024, “AI Chatbots Can Be Tricked into Misbehaving. Can Scientists Stop It?” Science News, February 1, https://www.sciencenews.org/article/generative-ai-chatbots-chatgpt-safety-concerns.

⁴⁷ W. Zhao, S. Alwidian, and Q.H. Mahmoud, 2022, “Adversarial Training Methods for Deep Learning: A Systematic Review,” Algorithms.

⁴⁸ [x]cube LABS, 2024, “Adversarial Attacks and Defense Mechanisms in Generative AI,” [x]cube LABS (blog). https://www.xcubelabs.com/blog/adversarial-attacks-and-defense-mechanisms-in-generative-ai.

Page 104 Cite Bookmark

that particular DNNs are robust,⁴⁹ or that they enjoy other properties,⁵⁰ although such approaches can have difficulty scaling to large networks. Formal methods can also be used to vet training data, to ensure that ML models are trained on correct examples, and can be used to evaluate whether ML model results are correct, and initiate a refinement loop if they are not.⁵¹

It remains to be seen in what ways, and how quickly, ML/AI capabilities will improve, and whether the above-listed mitigations will make them safer. In any case, the committee believes that the recommendations in this report should apply, in terms of assessing assurance and ensuring reliability.

Finding 5-1: ML/AI capabilities are advancing rapidly. In industry, they are being used both as part of software systems and, increasingly pervasively, as part of their engineering process. Assurance practices for human-authored software, and software with human-authored components, apply to ML/AI-enhanced software as well.

Recommendation 5-1: The Undersecretary of Defense for Research and Engineering should sponsor a follow-on study to more thoroughly consider the likely future risks and benefits of machine learning and artificial intelligence as both technology for implementing systems and in applications intended to automate or improve efficiency of development and verification of software.

___________________

⁴⁹ T. Gehr, M. Mirman, D. Drachsler-Cohen, P. Tsankov, S. Chaudhuri, and M. Vechev, 2018, “Ai2: Safety and Robustness Certification of Neural Networks with Abstract Interpretation,” In 2018 IEEE Symposium on Security and Privacy (SP).

⁵⁰ G. Katz, C. Barrett, D.L. Dill, K. Julian, and M.J. Kochenderfer, 2017, “Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks,” In Computer Aided Verification: 29th International Conference Proceedings, Springer International Publishing.

⁵¹ S. Roberts, 2024, “Move Over, Mathematicians, Here Comes AlphaProof,” New York Times.