Ben Shneiderman, University of Maryland (emeritus), and planning committee co-chair, opened the June 20, 2024, session along with planning committee member Abigail Jacobs, University of Michigan. Alex Givens, Center for Democracy & Technology, gave the keynote address for the workshop session. She emphasized attention to the interplay of human, organizational, and technical factors involved in the design, development, and use of AI systems. She highlighted that the impact of AI can only be assessed accurately if systems are understood within the societal context in which they are developed. Givens asserted that AI risk management guidance must be designed with users in mind. She identified two key ingredients for success: thoroughly identifying and documenting how various social and technical factors could interact to create risks and ensuring the right people are involved in that process. Givens cited the work of Miranda Bogen and Amy Winecoff, who are studying these factors.1
Givens illustrated the value of better measurement approaches using an example. New York City’s Local Law 144, the first legislation in the country requiring bias audits of automated hiring tools,2 was enacted, according to Givens, to check whether hiring tools have a disparate impact based on race, ethnicity, or sex. When advocates questioned why the law did not also require audits for discrimination based on other legally protected characteristics such as disability, the response was, according to Givens, that
___________________
1 A. Winecoff and M. Bogen, 2024, Improving Governance Outcomes Through AI Documentation, Center for Democracy and Technology, https://cdt.org/insights/report-improving-governance-outcomes-through-ai-documentation-bridging-theory-and-practice.
2 Local Law 144 of 2021, Council Int. No. 1894-A (2021).
disability discrimination would be too hard to measure. More robust measurement methods would remedy the situation, Givens observed.
Givens called for better representation of affected communities in the development of risk assessments and suggested the establishment of advisory councils from affected groups as a possible solution. Furthermore, she noted that increased transparency on policies and practices is needed if communities are to stay informed and be able to respond.
Givens observed that the “human-in-the-loop” aspect of AI risk management is especially important given that most AI users are not AI experts. One important issue that risk management guidance must address is how limited visibility into the detailed workings of AI systems hinders effective human decision making. Users, according to Givens, will need guidance and training to know when to override an AI recommendation. In concluding her talk, Givens emphasized the need for AI governance approaches that account for how people think and behave. With a combined consideration of human behavior and the properties of AI systems, governance can be designed to better support effective interactions with AI systems and sound decision making.
In her opening remarks, Jacobs highlighted the planning committee’s motivation for designing the second event around evaluation. Jacobs noted that evaluation is deeply intertwined with human and organizational factors because these factors indicate where to focus attention. She cautioned that poorly chosen or implemented evaluation approaches can obscure the disparate impacts of AI systems on particular groups or individuals and suggested that much can be learned from existing approaches to audits and other forms of independent oversight.
William Isaac, Google DeepMind and planning committee member, moderated the first panel of this workshop event. He began by asking the panelists how evaluations should be assessed and what trade-offs or challenges are associated with different evaluation methods.
Miranda Bogen, Center for Democracy & Technology, stated that measurement choices shape three aspects of risk analysis: what qualifies as a risk, how severe that risk is deemed to be, and how potential interventions are evaluated. A key challenge exists in balancing the need for consistent measurements across different contexts and organizations while allowing flexibility for context-specific measurements that can evolve over time as understanding improves. Bogen also noted two recurring challenges: the lack of transparency in how measurements are implemented and questions about whether the correct metrics are being used.
Rishi Bommasani, Stanford University, explained that evaluation is used to measure current technological capabilities and risks as well as to drive progress through iterative improvements. He emphasized that evaluation is particularly crucial for understanding complex technologies whose capabilities and risks might otherwise remain obscured. The resulting documentation can help in building public understanding and fostering more standardized approaches. Bommasani also noted that multiple actors rely on evaluation, including the AI model developers themselves as well as independent third parties.
Laura Weidinger, Google DeepMind, described an empirical review of current AI evaluation practices, which revealed that evaluations predominantly focus on model-centric assessments. These evaluations typically examine technical artifacts in isolation rather than conduct comprehensive safety assessments. Pointing to the success of system safety engineering, notably in aerospace engineering, Weidinger advocated for more holistic and contextualized evaluation approaches that account for human and societal factors and thus transcend simple benchmarking. According to Weidinger, this deeper, more comprehensive evaluation approach would improve the ability to predict and anticipate future developments in and potential impacts of AI systems.
Hanna Wallach, Microsoft Research, began with a fundamental observation about system development: developers aim to create systems that fulfill stakeholder requirements while avoiding undesired behaviors. Measurement and evaluation, according to Wallach, serve as the critical tools for verifying whether systems are meeting these intended goals and avoiding unwanted outcomes. Wallach outlined several key challenges in achieving these evaluation goals. She highlighted the complexity of managing diverse stakeholder groups and their competing interests, while noting that current evaluation methods tend to focus narrowly on models rather than complete systems. She observed that when examining entire systems, as is the case with her own research, one finds that some risks are difficult to measure concretely unless one adopts measurement modeling techniques from the social sciences.
Panelists discussed what they saw as the current targets of measurements, considering trustworthiness and explainability as well-known targets that have lost popularity. Wallach offered that some researchers have stopped measuring trustworthiness and transparency because they have found more specific properties, model behaviors, or uses in practice to be of greater interest. She also observed that the widespread availability of certain AI applications now allows many different groups to examine their real-world impacts directly. Last, according to Wallach, some researchers have concluded that no universal notions of trustworthiness or interpretability exist and have shifted away from employing those concepts. Weidinger argued that people have not stopped evaluating trustworthiness, but the vocabulary has shifted from “trustworthiness” to “safety”
with growing recognition of the impacts on public safety. Furthermore, Weidinger said, explainability and interpretability are resurfacing as important topics. Bogen echoed Weidinger’s sentiment regarding a shifting vocabulary and emphasized the importance of keeping terminology specific and distinct.
Prompted by Isaac, panelists commented on approaches to prioritizing risks and managing tensions among them, especially when not all risks can be measured concretely. Weidinger suggested prioritizing risks based on their potential impact on vulnerable populations and large groups rather than their measurability. Weidinger went further to stress that one should not fall into the trap of measuring the risks that are easier to measure. For example, although it may be hard to measure the addictive properties of chatbots, such measurement is important given the potential for real harm. Weidinger outlined some techniques other than the usual benchmarking and red teaming for measuring these “fuzzier” impacts: user testing, laboratory-based human–AI interaction testing, surveying, and qualitative interviewing.
Wallach echoed Weidinger’s observation that what is measured is usually what is easy to measure rather than what is most important to measure. Wallach attributed this discrepancy to preferences for evaluations that promise to yield the most information and for measurements of concepts that can be measured precisely. Wallach argued that gaining some information about potentially significant risks is far better than learning nothing. Taking a step back to remark on the validity of measurement, Wallach observed that many assumptions and decisions underlie measurement and that one cannot interpret measurements without taking those assumptions into account.
Bogen characterized measurement as a decision-making tool, observing that human and organizational dynamics shape decision making and thus influence the purpose of measurement, its limitations, and the opportunities it affords. Decision makers frequently encounter conflicting or inconsistent information when evaluating options. They often must assign numerical values to subjective or complex issues, as these need to be compared directly against concrete metrics, although proper context is frequently lacking. Bogen noted that this challenge is compounded by the fact that decisions must typically be made quickly by individuals who cannot develop deep expertise in all aspects of the matters they are evaluating.
Isaac next asked the panelists to consider prioritization in risk assessment, including how to scope evaluations, how to consider stakeholder perspectives, and how priorities may change when assessing “fuzzy” factors for which validated measurement approaches may not exist. Isaac homed in on the process of prioritization. He inquired about the impact factors such as evaluation demand and how organizational or public stakeholders are thinking about scoping risk. Isaac inquired about scoping methods, identification methods for risk, and validation methods regarding assumptions and
decisions used to identify risks. Furthermore, he asked panelists to discuss how the process shifts when they take a fuzzy concept, measure it, and then assess the validity of the described measurement.
Wallach described a prioritization approach whereby possible system events, or things that the system might do given some input or context, are broken into “constituent components of risk.” Each component, according to Wallach, is then considered in light of the needs and desires of the system’s stakeholders. Risks can then be prioritized based on the product of the level of undesirability and the probability of that event occurring. Wallach noted that notions of measurement reliability and construct validity come into play when measurements and background assumptions are revisited and reviewed.
Bogen underscored the potential disconnect between how stakeholders and regulators understand concepts such as fairness. This, according to Bogen, can thwart some assessment work because of potential legal implications.
Bommasani pointed to cost as a significant barrier to evaluation, heavily influencing how assessments are prioritized. He observed that assessments, especially complex ones, can be very costly and that the degree of insight obtained is often related to the level of investment in an assessment. Complex concepts that cannot be measured easily can be more costly, while less rigorous risk assessments are more accessible. Weidinger emphasized the importance of using multiple assessments spanning a wide range of evaluation approaches to validate findings. Agreeing with Bommasani’s point about cost as a barrier, Weidinger suggested that it may be possible to find less expensive proxies for expensive assessment approaches, allowing multiple evaluations to be conducted more easily.
Last, Isaac asked the panelists to define validity. He pointed to Wallach’s notion of validity through investigating the components of a system and Weidinger’s suggestion of triangulating the results of multiple tests as examples of forms of validity. Wallach named the two examples highlighted by Isaac as “content validity” and “convergent validity,” respectively. She also noted that Bogen and Bommasani discussed what she would term “consequential validity.”
Hoda Heidari, Carnegie Mellon University and planning committee member, moderated the second panel, which looked at how to evaluate and measure risks once they have been identified. Heidari gave the panelists the opportunity to provide opening remarks.
Lama Ahmad, OpenAI, began by emphasizing a key challenge: the gap between how AI models perform in controlled safety assessments versus their actual behavior when deployed in real-world environments. She turned to her own recent work, which
found that involving humans in safety evaluations can increase internal and external validity. Furthermore, she and her colleagues argue that human interaction evaluations help to assess direct human impact and interaction-specific harms such as those related to persuasion and anthropomorphism. She offered that in addition to benchmarking and other evaluations, red teaming plays a key role in mitigating the discrepancy noted at the start of her remarks. External expert–AI red teaming is useful with clearly stated assumptions and threat models. Furthermore, transparent documentation and disclosure, to the extent possible, allow for feedback loops that support the reduction of existing harm.
Kenneth Holstein, Carnegie Mellon University, discussed his work on enabling participatory approaches to AI design and evaluation, underscoring the importance of truly collaborative approaches to the design of evaluation measures and benchmarks. He noted that in current participatory work, even when non-AI experts like frontline workers and community members are involved, they are rarely empowered to contribute meaningfully to the design of AI measurement approaches. Often, domain experts are left out of the process of conceptualizing and operationalizing measures, resulting in misleading evaluations. Holstein encouraged collaborating with domain experts at early stages, such as defining what “success” would look like for an AI tool and how it might be measured, prior to tool development.
Margaret Mitchell, Hugging Face, emphasized how natural language generation—technology that generates information based on a context-aware knowledge base—provides a valuable perspective for designing evaluations. She explained that one practical application of this approach is involving domain experts to assess AI outputs and develop content plans that define what information is most relevant and how it should be structured. These methods facilitate coverage precision that allows systems to be evaluated with the end user in mind rather than the technology in isolation. Currently, according to Mitchell, systems are evaluated automatically against common benchmarks. Otherwise, organizations crowdsource feedback from sample populations that occasionally generate responses that are distant from the experiences of most end users. Mitchell emphasized the need for disaggregated evaluation that accounts for different contexts and subpopulations affected by a technology.
Arvind Narayanan, Princeton University, underscored the importance of analyzing the safety of AI models in the context of users. He pointed to AI-generated phishing emails as an example of a threat that is difficult to mitigate without taking user behavior into account. Exploring the phishing example further, Narayanan suggested that evaluation might focus on the AI-associated “gain of function”—for example, how AI models might improve the efficiency of malicious actors.
Heidari asked the panelists to discuss salient examples of popular evaluation methods. Mitchell illustrated a current industry trend of putting out an application
and requesting user feedback, providing useful case-by-case data. Mitchell noted that the large volume of responses one can get with a large user base can make identifying unique issues more difficult due to vast amounts of feedback data. Automated evaluations, conducted using algorithmic tools, complement assessment based on user feedback, Mitchell observed. Cheap and fast, they can be performed frequently and used to assess incremental system changes. Mitchell stated that disaggregated evaluation, assessing how well a system performs for different subpopulations, can create a fairer system even while using automated evaluations. Benchmarks can be better specified based on the subpopulation and intended use, and then metrics can be compared across subpopulations to understand the distribution of impacts.
Prompted by Heidari, Holstein described how qualitative methods fit into AI evaluations. Holstein agreed that automated and human-led evaluations as well as qualitative and quantitative approaches are needed. Holstein noted that different evaluation approaches need to be integrated thoughtfully. He suggested that it is good to have a healthy level of anxiety around whether automated or quantitative evaluations are measuring the right thing. He offered that qualitative approaches, like early-stage field observations and regular stakeholder interviews, can help inform the design of appropriate measurement approaches. Simultaneously, it is good to engage critically with the extent to which more local, qualitative insights hold up at a larger scale. Quantitative work can help to validate those findings and conclusions. Therefore, according to Holstein, qualitative and quantitative evaluation should be an iterative loop.
As another means of mixing methods, Ahmad suggested using models to generate use cases that can then be evaluated through red teaming. She cautioned that this might lead to a false sense of security if depth, breadth, and timing of examination are not considered. Reevaluating different situations iteratively is important as metrics and conditions shift over time.
Narayanan turned the conversation to predictive AI. In some contexts, such as banking or medical care, a good prediction does not always lead to a good decision. Narayanan said that prediction itself can change the subject that it is trying to predict. To alleviate this concern, he advocated for randomized controlled trials observed over a period of years as a method to assess the accuracy of prediction.
Heidari asked panelists to opine on the place of benchmarking in automated quantitative evaluations as well as how it should be complemented with other forms of evaluation. Mitchell stated that benchmarking is useful for quickly iterating on system performance, particularly in comparison to previous versions of the model or other models. Mitchell noticed, through Hugging Face’s integration of benchmark leader-boards, that benchmarks can act as an incentive to perform better in a particular way. Narayanan highlighted the need to improve benchmarking for AI agents or compound
AI systems that might complete complex tasks like flight booking. Current benchmarks for such systems do not always include A/B testing. Narayanan argued that this type of testing is crucial for AI agents.
Mitchell cautioned against relying on performance benchmarks if evaluations are to be fit for purpose. Such assessments, noted Mitchell, must always be grounded in the intended purpose of an AI system. Mitchell emphasized the need for end user input to construct real-world conditions accurately. Practitioners, according to Mitchell, would benefit from engaging directly with the people who could use or create the system’s data sets to assess how well the model is fit for their intended uses and purposes. Beyond increasing the validity of evaluation, such engagement could also allow better explainability of system outputs and transparency into why the system is producing those outputs.
Building on the previous panel, Heidari listed some of the factors associated with effective evaluation—validity, reliability, scalability, trackability, and economic feasibility—and asked panelists to comment on additional factors and how they would prioritize these factors. Narayanan prioritized construct validity—that is, the extent to which the evaluation mirrors the real-world condition one aims to model. For example, he noted, chatbot evaluators are hampered by a paucity of data on genuine interactions between users and chatbots. If researchers are to have better access to these types of data for realistic evaluation, structural changes such as more transparency from companies or safe harbors for external researchers testing commercial systems are needed, Narayanan said. Furthermore, Narayanan suggested developing and adopting new models for collaboration between external researchers and companies that would provide researchers with access to user logs while respecting user privacy and protecting commercial interests.
Heidari then asked Holstein and Ahmad to explore construct validity and fitness for purpose further. Holstein pointed to recurring false assumptions that AI tools have sufficient capabilities to allow particular jobs to be automated. He noted that the error usually stems from construct validity issues, when a system is evaluated only in an oversimplified environment that does not reflect the realities of a given job. Such construct validity issues, according to Holstein, can arise when scale is prioritized too highly over validity. Ahmad agreed, noting tensions among scalability, generalizability, and cost that hinder attempts to build construct validity. Building evaluations while incorporating human expertise to understand worker tasks, complex interactions between humans and AI systems, and the context in which they occur takes time and effort. Ahmad suggested public red teaming challenges to allow for crowdsourced testing, which is less resource-intensive than domain expert red teaming. To be effective, this would require intentional clarification of the purpose of the test, pre-identification of what useful data look like, and strong definitions of terms and boundaries within systems.
Heidari asked the panelists to share their thoughts on the recently launched NIST ARIA program. Ahmad complimented the multilayered, multistep approach of the program, as single-method approaches will be insufficient. Ahmad encouraged complex approaches to red teaming. Holstein complimented the program’s flexibility, noting that identifying what is most vital to learn at each stage of the evaluation process and choosing a method application accordingly is better than establishing a default order. Mitchell noted that the ARIA program is missing an emphasis on system testing and its relationship to model testing, field testing, and red teaming.
To end the panel, Heidari opened the floor to panelists to offer thoughts on where emphasis should be placed as evaluation efforts expand and develop. Narayanan urged AI system creators to invest in incident reporting to better understand current risks and harms. Mitchell stated that companies should show due diligence in addressing malicious use, defining intended use, and underscoring out-of-scope use prior to deployment. Ahmad suggested that relationships and trade-offs for model testing, red teaming, and field testing should be considered when audits are designed. Holstein concluded by expressing the importance of continued advances in approaches to improving validity in the measurement and evaluation of sociotechnical systems involving AI.
The third panel was moderated by planning committee member Solon Barocas, Microsoft and Cornell University, and focused on how to govern the mapping, measuring, and managing process of AI risk management. Barocas opened the floor to introductory remarks.
Diane Staheli, White House Office of Science and Technology Policy, stated that President Biden and Vice President Harris made significant progress in advancing safe, secure, and trustworthy AI. She highlighted the Advancing Governance, Innovation, and Risk Management for Agency Use of Artificial Intelligence memorandum as a recent milestone.3 Staheli indicated a need for nonprocedural evaluation standards and presented common challenges including a need for entities to own the process and steer the work, a need to maintain system knowledge over time to maintain evaluation quality, a need for open-source reference implementations, and the ability to determine what data are useful versus what are not.
Daniel Ho, Stanford University, similarly highlighted the memorandum as a sign of significant progress. Ho stressed the importance of substantive standards for evaluating
___________________
3 Office of Management and Budget, Executive Office of the President, 2024, “Advancing Governance, Innovation, and Risk Management for Agency Use of Artificial Intelligence,” OMB Memorandum M-24-10.
AI systems. To offer an example of insufficient evaluation standards, Ho pointed to cases where AI was regarded as “intelligent” when it completed procedural tasks that did not require reasoning. In the context of large language models (LLMs), Ho offered, test leakage is a concern. Test leakage involves the training data including benchmark tasks, allowing LLMs to bypass benchmark-based evaluation. He noted the facial recognition vendor test as an example of a test that prevents this issue. Otherwise, evaluations informed by domain experts can provide the perspective necessary to assess the achievements and potential of AI tools properly.
Ho voiced the critical need for external oversight with auditors that can also be audited themselves to ensure accountability. Last, he stated that organizations should measure resource investment for evaluations against the level of risk a tool might pose. Higher-risk applications, according to Ho, should require more expensive evaluations, while low-risk applications might allow for cheaper, automated evaluations.
Jacob Metcalf, Data & Society, pointed to ongoing work at Data & Society with an approach called “impact assessment,” which empowers communities to name the conditions of an assessment. Speaking on other ongoing work, Metcalf noted his recent publications assessing New York City Local Law 144, which prohibits organizations from using AI in employment decision making unless a bias audit is conducted and the resulting report is published.4,5 The stated publications highlight the strengths and weaknesses of the pioneering legislation, aiming to strengthen future approaches to bolster protections for individuals interacting with AI.
Barocas highlighted that documentation is needed to facilitate governance and asked panelists to discuss the audiences for evaluation reports and other documents. Furthermore, Barocas encouraged elaboration regarding how information should be presented based on the intended audience.
Metcalf returned to New York City Local Law 144 as a concrete example of a disparate impact audit. Based on the law, the audit must be summarized in a report and displayed on the website of the company that was audited, and job seekers must be notified of the AI tool being used. Metcalf noticed that many companies forgo publishing an audit to their website, suggesting a lack of enforcement of the directive. Furthermore, in cases in which a report is published, Metcalf suggested that these reports would not be easily understood by potential job seekers because they are not written with the public in mind. These factors work against the transparency intended by the creation of the law.
___________________
4 L. Groves, J. Metcalf, A. Kennedy, B. Vecchione, and A. Strait, 2024, “Auditing Work: Exploring the New York City Algorithmic Bias Audit Regime,” Pp. 1107–1120 in Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, Association for Computing Machinery.
5 L. Wright, R.M. Muenster, B. Vecchione, T. Qu, P. Cai, A. Smith, Comm 2450 Student Investigators, J. Metcalf, and J.N. Matias, 2024, “Null Compliance: NYC Local Law 144 and the Challenges of Algorithm Accountability,” Pp. 1701–1713 in Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, Association for Computing Machinery.
Explainability as a metric for evaluating ML models, according to Ho, is rarely considered from the perspective of the external stakeholders, though it is vital to inform external stakeholders of the function and behavior of models. Ho suggested that governance aim to guide evidence of due diligence in the execution of evaluation rather than enforce the use of evaluation alone. He argued that the design of auditing institutions can be more impactful than additional evaluation efforts. Ho referenced his past work to categorize elements of audits such as disclosure, scope of audit, independence, selection funding, and accreditation that can be standardized through governance.6
Staheli noted that intelligible public disclosure enables contestability, which, in turn, bolsters the trustworthiness of a system. Conversely, internal audiences for evaluation results might include chief AI officers who have deep technical understanding or a multidisciplinary team that understands the expected behavior of the system. Early understanding of expected behavior that establishes a performance baseline along with a designated internal audience, Staheli said, allows for continued performance monitoring.
The NIST AI RMF, according to Barocas, does not push a specific institutional arrangement. Barocas asked panelists to discuss this flexibility. Metcalf complimented the flexibility of NIST’s current AI RMF as the existing auditing and assessment paradigm is still in flux. Metcalf suggested a governance and organizational playbook as an addendum to the existing NIST AI RMF that might offer examples of how governance could be organized based on a given scenario. Ho went further to suggest a range of use case profiles for different organizations based on scale and type of AI integration.
Barocas highlighted the need for evaluations to consider how systems will be used in practice and in particular subpopulations. Metcalf noted that applications that claim national or global societal benefit conflict with social science–based notions that emphasize the need to design for the nuances of subpopulations.7 Ho suggested that the cost of releasing applications at a large scale might encourage more rigorous work by slowing down the rollout of products. If rollouts are slowed and smaller populations interact with the AI application, the opportunity may arise to conduct randomized tests to track performance and impact within smaller units. Staheli and Ho emphasized the need to maximize opportunities to engage in iterative monitoring.
Barocas urged the panelists to expand on their thoughts about resources and costs as evaluation limitations. Metcalf encouraged nurturing an ecosystem where incentives for evaluation are clear and tangible, particularly to those in the C-suite. Ho argued that in the design and development stage, evaluations should be a mandated portion
___________________
6 D.I. Raji, P. Xu, C. Honigsberg, and D. Ho, 2022, “Outsider Oversight: Designing a Third Party Audit Ecosystem for AI Governance,” Pp. 557–571 in Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, Association for Computing Machinery.
7 M. Young, U. Ehsan, R. Singh, et al., 2024, “Participation Versus Scale: Tensions in the Practical Demands on Participatory AI,” First Monday 29(4).
of every budget. Staheli endorsed building evaluation costs into project estimates but noted that doing so requires an understanding of the scope of evaluation, which is currently difficult to predict. User evaluation after adoption is also needed. Ho pointed to a Bloomberg Law report that illuminates the value of evaluating the benefit of integrating specific AI tools prior to investing in them through small-scale adoption.8
Barocas ended the panel by asking who should be performing evaluations, encouraging panelists to detail what composition of experts they imagine would be involved. Metcalf stated that while larger technology companies might have responsible AI teams, smaller and nontechnology-related organizations may not have the capacity to have the same. He emphasized the need to consider who will be held accountable and who will do the assessment in a manner that protects civil rights. Staheli reaffirmed the need for multidisciplinary teams of experts, including AI, human factors, and domain experts, as well as end users to ensure an AI tool is fit for purpose. Ho argued that no one-size-fits-all solution exists, as both public and private solutions have limitations and pitfalls. Rather, he concluded that external oversight is essential to effective evaluation. Ho noticed that federal agencies often lack the bandwidth to conduct audits. Ho suggested that auditing firms and auditors of those firms have the potential to establish consistent mechanisms for accountability.
___________________
8 I. Gottlieb, 2024, “Paul Weiss Assessing Value of AI, But Not Yet on Bottom Line,” Bloomberg Law, May 14.