To effectively navigate the path forward to realize the potential of generative AI (GenAI) in health care and health, several foundational activities will be required. These include skill generation; model testing, implementation, and monitoring; resources and infrastructure; and standardized oversight and guidelines.
To effectively use GenAI in health care, it is important that clinicians develop proficiency in using GenAI tools effectively and be attuned to discerning and interpreting GenAI-driven outputs. For instance, as with many artificial intelligence (AI) solutions, recommendations such as a likely diagnosis or treatment recommendations, or generated responses to patient queries, should be considered in context as possible, but not certain, such that they supplant clinical judgment and diligence. As confidence in the reliability and validity of GenAI-driven guidance improves, clinicians will make use of these tools more effectively to improve patient outcomes and enhance health care delivery.
The advancement and sustainable integration of AI tools in health care necessitate a comprehensive expansion of training and educational programs. Professional training programs for clinicians should incorporate core curricula focused on teaching the appropriate use of data science and AI products and services. Continuous education for health care professionals should empower them to become more informed consumers of AI technologies, capable of providing feedback to continually refine AI-driven results. Additionally, consumer education campaigns are essential to inform patients of their rights, and to build and maintain trust across various educational levels. Patients need to actively engage with their health care providers to discuss the potential benefits and risks of AI-driven interventions, to ensure that patients have a clear understanding of how AI will impact their care.
For the foreseeable future, the responsible and safe use of AI, including GenAI, will require a combination of sociotechnical approaches spanning the individual and systems levels to ensure benefits while mitigating harms, as has been described. To that point, it is also incumbent upon multiple health system stakeholders, from regulators, to health AI vendors, to health systems and providers, to consider and enable testing, evaluation, local validation, and ongoing monitoring, all as part of a web of solutions to enable comprehensive governance of such AI models applied to practice (Shah et al., 2023). Monitoring for errors, biases, and adverse effects of GenAI will in some ways be like monitoring for those in predictive and analytical AI, but GenAI also presents some unique challenges due to its very nature. As has been true with predictive and analytical AI solutions, GenAI’s processes are often difficult to discern and monitor for accuracy and veracity (the so-called black box issue). However, this makes the need for ongoing, multi-faceted monitoring of implemented AI solutions, including GenAI, following an “algorithmovigilance” approach even more important (Embí, 2021).
Many of the training, implementation, and monitoring needs are similar between predictive and generative AI. However, important differences exist as well and will require attention as the application of GenAI in health care matures. These tables provide a structured comparison, detailing current considerations of the areas where monitoring approaches align (see Table 5-1) and where they diverge (see Table 5-2) for generative versus predictive/analytical AI models.
To mitigate the risks of GenAI, thorough testing and validation across diverse datasets representative of real-world clinical scenarios should be integrated into algorithm training, paired with ongoing monitoring and refinement based on feedback from clinicians. A recent Biden administration Office of Management and Budget memo recommended that “testing conditions should mirror as closely as possible the conditions in which the AI will be deployed” (Office of Management and Budget, 2024). Validation should occur on the local population with the local operational software where the AI is being used. Such a local electronic health record (EHR) software safety monitoring program has been operationalized nationwide and is currently used annually by more than 3,000 hospitals (Classen et al., 2023).
Finally, GenAI health care applications must also consider patient privacy and ethical considerations. Given the rapidly evolving and complex nature of these tools, it is crucial to involve domain experts, ethicists, and stakeholders
TABLE 5-1 | Similarities in Monitoring Generative and Predictive/Analytical Artificial Intelligence
| Aspect | Description |
|---|---|
| Bias and fairness checks | Both types of AI require continuous monitoring to ensure that bias does not influence outcomes. This is crucial to avoid reinforcing stereotypes or making unfair predictions that could adversely affect members of subpopulations, particularly those who are underrepresented in training data and often are among the most vulnerable. |
| Performance degradation and drift management | AI models are susceptible to degradation over time as data patterns shift or as the model “drifts” from its original quality or relevance. Routine performance monitoring and retraining are necessary for both to keep them accurate and effective in their respective tasks. |
| Ethical and societal impact assessments | Both generative and predictive/analytical AI can have broader societal impacts. Monitoring requires a clear understanding of each model’s societal and ethical implications, particularly where outputs could affect user behavior, decision making, or public trust. |
| Regulatory and compliance oversight | As AI technology becomes more integrated into sectors like health care, both types of models face increasing regulatory scrutiny. Ensuring compliance with privacy, transparency, fairness, and equity standards is essential, especially for high-stakes applications such as in health care. |
| User feedback integration | User feedback is valuable for refining both generative and predictive AI, as it provides real-world insights into how well the AI is performing and where adjustments may be needed. Regular feedback loops improve accuracy and align outputs with user expectations or industry standards. |
throughout the development, deployment, and monitoring process to ensure that the AI systems are trustworthy, reliable, and fair.
Lack of resources and infrastructure pose significant barriers to the widespread implementation of large language models (LLMs) and GenAI in health care. Many health care organizations lack the necessary technical expertise, financial resources, and IT infrastructure to deploy and maintain AI systems effectively.
The deployment of LLMs and GenAI in health care introduces significant complexities and costs. Integrating AI systems into existing workflows and EHR systems demands substantial investments of time, resources, and training. These investments represent a significant financial burden for health care organizations, particularly for smaller institutions or those with limited budgets, potentially hindering widespread adoption across health care systems. In addition, local adoption of such programs is critically dependent on establishing the trust of frontline clinicians, which can be very resource intensive.
TABLE 5-2 | Differences in Monitoring Generative and Predictive/Analytical Artificial Intelligence
| Aspect | Generative AI | Predictive/Analytical AI |
|---|---|---|
| Output evaluation and quality control | The primary outputs are new content, such as text, images, and audio, where quality is subjective and context dependent. Monitoring focuses on coherence, relevance, and ensuring ethical content generation, as well as preventing issues like “hallucinations” or factual inaccuracies. | These models generate quantitative predictions, making performance assessment more straightforward through accuracy, precision, and recall metrics. The emphasis is on accuracy within defined data parameters rather than subjective quality. |
| Bias manifestation and detection | Bias can appear in subtle ways, shaping content tone, language, or framing. Monitoring involves detecting biases in generated language or other output media and preventing the spread of misinformation or unintended stereotypes. | Bias checks focus on ensuring that model predictions are fair across different groups. Monitoring bias in these models often involves fairness audits and statistical checks on outcomes rather than subjective analysis of generated content. |
| Performance degradation and adaptation and dynamic nature | Quality degradation may appear as reduced coherence or creativity, requiring frequent content review and adjustments. User feedback is often essential in detecting subtle shifts in output quality. Also, GenAI models are often designed to evolve over time, learning from new data, so monitoring requires ongoing vigilance to adapt to changes. | Model drift often relates to underlying data shifts, requiring statistical tracking of accuracy and regular retraining. The process is more data driven and straightforward, as performance is measured against historical accuracy benchmarks. |
| Impact on users and society | The potential for misuse of generative content (e.g., for spreading misinformation) adds a unique layer of impact monitoring, requiring checks on ethical content generation and user satisfaction. Societal impacts include privacy, misinformation, and the psychological effect on users. | Impacts are more directly related to decision making, where inaccurate predictions can affect outcomes in areas like medicine or eligibility for services. Monitoring focuses on ensuring reliable decision support and fairness in model applications. |
| Compliance and legal considerations | Content produced by generative AI can raise unique compliance issues related to privacy, misinformation, and ethical standards. Monitoring involves regulatory checks on content generation standards and adherence to ethical guidelines. | These models often operate in industries with established regulatory frameworks (e.g., finance or health care), so monitoring focuses on meeting interpretability, privacy, and compliance requirements within well-defined legal standards. |
Finally, disparities in health care access may worsen if patients lacking reliable internet or digital devices are excluded from benefiting from timely GenAI-driven diagnostics. Such exclusions could widen existing gaps in health care access and outcomes, further exacerbating inequities in health care delivery.
Given the cross-cutting nature of LLMs and GenAI, whose impact spans sectors and stakeholders, collaboration and coordination among federal and state regulators, non-governmental bodies, and organizations responsible for health care delivery is important to ensure adequate oversight of these technologies. Health care regulators with authority over industry developers and enablers of AI solutions (e.g., the U.S. Food and Drug Administration [FDA], the Assistant Secretary for Technology Policy/Office of the National Coordinator for Health Information Technology) should develop clear guidelines and standards for evaluating AI systems, including GenAI, as well as robust mechanisms for monitoring their safe and ethical use in health care. Health care organizations will need to navigate a complex regulatory landscape, including privacy laws, data security regulations, and health care standards, to ensure compliance with legal and ethical guidelines governing the effective and equitable use of GenAI in health care. Even as the number of FDA-regulated AI solutions grows, there currently are and likely will remain many more AI-enabled solutions in health care that are not regulated in the same fashion and yet will be increasingly relied upon for health care delivery and processes. From the summarization of health care data and the creation of new clinical data at the point of care, to the recommendation of diagnostic and therapeutic interventions and the use of GenAI-enabled chatbots that interact directly with patients as health care “agents,” there are numerous uses and downstream risks associated with GenAI in health care. Even as the federal regulatory landscape matures relative to AI vendors and enablers, regulations are also being clarified to define the responsibilities of health care providers to ensure that their adoption and use of AI does not lead to discriminatory care and inequities (Office for Civil Rights, 2024). Indeed, GenAI presents a formidable challenge for health care organizations seeking to harness the benefits of GenAI technology to improve patient care and clinical outcomes as well as drive new discoveries.
Even as the regulatory landscape around the approval and use of GenAI for care continues to unfold, new and refined regulatory and payment structures will need to evolve to incentivize appropriate innovation and sustainability of GenAI-driven health care solutions. Health care payers and policy makers
should develop flexible reimbursement models and regulatory frameworks that support the adoption and integration of GenAI technologies into clinical practice while safeguarding patient safety and health equity. Properly structured to foster alignment of incentives among different stakeholders, such frameworks can promote collaboration and knowledge sharing in an AI-driven health care ecosystem.
The need for new GenAI evaluation metrics and procedures is evident as traditional evaluation methods may not capture the unique characteristics and challenges of health care AI systems. New approaches, such as real-world local validation studies and standardized benchmarks, are needed to accurately assess the safety, efficacy, and usability of GenAI-driven health care technologies. Health care providers and GenAI developers should develop and implement a shared responsibility between using specialized expertise to optimize performance for specific tasks and ensuring that GenAI algorithms remain flexible and generalizable enough to adapt to evolving clinical needs and support diverse patient populations (Ratwani et al., 2024).
As we evaluate the opportunities and risks associated with the integration of LLMs and GenAI in health care, it becomes evident that the potential benefits of these technologies have the capacity to vastly outweigh the inherent risks. However, realizing these benefits will require intentional, coordinated actions by multiple stakeholder groups, including health care providers, researchers, policy makers, regulatory bodies, and technology developers. There is a pressing need to pivot toward a collective call for collaboration. By fostering collaboration and partnership across diverse stakeholder groups, GenAI-related challenges can be collectively addressed to harness the transformative potential of LLMs and GenAI to advance medicine, improve patient outcomes, and revolutionize the landscape of health care delivery.
Table 5-3 highlights a selection of key phases in the GenAI life cycle and examples of where collaboration can advance policies and practices that optimize the benefit–risk ratio and assessment. The extraordinarily high costs associated with developing and implementing GenAI models are a driving force for collaboration. Multi-stakeholder, multi-organizational cooperation to develop shared data assets and identify best practices presents substantial opportunities for cost sharing and commercial risk reduction. The use of standards and collaboration approaches used in the software life cycle can be of benefit in this area (Association for the Advancement of Medical Instrumentation, 2020).
| Phase | Sectors | Example Opportunities for Collaboration |
|---|---|---|
| Prioritizing problems |
|
|
| Data creation, acquisition, and assurance |
|
|
| Model development |
|
|
| Model evaluation |
|
|
| Phase | Sectors | Example Opportunities for Collaboration |
|---|---|---|
| Model standards |
|
|
| Model implementation and diffusion |
|
|
| Model monitoring and maintenance |
|
|
The complexity and interdisciplinary nature of developing, evaluating, implementing, and monitoring GenAI models in health care, coupled with the increased risk of harm when deploying these tools in an industry that exists to care for people and is defined by information asymmetry, also drives the need for deep interdisciplinary collaboration. There is no single stakeholder group capable of identifying and capturing the full spectrum of potential benefits and opportunities for GenAI in health care, nor to identify and mitigate all the associated risks. However, defining the current roles and responsibilities of different stakeholders
collaborating to develop and deploy GenAI in health care (see Table 5-4) reveals a high burden on certain stakeholders and challenges the notion that policy makers and regulators are ultimately responsible for the implementation of GenAI that helps, not harms, in health care.
Health care professionals are most often ultimately accountable for guaranteeing performance at key phases of GenAI development and/or deployment. While many of their collaborators, including developers, GenAI manufacturers, and researchers, are responsible for tasks and costs associated with these phases, the burden of decision making—and the assumption of risk for harm—most commonly sits with health care professionals. Even in a collaborative model, discrepancies in the roles and responsibilities should be addressed, and AI developers should collaborate with health care professionals and share in these responsibilities. In an ecosystem where health care professionals are the backdrop of responsible AI, equity issues will arise, as differently resourced health care systems and organizations will provide variable levels of support for the health care professionals driving GenAI strategies. Without proper support, there is also the risk of fueling further provider burnout as new GenAI models are introduced.
To reduce these risks, appropriate resources, training, and support should be provided to health care professionals in all settings, equipping them with the knowledge, operational capacity, and skills to serve as the stewards of responsible AI. The broader community should also intentionally implement strategies that seek to reduce this burden on health care professionals. An example is the development of more robust medical education and training opportunities to support the workforce required for scale and sustainability.
Policy makers and regulators are essential collaborators in almost all key phases of GenAI development and deployment in health care, responsible for creating the necessary incentives for action and enforcing best practices and standards. To illuminate where incentives may be necessary, Table 5-4 highlights examples of where collaboration can advance policies and practices that optimize the benefit–risk ratio and ensure appropriate benefit–risk assessment. Policies and incentives should drive the broad adoption and adaptation of this collaborative approach across all contexts and settings.
Finally, representative patients and communities should be engaged as critical collaborators throughout the total product life cycle of GenAI models in health and health care. There is no phase of the development or deployment of GenAI models that can be successfully completed without the insights from the individuals the health care industry exists to serve, specifically patients representing communities where specific models will be implemented. Establishing shared values that define
| Phase | Health Care Professionals and Systems | GenAI Developers and Manufacturers | Representative Patients and Communities | Policy Makers and Regulators | Researchers | Commentary |
|---|---|---|---|---|---|---|
| Prioritizing problems | A, R | C | R | I | R | With health GenAI solutions being developed for health care providers and payers, health care professionals and systems are accountable for prioritizing the problems that GenAI models could solve. They are also responsible, in conjunction with patients, communities, and researchers, for defining these problems. Once these are determined, then GenAI manufacturers and developers are consulted in order to determine if GenAI models are an appropriate solution (either in whole or in part) to the prioritized problems. Finally, policy makers should be informed of these prioritized problems so that they can assist in identifying any relevant regulatory considerations. |
| Data creation, acquisition, and assurance | R | A, R | C | C | C | GenAI developers and manufacturers are accountable for ensuring that data used for algorithms adhere to minimum viable data standards, including being representative, secure, and compliant with privacy standards. They are also responsible, along with health professionals and systems, for ensuring that the data they produce adhere to these standards. Policy makers and regulators, patients and communities, and researchers should all be consulted in these activities to ensure responsible and comprehensive data use in GenAI applications. |
| Model development | C | A, R | C | I | R | GenAI developers and manufacturers are ultimately accountable in how and which models are developed, though the responsibility for actually developing those models and identifying ground truth labels that optimize model applicability and minimize bias is a collaborative endeavor between developers, manufacturers, and researchers. Health systems and patient stakeholders should be consulted to identify optimal features and outcomes, while regulators should be informed of which models may be coming to their attention. |
| Model standards | C | C | C | A, R | C | Policy makers and regulators bear primary accountability and responsibility for developing and disseminating regulatory standards for model performance, bias, and interoperability. These standards ought to be developed in consultation with all other stakeholders. |
| Model evaluation | A | R | C | C | R | With consultation from patients, communities, policy makers, and regulators, health care professionals and systems are ultimately accountable for defining the parameters of trustworthiness and fit for purpose of models designed for implementation into their workflows, systems, organizations, and clinical decision making, and recommending them to their patients. GenAI developers, manufacturers, and researchers are responsible for creating the evidence to support these evaluations. |
| Phase | Health Care Professionals and Systems | GenAI Developers and Manufacturers | Representative Patients and Communities | Policy Makers and Regulators | Researchers | Commentary |
|---|---|---|---|---|---|---|
| Model implementation and diffusion | A, R | R | C | I | C | GenAI developers, manufacturers, health care professionals, and health systems assume responsibility for building tools suitable for clinical workflows and diverse patient populations, with consultation from patients and communities to ensure appropriate prioritization and from researchers to ensure evidence-based rollouts. However, health care professionals remain ultimately accountable for the integration of these tools into their workflows and patient care; advocating for investment and training to select, implement, and govern these models; and adapting pre-trained models to new domains. Policy makers do not play a large role in implementation, although they should be informed of implementation challenges in order to cross-reference against standards to ensure safe implementation. |
| Model monitoring and maintenance | A, R | R | C | R | R | GenAI developers, manufacturers, regulators, and researchers play a critical role in monitoring and managing changes in model performance over time. Similarly, policy makers and regulators should standardize performance parameters and generate incentives for this monitoring. Health care professionals and systems are ultimately accountable for ensuring that models deployed in their organizations, care workflows, or patients’ lives are trustworthy and fit for purpose, even long after initial implementation. |
Responsible (R): Those who are assigned the “Responsible” level of responsibility do the work to achieve the task. For every task or deliverable, there should be at least one role with this level of responsibility. However, other roles can be delegated to assist in the work required (see below for separately identifying those who participate in a supporting role).
Accountable (A): Those who are assigned the “Accountable” level of responsibility are those that are ultimately answerable for the correct and thorough completion of the deliverable or task. They are also the ones who may delegate the work to those who are “Responsible.” In other words, an “Accountable” should sign off or approve the work that a “Responsible” does. For every task or deliverable, there should be only one role with this level of responsibility.
Consulted (C): Those who are assigned the “Consulted” level of responsibility are those whose opinions are sought during the work to complete the task or deliverable, or who help to review the result of the work to ensure it meets the necessary goals. They are typically subject matter experts or those who may be directly affected by the work, and two-way communications with them are usually maintained throughout the work process.
Informed (I): Those who are assigned the “Informed” level of responsibility have no required work to support a task or deliverable but are usually kept up to date on work progress, or they may only be notified once the work or deliverable is completed. Typically, communications with those in an “Informed” role are purely one-way from those who are Accountable or Responsible.
patient input as table stakes in health GenAI is critical to capturing the promise of these new technologies to improve health and health care.
The integration of LLMs and GenAI in health care holds significant potential to transform the practice of medicine, the work and experiences of health care providers, and the health and well-being of patients. GenAI can support clinical decision making and streamline workflows, promote patients’ and their support networks’ engagement in care processes, address health equity issues, and support
clinical research. However, successful, ethical, and equitable implementation of GenAI requires careful consideration of the associated risks, particularly those concerning data privacy, bias, transparency, and infrastructure limitations. Collaboration among stakeholders, including health care providers, patients, policy makers, ethicists, and researchers, along with a cross-sector commitment to maximizing the benefit of GenAI while minimizing the risks, is important for navigating the complexities associated with GenAI in health care. Federal and organizational oversight; standardized guidelines for GenAI development, implementation, and responsible and ethical use; and continuous practitioner and patient education can facilitate the ethical and effective application of LLMs in health and medicine to improve patient outcomes, increase equitable access to care, and revolutionize medicine, research, and health care.