![]() |
Proceedings of a Workshop—in Brief |
To explore the opportunities and challenges of using artificial intelligence (AI) and digital health technologies for improving diagnosis, the National Academies of Sciences, Engineering, and Medicine’s (the National Academies’) Forum on Advancing Diagnostic Excellence hosted a hybrid workshop on July 25, 2024.1 The workshop highlighted the role of these technologies throughout the diagnostic process and their impact on the patient experience, including leveraging AI and digital health tools to understand the patient’s onset of symptoms, to improve information gathering and patient–clinician communication during the clinical encounter, and to support clinical decision making. Daniel Yang, vice president of artificial intelligence and emerging technologies for Kaiser Permanente, said, “Diagnostic errors are both the most common and consequential medical errors experienced by patients in the United States; these errors result in serious physical, psychological, and financial harm to patients.” AI could play a role in addressing this problem, but major challenges include evaluating the effectiveness of AI technologies and ensuring they promote equity instead of perpetuating health disparities, he said. This workshop builds on the consensus report Improving Diagnosis in Health Care (NASEM, 2015) and a workshop series on Advancing Diagnostic Excellence.2 This Proceedings of a Workshop—in Brief highlights the presentations and discussions that occurred at the workshop.3
Michael Howell, chief clinical officer at Google Health, said the earliest developers of AI raised concerns about machines modeling human behavior and its potential harms; these concerns have endured to the present day. Howell outlined the epochs of AI in three phases: AI 1.0, symbolic AI and probabilistic models; AI 2.0, deep learning; and AI 3.0, generative AI. AI 1.0 uses traditional programming and rules-based encoding to teach machines the difference between two objects, whereas AI 2.0 is task-specific and learns patterns, for example, to determine if two objects are different. For example, AI 2.0 has been used in diabetic retinopathy detection and lung cancer screening. Howell explained that for each medical
__________________
1 The workshop agenda and presentations are available at https://www.nationalacademies.org/event/42229_07-2024_diagnosis-in-the-era-of-digital-health-and-artificial-intelligence-a-workshop (accessed September 20, 2024).
2 The workshops explored ways to improve diagnosis for sepsis, acute cardiovascular events, and cancer and for older adults and in maternal health, as well as diagnostic lessons learned from the COVID-19 pandemic. More information is available at https://www.nationalacademies.org/our-work/advancing-diagnostic-excellence-a-workshop-series (accessed September 20, 2024).
3 This Proceedings of a Workshop—in Brief is not intended to provide a comprehensive summary of information shared during the workshop. The information summarized here reflects the knowledge and opinions of individual workshop participants and should not be seen as a consensus of the workshop participants, the planning committee, or the National Academies of Sciences, Engineering, and Medicine.
condition, use of AI 2.0 requires new inputs, datasets, and retraining. In contrast, AI 3.0 is more dynamic and is able to emulate human speech and clinician–patient interactions and can assist clinicians with evaluations and analysis. Howell then described what he sees as the future of the field: developing methods for evaluating AI models and appropriately implementing these innovations; developing a thoughtful and effective regulatory framework (e.g., based on the AI Code of Conduct from the National Academy of Medicine);4 and using AI to promote and improve health equity.
Grace Cordovano, a board-certified patient advocate specializing in oncology, is a care partner to two adults with disabilities and a patient herself. She highlighted the importance of including the patient perspective and care partner perspective in considering the opportunities and challenges of using AI and digital health in diagnosis. She stated that it is the “dawn of a new era” for patients who now have access to new tools and technologies to assist them when clinicians are unavailable (e.g., on a weekend or holiday) or when there is a long wait time for an appointment. Although most patients may lack access to medical journals and clinicians’ education, she said patients and their caregivers are experts in their own experience of navigating life with a health condition or helping others to do so. Cordovano said clinicians should not disparage patients who search for information related to their symptoms on digital platforms. Rather, she emphasized the importance of acknowledging that patients are trying to understand their symptoms and overcome barriers—such as language, health insurance, or time constraints—to access health care. “AI gives us the opportunity to potentially have a trusted tool within the right guardrails and frameworks to start helping us prepare for those appointments, to start triaging ourselves to help us connect with our physicians and start a partnership,” she said.
However, Lucia Savage, chief privacy and regulatory officer at Omada Health, also noted that scientific misinformation is rampant, and people using the internet may not know whether they are accessing accurate or inaccurate information, so they need guidance from clinicians. Savage described the context for concerns around health care data, privacy, mistrust, and ways to appropriately handle patient data. She shared an anecdote where a health system tracked whether its patients had completed breast cancer screening and would prompt those who had not yet completed the screening with reminders. While some patients expressed concern that their data were being tracked in this way, others expressed gratitude for the reminders for cancer screenings that may have saved their lives. Within this context, Savage said individual level digital data tracking can be lifesaving, but at the population level it can be unsettling. Savage continued, stating that trust in “big tech” is low as people realize their data is gathered without their knowledge from online search platforms (e.g., on Google, ChatGPT, or WebMD). She said that although the HIPAA Privacy and Security Rules5 apply only to health care contexts, some wrongly assume that these regulations apply to all consumer contexts, such as online search platforms, which are regulated differently from personal data gathered in health care systems. Considering these challenges and consumer concerns, Savage delineated rules to “do no harm” when handling patient data: ensure that data are stored securely, protected by two-factor authentication and encryption; de-identify data when possible; and trust the individual—tell them how AI and other digital tools are being used in their diagnosis.
John Whyte, chief medical officer at WebMD, discussed the role of patient navigation of digital platforms to understand symptoms and connect to care. A billion health-related queries are made online every day, many related to symptoms (e.g., fever, chills, cough), said Whyte, and today, people get medical information from a wide range of digital tools which continue to evolve (e.g., from using a search platform to using a symptom checker that provides a potential diagnosis). Whyte suggested that trained and educated medical communicators need to reach people where they are getting their medical
__________________
4 The Artificial Intelligence Code of Conduct is led by the National Academy of Medicine and aims to provide a guiding framework to ensure that AI algorithms and their applications in health, health care, and biomedical science perform accurately, safely, reliably, and ethically. See https://nam.edu/programs/value-science-driven-health-care/health-care-artificial-intelligence-code-of-conduct/ (accessed October 4, 2024).
5 The Privacy Rule (https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/index.html) and the Security Rule (https://www.hhs.gov/hipaa/for-professionals/security/index.html) were promulgated under that Health Insurance Portability and Accountability Act of 1996, Public Law 104-191, 110 Stat. 1936 (1996) (accessed October 4, 2024).
information. Whyte explained that people want a sense of community, which they find through online platforms. For example, if someone is diagnosed with diabetes, the patient may connect online with others who have diabetes and start receiving advice from them. Whyte pointed out that physicians often do not acknowledge these online communities as part of the patient’s diagnostic experience. People seek immediate answers and information, but in the United States, he said people often wait longer than 20 days to see a primary care physician, which is a long time for patients who are concerned about signs or symptoms. Whyte concluded by saying the medical community will “have to meet patients where they are . . . and empower them with quality information but also make sure that the health care system is still engaged.”
Kenrick Cato, professor at the University of Pennsylvania, discussed how clinicians gather patient information during the diagnostic process and how the patient experience is captured throughout the process. Cato explained that patient signs and symptoms are typically captured by taking a patient history, physical examination, and diagnostic testing and may include gathering data from personal monitoring devices and referrals from other clinicians (e.g., specialists)—all of which is entered into the electronic health record (EHR).6 EHRs are built for billing, quality reporting or regulatory compliance, and clinical documentation but are not specifically designed to support the diagnostic process. Furthermore, there is variability in how information is documented and analyzed because numerous EHR vendors offer variable capabilities. Variability also exists in information flow, in access points (e.g., in person or virtual), and in the knowledge of those using the EHR, and Cato noted that this all factors into the diagnostic process. Cato continued by describing the wide variations for extracting signs and symptoms, which can range from practices still using faxes (because they are HIPAA compliant) to systems that are beginning to implement AI tools, such as Microsoft’s Copilot, where large language models (LLMs)7 are being employed in natural language processing that can summarize multiple streams and modes of information (e.g., lab results, imaging reports, clinical notes, vital signs). Technology such as ambient voice recording devices can record the exam-room conversation between the clinician and patient, translating it into documentation or structured information. However, Cato said, none of this adequately captures the patient experience, and it is important to also consider factors such as variability of resources, ethical issues, algorithmic bias, and validity and reliability of these new technologies.
Sigall Bell, associate professor at Harvard Medical School, said that open notes8 help to improve patient safety and advance diagnostic excellence through increasing completion of ambulatory diagnostic tests and referrals, strengthening patient–clinician communication and relationships, detecting EHR errors, and preventing diagnostic blind spots. Bell reported that “40 to 80 percent of health care encounter information is forgotten or misremembered [which] is worse with stress, with bad news, complexity, and language barriers” (Kessels, 2003). Open notes can help address this issue by providing an artifact from the visit so patients can review the information and help connect them with next steps. In a 2021 study, patients and their caregivers reported using open notes helped them to better understand the reason for tests and referrals; 97 percent of participants reported the same or increased trust in the health care provider and felt the same or more aligned in their goals with their clinicians (Figure 1; Bell et al., 2021).
Using open notes allows patients to review their charts to make sure their clinician understood and recorded their information correctly. EHRs often contain multiple errors that impact the patient–clinician relationship in myriad ways, said Bell, noting a 2020 patient survey in which
__________________
6 EHR is a “digital version of a patient’s paper chart [that] are real-time, patient-centered records that make information available instantly and securely to authorized users.” (See https://www.healthit.gov/faq/what-electronic-health-record-ehr; accessed October 4, 2024.)
7 A large language model is a statistical language model that is trained on large amounts of data and can be used to generate text and other content and perform robust autocomplete systems, that can power chatbot AI systems to make the concepts more accessible to broad audiences.
8 Open notes refer to transparent information sharing when medical providers and clinicians invite patients to see their medical notes through secure patient portals. See https://www.opennotes.org/ (accessed November 8, 2024).
21 percent of respondents reported finding a mistake in their notes, and 44 percent of those reported the mistake as serious (Bell et al., 2020). The wrong symptoms or incomplete history can lead to patients receiving the wrong tests, referrals, diagnosis, or treatment. The study also found omissions and misalignments between patients and their health care providers, wherein a patient’s main concerns were not heard or understood. In response, Bell and her team developed a tool called OurDX9 which streamlines the process of capturing relevant, actionable information from the patient prior to a visit and delivering it to the clinician in the EHR to help prevent these patient-reported breakdowns. AI could integrate well with OurDX, Bell stated, because AI could simplify or translate the previous visit notes for patient review, guide patients through the OurDX questions conversationally, and generate a “pre-note” that would be available for clinicians at or before the visit.
Jonathan Chen, assistant professor of medicine at Stanford Department of Medicine, discussed the rapid speed with which AI is improving and its implications for deployment in medical practice. Chen shared the story of a patient with an elevated heart rate, whose hormone tests revealed hyperthyroidism. Chen referred the patient to an endocrinologist for further diagnosis and treatment, but the next available appointment was three months away. Chen uses this case to ask if AI could help in such a situation by indicating possible next steps in terms of diagnostics, treatment, and management plans. Chen said he envisions a future with multiple tiers of clinical consultation support tailored to better match the needs of patients by providing “immediate and personalized suggestions on what the patient needs, converting our medical records of individual care to reusable institutional knowledge that can empower individuals with the collective experience of the many.” Such systems need to demonstrate accuracy, Chen said, and they have been proving to be highly reliable. The most recent AI tools—GPT-4,10 Med-PaLM 2,11 and Med-Gemini12—exceeded a passing score on standard multiple-choice medical exam questions (Singhal et al., 2023). These tools were also tested using an exam given to Stanford medical students
__________________
9 OurDX is a program within open notes that invites patients to contribute information before the visit and assesses whether patients’ pre-visit survey information helps patients and clinicians prepare for the medical appointment and enables co-production of diagnostic safety. See https://www.opennotes.org/ourdiagnosis/our-diagnosis-faq-ourdx/ (accessed October 4, 2024).
10 Generative Pre-trained Transformer (GPT) 4 is the fourth in a series of large language models created by OpenAI using deep learning that leverages larges amount of data to perform a task. See https://openai.com/index/gpt-4/ (accessed November 8, 2024).
11 Med-PaLM 2 is a large language model created by Google and is designed to provide high quality answers to medical questions. See https://sites.research.google/med-palm/ (accessed November 8, 2024).
12 Med-Gemini 2 is a group of multimodal models created by Google that specialize in medicine using data from web search. See Saab et al., 2024.
to decide if they are ready to see real patients (Strong et al., 2023). For example, the student is asked to summarize a case in 200 words, provide a differential diagnosis, and justify their reasoning. GPT-4 passed the exam, even outscoring the Stanford medical student average. Chen and his colleagues then sought to understand to what extent AI helps human clinicians in a study that compared “doctor + internet” and “doctor + GPT4.” (Goh et al., 2024). They assigned a complex diagnostic case to test for open-ended medical reasoning to two groups of physicians: one was allowed to use online tools, such as UpToDate13 and Google, but not LLMs; the other groups had access to GPT-4. The two groups of clinicians received similar scores, but a chatbot (GPT-4) alone outscored both groups of clinicians (Figure 2). Chen concluded by saying this area needs more research and exploration and suggested further education and training for clinicians on the use of these tools. Chen quoted a colleague who said, “While AI is not going to replace anybody, those who learn how to use it may very well replace those who do not.”
Adam Rodman, director of AI programs at Beth Israel Deaconess Medical Center, explained the potential for LLMs to support clinicians in improving care management decisions and diagnostic reasoning. Historically, clinical decision making is based on pretest probability as well as measured and thoughtful deliberation and is therefore subject to the usual human cognitive processes, said Rodman. This can lead to mistakes, oversights, and medical errors. Rodman shared the story of a patient
who died shortly before GPT-4 was released, and after reflecting on the event, Rodman used GPT-4 to provide a list of differential diagnoses using the patients’ signs and symptoms and his problem representation (a clinician’s summary of a patient’s medical condition) as inputs. GPT-4 was able to accurately diagnose M. bovis14 as the primary cause of the patient’s death, followed by the secondary diagnosis of culture-negative endocarditis, which Rodman and his clinical team had wrongly determined to be the primary diagnosis and for which they had treated the patient. Rodman explained that LLMs are able to use the same thought patterns that clinicians use when determining a diagnosis but can delve further into language to make diagnoses. Rodman described studies in which generative AI models (GPT-4V15 and Gemini Ultra16) were able to provide the correct differential diagnosis in challenging medical cases, with close to 89 percent accuracy (Han et al., 2024). Moreover, LLMs are able to outperform humans in pretest and posttest probability (Rodman et al., 2023), make more accurate forecasts, and process medical data and clinical reasoning better than physicians (Cabral et al., 2024). Rodman reiterated Chen’s earlier point that LLMs do not make humans better at diagnosis. He concluded by saying that although LLMs are outperforming clinicians in some areas, such as solving complex cases and making predictions, these technologies can become tools and assets for clinicians.
Jason Poff, director of innovation deployment at Radiology Partners, shared the opportunities and challenges of using diagnostic AI in radiology. Human radiologists excel at providing a comprehensive evaluation that encompasses multiple diagnoses and conditions and expressing uncertainty, he said, whereas AI struggles
__________________
13 UptoDate is an evidence based clinical resource created by Wolters Kluwer Health with content that is reviewed and compiled by clinicians and experts on the latest clinical, drug, and patient information. See: https://www.wolterskluwer.com/en/solutions/uptodate (accessed October 4, 2024).
14 In people, M. bovis causes “tuberculosis disease that can affect the lungs, lymph nodes, and other parts of the body.” See https://www.cdc.gov/tb/about/m-bovis.html (accessed October 4, 2024).
15 Generative Pre-trained Transformer 4-Vision (GPT-4V) is a multimodal model that allows users to upload images as inputs to perform a specified task, which in this case analyzed medical images.
16 Gemini Ultra is a multimodal model created by Google that is able to complete multiple tasks. See https://deepmind.google/technologies/gemini/ultra/ (accessed October 4, 2024).
to do so. Most radiologists see AI “as supporting our practice of medicine,” said Poff. In assessing an image, AI can see what humans might overlook, which Poff highlighted in a case example of a patient complaining of chest pain. While the physician team was considering a diagnosis of acute coronary syndrome, AI was able to flag a rib fracture. AI can also help make multiple diagnoses because clinical teams may stop looking when they find three or four concerns to address in a patient, without considering whether there might be a fifth potential concern that could be crucial as well. Despite these strengths, Poff said, “AI can lead us astray,” and can make errors, which his team is trying to prevent by training clinicians on common pitfalls in using AI. For example, AI misidentified an abnormal airway as a pulmonary embolism or blood clot (false positive) in a patient case that needed to be overruled when a radiologist’s more comprehensive analysis found the issue was actually bronchiectasis.
Yvonne Lui, associate chair for AI at NYU Langone Health, described how AI can improve the cycle of diagnosis by sharing some examples from radiology. AI allows for automatic tumor segmentation, a quantitative measure of the size of the tumor, and changes over time. In addition, AI can predict outcomes to identify which patients in an overcrowded Emergency Department need to be admitted or observed. In this example, AI used X-rays and multiple clinical variables to determine patient prioritization by calculating risk of deterioration over time. Lui also described a case in which AI was used to determine what data to acquire when creating an image; for example, sampling only the needed data reduces the time patients spend in an MRI machine. Lui concluded by saying AI has great potential to aid clinical decisions, but the diagnostic process and a patient’s care cycle is composed of several iterative steps: information gathering, testing, assessing, and evaluating. AI can have small but significant applications throughout, but she cautioned against blindly following technology without stopping to consider if one is going in the right direction.
Kadija Ferryman, assistant professor at the Johns Hopkins Bloomberg School of Public Health, discussed the interrelated roles of ethics including equity, policy, and governance approaches in mitigating racial and ethnic biases in clinical algorithms. Ferryman said conversations around the racial and ethnic bias in diagnostic testing have implications for everyone involved. For instance, Ferryman and her colleagues published an article on the importance of examining the scientific validity of using race-based adjustments in pulmonary function calculations, challenging “a racial correction factor [that] had been used for decades” (Brems et al., 2022). The following year, the American Thoracic Society removed race from pulmonary function calculations. However, clinicians and care teams that begin using a global calculator17 will be challenged in how to interpret previous results and in explaining to patients how and why the values may have changed. She said clinical algorithms should require considerations for equity, and with AI’s usage in health care becoming more prevalent, Ferryman argued that federal considerations for equity and oversight on the accuracy of AI tools in clinical care and policies will be much more important. Ferryman and colleagues explored approval for AI devices by the Food and Drug Administration and found that developers rarely include testing of these tools across different demographic groups as part of their application. Participants who were interviewed as part of this study also expressed concern about algorithmic bias in AI tools, potential gaps in implementation, and the possibility of increasing health inequities. Ferryman challenged the idea of “garbage data”—when researchers realize data are biased. Instead of throwing out such data, she suggested that researchers learn why biased data were being collected. Rather than imputing data from a missing demographic group (as is often done), Ferryman called for looking at why those data are missing. For example, she asked, “Is it because there is a lack of access by that racial group to the clinical context? Is it because there is an earned mistrust of that group or that health care institution?” Empirical research is needed on operationalizing and implementing policies in this area, said Ferryman. While she feels encouraged to see many new ethical frameworks coming out, she said more evaluation research is needed to
__________________
17 “Global calculator” refers to the guidelines for reference values for pulmonary function tests that were developed by the Global Lung Function Initiative. See https://www.ersnet.org/science-and-research/ongoing-clinical-research-collaborations/the-global-lung-function-initiative/ (accessed October 4, 2024).
determine the usefulness of the frameworks. Ferryman suggested that equity be considered in the development of data collection tools. Furthermore, she said, community and interdisciplinary groups need the chance to weigh in on measures of social needs and standardization of data, to avoid requirements for the collection of social risk information that can exacerbate health disparities; for example, questions about drug use are known to be asked disproportionately of racial minority groups.
Michael Cary, associate professor at Duke University, discussed AI approaches to advancing health equity for older adults, who face myriad social and economic inequities alongside medical issues. There are nearly 60 million older adults (ages 65 and older) living in the United States, and that number is expected to increase by from 17 percent to 22 percent of the total population by 2040 (ACL, 2024). Many older adults are experiencing poverty, and about a quarter identify with a racial or ethnic minoritized group. Cary added that older adults are living longer in general, while also managing multiple chronic and complex conditions. He said that a 2024 National Academies report on health care inequities describes “the transformational power of technology, particularly within the AI space,” and mentions the EHR as a rich data source for creating diagnostic and treatment plans (NASEM, 2024). However, biased, insufficient, or misrepresentative data on racial and ethnic minoritized groups could result in inappropriate clinical decision making, Cary said. He added that data on sexual orientation or gender identity are missing from 60–70 percent of records in some health systems (Grasso et al., 2019). In a study on the predictive accuracy of stroke risk models, Cary and his team found significant bias—the models were all less accurate for Black individuals compared to White individuals, for men compared to women, and for older adults compared to younger adults (Hong et al., 2023). He said the findings were concerning because they could potentially result in less access to care. Cary suggested strategies to address bias: using more diverse training data by including data from underrepresented groups such as older adults and Black patients; adjusting algorithms to ensure biased data do not result in an under- or overestimation of risk for disease; and continuously monitoring implemented systems to check and adjust for biases. At the societal level, Cary said, increased awareness and education for health care professionals is needed. While the potential of AI gets a lot of attention, he continued, clinicians need training to use these tools safely.
Irene Dankwa-Mullan, chief health officer of Marti Health, discussed health equity and the role of digital health in AI. She provided context for the intersection of health equity and diagnostic excellence and stated that equitable diagnostic excellence “is the foundational pillar for a fair and just health system … [and] involves a comprehensive, systematic, and patient-centered approach to diagnosis.” Its goal is minimizing errors, she continued, including inherent biases. Key elements include early detection, precision care, and patient-centered care that incorporates patients’ goals, values, and preferences. Dankwa-Mullan said studies indicate that about 12 million U.S. adults experience a diagnostic error annually (Singh et al., 2014), and about 795,000 Americans become permanently disabled or die because of misdiagnosis each year (Newman-Toker et al., 2024). She suggested these numbers are underestimating diagnostic errors because they do not fully account for the stark disparities in outcomes that are experienced by socially disadvantaged patient subgroups. She described diagnostic excellence as striving to accurately identify and understand medical conditions while recognizing that the same condition can show up differently in different people, depending on their unique health and backgrounds.
Advancing diagnostic excellence is a necessity, not just a goal, Dankwa-Mullan emphasized. She also discussed the increasing role of technology that integrates seamlessly into people’s lives. Wearable devices, or remote monitoring systems, “are at the forefront of such a transformation,” she said, because they enable continuous tracking and can prompt medical interventions before conditions become severe. Telemedicine and social media platforms provide a “critical bridge for patients in rural and remote areas,” Dankwa-Mullan added. User-friendly mobile applications enable patients to receive personalized insights, advice, and reminders, encouraging people “to take an active role in managing their health.” Other tools can capture patients’ experiences, including their concerns about health care. With predictive analytics,18
__________________
18 Predictive analytics is a way of using data and patterns to make educated guesses about what might happen to patients in the future as they manage their health condition.
AI can help manage chronic conditions and tailor treatment plans. Dankwa-Mullan discussed her current work in managing sickle cell disease,19 which can cause sudden severe episodes of pain, called pain crises. By looking at things like a person’s environment and specific health markers, one can predict when these pain crises are likely to happen. “The diagnostic process needs to start where the patients are,” she said, noting that many pain crises happen in the patient’s home. Predictive analytics tools are one example of how AI is helping advance health equity. Dankwa-Mullan concluded, “We need to embrace AI; we all need to work together and develop culturally appropriate AI models and seize this opportunity to make equitable diagnostic excellence not just an aspiration, but a standard for everyone.”
In the final session, panelists highlighted key themes. Ysabel Duron, The Latino Cancer Institute, and Craig Umscheid, Agency for Healthcare Research and Quality, echoed many of the points raised throughout the workshop on bias and biased data, with Umscheid describing some of the agency’s ongoing work on guiding principles to mitigate bias. Duron raised concerns around language barriers, cultural norms, and ensuring patients are equal partners in deciding their care, rather than relying solely on clinicians and AI. Judy Gichoya, Emory University, added that work in bias and fairness can be challenging because patients, clinicians, and policymakers all want a definitive action plan, but the field will exist “in the gray” as researchers learn more about what AI can and cannot do. Duron commented that cultural competence will need to be incorporated into the clinical workflow to increase patient access and transparency, even with AI. She observed that many people in the Latino community trust clinicians as the experts, and they would need to be given the authority to question the accuracy of their EHRs, adding that records may need to be translated into Spanish for patients to use them. Umscheid echoed earlier comments that the general framework for the diagnostic process needs to begin before the patient engages with the health care system, since signs and symptoms often start much earlier. He raised the question of how to use AI to systematically measure adverse events, noting the potential for measuring diagnostic error and using it to calibrate diagnoses. Prabhjot Singh, Peterson Health Technology Institute, asked how one would know if diagnostic AI tools in clinical use are high-performing, and how these tools can be adopted and scaled. Singh added that the health care system will have to assess the value and determine payment for these technologies. Eric Horvitz, Microsoft, said he sees AI systems as “extremely promising for being harnessed to dramatically reduce diagnostic delay, to raise the accuracy of diagnosis, [and] to lower misdiagnosis rates.” Horvitz emphasized the need to address the challenges in translating technologies into clinical use, for workflow and standards of practice; establishing best practices and tools for reducing overhead; raising efficiency of testing; and validating AI systems in the clinical setting. In 5–10 years, Horvitz said he hopes AI methods will be integrated seamlessly into clinical workflows and that interfaces will be configured in such a way that “celebrates humans and promotes more human touch versus less.” Additionally, he continued, predictive analytics may be used to guide efforts such as antimicrobial stewardship for high-risk patients; memory jogging for safety measures; pattern-recognition for analyses; and multimodal AI systems that analyze language, imagery, genomic, and proteomic data.
Suggestions from workshop participants for improving diagnostic excellence through artificial intelligence and digital health technologies are outlined in Box 1.
ACL (Administration for Community Living). 2024. 2023 profile of older Americans. Washington, DC: Department of Health and Human Services.
Bell, S. K., T. Delbanco, J. G. Elmore, P. S. Fitzgerald, A. Fossa, K. Harcourt, S. G. Leveille, T. H. Payne, R. A. Stametz, J. Walker, and C. M. DesRoches. 2020. Frequency and types of patient-reported errors in electronic health record ambulatory care notes. JAMA Netw Open 3(6):e205867.
__________________
19 Sickle cell disease is a genetic disorder in which distorted red blood cells block blood flow. See https://www.nhlbi.nih.gov/health/sickle-cell-disease (accessed October 4, 2024).
NOTE: This list is the rapporteurs’ summary of points made by the individual speakers identified, and the statements have not been endorsed or verified by the National Academies of Sciences, Engineering, and Medicine. They are not intended to reflect a consensus among workshop participants.
Bell, S. K., P. Folcarelli, A. Fossa, M. Gerard, M. Harper, S. Leveille, C. Moore, K. E. Sands, B. Sarnoff Lee, J. Walker, and F. Bourgeois. 2021. Tackling ambulatory safety risks through patient engagement: What 10,000 patients and families say about safety-related knowledge, behaviors, and attitudes after reading visit notes. J Patient Saf 17(8):e791–e799.
Brems, J. H., K. Ferryman, M. C. McCormack, and J. Sugarman. 2022. Ethical considerations regarding the use of race in pulmonary function testing. Chest 162(4):878–881.
Cabral, S., D. Restrepo, Z. Kanjee, P. Wilson, B. Crowe, R. E. Abdulnour, and A. Rodman. 2024. Clinical reasoning of a generative artificial intelligence
model compared with physicians. JAMA Intern Med 184(5):581–583.
Chimowitz, H., M. Gerard, A. Fossa, F. Bourgeois, and S. K. Bell. 2018. Empowering informal caregivers with health information: OpenNotes as a safety strategy. The Joint Commission Journal on Quality and Patient Safety 44(3):130-136.
Goh, E., R. Gallo, J. Hom, E. Strong, Y. Weng, H. Kerman, J. A. Cool, Z. Kanjee, A. S. Parsons, N. Ahuja, E. Horvitz, D. Yang, A. Milstein, A. P. J. Olson, A. Rodman, and J. H. Chen. 2024. Large language model influence on diagnostic reasoning: A randomized clinical trial. JAMA Network Open 7(10):e2440969-e2440969.
Grasso, C., H. Goldhammer, D. Funk, D. King, S. L. Reisner, K. H. Mayer, and A. S. Keuroghlian. 2019. Required sexual orientation and gender identity reporting by US health centers: First-year data. Am J Public Health 109(8):1111–1118.
Han, T., L. C. Adams, K. K. Bressem, F. Busch, S. Nebelung, and D. Truhn. 2024. Comparative analysis of multimodal large language model performance on clinical vignette questions. JAMA 331(15):1320–1321.
Hong, C., M. J. Pencina, D. M. Wojdyla, J. L. Hall, S. E. Judd, M. Cary, M. M. Engelhard, S. Berchuck, Y. Xian, R. D’Agostino, Sr., G. Howard, B. Kissela, and R. Henao. 2023. Predictive accuracy of stroke risk prediction models across black and white race, sex, and age groups. JAMA 329(4):306–317.
Kessels, R. P. 2003. Patients’ memory for medical information. J R Soc Med 96(5):219–222.
Lam, B. D., F. Bourgeois, C. M. DesRoches, Z. Dong, and S. K. Bell. 2021. Attitudes, experiences, and safety behaviours of adolescents and young adults who read visit notes: opportunities to engage patients early in their care. Future Healthcare Journal 8(3):e585-e592.
NASEM (National Academies of Sciences, Engineering, and Medicine). 2015. Improving Diagnosis in Health Care. Washington, DC: The National Academies Press. https://doi.org/10.17226/21794.
NASEM. 2024. Ending Unequal Treatment: Strategies to Achieve Equitable Health Care and Optimal Health for All. Washington, DC: The National Academies Press. https://doi.org/10.17226/27820.
Newman-Toker, D. E., N. Nassery, A. C. Schaffer, C. W. Yu-Moe, G. D. Clemens, Z. Wang, Y. Zhu, A. S. Saber Tehrani, M. Fanai, A. Hassoon, and D. Siegal. 2024. Burden of serious harms from diagnostic error in the USA. BMJ Qual Saf 33(2):109–120.
Rodman, A., T. A. Buckley, A. K. Manrai, and D. J. Morgan. 2023. Artificial intelligence vs clinician performance in estimating probabilities of diagnoses before and after testing. JAMA Netw Open 6(12):e2347075.
Saab, K., T. Tu, W.-H. Weng, R. Tanno, D. Stutz, E. Wulczyn, F. Zhang, T. Strother, C. Park, E. Vedadi, J. Zambrano Chaves, S.-Y. Hu, M. Schaekermann, A. Kamath, Y. Cheng, et al. 2024. Capabilities of gemini models in medicine. arXiv preprint. https://doi.org/10.48550/arXiv.2404.18416.
Singh, H., A. N. Meyer, and E. J. Thomas. 2014. The frequency of diagnostic errors in outpatient care: Estimations from three large observational studies involving US adult populations. BMJ Qual Saf 23(9):727–731.
Singhal, K., T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis, and D. Neal. 2023. Towards expert-level medical question answering with large language models. arXiv preprint. arXiv:2305.09617.
Strong, E., A. DiGiammarino, Y. Weng, A. Kumar, P. Hosamani, J. Hom, and J. H. Chen. 2023. Chatbot vs medical student performance on free-response clinical reasoning examinations. JAMA Intern Med 183(9):1028–1030.
DISCLAIMER This Proceedings of a Workshop—in Brief has been prepared by Adrienne Formentos, Jennifer Lalitha Flaubert, and Allison Boman as a factual summary of what occurred at the meeting. The statements made are those of the rapporteurs or individual workshop participants and do not necessarily represent the views of all workshop participants; the planning committee; or the National Academies of Sciences, Engineering, and Medicine.
*The National Academies of Sciences, Engineering, and Medicine’s planning committees are solely responsible for organizing the workshop, identifying topics, and choosing speakers. The responsibility for the published Proceedings of a Workshop—in Brief rests with the institution. Daniel Yang (Chair), Kaiser Permanente; Julia Adler-Milstein, University of California, San Francisco; Thomas Cudjoe, Johns Hopkins School of Medicine; Gene Harkless, University of New Hampshire; Maia Hightower, Equality AI; Michael Howell, Google; Salahuddin Kazi, University of Texas Southwestern Medical Center; David Larson, Stanford University; Pari Pandharipande, Ohio State University; Judy Wawira Gichoya, Emory University.
REVIEWERS To ensure that it meets institutional standards for quality and objectivity, this Proceedings of a Workshop—in Brief was reviewed by Jonathan H. Chen, Stanford University, and Irene Dankwa-Mullan, Marti Health. Leslie J. Sim, National Academies of Sciences, Engineering, and Medicine, served as the review coordinator.
SPONSORS This workshop was partially supported by the American Association of Nurse Practitioners; American Board of Internal Medicine; American College of Radiology; Centers for Disease Control and Prevention; Centers for Medicare and Medicaid Services; Danaher Corporation; The Doctors Company; The Gordon and Betty Moore Foundation; The John A. Hartford Foundation; The Mont Fund; and Radiological Society of North America.
STAFF Jennifer Lalitha Flaubert, Adrienne Formentos, Anesia Wilks (until July 2024), and Sharyl Nass. Board on Health Care Services, Health and Medicine Division, National Academies of Sciences, Engineering, and Medicine.
For additional information regarding the workshop, visit https://www.nationalacademies.org/event/42229_07-2024_diagnosis-in-the-era-of-digital-health-and-artificial-intelligence-a-workshop.
SUGGESTED CITATION National Academies of Sciences, Engineering, and Medicine. 2024. Diagnosis in the era of digital health and artificial intelligence: Proceedings of a workshop—in brief. Washington, DC: The National Academies Press. https://doi.org/10.17226/28571.
|
Health and Medicine Division Copyright 2024 by the National Academy of Sciences. All rights reserved. |
![]() |