The Institute for Health Metrics and Evaluation’s mapping of drug–use pairs to Global Burden of Disease (GBD) categories involved the following steps: (1) We identified drug uses in the Evaluate Pharma database, covering current drugs for the top 20 pharmaceutical companies, and pipeline drugs for all companies; (2) for validation, we manually mapped drug–use pairs to GBD conditions (causes, risk factors, impairments, injuries, or pathogens) for two companies’ current and pipeline portfolios; (3) we then applied a large language model (LLM) to assign drug–use pairs to GBD categories, using the manual mappings as a benchmark for optimizing our input configuration; (4) this highest performing LLM method was used to map the current portfolios of the top 20 pharmaceutical companies and pipeline portfolios for all companies; and (5) we compared these pharmaceutical portfolios by GBD cause to the respective disease burden. The remaining sections in this document provide additional information about each of these steps.
We used the Evaluate Pharma database to identify both current pharmaceutical products and pipeline pharmaceutical products. To discover all uses for each of the current drugs, we mapped drug names from the Evaluate Pharma database to reference sources (e.g., Redbook) that specify the use of each drug. For pipeline drugs, we relied on the “specified use” variable in the Evaluate Pharma database.
To assess and optimize the performance of the LLM-based mapping, we created a validation dataset from Pfizer and Sanofi’s current and pipeline drug portfolios. Two independent coders mapped each drug–use pair to GBD causes, risk, and injury codes, with a third reviewer resolving any discrepancies. We also compared LLM-based assignments to manual mappings to refine the validation dataset. In addition to causes, other entities were included as options for mapping. The final mapping included 334 causes, 47 injury codes, 18 noncause groupings, 4 risk factors, and the heart failure impairment.
We supplied the LLM with drug–use pairs and a list of GBD conditions, instructing it to identify the most relevant condition. We refined the prompt to enhance accuracy, using our validation set to evaluate improvements. We also tested different foundational models, including GPT4, o1-mini, and o1-preview. In addition to prompt refinement, we undertook a range of performance optimization approaches. These included the provision of condition keywords generated through a separate LLM process and an adjudication process, whereby we used multiple LLM instances, each with its own medical specialty focus, with a final LLM instance determining the most likely condition assignment.
The table below describes concordance between different LLM approaches that vary according to the foundational model used, whether condition keywords were provided to the LLM, and whether an adjudication
| Level 1 Cause | Level 2 Cause | Level 3 Cause | Level 4 Cause | |
|---|---|---|---|---|
| o1-preview with keywords, adjudicated | 98.5% | 96.0% | 93.9% | 92.8% |
| o1-preview with keywords | 98.3% | 95.3% | 93.0% | 93.0% |
| o1-preview without keywords | 97.0% | 91.8% | 84.8% | 83.8% |
| o1-mini without keywords | 97.1% | 90.5% | 83.5% | 85.7% |
| o1-mini with keywords | 97.3% | 91.6% | 86.5% | 91.7% |
| GPT-4 with keywords | 95.3% | 87.5% | 80.1% | 85.6% |
process was used. We evaluated concordance at the four levels of the GBD cause hierarchy, with higher levels indicating greater granularity. The highest performing approach was one that uses the o1-preview foundational LLM, condition keywords, and adjudication (limited to instances where the initial classification by the LLM had a confidence level less than or equal to 80 percent).
Using Evaluate Pharma, we extracted the most recent product data as of February 2025. We then applied our most accurate LLM method for classifying the complete dataset, which includes over 7,000 current and pipeline products from the top 20 companies and over 37,000 additional pipeline products from other companies. Some adjustments were made to the LLM outputs. Specifically, for a small number of cases where the LLM’s assignments did not match any valid condition in our hierarchy, we manually mapped the drug–use pairs to the correct condition.
This analysis encompassed pharmaceutical products globally, both on-market and in development. Comparison of findings to disease burden was made for current drugs to 2021 disease burden and for pipeline drugs to 2030 forecasted disease burden, as defined by GBD 2021.
This page intentionally left blank.