BI-RADS category assignments by GPT-3.5, GPT-4, and Google Bard: A multilanguage study Journal Article


Authors: Cozzi, A.; Pinker, K.; Hidber, A.; Zhang, T.; Bonomo, L.; Lo Gullo, R.; Christianson, B.; Curti, M.; Rizzo, S.; Del Grande, F.; Mann, R. M.; Schiaffino, S.
Article Title: BI-RADS category assignments by GPT-3.5, GPT-4, and Google Bard: A multilanguage study
Abstract: Background: The performance of publicly available large language models (LLMs) remains unclear for complex clinical tasks. Purpose: To evaluate the agreement between human readers and LLMs for Breast Imaging Reporting and Data System (BI-RADS) categories assigned based on breast imaging reports written in three languages and to assess the impact of discordant category assignments on clinical management. Materials and Methods: This retrospective study included reports for women who underwent MRI, mammography, and/or US for breast cancer screening or diagnostic purposes at three referral centers. Reports with findings categorized as BI-RADS 1–5 and written in Italian, English, or Dutch were collected between January 2000 and October 2023. Board-certified breast radiologists and the LLMs GPT-3.5 and GPT-4 (OpenAI) and Bard, now called Gemini (Google), assigned BI-RADS categories using only the findings described by the original radiologists. Agreement between human readers and LLMs for BI-RADS categories was assessed using the Gwet agreement coefficient (AC1 value). Frequencies were calculated for changes in BI-RADS category assignments that would affect clinical management (ie, BI-RADS 0 vs BI-RADS 1 or 2 vs BI-RADS 3 vs BI-RADS 4 or 5) and compared using the McNemar test. Results: Across 2400 reports, agreement between the original and reviewing radiologists was almost perfect (AC1 = 0.91), while agreement between the original radiologists and GPT-4, GPT-3.5, and Bard was moderate (AC1 = 0.52, 0.48, and 0.42, respectively). Across human readers and LLMs, differences were observed in the frequency of BI-RADS category upgrades or downgrades that would result in changed clinical management (118 of 2400 [4.9%] for human readers, 611 of 2400 [25.5%] for Bard, 573 of 2400 [23.9%] for GPT-3.5, and 435 of 2400 [18.1%] for GPT-4; P < .001) and that would negatively impact clinical management (37 of 2400 [1.5%] for human readers, 435 of 2400 [18.1%] for Bard, 344 of 2400 [14.3%] for GPT-3.5, and 255 of 2400 [10.6%] for GPT-4; P < .001). Conclusion: LLMs achieved moderate agreement with human reader–assigned BI-RADS categories across reports written in three languages but also yielded a high percentage of discordant BI-RADS categories that would negatively impact clinical management. © RSNA, 2024.
Keywords: adult; aged; middle aged; retrospective studies; clinical trial; nuclear magnetic resonance imaging; magnetic resonance imaging; breast cancer; image analysis; breast; echomammography; cancer screening; diagnostic imaging; breast neoplasms; retrospective study; radiologist; mammography; multicenter study; breast tumor; radiology information systems; language; ultrasonography, mammary; procedures; breast imaging reporting and data system; humans; human; female; article; malignant neoplasm; large language model; chatgpt
Journal Title: Radiology
Volume: 311
Issue: 1
ISSN: 0033-8419
Publisher: Radiological Society of North America, Inc.  
Date Published: 2024-04-01
Start Page: e232133
Language: English
DOI: 10.1148/radiol.232133
PUBMED: 38687216
PROVIDER: scopus
PMCID: PMC11070611
DOI/URL:
Notes: The MSK Cancer Center Support Grant (P30 CA008748) is acknowledged in the PubMed record and PDF. Corresponding MSK author is Katja Pinker -- Source: Scopus
Altmetric
Citation Impact
BMJ Impact Analytics
MSK Authors