Synapse - Evaluating large language models on their accuracy and completeness

Evaluating large language models on their accuracy and completeness Journal Article

Authors:	Edalat, C.; Kirupaharan, N.; Dalvin, L. A.; Mishra, K.; Marshall, R.; Xu, H.; Francis, J. H.; Berkenstock, M.
Article Title:	Evaluating large language models on their accuracy and completeness
Abstract:	Purpose:To analyze the accuracy and thoroughness of three large language models (LLMs) to produce information for providers about immune checkpoint inhibitor ocular toxicities.Methods:Eight questions were created about the general definition of checkpoint inhibitors, their mechanism of action, ocular toxicities, and toxicity management. All were inputted into ChatGPT 4.0, Bard, and LLaMA programs. Using the six-point Likert scale for accuracy and completeness, four ophthalmologists who routinely treat ocular toxicities of immunotherapy agents rated the LLMs answers. Analysis of variance testing was used to assess significant differences among the three LLMs and a post hoc pairwise t-test. Fleiss kappa values were calculated to account for interrater variability.Results:ChatGPT responses were rated with an average of 4.59 for accuracy and 4.09 for completeness; Bard answers were rated 4.59 and 4.19; LLaMA results were rated 4.38 and 4.03. The three LLMs did not significantly differ in accuracy (P = 0.47) nor completeness (P = 0.86). Fleiss kappa values were found to be poor for both accuracy (-0.03) and completeness (0.01).Conclusion:All three LLMs provided highly accurate and complete responses to questions centered on immune checkpoint inhibitor ocular toxicities and management. Further studies are needed to assess specific immune checkpoint inhibitor agents and the accuracy and completeness of updated versions of LLMs.
Keywords:	artificial intelligence; models; immune checkpoint inhibitors; large language; ocular adverse effects
Journal Title:	Retina
Volume:	45
Issue:	1
ISSN:	0275-004X
Publisher:	Wolters Kluwer
Date Published:	2025-01-01
Start Page:	128
End Page:	132
Language:	English
ACCESSION:	WOS:001381965800014
DOI:	10.1097/iae.0000000000004271
PROVIDER:	wos
PUBMED:	39312883
Notes:	Article -- Source: Wos

Altmetric

What is Altmetric?

Citation Impact

What is Dimensions Citation Badge?

BMJ Impact Analytics

MSK Authors

267 Francis

Related MSK Work

Accuracy And Completeness Of Large Language Models About Antibody Drug Conjugates And Associated Ocular Adverse Effects

Cornea 2025
Conformity Of Chat Gpt Recommendations With The Aua/Sufu Guideline On Postprostatectomy Urinary Incontinence

Neurourology and Urodynamics 2024
The Accuracy Of Artificial Intelligence Chat Gpt In Oncology Examination Questions

Journal of the American College of Radiology 2024
Integrating Artificial Intelligence In Renal Cell Carcinoma: Evaluating Chat Gpt’s Performance In Educating Patients And Trainees

Translational Cancer Research 2024
Comparing Large Language Models For Antibiotic Prescribing In Different Clinical Scenarios: Which Performs Better?

Clinical Microbiology and Infection 2025