Synapse - Development and validation of the provider documentation summarization quality instrument for large language models

Development and validation of the provider documentation summarization quality instrument for large language models Journal Article

Authors:	Croxford, E.; Gao, Y.; Pellegrino, N.; Wong, K.; Wills, G.; First, E.; Schnier, M.; Burton, K.; Ebby, C.; Gorski, J.; Kalscheur, M.; Khalil, S.; Pisani, M.; Rubeor, T.; Stetson, P.; Liao, F.; Goswami, C.; Patterson, B.; Afshar, M.
Article Title:	Development and validation of the provider documentation summarization quality instrument for large language models
Abstract:	OBJECTIVES: As large language models (LLMs) are integrated into electronic health record (EHR) workflows, validated instruments are essential to evaluate their performance before implementation and as models and documentation practices evolve. Existing instruments for provider documentation quality are often unsuitable for the complexities of LLM-generated text and lack validation on real-world data. The Provider Documentation Summarization Quality Instrument (PDSQI-9) was developed to evaluate LLM-generated clinical summaries. This study aimed to validate the PDSQI-9 across key aspects of construct validity. MATERIALS AND METHODS: Multi-document summaries were generated from real-world EHR data across multiple specialties using several LLMs (GPT-4o, Mixtral 8x7b, and Llama 3-8b). Validation included Pearson correlation analyses for substantive validity, factor analysis and Cronbach's α for structural validity, inter-rater reliability (ICC and Krippendorff's α) for generalizability, a semi-Delphi process for content validity, and comparisons of high- versus low-quality summaries for discriminant validity. Raters underwent standardized training to ensure consistent application of the instrument. RESULTS: Seven physician raters evaluated 779 summaries and answered 8329 questions, achieving over 80% power for inter-rater reliability. The PDSQI-9 demonstrated strong internal consistency (Cronbach's α = 0.879; 95% CI, 0.867-0.891) and high inter-rater reliability (ICC = 0.867; 95% CI, 0.867-0.868), supporting structural validity and generalizability. Factor analysis identified a 4-factor model explaining 58% of the variance, representing organization, clarity, accuracy, and utility. Substantive validity was supported by correlations between note length and scores for Succinct (ρ = -0.200, P = .029) and Organized (ρ = -0.190, P = .037). The semi-Delphi process ensured clinically relevant attributes, and discriminant validity distinguished high- from low-quality summaries (P<.001). DISCUSSION: The PDSQI-9 showed high inter-rater reliability, internal consistency, and a meaningful factor structure that reliably captured key dimensions of documentation quality. It distinguished between high- and low-quality summaries, supporting its practical utility for health systems needing an evaluation instrument for LLMs. CONCLUSIONS: The PDSQI-9 demonstrates robust construct validity, supporting its use in clinical practice to evaluate LLM-generated summaries and facilitate safer, more effective integration of LLMs into healthcare workflows. © The Author(s) 2025. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For commercial re-use, please contact reprints@oup.com for reprints and translation rights for reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site—for further information plea
Keywords:	reproducibility; reproducibility of results; validation study; evaluation; documentation; electronic health records; natural language processing; humans; human; electronic health record; large language model; large language models; clinical natural language generation; generative artificial intellgience; multi-document summarization
Journal Title:	Journal of the American Medical Informatics Association
Volume:	32
Issue:	6
ISSN:	1067-5027
Publisher:	Oxford University Press
Date Published:	2025-06-01
Start Page:	1050
End Page:	1060
Language:	English
DOI:	10.1093/jamia/ocaf068
PUBMED:	40323321
PROVIDER:	scopus
PMCID:	PMC12089781
DOI/URL:	https://www.scopus.com/inward/record.uri?eid=2-s2.0-105005972583&doi=10.1093%2fjamia%2focaf068&partnerID=40&md5=4723321ccc6ae7f9df90bc4453d94d98
Notes:	Article -- Source: Scopus

Altmetric

What is Altmetric?

Citation Impact

What is Dimensions Citation Badge?

BMJ Impact Analytics

MSK Authors

52 Stetson

Related MSK Work

Validity And Reliability Of The Dexter Hand Evaluation And Therapy System In Hand Injured Patients

Journal of Hand Therapy 2000
Development And Initial Psychometric Properties Of The Research Complexity Index

Journal of Clinical and Translational Science 2024
Improving The Evaluation Of Aesthetic Outcomes In Diep Flap Breast Reconstruction: Validation Of The Aesthetic Grading Tool

Journal of Surgical Oncology 2025
Validation Of The Swedish M. D. Anderson Dysphagia Inventory (Mdadi) In Patients With Head And Neck Cancer And Neurologic Swallowing Disturbances

Dysphagia 2012
Quality Of Life Assessment In Individuals With Lung Cancer: Testing The Lung Cancer Symptom Scale (Lcss)

European Journal of Cancer 1993