Synapse - Targeted generative data augmentation for automatic metastases detection from free-text radiology reports

Targeted generative data augmentation for automatic metastases detection from free-text radiology reports Journal Article

Authors:	Ashofteh Barabadi, M.; Zhu, X.; Chan, W. Y.; Simpson, A. L.; Do, R. K. G.
Article Title:	Targeted generative data augmentation for automatic metastases detection from free-text radiology reports
Abstract:	Automatic identification of metastatic sites in cancer patients from electronic health records is a challenging yet crucial task with significant implications for diagnosis and treatment. In this study, we demonstrate how advancements in natural language processing, namely the instruction-following capability of recent large language models and extensive model pretraining, made it possible to automate metastases detection from radiology reports texts with a limited amount of gold-labeled data. Specifically, we prompt Llama3, an open-source instruction-tuned large language model, to generate synthetic training data to expand our limited labeled data and adapt BERT, a small pretrained language model, to the task. We further investigate three targeted data augmentation techniques which selectively expand the original training samples, leading to comparable or superior performance compared to vanilla data augmentation, in most cases, while being substantially more computationally efficient. In our experiments, data augmentation improved the average F1-score by 2.3, 3.5, and 3.9 points for lung, liver, and adrenal glands, the organs for which we had access to expert-annotated data. This observation suggests that Llama3, which has not been specifically tailored to this task or clinical data in general, can generate high-quality synthetic data through paraphrasing in the clinical context. We also compare metastasis identification accuracy between models utilizing institutionally standardized reports vs. non-structured reports, which complicate the extraction of relevant information, and show how including patient history with a customized model architecture narrows the gap between those two setups from 7.3 to 4.5 points on F1-score under LoRA tuning. Our work delivers a broadly applicable solution with remarkable performance that does not require model customization for each institution, making large-scale, low-cost spatio-temporal cancer progression pattern extraction possible. Copyright © 2025 Ashofteh Barabadi, Zhu, Chan, Simpson and Do.
Keywords:	natural language processing; large language models; metastases detection; free-text radiology report; synthetic data generation; targeted data augmentation
Journal Title:	Frontiers in Artificial Intelligence
Volume:	8
ISSN:	2624-8212
Publisher:	Frontiers Research Foundation
Date Published:	2025-02-06
Start Page:	1513674
Language:	English
DOI:	10.3389/frai.2025.1513674
PROVIDER:	scopus
PMCID:	PMC11839598
PUBMED:	39981192
DOI/URL:	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85218123352&doi=10.3389%2ffrai.2025.1513674&partnerID=40&md5=a759c98d214407d38414ba03e20d112a
Notes:	Article -- Source: Scopus

Altmetric

What is Altmetric?

Citation Impact

What is Dimensions Citation Badge?

BMJ Impact Analytics

MSK Authors

261 Do

Related MSK Work

Cancer Type, Stage And Prognosis Assessment From Pathology Reports Using Ll Ms

Scientific Reports 2025
Empirical Evaluation Of Language Modeling To Ascertain Cancer Outcomes From Clinical Text Reports

BMC Bioinformatics 2023
Adapting Large Language Models For Automatic Annotation Of Radiology Reports For Metastases Detection

Conference proceedings of the Canadian Conference on Electrical and Computer Engineering 2024
Self Supervised Learning Improves Robustness Of Deep Learning Lung Tumor Segmentation Models To Ct Imaging Differences

Medical Physics 2025
Leveraging Time Irreversibility With Order Contrastive Pre Training

Proceedings of Machine Learning Research 2022