Targeted generative data augmentation for automatic metastases detection from free-text radiology reports Journal Article


Authors: Ashofteh Barabadi, M.; Zhu, X.; Chan, W. Y.; Simpson, A. L.; Do, R. K. G.
Article Title: Targeted generative data augmentation for automatic metastases detection from free-text radiology reports
Abstract: Automatic identification of metastatic sites in cancer patients from electronic health records is a challenging yet crucial task with significant implications for diagnosis and treatment. In this study, we demonstrate how advancements in natural language processing, namely the instruction-following capability of recent large language models and extensive model pretraining, made it possible to automate metastases detection from radiology reports texts with a limited amount of gold-labeled data. Specifically, we prompt Llama3, an open-source instruction-tuned large language model, to generate synthetic training data to expand our limited labeled data and adapt BERT, a small pretrained language model, to the task. We further investigate three targeted data augmentation techniques which selectively expand the original training samples, leading to comparable or superior performance compared to vanilla data augmentation, in most cases, while being substantially more computationally efficient. In our experiments, data augmentation improved the average F1-score by 2.3, 3.5, and 3.9 points for lung, liver, and adrenal glands, the organs for which we had access to expert-annotated data. This observation suggests that Llama3, which has not been specifically tailored to this task or clinical data in general, can generate high-quality synthetic data through paraphrasing in the clinical context. We also compare metastasis identification accuracy between models utilizing institutionally standardized reports vs. non-structured reports, which complicate the extraction of relevant information, and show how including patient history with a customized model architecture narrows the gap between those two setups from 7.3 to 4.5 points on F1-score under LoRA tuning. Our work delivers a broadly applicable solution with remarkable performance that does not require model customization for each institution, making large-scale, low-cost spatio-temporal cancer progression pattern extraction possible. Copyright © 2025 Ashofteh Barabadi, Zhu, Chan, Simpson and Do.
Keywords: natural language processing; large language models; metastases detection; free-text radiology report; synthetic data generation; targeted data augmentation
Journal Title: Frontiers in Artificial Intelligence
Volume: 8
ISSN: 2624-8212
Publisher: Frontiers Research Foundation  
Date Published: 2025-02-06
Start Page: 1513674
Language: English
DOI: 10.3389/frai.2025.1513674
PROVIDER: scopus
PMCID: PMC11839598
PUBMED: 39981192
DOI/URL:
Notes: Article -- Source: Scopus
Altmetric
Citation Impact
BMJ Impact Analytics
MSK Authors
  1. Kinh Gian Do
    256 Do