Synapse - Improved prediction of drug-induced liver injury literature using natural language processing and machine learning methods

Improved prediction of drug-induced liver injury literature using natural language processing and machine learning methods Journal Article

Authors:	Oh, J. H.; Tannenbaum, A.; Deasy, J. O.
Article Title:	Improved prediction of drug-induced liver injury literature using natural language processing and machine learning methods
Abstract:	Drug-induced liver injury (DILI) is an adverse hepatic drug reaction that can potentially lead to life-threatening liver failure. Previously published work in the scientific literature on DILI has provided valuable insights for the understanding of hepatotoxicity as well as drug development. However, the manual search of scientific literature in PubMed is laborious and time-consuming. Natural language processing (NLP) techniques along with artificial intelligence/machine learning approaches may allow for automatic processing in identifying DILI-related literature, but useful methods are yet to be demonstrated. To address this issue, we have developed an integrated NLP/machine learning classification model to identify DILI-related literature using only paper titles and abstracts. For prediction modeling, we used 14,203 publications provided by the Critical Assessment of Massive Data Analysis (CAMDA) challenge, employing word vectorization techniques in NLP in conjunction with machine learning methods. Classification modeling was performed using 2/3 of the data for training and the remainder for test in internal validation. The best performance was achieved using a linear support vector machine (SVM) model on the combined vectors derived from term frequency-inverse document frequency (TF-IDF) and Word2Vec, resulting in an accuracy of 95.0% and an F1-score of 95.0%. The final SVM model constructed from all 14,203 publications was tested on independent datasets, resulting in accuracies of 92.5%, 96.3%, and 98.3%, and F1-scores of 93.5%, 86.1%, and 75.6% for three test sets (T1-T3). Furthermore, the SVM model was tested on four external validation sets (V1-V4), resulting in accuracies of 92.0%, 96.2%, 98.3%, and 93.1%, and F1-scores of 92.4%, 82.9%, 75.0%, and 93.3%. Copyright © 2023 Oh, Tannenbaum and Deasy.
Keywords:	validation process; diagnostic accuracy; prediction; artificial intelligence; logistic regression analysis; drug-induced liver injury; machine learning; natural language processing; human; article; random forest; tf-idf; word2vec; linear support vector machine
Journal Title:	Frontiers in Genetics
Volume:	14
ISSN:	1664-8021
Publisher:	Frontiers Media S.A.
Date Published:	2023-07-17
Start Page:	1161047
Language:	English
DOI:	10.3389/fgene.2023.1161047
PROVIDER:	scopus
PMCID:	PMC10390074
PUBMED:	37529777
DOI/URL:	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85166435985&doi=10.3389%2ffgene.2023.1161047&partnerID=40&md5=a4f8208bba076a4864596ff1db7706e4
Notes:	The MSK Cancer Center Support Grant (P30 CA008748) is acknowledged in the PDF -- Corresponding author is MSK author: Jung Hun Oh -- Source: Scopus

Altmetric

What is Altmetric?

Citation Impact

What is Dimensions Citation Badge?

BMJ Impact Analytics

MSK Authors

188 Oh
527 Deasy

Related MSK Work

Screening For Obstructive Sleep Apnea In Patients With Cancer A Machine Learning Approach

SLEEP Advances 2023
Deep Learning, Sparse Coding, And Svm For Melanoma Recognition In Dermoscopy Images

Lecture Notes in Computer Science 2015
A Simultaneous Multiparametric (18)f Fdg Pet/Mri Radiomics Model For The Diagnosis Of Triple Negative Breast Cancer

Cancers 2022
Natural Language Processing Of Large Scale Structured Radiology Reports To Identify Oncologic Patients With Or Without Splenomegaly Over A 10 Year Period

JCO Clinical Cancer Informatics 2022
Automated Identification Of Patients With Immune Related Adverse Events From Clinical Notes Using Word Embedding And Machine Learning

JCO Clinical Cancer Informatics 2021