Authors: |
Jee, J.; Fong, C.; Pichotta, K.; Tran, T. N.; Luthra, A.; Waters, M.; Fu, C.; Altoe, M.; Liu, S. Y.; Maron, S. B.; Ahmed, M.; Kim, S.; Pirun, M.; Chatila, W. K.; de Bruijn, I.; Pasha, A.; Kundra, R.; Gross, B.; Mastrogiacomo, B.; Aprati, T. J.; Liu, D.; Gao, J.; Capelletti, M.; Pekala, K.; Loudon, L.; Perry, M.; Bandlamudi, C.; Donoghue, M.; Satravada, B. A.; Martin, A.; Shen, R.; Chen, Y.; Brannon, A. R.; Chang, J.; Braunstein, L.; Li, A.; Safonov, A.; Stonestrom, A.; Sanchez-Vela, P.; Wilhelm, C.; Robson, M.; Scher, H.; Ladanyi, M.; Reis-Filho, J. S.; Solit, D. B.; Jones, D. R.; Gomez, D.; Yu, H.; Chakravarty, D.; Yaeger, R.; Abida, W.; Park, W.; O’Reilly, E. M.; Garcia-Aguilar, J.; Socci, N.; Sanchez-Vega, F.; Carrot-Zhang, J.; Stetson, P. D.; Levine, R.; Rudin, C. M.; Berger, M. F.; Shah, S. P.; Schrag, D.; Razavi, P.; Kehl, K. L.; Li, B. T.; Riely, G. J.; Schultz, N. |
Contributors: |
Lisman, A.; Gross, B.; Mastrogiacomo, B.; Zhao, G.; de Bruijn, I.; Tran, T. N.; Chatila, W. K.; Li, X.; Kohli, A.; Moore, D.; Lim, R.; Pollard, T.; Sheridan, R.; Wang, A.; Chennault, C.; Wilson, M.; Zhang, H.; Pimienta, R.; Rangavajhala, S.; Garcia, J.; Rachuri, N.; Boehm, K.; Parker, M.; Walch, H.; Nandakumar, S.; Eichholz, J.; Kris, A.; Manca, P.; Bai, X.; Agbamu, T.; U, J.; Bi, X. |
Article Title: |
Automated real-world data integration improves cancer outcome prediction |
Abstract: |
The digitization of health records and growing availability of tumour DNA sequencing provide an opportunity to study the determinants of cancer outcomes with unprecedented richness. Patient data are often stored in unstructured text and siloed datasets. Here we combine natural language processing annotations1,2 with structured medication, patient-reported demographic, tumour registry and tumour genomic data from 24,950 patients at Memorial Sloan Kettering Cancer Center to generate a clinicogenomic, harmonized oncologic real-world dataset (MSK-CHORD). MSK-CHORD includes data for non-small-cell lung (n = 7,809), breast (n = 5,368), colorectal (n = 5,543), prostate (n = 3,211) and pancreatic (n = 3,109) cancers and enables discovery of clinicogenomic relationships not apparent in smaller datasets. Leveraging MSK-CHORD to train machine learning models to predict overall survival, we find that models including features derived from natural language processing, such as sites of disease, outperform those based on genomic data or stage alone as tested by cross-validation and an external, multi-institution dataset. By annotating 705,241 radiology reports, MSK-CHORD also uncovers predictors of metastasis to specific organ sites, including a relationship between SETD2 mutation and lower metastatic potential in immunotherapy-treated lung adenocarcinoma corroborated in independent datasets. We demonstrate the feasibility of automated annotation from unstructured notes and its utility in predicting patient outcomes. The resulting data are provided as a public resource for real-world oncologic research. © The Author(s) 2024. |
Keywords: |
immunohistochemistry; human tissue; protein expression; treatment outcome; gene mutation; major clinical study; overall survival; sequence analysis; single nucleotide polymorphism; mutation; histopathology; area under the curve; pancreas cancer; cancer staging; nuclear magnetic resonance imaging; positron emission tomography; follow up; colorectal cancer; gene; cancer immunotherapy; metastasis; computer assisted tomography; breast cancer; hepatocyte nuclear factor 3alpha; incidence; epidermal growth factor receptor 2; cohort analysis; brca2 protein; protein p53; tumor marker; cancer research; gleason score; liver metastasis; lung adenocarcinoma; dna; microsatellite instability; myc protein; phosphatidylinositol 3,4,5 trisphosphate 3 phosphatase; hematopoiesis; checkpoint kinase 2; cyclin dependent kinase inhibitor 2a; tumor; hormone receptor; k ras protein; b raf kinase; fibroblast growth factor receptor 1; language processing; linear regression analysis; programmed death 1 receptor; non small cell lung cancer; data set; nonlinear system; disease registry; patient-reported outcome; protein kinase lkb1; false discovery rate; machine learning; learning algorithm; brg1 protein; rb1 gene; ccne1 gene; nf1 gene; cancer prognosis; data processing; ar gene; kelch like ech associated protein 1; setd2 gene; ccnd1 gene; measurement accuracy; cancer; human; male; female; article; data integration; random forest; whole exome sequencing; esr1 gene; measurement precision; people by smoking status; cross validation; digitization; cancer outcome prediction; nkk2-1 gene; nkk3-1 gene; real world data integration
|
Journal Title: |
Nature
|
Volume: |
636 |
Issue: |
8043 |
ISSN: |
0028-0836 |
Publisher: |
Nature Publishing Group
|
Date Published: |
2024-12-19 |
Start Page: |
728 |
End Page: |
736 |
Language: |
English |
DOI: |
10.1038/s41586-024-08167-5
|
PUBMED: |
39506116
|
PROVIDER: |
scopus
|
PMCID: |
PMC11655358
|
DOI/URL: |
|
Notes: |
The MSK Cancer Center Support Grant (P30 CA008748) is acknowledged in the PubMed record and PDF. Corresponding MSK author is Nikolaus Schultz -- Source: Scopus |