Automated real-world data integration improves cancer outcome prediction Journal Article


Authors: Jee, J.; Fong, C.; Pichotta, K.; Tran, T. N.; Luthra, A.; Waters, M.; Fu, C.; Altoe, M.; Liu, S. Y.; Maron, S. B.; Ahmed, M.; Kim, S.; Pirun, M.; Chatila, W. K.; de Bruijn, I.; Pasha, A.; Kundra, R.; Gross, B.; Mastrogiacomo, B.; Aprati, T. J.; Liu, D.; Gao, J.; Capelletti, M.; Pekala, K.; Loudon, L.; Perry, M.; Bandlamudi, C.; Donoghue, M.; Satravada, B. A.; Martin, A.; Shen, R.; Chen, Y.; Brannon, A. R.; Chang, J.; Braunstein, L.; Li, A.; Safonov, A.; Stonestrom, A.; Sanchez-Vela, P.; Wilhelm, C.; Robson, M.; Scher, H.; Ladanyi, M.; Reis-Filho, J. S.; Solit, D. B.; Jones, D. R.; Gomez, D.; Yu, H.; Chakravarty, D.; Yaeger, R.; Abida, W.; Park, W.; O’Reilly, E. M.; Garcia-Aguilar, J.; Socci, N.; Sanchez-Vega, F.; Carrot-Zhang, J.; Stetson, P. D.; Levine, R.; Rudin, C. M.; Berger, M. F.; Shah, S. P.; Schrag, D.; Razavi, P.; Kehl, K. L.; Li, B. T.; Riely, G. J.; Schultz, N.
Contributors: Lisman, A.; Gross, B.; Mastrogiacomo, B.; Zhao, G.; de Bruijn, I.; Tran, T. N.; Chatila, W. K.; Li, X.; Kohli, A.; Moore, D.; Lim, R.; Pollard, T.; Sheridan, R.; Wang, A.; Chennault, C.; Wilson, M.; Zhang, H.; Pimienta, R.; Rangavajhala, S.; Garcia, J.; Rachuri, N.; Boehm, K.; Parker, M.; Walch, H.; Nandakumar, S.; Eichholz, J.; Kris, A.; Manca, P.; Bai, X.; Agbamu, T.; U, J.; Bi, X.
Article Title: Automated real-world data integration improves cancer outcome prediction
Abstract: The digitization of health records and growing availability of tumour DNA sequencing provide an opportunity to study the determinants of cancer outcomes with unprecedented richness. Patient data are often stored in unstructured text and siloed datasets. Here we combine natural language processing annotations1,2 with structured medication, patient-reported demographic, tumour registry and tumour genomic data from 24,950 patients at Memorial Sloan Kettering Cancer Center to generate a clinicogenomic, harmonized oncologic real-world dataset (MSK-CHORD). MSK-CHORD includes data for non-small-cell lung (n = 7,809), breast (n = 5,368), colorectal (n = 5,543), prostate (n = 3,211) and pancreatic (n = 3,109) cancers and enables discovery of clinicogenomic relationships not apparent in smaller datasets. Leveraging MSK-CHORD to train machine learning models to predict overall survival, we find that models including features derived from natural language processing, such as sites of disease, outperform those based on genomic data or stage alone as tested by cross-validation and an external, multi-institution dataset. By annotating 705,241 radiology reports, MSK-CHORD also uncovers predictors of metastasis to specific organ sites, including a relationship between SETD2 mutation and lower metastatic potential in immunotherapy-treated lung adenocarcinoma corroborated in independent datasets. We demonstrate the feasibility of automated annotation from unstructured notes and its utility in predicting patient outcomes. The resulting data are provided as a public resource for real-world oncologic research. © The Author(s) 2024.
Keywords: immunohistochemistry; human tissue; protein expression; treatment outcome; gene mutation; major clinical study; overall survival; sequence analysis; single nucleotide polymorphism; mutation; histopathology; area under the curve; pancreas cancer; cancer staging; nuclear magnetic resonance imaging; positron emission tomography; follow up; colorectal cancer; gene; cancer immunotherapy; metastasis; computer assisted tomography; breast cancer; hepatocyte nuclear factor 3alpha; incidence; epidermal growth factor receptor 2; cohort analysis; brca2 protein; protein p53; tumor marker; cancer research; gleason score; liver metastasis; lung adenocarcinoma; dna; microsatellite instability; myc protein; phosphatidylinositol 3,4,5 trisphosphate 3 phosphatase; hematopoiesis; checkpoint kinase 2; cyclin dependent kinase inhibitor 2a; tumor; hormone receptor; k ras protein; b raf kinase; fibroblast growth factor receptor 1; language processing; linear regression analysis; programmed death 1 receptor; non small cell lung cancer; data set; nonlinear system; disease registry; patient-reported outcome; protein kinase lkb1; false discovery rate; machine learning; learning algorithm; brg1 protein; rb1 gene; ccne1 gene; nf1 gene; cancer prognosis; data processing; ar gene; kelch like ech associated protein 1; setd2 gene; ccnd1 gene; measurement accuracy; cancer; human; male; female; article; data integration; random forest; whole exome sequencing; esr1 gene; measurement precision; people by smoking status; cross validation; digitization; cancer outcome prediction; nkk2-1 gene; nkk3-1 gene; real world data integration
Journal Title: Nature
Volume: 636
Issue: 8043
ISSN: 0028-0836
Publisher: Nature Publishing Group  
Date Published: 2024-12-19
Start Page: 728
End Page: 736
Language: English
DOI: 10.1038/s41586-024-08167-5
PUBMED: 39506116
PROVIDER: scopus
PMCID: PMC11655358
DOI/URL:
Notes: The MSK Cancer Center Support Grant (P30 CA008748) is acknowledged in the PubMed record and PDF. Corresponding MSK author is Nikolaus Schultz -- Source: Scopus
Altmetric
Citation Impact
BMJ Impact Analytics
MSK Authors
  1. Mark E Robson
    676 Robson
  2. David Solit
    775 Solit
  3. Ronglai Shen
    203 Shen
  4. Deborah Schrag
    226 Schrag
  5. Helena Alexandra Yu
    275 Yu
  6. Daniel R Gomez
    231 Gomez
  7. Marc Ladanyi
    1322 Ladanyi
  8. Gregory J Riely
    598 Riely
  9. Rona Denit Yaeger
    313 Yaeger
  10. Ross Levine
    769 Levine
  11. Mono Pirun
    18 Pirun
  12. Eileen O'Reilly
    774 O'Reilly
  13. Nicholas D Socci
    264 Socci
  14. Michael Forman Berger
    760 Berger
  15. Howard Scher
    1125 Scher
  16. Jianjiong Gao
    131 Gao
  17. Wassim Abida
    151 Abida
  18. Nikolaus D Schultz
    481 Schultz
  19. Benjamin E Gross
    44 Gross
  20. Angela Rose Brannon
    98 Brannon
  21. Manda E Wilson
    18 Wilson
  22. Raymond Sear Lim
    57 Lim
  23. Charles Rudin
    483 Rudin
  24. Pedram Razavi
    170 Razavi
  25. David Randolph Jones
    413 Jones
  26. Jason Chih-Peng Chang
    130 Chang
  27. Bob Tingkan Li
    276 Li
  28. Peter D Stetson
    43 Stetson
  29. Lisa D Loudon
    4 Loudon
  30. Xiang   Li
    70 Li
  31. Ritika   Kundra
    86 Kundra
  32. Hongxin Zhang
    46 Zhang
  33. Walid Khaled Chatila
    101 Chatila
  34. David Liu
    2 Liu
  35. Sohrab Prakash Shah
    85 Shah
  36. Christopher Joseph Fong
    41 Fong
  37. Wungki Park
    97 Park
  38. Axel Stephen Martin
    19 Martin
  39. Aaron Samuel Lisman
    12 Lisman
  40. Avery J Wang
    12 Wang
  41. Steven Maron
    101 Maron
  42. Gaofei Zhao
    9 Zhao
  43. Henry Stuart Walch
    97 Walch
  44. Clare Jon Wilhelm
    25 Wilhelm
  45. Anisha Luthra
    26 Luthra
  46. Justin Jee
    51 Jee
  47. Anyi Li
    17 Li
  48. Anton Safonov
    29 Safonov
  49. Kevin Michael Boehm
    12 Boehm
  50. Arfath Pasha
    6 Pasha
  51. Thinh Ngoc Tran
    10 Tran
  52. Yuan Chen
    37 Chen
  53. Maria Perry
    6 Perry
  54. Michele Waters
    10 Waters
  55. Armaan Kohli
    4 Kohli
  56. Mehnaj Sarah Ahmed
    7 Ahmed
  57. Kelly Rose Pekala
    7 Pekala
  58. Mirella Lorrainy Altoe
    6 Altoe
  59. Si-Yang Liu
    7 Liu
  60. Susie Kim
    5 Kim
  61. Xuechun Bai
    3 Bai
  62. Paolo Manca
    6 Manca
  63. Ayush Vinu Kris
    2 Kris
  64. Chenlian Fu
    1 Fu
  65. Tejiri Ezioghene Agbamu
    1 Agbamu
  66. Mitchell I. Parker
    1 Parker
  67. Jowel Garcia
    1 Garcia
  68. Darin Shawn Moore
    1 Moore
  69. Xinran Bi
    1 Bi