Unlocking the power of multi-institutional data: Integrating and harmonizing genomic data across institutions Journal Article


Authors: Chen, Y.; Shen, R.; Feng, X.; Panageas, K.
Article Title: Unlocking the power of multi-institutional data: Integrating and harmonizing genomic data across institutions
Abstract: Cancer is a complex disease driven by genomic alterations, and tumor sequencing is becoming a mainstay of clinical care for cancer patients. The emergence of multi-institution sequencing data presents a powerful resource for learning real-world evidence to enhance precision oncology. GENIE BPC, led by American Association for Cancer Research, establishes a unique database linking genomic data with clinical information for patients treated at multiple cancer centers. However, leveraging sequencing data from multiple institutions presents significant challenges. Variability in gene panels can lead to loss of information when analyses focus on genes common across panels. Additionally, differences in sequencing techniques and patient heterogeneity across institutions add complexity. High data dimensionality, sparse gene mutation patterns, and weak signals at the individual gene level further complicate matters. Motivated by these real-world challenges, we introduce the Bridge model. It uses a quantile-matched latent variable approach to derive integrated features to preserve information beyond common genes and maximize the utilization of all available data, while leveraging information sharing to enhance both learning efficiency and the model's capacity to generalize. By extracting harmonized and noise-reduced lower-dimensional latent variables, the true mutation pattern unique to each individual is captured. We assess model's performance and parameter estimation through extensive simulation studies. The extracted latent features from the Bridge model consistently excel in predicting patient survival across six cancer types in GENIE BPC data. © 2024 The Author(s).
Keywords: genetics; mutation; neoplasm; neoplasms; gene expression; lung cancer; models, statistical; genomics; computer simulation; tumor; statistical model; diseases; gene encoding; personalized medicine; genetic database; databases, genetic; cancer genomics; database; procedures; data interpretation; noise; data reduction; power; genomic data; missing data; real-world; cancer; humans; human; precision medicine; data integration; latent variable; precision oncology; data assimilation; data accuracy; dimension reduction; systematic biases; bridge model; systematic bias
Journal Title: Biometrics
Volume: 80
Issue: 4
ISSN: 0006-341X
Publisher: Wiley Blackwell  
Date Published: 2024-12-01
Start Page: ujae146
Language: English
DOI: 10.1093/biomtc/ujae146
PUBMED: 39679742
PROVIDER: scopus
PMCID: PMC11647914
DOI/URL:
Notes: The MSK Cancer Center Support Grant (P30 CA008748) is acknowledge in the PDF -- Corresponding authors is MSK author: Yuan Chen -- Source: Scopus
Altmetric
Citation Impact
BMJ Impact Analytics
MSK Authors
  1. Ronglai Shen
    204 Shen
  2. Katherine S Panageas
    512 Panageas
  3. Yuan Chen
    38 Chen