Topical hidden genome: Discovering latent cancer mutational topics using a Bayesian multilevel context-learning approach Journal Article


Authors: Chakraborty, S.; Guan, Z.; Begg, C. B.; Shen, R.
Article Title: Topical hidden genome: Discovering latent cancer mutational topics using a Bayesian multilevel context-learning approach
Abstract: Inferring the cancer-type specificities of ultra-rare, genome-wide somatic mutations is an open problem. Traditional statistical methods cannot handle such data due to their ultra-high dimensionality and extreme data sparsity. To harness information in rare mutations, we have recently proposed a formal multilevel multilogistic "hidden genome" model. Through its hierarchical layers, the model condenses information in ultra-rare mutations through meta-features embodying mutation contexts to characterize cancer types. Consistent, scalable point estimation of the model can incorporate 10s of millions of variants across thousands of tumors and permit impressive prediction and attribution. However, principled statistical inference is infeasible due to the volume, correlation, and noninterpretability of mutation contexts. In this paper, we propose a novel framework that leverages topic models from computational linguistics to effectuate dimension reduction of mutation contexts producing interpretable, decorrelated meta-feature topics. We propose an efficient MCMC algorithm for implementation that permits rigorous full Bayesian inference at a scale that is orders of magnitude beyond the capability of existing out-of-the-box inferential high-dimensional multi-class regression methods and software. Applying our model to the Pan Cancer Analysis of Whole Genomes dataset reveals interesting biological insights including somatic mutational topics associated with UV exposure in skin cancer, aging in colorectal cancer, and strong influence of epigenome organization in liver cancer. Under cross-validation, our model demonstrates highly competitive predictive performance against blackbox methods of random forest and deep learning. © The Author(s) 2024. Published by Oxford University Press on behalf of The International Biometric Society.
Keywords: genetics; mutation; neoplasm; neoplasms; bayes theorem; skin neoplasms; algorithms; skin tumor; algorithm; models, statistical; statistical model; markov chain monte carlo; humans; human; context learning; multilevel bayesian models; rare somatic variants; topic model; whole genome data
Journal Title: Biometrics
Volume: 80
Issue: 2
ISSN: 0006-341X
Publisher: Wiley Blackwell  
Date Published: 2024-06-01
Start Page: ujae030
Language: English
DOI: 10.1093/biomtc/ujae030
PUBMED: 38682463
PROVIDER: scopus
PMCID: PMC11056772
DOI/URL:
Notes: Article -- MSK Cancer Center Support Grant (P30 CA008748) acknowledged in PDF -- Source: Scopus
Altmetric
Citation Impact
BMJ Impact Analytics
MSK Authors
  1. Colin B Begg
    306 Begg
  2. Ronglai Shen
    204 Shen