Synapse - High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions

High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions Journal Article

Authors:	Agius, P.; Arvey, A.; Chang, W.; Noble, W. S.; Leslie, C.
Article Title:	High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions
Abstract:	Accurately modeling the DNA sequence preferences of transcription factors (TFs), and using these models to predict in vivo genomic binding sites for TFs, are key pieces in deciphering the regulatory code. These efforts have been frustrated by the limited availability and accuracy of TF binding site motifs, usually represented as position-specific scoring matrices (PSSMs), which may match large numbers of sites and produce an unreliable list of target genes. Recently, protein binding microarray (PBM) experiments have emerged as a new source of high resolution data on in vitro TF binding specificities. PBM data has been analyzed either by estimating PSSMs or via rank statistics on probe intensities, so that individual sequence patterns are assigned enrichment scores (E-scores). This representation is informative but unwieldy because every TF is assigned a list of thousands of scored sequence patterns. Meanwhile, high-resolution in vivo TF occupancy data from ChIP-seq experiments is also increasingly available. We have developed a flexible discriminative framework for learning TF binding preferences from high resolution in vitro and in vivo data. We first trained support vector regression (SVR) models on PBM data to learn the mapping from probe sequences to binding intensities. We used a novel k-mer based string kernel called the di-mismatch kernel to represent probe sequence similarities. The SVR models are more compact than E-scores, more expressive than PSSMs, and can be readily used to scan genomics regions to predict in vivo occupancy. Using a large data set of yeast and mouse TFs, we found that our SVR models can better predict probe intensity than the E-score method or PBM-derived PSSMs. Moreover, by using SVRs to score yeast, mouse, and human genomic regions, we were better able to predict genomic occupancy as measured by ChIP-chip and ChIP-seq experiments. Finally, we found that by training kernel-based models directly on ChIP-seq data, we greatly improved in vivo occupancy prediction, and by comparing a TF's in vitro and in vivo models, we could identify cofactors and disambiguate direct and indirect binding. © 2010 Agius et al.
Keywords:	controlled study; protein array analysis; nonhuman; binding affinity; accuracy; reproducibility of results; animals; mice; computational biology; analytic method; protein binding; transcription factor; algorithms; information processing; databases, protein; prediction; transcription factors; dna; algorithm; chromatin immunoprecipitation; models, statistical; artificial intelligence; scoring system; intermethod comparison; binding site; models, molecular; area under curve; binding sites; dna binding; protein microarray; sequence analysis, protein; kernel method; fungal proteins
Journal Title:	PLoS Computational Biology
Volume:	6
Issue:	9
ISSN:	1553-7358
Publisher:	Public Library of Science
Date Published:	2010-09-01
Start Page:	e1000916
Language:	English
DOI:	10.1371/journal.pcbi.1000916
PUBMED:	20838582
PROVIDER:	scopus
PMCID:	PMC2936517
DOI/URL:	http://www.scopus.com/inward/record.url?eid=2-s2.0-78049440479&partnerID=40&md5=b59e592d7b8e82245f6927ab5181f811
Notes:	--- - "Cited By (since 1996): 1" - "Export Date: 20 April 2011" - "Art. No.: e1000916" - "Source: Scopus"

Altmetric

What is Altmetric?

Citation Impact

What is Dimensions Citation Badge?

BMJ Impact Analytics

MSK Authors

195 Leslie
11 Agius
20 Arvey
3 Chang

Related MSK Work

A Fully Bayesian Hidden Ising Model For Ch Ip Seq Data Analysis

Biostatistics 2012
Seq Gl Identifies Context Dependent Binding Signals In Genome Wide Regulatory Element Maps

PLoS Computational Biology 2015
Direct Ch Ip Seq Significance Analysis Improves Target Prediction

BMC Genomics 2015
Determination And Inference Of Eukaryotic Transcription Factor Sequence Specificity

Cell 2014
A Computational Drug Repositioning Approach For Targeting Oncogenic Transcription Factors

Cell Reports 2016