Predicting in-vitro Transcription Factor Binding Sites Using DNA Sequence + Shape

Discovery of transcription factor binding sites (TFBSs) is essential for understanding the underlying binding mechanisms and cellular functions. Recently, Convolutional neural network (CNN) has succeeded in predicting TFBSs from the primary DNA sequences. In addition to DNA sequences, several evidences suggest that protein-DNA binding is partly mediated by properties of DNA shape. Although many methods have been proposed to jointly account for DNA sequences and shape properties in predicting TFBSs, they ignore the power of the combination of deep learning and DNA sequence + shape. Therefore we develop a deep-learning-based sequence + shape framework (DLBSS) in this paper, which appropriately integrates DNA sequences and shape properties, to better understand protein-DNA binding preference. This method uses a shared CNN to find their common patterns from DNA sequences and their corresponding shape features, which are then concatenated to compute a predicted value. Using 66 in-vitro datasets derived from universal protein binding microarrays (uPBMs), we show that our proposed method DLBSS significantly improves the performance of predicting TFBSs. In addition, we explore the performance of the proposed method when using a deep CNN, and figure out which shape features are important for predicting TFBSs, through a series of experiments.

[1]  Jared M. Sagendorf,et al.  Expanding the repertoire of DNA shape features for genome-scale studies of transcription factor binding , 2017, Nucleic acids research.

[2]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[3]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[4]  Terence P. Speed,et al.  Finding short DNA motifs using permuted markov models , 2004, RECOMB.

[5]  M. Bulyk,et al.  Genomic regions flanking E-box binding sites influence DNA binding specificity of bHLH transcription factors through DNA shape. , 2013, Cell reports.

[6]  D.-S. Huang,et al.  Radial Basis Probabilistic Neural Networks: Model and Application , 1999, Int. J. Pattern Recognit. Artif. Intell..

[7]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[8]  Charles Elkan,et al.  The Value of Prior Knowledge in Discovering Motifs with MEME , 1995, ISMB.

[9]  G. Stormo,et al.  Determining the specificity of protein–DNA interactions , 2010, Nature Reviews Genetics.

[10]  R. Mann,et al.  The role of DNA shape in protein-DNA recognition , 2009, Nature.

[11]  De-Shuang Huang,et al.  Weakly-Supervised Convolutional Neural Network Architecture for Predicting Protein-DNA Binding , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[12]  R. Tjian,et al.  Transcriptional regulation in mammalian cells by sequence-specific DNA binding proteins. , 1989, Science.

[13]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[14]  De-Shuang Huang,et al.  ChIP-PIT: Enhancing the Analysis of ChIP-Seq Data Using Convex-Relaxed Pair-Wise Interaction Tensor Decomposition , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[15]  Dongwon Lee,et al.  kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets , 2013, Nucleic Acids Res..

[16]  De-Shuang Huang,et al.  Mining the bladder cancer-associated genes by an integrated strategy for the construction and analysis of differential co-expression networks , 2015, BMC Genomics.

[17]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[18]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[19]  A. Philippakis,et al.  Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities , 2006, Nature Biotechnology.

[20]  Manolis Kellis,et al.  Deep learning for regulatory genomics , 2015, Nature Biotechnology.

[21]  G. Stormo,et al.  Improved Models for Transcription Factor Binding Site Identification Using Nonindependent Interactions , 2012, Genetics.

[22]  Xiaohui S. Xie,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015, bioRxiv.

[23]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[24]  William Stafford Noble,et al.  DNA sequence+shape kernel enables alignment-free modeling of transcription factor binding , 2016, bioRxiv.

[25]  David K. Gifford,et al.  Convolutional neural network architectures for predicting DNA–protein binding , 2016, Bioinform..

[26]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[27]  Atina G. Coté,et al.  Evaluation of methods for modeling transcription factor sequence specificity , 2013, Nature Biotechnology.

[28]  De-Shuang Huang,et al.  A Constructive Hybrid Structure Optimization Methodology for Radial Basis Probabilistic Neural Networks , 2008, IEEE Transactions on Neural Networks.

[29]  De-Shuang Huang,et al.  Predicting Hub Genes Associated with Cervical Cancer through Gene Co-Expression Networks , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[30]  De-Shuang Huang,et al.  High-Order Convolutional Neural Network Architecture for Predicting DNA-Protein Binding Sites , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[31]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[32]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[33]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[34]  Lin Yang,et al.  DNAshapeR: an R/Bioconductor package for DNA shape prediction and feature encoding , 2015, Bioinform..

[35]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[36]  Lin Yang,et al.  DNAshape: a method for the high-throughput prediction of DNA structural features on a genomic scale , 2013, Nucleic Acids Res..

[37]  De-Shuang Huang,et al.  A Two-Stage Geometric Method for Pruning Unreliable Links in Protein-Protein Networks , 2015, IEEE Transactions on NanoBioscience.

[38]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[39]  G. Stormo Consensus patterns in DNA. , 1990, Methods in enzymology.

[40]  Zhen Wang,et al.  SFAPS: An R package for structure/function analysis of protein sequences based on informational spectrum method , 2013, 2013 IEEE International Conference on Bioinformatics and Biomedicine.

[41]  R. Mann,et al.  Quantitative modeling of transcription factor binding specificities using DNA shape , 2015, Proceedings of the National Academy of Sciences.

[42]  R. Young,et al.  Transcriptional Regulation and Its Misregulation in Disease , 2013, Cell.