Prediction of Transcription Factor Families Using DNA Sequence Features

Understanding the mechanisms of protein-DNA interaction is of critical importance in biology. Transcription factor (TF) binding to a specific DNA sequence depends on at least two factors: A protein-level DNA-binding domain and a nucleotide-level specific sequence serving as a TF binding site. TFs have been classified into families based on these factors. TFs within each family bind to specific nucleotide sequences in a very similar fashion. Identification of the TF family that might bind at a particular nucleotide sequence requires a machine learning approach. Here we considered two sets of features based on DNA sequences and their physicochemical properties and applied a one-versus-all SVM (OVA-SVM) with class-wise optimized features to identify TF family-specific features in DNA sequences. Using this approach, a mean prediction accuracy of ~80% was achieved, which represents an improvement of ~7% over previous approaches on the same data.

[1]  Yi Lin Multicategory Support Vector Machines, Theory, and Application to the Classification of . . . , 2003 .

[2]  Akinori Sarai,et al.  ACTIVITY: a database on DNA/RNA sites activity adapted to apply sequence-activity relationships from one system to another , 2001, Nucleic Acids Res..

[3]  C. Domeniconi,et al.  An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification , 2004 .

[4]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[5]  G. Fogel,et al.  A statistical analysis of the TRANSFAC database. , 2005, Bio Systems.

[6]  C. Lawrence,et al.  Factors influencing the identification of transcription factor binding sites by cross-species comparison. , 2002, Genome research.

[7]  Jason Weston,et al.  Support vector machines for multi-class pattern recognition , 1999, ESANN.

[8]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[10]  J. Thornton,et al.  An overview of the structures of protein-DNA complexes , 2000, Genome Biology.

[11]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[12]  Wyeth W. Wasserman,et al.  A new generation of JASPAR, the open-access repository for transcription factor binding site profiles , 2005, Nucleic Acids Res..

[13]  Miguel Figueroa,et al.  Competitive learning with floating-gate circuits , 2002, IEEE Trans. Neural Networks.

[14]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[15]  Sayan Mukherjee,et al.  An Analytical Method for Multiclass Molecular Cancer Classification , 2003, SIAM Rev..

[16]  J. Weston,et al.  Support Vector Machines for Multi-class Pattern Recognition 1. K-class Pattern Recognition 2. Solving K-class Problems with Binary Svms , 1999 .

[17]  A. Sandelin,et al.  Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. , 2004, Journal of molecular biology.

[18]  Lee Ann McCue,et al.  Making connections between novel transcription factors and their DNA motifs. , 2005, Genome research.

[19]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[20]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[21]  T. Svingen,et al.  Hox transcription factors and their elusive mammalian gene targets , 2006, Heredity.

[22]  Ponnuthurai N. Suganthan,et al.  Feature Selection Approach for Quantitative Prediction of Transcriptional Activities , 2006, 2006 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology.

[23]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[25]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[26]  Ganesan Pugalenthi,et al.  Predicting protein structural class by SVM with class-wise optimized features and decision probabilities. , 2008, Journal of theoretical biology.

[27]  Lawrence Carin,et al.  Sparse multinomial logistic regression: fast algorithms and generalization bounds , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Ulrich H.-G. Kreßel,et al.  Pairwise classification and support vector machines , 1999 .

[29]  G. Christian Overton,et al.  Conformational and physicochemical DNA features specific for transcription factor binding sites , 1999, Bioinform..

[30]  Alexander J. Smola,et al.  Advances in Large Margin Classifiers , 2000 .

[31]  M. Klemsz,et al.  The ETS-domain: a new DNA-binding motif that recognizes a purine-rich core DNA sequence. , 1990, Genes & development.

[32]  Lee Ann McCue,et al.  Identification of co-regulated genes through Bayesian clustering of predicted regulatory binding sites , 2003, Nature Biotechnology.

[33]  M. Blanchette,et al.  Discovery of regulatory elements by a computational method for phylogenetic footprinting. , 2002, Genome research.

[34]  Alexander J. Hartemink,et al.  Sequence features of DNA binding sites reveal structural class of associated transcription factor , 2006, Bioinform..

[35]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[36]  Isabelle Guyon,et al.  Comparison of classifier methods: a case study in handwritten digit recognition , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).