Selecting predictive features for recognition of hypersensitive sites of regulatory genomic sequences with an evolutionary algorithm

This paper proposes a method to improve the recognition of regulatory genomic sequences. Annotating sequences that regulate gene transcription is an emerging challenge in genomics research. Identifying regulatory sequences promises to reveal underlying reasons for phenotypic differences among cells and for diseases associated with pathologies in protein expression. Computational approaches have been limited by the scarcity of experimentally-known features specific to regulatory sequences. High-throughput experimental technology is finally revealing a wealth of hypersensitive (HS) sequences that are reliable markers of regulatory sequences and currently the focus of classification methods. The contribution of this paper is a novel method that combines evolutionary computation and SVM classification to improve the recognition of HS sequences. Based on experimental evidence that HS regions employ sequence features to interact with enzymes, the method seeks motifs to discriminate between HS and non-HS sequences. An evolutionary algorithm (EA) searches the space of sequences of different lengths to obtain such motifs. Experiments reveal that these motifs improve recognition of HS sequences by more than 10% compared to state-of-the-art classification methods. Analysis of these motifs reveals interesting insight into features employed by regulatory sequences to interact with DNA-binding enzymes.

[1]  J. Stamatoyannopoulos,et al.  Genome-wide identification of DNaseI hypersensitive sites using active chromatin sequence libraries. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[2]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[3]  Robert J. Marks,et al.  Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks , 1999 .

[4]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[5]  William Stafford Noble,et al.  Predicting the in vivo signature of human gene regulatory sequence , 2005, ISMB.

[6]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[7]  A. Nienhuis,et al.  Mechanism of DNase I hypersensitive site formation within the human globin locus control region. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[8]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[9]  Michael Litt,et al.  The insulation of genes from external enhancers and silencing chromatin , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[10]  K. Heller,et al.  Sequence information for the splicing of human pre-mRNA identified by support vector machine classification. , 2003, Genome research.

[11]  Burkhard Rost,et al.  Using genetic algorithms to select most predictive protein features , 2009, Proteins.

[12]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[13]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[14]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[15]  Chaoyang Zhang,et al.  Supervised learning method for the prediction of subcellular localization of proteins using amino acid and amino acid pair composition , 2008, BMC Genomics.

[16]  J. Stamatoyannopoulos,et al.  NF‐E2 and GATA binding motifs are required for the formation of DNase I hypersensitive site 4 of the human beta‐globin locus control region. , 1995, The EMBO journal.

[17]  Ming-Zhu Lu,et al.  Optimization of combined kernel function for SVM by Particle Swarm Optimization , 2009, 2009 International Conference on Machine Learning and Cybernetics.

[18]  J. Stamatoyannopoulos,et al.  High-throughput localization of functional elements by quantitative chromatin profiling , 2004, Nature Methods.

[19]  M. Tompa,et al.  Discovery of novel transcription factor binding sites by statistical overrepresentation. , 2002, Nucleic acids research.

[20]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[21]  G. Felsenfeld,et al.  Chromatin Unfolds , 1996, Cell.

[22]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[23]  Lise Getoor,et al.  A Feature Generation Algorithm with Applications to Bio- logical Sequence Classification , 2007 .

[24]  Xizhao Wang,et al.  Optimization of combined kernel function for SVM based on large margin learning theory , 2008, 2008 IEEE International Conference on Systems, Man and Cybernetics.

[25]  G. Rubin,et al.  Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Lise Getoor,et al.  Features generated for computational splice-site prediction correspond to functional elements , 2007, BMC Bioinformatics.

[27]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[28]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[29]  William Stafiord Noble,et al.  Support vector machine applications in computational biology , 2004 .

[30]  Portland Press Ltd Nomenclature Committee for the International Union of Biochemistry (NC-IUB). Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. , 1985, Molecular biology and evolution.

[31]  Mathieu Blanchette,et al.  Motif Discovery in Heterogeneous Sequence Data , 2003, Pacific Symposium on Biocomputing.

[32]  M. Groudine,et al.  Controlling the double helix , 2003, Nature.

[33]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[34]  Andreas Prlic,et al.  Sequence analysis , 2003 .

[35]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[36]  Bernhard Schölkopf,et al.  Support Vector Machine Applications in Computational Biology , 2004 .

[37]  F. Robert,et al.  Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression , 2006 .

[38]  Dong Seong Kim,et al.  Determining Optimal Decision Model for Support Vector Machine by Genetic Algorithm , 2004, CIS.

[39]  J. Hughes,et al.  Using genomics to study how chromatin influences gene expression. , 2007, Annual review of genomics and human genetics.

[40]  Andreas Prlic,et al.  BioJava: an open-source framework for bioinformatics in 2012 , 2012, Bioinform..

[41]  D. S. Gross,et al.  Nuclease hypersensitive sites in chromatin. , 1988, Annual review of biochemistry.

[42]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[43]  Nc Biochemistry,et al.  Nomenclature Committee of the International Union of Biochemistry (NC-IUB). Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. , 1985, European journal of biochemistry.

[44]  F. Grosveld,et al.  Detailed analysis of the site 3 region of the human beta‐globin dominant control region. , 1990, The EMBO journal.

[45]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[46]  Keith Vertanen,et al.  Genetic Adventures in Parallel : Towards a Good Island Model under PVM , 2004 .

[47]  E. Davidson Genomic Regulatory Systems: Development and Evolution , 2005 .

[48]  M. L. Howard,et al.  cis-Regulatory control circuits in development. , 2004, Developmental biology.

[49]  Michael R. Green,et al.  Transcriptional regulatory elements in the human genome. , 2006, Annual review of genomics and human genetics.

[50]  Nomenclature Committee of the International Union of Biochemistry (NC-IUB). Nomenclature of electron-transfer proteins. Recommendations 1989. , 1992, The Journal of biological chemistry.

[51]  Carl Wu The 5′ ends of Drosophila heat shock genes in chromatin are hypersensitive to DNase I , 1980, Nature.