Feature and Kernel Evolution for Recognition of Hypersensitive Sites in DNA Sequences

The annotation of DNA regions that regulate gene transcription is the first step towards understanding phenotypical differences among cells and many diseases. Hypersensitive (HS) sites are reliable markers of regulatory regions. Mapping HS sites is the focus of many statistical learning techniques that employ Support Vector Machines (SVM) to classify a DNA sequence as HS or non-HS. The contribution of this paper is a novel methodology inspired by biological evolution to automate the basic steps in SVM and improve classification accuracy. First, an evolutionary algorithm designs optimal sequence motifs used to associate feature vectors with the input sequences. Second, a genetic programming algorithm designs optimal kernel functions that map the feature vectors into a high-dimensional space where the vectors can be optimally separated into the HS and non-HS classes. Results show that the employment of evolutionary computation techniques improves classification accuracy and promises to automate the analysis of biological sequences.

[1]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[2]  A. Nienhuis,et al.  Mechanism of DNase I hypersensitive site formation within the human globin locus control region. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[4]  Lise Getoor,et al.  A Feature Generation Algorithm with Applications to Bio- logical Sequence Classification , 2007 .

[5]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[6]  Lise Getoor,et al.  Features generated for computational splice-site prediction correspond to functional elements , 2007, BMC Bioinformatics.

[7]  Christina S. Leslie,et al.  Fast String Kernels using Inexact Matching for Protein Sequences , 2004, J. Mach. Learn. Res..

[8]  Boonserm Kijsirikul,et al.  Evolutionary strategies for multi-scale radial basis function kernels in support vector machines , 2005, GECCO '05.

[9]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[10]  Sean Luke,et al.  Evolving kernels for support vector machine classification , 2007, GECCO '07.

[11]  Charles P. Staelin Parameter selection for support vector machines , 2002 .

[12]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[13]  Kenneth A. De Jong,et al.  Using evolutionary computation to improve SVM classification , 2010, IEEE Congress on Evolutionary Computation.

[14]  Carl Wu The 5′ ends of Drosophila heat shock genes in chromatin are hypersensitive to DNase I , 1980, Nature.

[15]  Andreas Prlic,et al.  Sequence analysis , 2003 .

[16]  Portland Press Ltd Nomenclature Committee for the International Union of Biochemistry (NC-IUB). Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. , 1985, Molecular biology and evolution.

[17]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[18]  J. Hughes,et al.  Using genomics to study how chromatin influences gene expression. , 2007, Annual review of genomics and human genetics.

[19]  Nc Biochemistry,et al.  Nomenclature Committee of the International Union of Biochemistry (NC-IUB). Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. , 1985, European journal of biochemistry.

[20]  William Stafiord Noble,et al.  Support vector machine applications in computational biology , 2004 .

[21]  Bernhard Schölkopf,et al.  Support Vector Machine Applications in Computational Biology , 2004 .

[22]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[23]  Michael R. Green,et al.  Transcriptional regulatory elements in the human genome. , 2006, Annual review of genomics and human genetics.

[24]  J. Stamatoyannopoulos,et al.  Genome-wide identification of DNaseI hypersensitive sites using active chromatin sequence libraries. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[25]  William Stafford Noble,et al.  Predicting the in vivo signature of human gene regulatory sequence , 2005, ISMB.

[26]  F. Robert,et al.  Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression , 2006 .

[27]  Kenneth A. De Jong,et al.  Selecting predictive features for recognition of hypersensitive sites of regulatory genomic sequences with an evolutionary algorithm , 2010, GECCO '10.

[28]  Nozha Boujemaa,et al.  Conditionally Positive Definite Kernels for SVM Based Image Recognition , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[29]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[30]  Kwang-Kyu Seo,et al.  A GA-Based Feature Subset Selection and Parameter Optimization of Support Vector Machine for Content - Based Image Retrieval , 2007, ADMA.

[31]  Christian Igel,et al.  Evolutionary tuning of multiple SVM parameters , 2005, ESANN.

[32]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Multiclass SVM Model Selection Using Particle Swarm Optimization , 2006, 2006 Sixth International Conference on Hybrid Intelligent Systems (HIS'06).

[33]  Michael Litt,et al.  The insulation of genes from external enhancers and silencing chromatin , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[34]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[35]  D. S. Gross,et al.  Nuclease hypersensitive sites in chromatin. , 1988, Annual review of biochemistry.

[36]  David J. Montana,et al.  Strongly Typed Genetic Programming , 1995, Evolutionary Computation.

[37]  Alex Smola,et al.  Kernel methods in machine learning , 2007, math/0701907.

[38]  K. Heller,et al.  Sequence information for the splicing of human pre-mRNA identified by support vector machine classification. , 2003, Genome research.

[39]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2006 .

[40]  Chaoyang Zhang,et al.  Supervised learning method for the prediction of subcellular localization of proteins using amino acid and amino acid pair composition , 2008, BMC Genomics.

[41]  Ingo Mierswa,et al.  Evolutionary learning with kernels: a generic solution for large margin problems , 2006, GECCO '06.

[42]  Keith Vertanen,et al.  Genetic Adventures in Parallel : Towards a Good Island Model under PVM , 2004 .

[43]  J. Stamatoyannopoulos,et al.  High-throughput localization of functional elements by quantitative chromatin profiling , 2004, Nature Methods.