An Evolutionary Algorithm Approach for Feature Generation from Sequence Data and Its Application to DNA Splice Site Prediction

Associating functional information with biological sequences remains a challenge for machine learning methods. The performance of these methods often depends on deriving predictive features from the sequences sought to be classified. Feature generation is a difficult problem, as the connection between the sequence features and the sought property is not known a priori. It is often the task of domain experts or exhaustive feature enumeration techniques to generate a few features whose predictive power is then tested in the context of classification. This paper proposes an evolutionary algorithm to effectively explore a large feature space and generate predictive features from sequence data. The effectiveness of the algorithm is demonstrated on an important component of the gene-finding problem, DNA splice site prediction. This application is chosen due to the complexity of the features needed to obtain high classification accuracy and precision. Our results test the effectiveness of the obtained features in the context of classification by Support Vector Machines and show significant improvement in accuracy and precision over state-of-the-art approaches.

[1]  B. Schölkopf,et al.  Accurate Splice Site Detection for Caenorhabditis elegans , 2004 .

[2]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[3]  Anil K. Jain,et al.  Dimensionality reduction using genetic algorithms , 2000, IEEE Trans. Evol. Comput..

[4]  William B. Langdon,et al.  Genetic Programming for Mining DNA Chip Data from Cancer Patients , 2004, Genetic Programming and Evolvable Machines.

[5]  Kenneth A. De Jong,et al.  Using evolutionary computation to improve SVM classification , 2010, IEEE Congress on Evolutionary Computation.

[6]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[7]  Sung-Bae Cho,et al.  Lymphoma Cancer Classification Using Genetic Programming with SNR Features , 2004, EuroGP.

[8]  Rafael Ramírez,et al.  A Genetic Programming Approach to Feature Selection and Classification of Instantaneous Cognitive States , 2009, EvoWorkshops.

[9]  Julie Wilson,et al.  Novel feature selection method for genetic programming using metabolomic 1H NMR data , 2006 .

[10]  Gunnar Rätsch,et al.  New Methods for Splice Site Recognition , 2002, ICANN.

[11]  Burkhard Rost,et al.  Using genetic algorithms to select most predictive protein features , 2009, Proteins.

[12]  Donald E. Brown,et al.  Fast generic selection of features for neural network classifiers , 1992, IEEE Trans. Neural Networks.

[13]  Lakhmi C. Jain,et al.  Nearest neighbor classifier: Simultaneous editing and feature selection , 1999, Pattern Recognit. Lett..

[14]  Jack Sklansky,et al.  A note on genetic algorithms for large-scale feature selection , 1989, Pattern Recognition Letters.

[15]  Johanne Cohen,et al.  Shuffling biological sequences with motif constraints , 2008, J. Discrete Algorithms.

[16]  Xiaoming Xu,et al.  A hybrid genetic algorithm for feature selection wrapper based on mutual information , 2007, Pattern Recognit. Lett..

[17]  Kenneth A. De Jong,et al.  Selecting predictive features for recognition of hypersensitive sites of regulatory genomic sequences with an evolutionary algorithm , 2010, GECCO '10.

[18]  Jason H. Moore,et al.  Symbolic discriminant analysis of microarray data in autoimmune disease , 2002, Genetic epidemiology.

[19]  Nikhil R. Pal,et al.  Genetic programming for simultaneous feature selection and classifier design , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[20]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[21]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[22]  David Eisenberg,et al.  Motif‐based fold assignment , 2001, Protein science : a publication of the Protein Society.

[23]  Stephen F. Smith,et al.  A learning system based on genetic adaptive algorithms , 1980 .

[24]  K. Heller,et al.  Sequence information for the splicing of human pre-mRNA identified by support vector machine classification. , 2003, Genome research.

[25]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[26]  W. John Wilbur,et al.  DNA splice site detection: a comparison of specific and general methods , 2002, AMIA.

[27]  Igor Vorechovsky,et al.  Position-Dependent Repression and Promotion of DQB1 Intron 3 Splicing by GGGG Motifs1 , 2006, The Journal of Immunology.

[28]  O. Gotoh,et al.  Detection of the Splicing Sites with Kernel Method Approaches Dealing with Nucleotide Doublets , 2003 .

[29]  Debashis Ghosh,et al.  Feature selection and molecular classification of cancer using genetic programming. , 2007, Neoplasia.

[30]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[31]  G. Ast,et al.  Human-mouse comparative analysis reveals that branch-site plasticity contributes to splicing regulation. , 2005, Human molecular genetics.

[32]  Zheng Rong Yang,et al.  Evaluation of Mutual Information and Genetic Programming for Feature Selection in QSAR , 2004, J. Chem. Inf. Model..

[33]  William Stafford Noble,et al.  Predicting the in vivo signature of human gene regulatory sequence , 2005, ISMB.

[34]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[35]  Lise Getoor,et al.  A Feature Generation Algorithm with Applications to Bio- logical Sequence Classification , 2007 .

[36]  T. Cooper,et al.  Identification of a new class of exonic splicing enhancers by in vivo selection , 1997, Molecular and cellular biology.

[37]  Nichael Lynn Cramer,et al.  A Representation for the Adaptive Generation of Simple Sequential Programs , 1985, ICGA.

[38]  S. Salzberg,et al.  GeneSplicer: a new computational method for splice site prediction. , 2001, Nucleic acids research.

[39]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[40]  Joseph A. Driscoll,et al.  Classification of Gene Expression Data with Genetic Programming , 2003 .

[41]  Chaoyang Zhang,et al.  Supervised learning method for the prediction of subcellular localization of proteins using amino acid and amino acid pair composition , 2008, BMC Genomics.

[42]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[43]  E. Birney,et al.  EGASP: the human ENCODE Genome Annotation Assessment Project , 2006, Genome Biology.

[44]  Lise Getoor,et al.  Features generated for computational splice-site prediction correspond to functional elements , 2007, BMC Bioinformatics.

[45]  R. Boggia,et al.  Genetic algorithms as a strategy for feature selection , 1992 .

[46]  Byung Ro Moon,et al.  Hybrid Genetic Algorithms for Feature Selection , 2004, IEEE Trans. Pattern Anal. Mach. Intell..

[47]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[48]  Kenneth A. De Jong,et al.  Feature and Kernel Evolution for Recognition of Hypersensitive Sites in DNA Sequences , 2010, BIONETICS.

[49]  Gunnar Rätsch,et al.  Accurate splice site prediction using support vector machines , 2007, BMC Bioinformatics.

[50]  Richard K. Belew,et al.  New Methods for Competitive Coevolution , 1997, Evolutionary Computation.

[51]  Amos Bairoch,et al.  The PROSITE database, its status in 2002 , 2002, Nucleic Acids Res..

[52]  Christopher B. Burge,et al.  Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals , 2003, RECOMB '03.

[53]  Tong Zhang,et al.  Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.