Segment and Combine Approach for Biological Sequence Classification

This paper presents a new algorithm based on the segment and combine paradigm, for automatic classification of biological sequences. It classifies sequences by aggregating the information about their subsequences predicted by a classifier derived by machine learning from a random sample of training subsequences. This generic approach is combined with decision tree based ensemble methods, scalable both with respect to sample size and vocabulary size. The method is applied to three families of problems: DNA sequence recognition, splice junction detection, and gene regulon prediction. With respect to standard approaches based on n-grams, it appears competitive in terms of accuracy, flexibility, and scalability. The paper also highlights the possibility to exploit the resulting models to identify interpretable patterns specific of a given class of biological sequences.

[1]  Pierre Geurts,et al.  Segment and Combine Approach for Non-parametric Time-Series Classification , 2005, PKDD.

[2]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[3]  Haym Hirsh,et al.  Using background knowledge to improve inductive learning of DNA sequences , 1994, Proceedings of the Tenth Conference on Artificial Intelligence for Applications.

[4]  Yvan Saeys,et al.  Feature selection for splice site prediction: A new method using EDA-based feature ranking , 2004, BMC Bioinformatics.

[5]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[6]  Raphaël Marée,et al.  Random subwindows for robust image classification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[7]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[8]  Jacques van Helden,et al.  Regulatory Sequence Analysis Tools , 2003, Nucleic Acids Res..

[9]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[10]  Shoshana J. Wodak,et al.  Combining pattern discovery and discriminant analysis to predict gene co-regulation , 2004, Bioinform..

[11]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[12]  Jean-Philippe Vert Tahs ocal Alignment Kernels for Biological Sequences , 2004 .

[13]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[14]  Yuh-Jyh Hu,et al.  Combinatorial motif analysis and hypothesis generation on a genomic scale , 2000, Bioinform..

[15]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[16]  Michael Q. Zhang Discriminant Analysis and Its Application in DNA Sequence Motif Recognition , 2000, Briefings Bioinform..

[17]  Dennis Shasha,et al.  New Techniques for DNA Sequence Classification , 1999, J. Comput. Biol..

[18]  Louis Wehenkel,et al.  Automatic Learning Techniques in Power Systems , 1997 .

[19]  Jude W. Shavlik,et al.  Training Knowledge-Based Neural Networks to Recognize Genes , 1990, NIPS.