Multiclass protein fold recognition using multiobjective evolutionary algorithms

Protein fold recognition (PFR) is an important approach to structure discovery without relying on sequence similarity. In pattern recognition terminology, PFR is a multiclass classification problem to be solved by employing feature analysis and pattern classification techniques. This work reformulates PFR into a multiobjective optimization problem and proposes a multiobjective feature analysis and selection algorithm (MOFASA). We use support vector machines as the classifier. Experimental results on the structural classification of protein (SCOP) data set indicate that MOFASA is capable of achieving comparable performances to the existing results. In addition, MOFASA identifies relevant features for further biological analysis.

[1]  David C. Jones,et al.  GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. , 1999, Journal of molecular biology.

[2]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[3]  Yves Deville,et al.  Multi-class protein fold classification using a new ensemble machine learning approach. , 2003, Genome informatics. International Conference on Genome Informatics.

[4]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[5]  K. Chou,et al.  Application of SVM to predict membrane protein types. , 2004, Journal of theoretical biology.

[6]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[7]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[8]  Sholom M. Weiss,et al.  Estimating Performance Gains for Voted Decision Trees , 1998, Intell. Data Anal..

[9]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[10]  Kalyanmoy Deb,et al.  Multi-objective optimization using evolutionary algorithms , 2001, Wiley-Interscience series in systems and optimization.

[11]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[13]  Pierre Baldi,et al.  Bioinformatics - the machine learning approach (2. ed.) , 2000 .

[14]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[15]  Ponnuthurai N. Suganthan,et al.  Feature Analysis and Classification of Protein Secondary Structure Data , 2003, ICANN.

[16]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[17]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[18]  Robert S. Ledley,et al.  The Protein Information Resource , 2003, Nucleic Acids Res..

[19]  Chris Sander,et al.  Protein folds and families: sequence and structure alignments , 1999, Nucleic Acids Res..

[20]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[21]  Mineichi Kudo,et al.  Comparison of algorithms that select features for pattern classifiers , 2000, Pattern Recognit..