A multi-level approach to SCOP fold recognition

The classification of proteins based on their structure can play an important role in the deduction or discovery of protein function. However, the relatively low number of solved protein structures and the unknown relationship between structure and sequence requires an alternative method of representation for classification to be effective. Furthermore, the large number of potential folds causes problems for many classification strategies, increasing the likelihood that the classifier will reach a local optima while trying to distinguish between all of the possible structural categories. Here we present a hierarchical strategy for structural classification that first partitions proteins based on their SCOP class before attempting to assign a protein fold. Using a well-known dataset derived from the 27 most-populated SCOP folds and several sequence-based descriptor properties as input features, we test a number of classification methods, including Naive Bayes and Boosted C4.5. Our strategy achieves an average fold recognition of 74%, which is significantly higher than the 56-60% previously reported in the literature, indicating the effectiveness of a multi-level approach.

[1]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[2]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[3]  David W. Aha,et al.  Simplifying decision trees: A survey , 1997, The Knowledge Engineering Review.

[4]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[5]  Ankush Mittal,et al.  Protein Structure and Fold Prediction Using Tree-augmented Naïve Bayesian Classifier , 2005, J. Bioinform. Comput. Biol..

[6]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[7]  Yves Deville,et al.  Multi-class protein fold classification using a new ensemble machine learning approach. , 2003, Genome informatics. International Conference on Genome Informatics.

[8]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[9]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[10]  Kalyanmoy Deb,et al.  Multi-Class Protein Fold Recognition Using Multi-Objective Evolutionary Algorithms , 2004 .

[11]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[12]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[13]  Sreerama K. Murthy,et al.  Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey , 1998, Data Mining and Knowledge Discovery.

[14]  D. Haussler,et al.  Boolean Feature Discovery in Empirical Learning , 1990, Machine Learning.