Prediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties

BackgroundThe number of protein sequences deriving from genome sequencing projects is outpacing our knowledge about the function of these proteins. With the gap between experimentally characterized and uncharacterized proteins continuing to widen, it is necessary to develop new computational methods and tools for functional prediction. Knowledge of catalytic sites provides a valuable insight into protein function. Although many computational methods have been developed to predict catalytic residues and active sites, their accuracy remains low, with a significant number of false positives. In this paper, we present a novel method for the prediction of catalytic sites, using a carefully selected, supervised machine learning algorithm coupled with an optimal discriminative set of protein sequence conservation and structural properties.ResultsTo determine the best machine learning algorithm, 26 classifiers in the WEKA software package were compared using a benchmarking dataset of 79 enzymes with 254 catalytic residues in a 10-fold cross-validation analysis. Each residue of the dataset was represented by a set of 24 residue properties previously shown to be of functional relevance, as well as a label {+1/-1} to indicate catalytic/non-catalytic residue. The best-performing algorithm was the Sequential Minimal Optimization (SMO) algorithm, which is a Support Vector Machine (SVM). The Wrapper Subset Selection algorithm further selected seven of the 24 attributes as an optimal subset of residue properties, with sequence conservation, catalytic propensities of amino acids, and relative position on protein surface being the most important features.ConclusionThe SMO algorithm with 7 selected attributes correctly predicted 228 of the 254 catalytic residues, with an overall predictive accuracy of more than 86%. Missing only 10.2% of the catalytic residues, the method captures the fundamental features of catalytic residues and can be used as a "catalytic residue filter" to facilitate experimental identification of catalytic residues for proteins with known structure but unknown function.

[1]  B. Lee,et al.  The interpretation of protein structures: estimation of static accessibility. , 1971, Journal of molecular biology.

[2]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[3]  J. S. Milton,et al.  Statistical Methods in the Biological and Health Sciences , 1982 .

[4]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[5]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[6]  Eugene W. Myers,et al.  Basic local alignment search tool. Journal of Molecular Biology , 1990 .

[7]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[8]  S. Altschul,et al.  Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[9]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[10]  F. Cohen,et al.  An evolutionary trace method defines binding surfaces common to protein families. , 1996, Journal of molecular biology.

[11]  M. Billeter,et al.  MOLMOL: a program for display and analysis of macromolecular structures. , 1996, Journal of molecular graphics.

[12]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[13]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[14]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[15]  T. Blundell,et al.  Evolutionary trace analysis of TGF-beta and related growth factors: implications for site-directed mutagenesis. , 2000, Protein engineering.

[16]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[17]  S. Parthasarathy,et al.  Protein thermal stability: insights from atomic displacement parameters (B values). , 2000, Protein engineering.

[18]  M. Sternberg,et al.  Automated structure-based prediction of functional sites in proteins: applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking. , 2001, Journal of molecular biology.

[19]  M. Ondrechen,et al.  THEMATICS: A simple computational predictor of enzyme function from structure , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[20]  D. Eisenberg,et al.  Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. , 2001, Journal of molecular biology.

[21]  A. Elcock Prediction of functionally important residues based solely on the computed energetics of protein structure. , 2001, Journal of molecular biology.

[22]  Gail J. Bartlett,et al.  Analysis of catalytic residues in enzyme active sites. , 2002, Journal of molecular biology.

[23]  W. S. Valdar,et al.  Scoring residue conservation , 2002, Proteins.

[24]  Gail J. Bartlett,et al.  Using a neural network and spatial clustering to predict the location of active sites in enzymes. , 2003, Journal of molecular biology.

[25]  J. Skolnick,et al.  How well is enzyme function conserved as a function of pairwise sequence identity? , 2003, Journal of molecular biology.

[26]  K. Kinoshita,et al.  Identification of protein biochemical functions by similarity search using the molecular surface database eF‐site , 2003, Protein science : a publication of the Protein Society.

[27]  K. Nishikawa,et al.  Prediction of catalytic residues in enzymes based on known tertiary structure, stability profile, and sequence conservation. , 2003, Journal of molecular biology.

[28]  L. Kavraki,et al.  An accurate, sensitive, and scalable method to identify functional sites in protein structures. , 2003, Journal of molecular biology.

[29]  S. J. Campbell,et al.  Ligand binding: functional site location, similarity and docking. , 2003, Current opinion in structural biology.

[30]  B. Rost,et al.  Automatic prediction of protein function , 2003, Cellular and Molecular Life Sciences CMLS.

[31]  P. Radivojac,et al.  Improved amino acid flexibility parameters , 2003, Protein science : a publication of the Protein Society.

[32]  Jie Liang,et al.  CASTp: Computed Atlas of Surface Topography of proteins , 2003, Nucleic Acids Res..

[33]  Ashish V. Tendulkar,et al.  Functional sites in protein families uncovered via an objective and automated graph theoretic approach. , 2003, Journal of molecular biology.

[34]  Robert S. Ledley,et al.  The Protein Information Resource , 2003, Nucleic Acids Res..

[35]  J. Thornton,et al.  Searching for functional sites in protein structures. , 2004, Current opinion in chemical biology.

[36]  Shunyi Zhu,et al.  Evolutionary trace analysis of scorpion toxins specific for K‐channels , 2003, Proteins.

[37]  A. Panchenko,et al.  Prediction of functional sites by analysis of sequence and structure conservation , 2004, Protein science : a publication of the Protein Society.

[38]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[39]  S. Chakravarty,et al.  Evolutionary Trace Residues in Noroviruses: Importance in Receptor Binding, Antigenicity, Virion Assembly, and Strain Diversity , 2005, Journal of Virology.

[40]  Andreas Christmann,et al.  Support Vector Machines , 2008, Data Mining and Knowledge Discovery Handbook.