Enhanced performance in prediction of protein active sites with THEMATICS and support vector machines

Theoretical microscopic titration curves (THEMATICS) is a computational method for the identification of active sites in proteins through deviations in computed titration behavior of ionizable residues. While the sensitivity to catalytic sites is high, the previously reported sensitivity to catalytic residues was not as high, about 50%. Here THEMATICS is combined with support vector machines (SVM) to improve sensitivity for catalytic residue prediction from protein 3D structure alone. For a test set of 64 proteins taken from the Catalytic Site Atlas (CSA), the average recall rate for annotated catalytic residues is 61%; good precision is maintained selecting only 4% of all residues. The average false positive rate, using the CSA annotations is only 3.2%, far lower than other 3D‐structure‐based methods. THEMATICS–SVM returns higher precision, lower false positive rate, and better overall performance, compared with other 3D‐structure‐based methods. Comparison is also made with the latest machine learning methods that are based on both sequence alignments and 3D structures. For annotated sets of well‐characterized enzymes, THEMATICS–SVM performance compares very favorably with methods that utilize sequence homology. However, since THEMATICS depends only on the 3D structure of the query protein, no decline in performance is expected when applied to novel folds, proteins with few sequence homologues, or even orphan sequences. An extension of the method to predict non‐ionizable catalytic residues is also presented. THEMATICS–SVM predicts a local network of ionizable residues with strong interactions between protonation events; this appears to be a special feature of enzyme active sites.

[1]  W. L. Jorgensen,et al.  Comparison of simple potential functions for simulating liquid water , 1983 .

[2]  R. Hamlin,et al.  Crystal structure of cytochrome c peroxidase compound I. , 1987, Biochemistry.

[3]  W. L. Jorgensen,et al.  The OPLS [optimized potentials for liquid simulations] potential functions for proteins, energy minimizations for crystals of cyclic peptides and crambin. , 1988, Journal of the American Chemical Society.

[4]  T. Poulos,et al.  Crystal structure of recombinant pea cytosolic ascorbate peroxidase. , 1995, Biochemistry.

[5]  D. Ringe,et al.  Locating and characterizing binding sites on proteins , 1996, Nature Biotechnology.

[6]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[7]  J. C. BurgesChristopher A Tutorial on Support Vector Machines for Pattern Recognition , 1998 .

[8]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[9]  P D Karp,et al.  What we do not know about sequence analysis and sequence databases. , 1998, Bioinformatics.

[10]  M. Helmer-Citterich,et al.  Three-dimensional profiles: a new tool to identify protein surface similarities. , 1998, Journal of molecular biology.

[11]  Jaime Prilusky,et al.  Automated analysis of interatomic contacts in proteins , 1999, Bioinform..

[12]  A. Valencia,et al.  Practical limits of function prediction , 2000, Proteins.

[13]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[14]  M. Gerstein,et al.  Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. , 2000, Journal of molecular biology.

[15]  M. Sternberg,et al.  Automated structure-based prediction of functional sites in proteins: applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking. , 2001, Journal of molecular biology.

[16]  M. Ondrechen,et al.  THEMATICS: A simple computational predictor of enzyme function from structure , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[17]  F. Rojo,et al.  A Mutation in the C-terminal domain of the RNA polymerase alpha subunit that destabilizes the open complexes formed at the phage phi 29 late A3 promoter. , 2001, Journal of molecular biology.

[18]  D. Eisenberg,et al.  Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. , 2001, Journal of molecular biology.

[19]  A. Tropsha,et al.  Four-body potentials reveal protein-specific correlations to stability changes caused by hydrophobic core mutations. , 2001, Journal of molecular biology.

[20]  A. Elcock Prediction of functionally important residues based solely on the computed energetics of protein structure. , 2001, Journal of molecular biology.

[21]  Gail J. Bartlett,et al.  Analysis of catalytic residues in enzyme active sites. , 2002, Journal of molecular biology.

[22]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[23]  Gail J. Bartlett,et al.  Using a neural network and spatial clustering to predict the location of active sites in enzymes. , 2003, Journal of molecular biology.

[24]  Karl H. Clodfelter,et al.  Identification of substrate binding sites in enzymes by computational solvent mapping. , 2003, Journal of molecular biology.

[25]  K. Nishikawa,et al.  Prediction of catalytic residues in enzymes based on known tertiary structure, stability profile, and sequence conservation. , 2003, Journal of molecular biology.

[26]  Pengyu Y. Ren,et al.  Polarizable Atomic Multipole Water Model for Molecular Mechanics Simulation , 2003 .

[27]  J. Warwicker,et al.  Enzyme/non-enzyme discrimination and prediction of enzyme active site location using charge-based methods. , 2004, Journal of molecular biology.

[28]  Ihsan A. Shehadi,et al.  Future directions in protein function prediction , 2002, Molecular Biology Reports.

[29]  P. Babbitt,et al.  Superfamily active site templates , 2004, Proteins.

[30]  Janet M. Thornton,et al.  The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data , 2004, Nucleic Acids Res..

[31]  Bernhard Schölkopf,et al.  Training Invariant Support Vector Machines , 2002, Machine Learning.

[32]  Gil Amitai,et al.  Network analysis of protein structures identifies functional residues. , 2004, Journal of molecular biology.

[33]  C. Innis,et al.  Prediction of functional sites in proteins using conserved functional group analysis. , 2004, Journal of molecular biology.

[34]  Ronald J. Williams,et al.  Statistical criteria for the identification of protein active sites using theoretical microscopic titration curves , 2005, Proteins.

[35]  Richard M. Jackson,et al.  Q-SiteFinder: an energy-based method for the prediction of protein-ligand binding sites , 2005, Bioinform..

[36]  M. Eisenstein,et al.  Looking at enzymes from the inside out: the proximity of catalytic residues to the molecular centroid can be used for detection of active sites and enzyme-ligand interfaces. , 2005, Journal of molecular biology.

[37]  Ying Wei,et al.  Selective prediction of interaction sites in protein structures with THEMATICS , 2007, BMC Bioinformatics.

[38]  Cathy H. Wu,et al.  Prediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties , 2006, BMC Bioinformatics.

[39]  P. Radivojac,et al.  Evaluation of features for catalytic residue prediction in novel folds , 2007 .

[40]  Johannes C. Hermann,et al.  Structure-based activity prediction for an enzyme of unknown function , 2007, Nature.