Partial Order Optimum Likelihood (POOL): Maximum Likelihood Prediction of Protein Active Site Residues Using 3D Structure and Sequence Properties

A new monotonicity-constrained maximum likelihood approach, called Partial Order Optimum Likelihood (POOL), is presented and applied to the problem of functional site prediction in protein 3D structures, an important current challenge in genomics. The input consists of electrostatic and geometric properties derived from the 3D structure of the query protein alone. Sequence-based conservation information, where available, may also be incorporated. Electrostatics features from THEMATICS are combined with multidimensional isotonic regression to form maximum likelihood estimates of probabilities that specific residues belong to an active site. This allows likelihood ranking of all ionizable residues in a given protein based on THEMATICS features. The corresponding ROC curves and statistical significance tests demonstrate that this method outperforms prior THEMATICS-based methods, which in turn have been shown previously to outperform other 3D-structure-based methods for identifying active site residues. Then it is shown that the addition of one simple geometric property, the size rank of the cleft in which a given residue is contained, yields improved performance. Extension of the method to include predictions of non-ionizable residues is achieved through the introduction of environment variables. This extension results in even better performance than THEMATICS alone and constitutes to date the best functional site predictor based on 3D structure only, achieving nearly the same level of performance as methods that use both 3D structure and sequence alignment data. Finally, the method also easily incorporates such sequence alignment data, and when this information is included, the resulting method is shown to outperform the best current methods using any combination of sequence alignments and 3D structures. Included is an analysis demonstrating that when THEMATICS features, cleft size rank, and alignment-based conservation scores are used individually or in combination THEMATICS features represent the single most important component of such classifiers.

[1]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[2]  F. T. Wright,et al.  Order restricted statistical inference , 1988 .

[3]  Michael J. Best,et al.  Active set algorithms for isotonic regression; A unifying framework , 1990, Math. Program..

[4]  M. Gilson Multiple‐site titration and molecular modeling: Two rapid methods for computing energies and forces for ionizable groups in proteins , 1993, Proteins.

[5]  R. Laskowski SURFNET: a program for visualizing molecular surfaces, cavities, and intermolecular interactions. , 1995, Journal of molecular graphics.

[6]  L. R. Scott,et al.  Electrostatics and diffusion of molecules in solution: simulations with the University of Houston Brownian dynamics program , 1995 .

[7]  M. Swindells,et al.  Protein clefts in molecular recognition and function. , 1996, Protein science : a publication of the Protein Society.

[8]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[9]  C. Enroth,et al.  The crystal structure of phenol hydroxylase in complex with FAD and phenol provides evidence for a concerted conformational change in the enzyme and its cofactor during catalysis. , 1998, Structure.

[10]  H. Edelsbrunner,et al.  Anatomy of protein pockets and cavities: Measurement of binding site geometry and implications for ligand design , 1998, Protein science : a publication of the Protein Society.

[11]  M. Ondrechen,et al.  THEMATICS: A simple computational predictor of enzyme function from structure , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[12]  A. Elcock Prediction of functionally important residues based solely on the computed energetics of protein structure. , 2001, Journal of molecular biology.

[13]  Gail J. Bartlett,et al.  Analysis of catalytic residues in enzyme active sites. , 2002, Journal of molecular biology.

[14]  Gail J. Bartlett,et al.  Using a neural network and spatial clustering to predict the location of active sites in enzymes. , 2003, Journal of molecular biology.

[15]  Karl H. Clodfelter,et al.  Identification of substrate binding sites in enzymes by computational solvent mapping. , 2003, Journal of molecular biology.

[16]  K. Nishikawa,et al.  Prediction of catalytic residues in enzymes based on known tertiary structure, stability profile, and sequence conservation. , 2003, Journal of molecular biology.

[17]  Jie Liang,et al.  CASTp: Computed Atlas of Surface Topography of proteins , 2003, Nucleic Acids Res..

[18]  Tal Pupko,et al.  ConSurf: Identification of Functional Regions in Proteins by Surface-Mapping of Phylogenetic Information , 2003, Bioinform..

[19]  H. Wolfson,et al.  Recognition of Functional Sites in Protein Structures☆ , 2004, Journal of Molecular Biology.

[20]  Ying Wei,et al.  Physicochemical Methods for Prediction of Functional Information for Proteins , 2004 .

[21]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[22]  Janet M. Thornton,et al.  The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data , 2004, Nucleic Acids Res..

[23]  Gil Amitai,et al.  Network analysis of protein structures identifies functional residues. , 2004, Journal of molecular biology.

[24]  R. Edgar MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[25]  R. Greaves,et al.  Active site identification through geometry-based and sequence profile-based calculations: burial of catalytic clefts. , 2005, Journal of molecular biology.

[26]  Ronald J. Williams,et al.  Statistical criteria for the identification of protein active sites using theoretical microscopic titration curves , 2005, Proteins.

[27]  D. Baker,et al.  Improvement in protein functional site prediction by distinguishing structural and functional constraints on protein family evolution using computational design , 2005, Nucleic acids research.

[28]  Ying Wei,et al.  Selective prediction of interaction sites in protein structures with THEMATICS , 2007, BMC Bioinformatics.

[29]  Cathy H. Wu,et al.  Prediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties , 2006, BMC Bioinformatics.

[30]  Philip E. Bourne,et al.  A robust and efficient algorithm for the shape description of protein structures and its application in predicting ligand binding sites , 2007, BMC Bioinformatics.

[31]  P. Radivojac,et al.  Evaluation of features for catalytic residue prediction in novel folds , 2006, Protein science : a publication of the Protein Society.

[32]  Judith D. Cohn,et al.  Fast dynamics perturbation analysis for prediction of protein functional sites , 2008, BMC Structural Biology.

[33]  Ronald J. Williams,et al.  Enhanced performance in prediction of protein active sites with THEMATICS and support vector machines , 2008, Protein science : a publication of the Protein Society.

[34]  E. Rossetto A Educação das Crianças Pequenas nas Cirandas Infantis do MST , 2010 .