Active site prediction using evolutionary and structural information

Motivation: The identification of catalytic residues is a key step in understanding the function of enzymes. While a variety of computational methods have been developed for this task, accuracies have remained fairly low. The best existing method exploits information from sequence and structure to achieve a precision (the fraction of predicted catalytic residues that are catalytic) of 18.5% at a corresponding recall (the fraction of catalytic residues identified) of 57% on a standard benchmark. Here we present a new method, Discern, which provides a significant improvement over the state-of-the-art through the use of statistical techniques to derive a model with a small set of features that are jointly predictive of enzyme active sites. Results: In cross-validation experiments on two benchmark datasets from the Catalytic Site Atlas and CATRES resources containing a total of 437 manually curated enzymes spanning 487 SCOP families, Discern increases catalytic site recall between 12% and 20% over methods that combine information from both sequence and structure, and by ≥50% over methods that make use of sequence conservation signal only. Controlled experiments show that Discern's improvement in catalytic residue prediction is derived from the combination of three ingredients: the use of the INTREPID phylogenomic method to extract conservation information; the use of 3D structure data, including features computed for residues that are proximal in the structure; and a statistical regularization procedure to prevent overfitting. Contact: Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  EDWIN C. Webb The Enzymes , 1961, Nature.

[2]  J. Kraut,et al.  A detailed structural comparison between the charge relay system in chymotrypsinogen and in alpha-chymotrypsin. , 1976, Biochemistry.

[3]  J. Kraut Serine proteases: structure and mechanism of catalysis. , 1977, Annual review of biochemistry.

[4]  J. Richardson,et al.  The beta bulge: a common small unit of nonrepetitive protein structure. , 1978, Proceedings of the National Academy of Sciences of the United States of America.

[5]  D. Eisenberg,et al.  Hydrophobic moments and protein structure , 1982 .

[6]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[7]  J. Wells,et al.  Dissecting the catalytic triad of a serine protease , 1988, Nature.

[8]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .


[10]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[11]  R. Fletterick,et al.  Perturbing the polar environment of Asp102 in trypsin: consequences of replacing conserved Ser214. , 1992, Biochemistry.

[12]  W. Rutter,et al.  Converting trypsin to chymotrypsin: the role of surface loops. , 1992, Science.

[13]  P. Frey,et al.  A low-barrier hydrogen bond in the catalytic triad of serine proteases. , 1994, Science.

[14]  C. Craik,et al.  Structural basis of substrate specificity in the serine proteases , 1995, Protein science : a publication of the Protein Society.

[15]  C. Sander,et al.  A method to predict functional residues in proteins , 1995, Nature Structural Biology.

[16]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[17]  R. Altman,et al.  Characterizing the microenvironment surrounding protein sites , 1995, Protein science : a publication of the Protein Society.

[18]  C. Frömmel,et al.  The automatic search for ligand binding sites in proteins of known three-dimensional structure using only geometric criteria. , 1996, Journal of molecular biology.

[19]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[20]  F. Cohen,et al.  An evolutionary trace method defines binding surfaces common to protein families. , 1996, Journal of molecular biology.

[21]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[22]  P. Berna,et al.  Residue accessibility, hydrogen bonding, and molecular recognition: metal-chelate probing of active site histidines in chymotrypsins. , 1997, Biochemistry.

[23]  É. Várallyay,et al.  The role of disulfide bond C191-C220 in trypsin and chymotrypsin. , 1997, Biochemical and biophysical research communications.

[24]  J. Skolnick,et al.  Method for prediction of protein function from sequence using the sequence-to-structure-to-function paradigm with application to glutaredoxins/thioredoxins and T1 ribonucleases. , 1998, Journal of molecular biology.

[25]  T. Nakatsu,et al.  Crystal structure of asparagine synthetase reveals a close evolutionary relationship to class II aminoacyl-tRNA synthetase , 1998, Nature Structural Biology.

[26]  Pieter F. W. Stouten,et al.  Fast prediction and visualization of protein binding pockets with PASS , 2000, J. Comput. Aided Mol. Des..

[27]  S. Henikoff,et al.  Amino acid substitution matrices. , 2000, Advances in protein chemistry.

[28]  M. Sternberg,et al.  Automated structure-based prediction of functional sites in proteins: applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking. , 2001, Journal of molecular biology.

[29]  J. Kirsch,et al.  A novel engineered subtilisin BPN' lacking a low-barrier hydrogen bond in the catalytic triad. , 2001, Biochemistry.

[30]  A. Sali,et al.  Protein Structure Prediction and Structural Genomics , 2001, Science.

[31]  M. Ondrechen,et al.  THEMATICS: A simple computational predictor of enzyme function from structure , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[32]  D. Eisenberg,et al.  Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. , 2001, Journal of molecular biology.

[33]  Olga Veksler,et al.  Fast Approximate Energy Minimization via Graph Cuts , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  A. Elcock Prediction of functionally important residues based solely on the computed energetics of protein structure. , 2001, Journal of molecular biology.

[35]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[36]  L. Hedstrom Serine protease mechanism and specificity. , 2002, Chemical reviews.

[37]  Gail J. Bartlett,et al.  Analysis of catalytic residues in enzyme active sites. , 2002, Journal of molecular biology.

[38]  M. Krem,et al.  Ser214 Is Crucial for Substrate Binding to Serine Proteases* , 2002, The Journal of Biological Chemistry.

[39]  Gail J. Bartlett,et al.  Using a neural network and spatial clustering to predict the location of active sites in enzymes. , 2003, Journal of molecular biology.

[40]  S. Sathiya Keerthi,et al.  A simple and efficient algorithm for gene selection using sparse logistic regression , 2003, Bioinform..

[41]  K. Nishikawa,et al.  Prediction of catalytic residues in enzymes based on known tertiary structure, stability profile, and sequence conservation. , 2003, Journal of molecular biology.

[42]  Kam D. Dahlquist,et al.  Regression Approaches for Microarray Data Analysis , 2002, J. Comput. Biol..

[43]  J. Warwicker,et al.  Enzyme/non-enzyme discrimination and prediction of enzyme active site location using charge-based methods. , 2004, Journal of molecular biology.

[44]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[45]  O. Lichtarge,et al.  A family of evolution-entropy hybrid methods for ranking protein residues by importance. , 2004, Journal of molecular biology.

[46]  Vladimir Kolmogorov,et al.  An Experimental Comparison of Min-Cut/Max-Flow Algorithms for Energy Minimization in Vision , 2004, IEEE Trans. Pattern Anal. Mach. Intell..

[47]  Y. Ritov,et al.  Persistence in high-dimensional linear predictor selection and the virtue of overparametrization , 2004 .

[48]  R. Zabih,et al.  What energy functions can be minimized via graph cuts , 2004 .

[49]  Patrice Koehl,et al.  The ASTRAL Compendium in 2004 , 2003, Nucleic Acids Res..

[50]  Janet M. Thornton,et al.  The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data , 2004, Nucleic Acids Res..

[51]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[52]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[53]  M. Sternberg,et al.  Automated prediction of protein function and detection of functional sites from structure. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[54]  N. Ben-Tal,et al.  Comparison of site-specific rate-inference methods for protein sequences: empirical Bayesian methods are superior. , 2004, Molecular biology and evolution.

[55]  A. Panchenko,et al.  Prediction of functional sites by analysis of sequence and structure conservation , 2004, Protein science : a publication of the Protein Society.

[56]  C. Innis,et al.  Prediction of functional sites in proteins using conserved functional group analysis. , 2004, Journal of molecular biology.

[57]  Gail J. Bartlett,et al.  Effective function annotation through catalytic residue conservation. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[58]  Richard M. Jackson,et al.  Q-SiteFinder: an energy-based method for the prediction of protein-ligand binding sites , 2005, Bioinform..

[59]  L. Polgár The catalytic triad of serine peptidases , 2005, Cellular and Molecular Life Sciences CMLS.

[60]  Itay Mayrose,et al.  ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures , 2005, Nucleic Acids Res..

[61]  Mike P. Liang,et al.  Structural characterization of proteins using residue environments , 2005, Proteins.

[62]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[63]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[64]  M. Schroeder,et al.  LIGSITEcsc: predicting ligand binding sites using the Connolly surface and degree of conservation , 2006, BMC Structural Biology.

[65]  Cathy H. Wu,et al.  Prediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties , 2006, BMC Bioinformatics.

[66]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[67]  P. Radivojac,et al.  Evaluation of features for catalytic residue prediction in novel folds , 2007 .

[68]  Mona Singh,et al.  Predicting functionally important residues from sequence conservation , 2007, Bioinform..

[69]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale l1-Regularized Logistic Regression , 2007, J. Mach. Learn. Res..


[71]  Johannes Söding,et al.  Prediction of protein functional residues from sequence by probability density estimation , 2008, Bioinform..

[72]  Kimmen Sjölander,et al.  INTREPID—INformation-theoretic TREe traversal for Protein functional site IDentification , 2008, Bioinform..

[73]  C. Hoggart,et al.  Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies , 2008, PLoS genetics.

[74]  Yoav Freund,et al.  ResBoost: characterizing and predicting catalytic residues in enzymes , 2009, BMC Bioinformatics.

[75]  Ronald J. Williams,et al.  Enhanced performance in prediction of protein active sites with THEMATICS and support vector machines , 2008, Protein science : a publication of the Protein Society.

[76]  Scott R. Manalis,et al.  Presidential Early Career Award for Scientists and Engineers , 2008 .