Artefacts and biases affecting the evaluation of scoring functions on decoy sets for protein structure prediction

Motivation: Decoy datasets, consisting of a solved protein structure and numerous alternative native-like structures, are in common use for the evaluation of scoring functions in protein structure prediction. Several pitfalls with the use of these datasets have been identified in the literature, as well as useful guidelines for generating more effective decoy datasets. We contribute to this ongoing discussion an empirical assessment of several decoy datasets commonly used in experimental studies. Results: We find that artefacts and sampling issues in the large majority of these data make it trivial to discriminate the native structure. This underlines that evaluation based on the rank/z-score of the native is a weak test of scoring function performance. Moreover, sampling biases present in the way decoy sets are generated or used can strongly affect other types of evaluation measures such as the correlation between score and root mean squared deviation (RMSD) to the native. We demonstrate how, depending on type of bias and evaluation context, sampling biases may lead to both over- or under-estimation of the quality of scoring terms, functions or methods. Availability: Links to the software and data used in this study are available at http://dbkgroup.org/handl/decoy_sets. Contact: simon.lovell@manchester.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Jeffrey Skolnick,et al.  Can a physics‐based, all‐atom potential find a protein's native structure among misfolded structures? I. Large scale AMBER benchmarking , 2007, J. Comput. Chem..

[2]  A. Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP)—round IX , 2011, Proteins.

[3]  Richard Bonneau,et al.  Rosetta in CASP4: Progress in ab initio protein structure prediction , 2001, Proteins.

[4]  Terence P Speed,et al.  A statistical approach to the interpretation of molecular dynamics simulations of calmodulin equilibrium dynamics , 2005, Protein science : a publication of the Protein Society.

[5]  A. Sali,et al.  A composite score for predicting errors in protein structure models , 2006, Protein science : a publication of the Protein Society.

[6]  Ray Luo,et al.  Physical scoring function based on AMBER force field and Poisson–Boltzmann implicit solvent for protein structure prediction , 2004, Proteins.

[7]  C Kooperberg,et al.  Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. , 1997, Journal of molecular biology.

[8]  B. Hess Convergence of sampling in protein simulations. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[9]  Ralph E. Steuer Multiple criteria optimization , 1986 .

[10]  Shoji Takada,et al.  Optimizing physical energy functions for protein folding , 2003, Proteins.

[11]  Wolfgang Wenzel,et al.  Protein structure prediction by all-atom free-energy refinement , 2006, BMC Structural Biology.

[12]  Alan Grossfield,et al.  Convergence of molecular dynamics simulations of membrane proteins , 2007, Proteins.

[13]  B. McConkey,et al.  Discrimination of native protein structures using atom–atom contact scoring , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Qianqian Zhu,et al.  How well can we predict native contacts in proteins based on decoy structures and their energies? , 2003, Proteins.

[15]  Kevin Karplus,et al.  Model quality assessment using distance constraints from alignments , 2009, Proteins.

[16]  M. Levitt,et al.  A novel approach to decoy set generation: designing a physical energy function having local minima with native structure characteristics. , 2003, Journal of molecular biology.

[17]  Arne Elofsson,et al.  3D-Jury: A Simple Approach to Improve Protein Structure Predictions , 2003, Bioinform..

[18]  Alfonso Valencia,et al.  Predicting reliable regions in protein alignments from sequence profiles. , 2003, Journal of molecular biology.

[19]  Yong Duan,et al.  Distinguish protein decoys by Using a scoring function based on a new AMBER force field, short molecular dynamics simulations, and the generalized born solvent model , 2004, Proteins.

[20]  R Samudrala,et al.  Decoys ‘R’ Us: A database of incorrect conformations to improve protein structure prediction , 2000, Protein science : a publication of the Protein Society.

[21]  Liam J. McGuffin,et al.  Benchmarking consensus model quality assessment for protein fold recognition , 2007, BMC Bioinformatics.

[22]  Xiang Li,et al.  Developing optimal non-linear scoring function for protein design , 2004, Bioinform..

[23]  J. Thornton,et al.  PROCHECK: a program to check the stereochemical quality of protein structures , 1993 .

[24]  D. Eisenberg,et al.  Assessment of protein models with three-dimensional profiles , 1992, Nature.

[25]  R. A. Scott,et al.  Discriminating compact nonnative structures from the native structure of globular proteins. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Richard Bonneau,et al.  An improved protein decoy set for testing energy functions for protein structure prediction , 2003, Proteins.

[27]  Silvio C. E. Tosatto,et al.  A decoy set for the thermostable subdomain from chicken villin headpiece, comparison of different free energy estimators , 2005, BMC Bioinformatics.

[28]  B. Honig,et al.  A hierarchical approach to all‐atom protein loop prediction , 2004, Proteins.

[29]  Anna Tramontano,et al.  Critical assessment of methods of protein structure prediction—Round VII , 2007, Proteins.

[30]  M. Karplus,et al.  Discrimination of the native from misfolded protein models with an energy function including implicit solvation. , 1999, Journal of molecular biology.

[31]  Jie Liang,et al.  Chapter 4: Knowledge-based energy functions for computational studies of proteins , 2006, q-bio/0601026.

[32]  Jinn-Moon Yang,et al.  GEMDOCK: A generic evolutionary method for molecular docking , 2004, Proteins.

[33]  J. Skolnick,et al.  A distance‐dependent atomic knowledge‐based potential for improved protein structure selection , 2001, Proteins.

[34]  Daniel M Zuckerman,et al.  On the structural convergence of biomolecular simulations by determination of the effective sample size. , 2007, The journal of physical chemistry. B.

[35]  D. Baker,et al.  Clustering of low-energy conformations near the native structures of small proteins. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Adam Zemla,et al.  LGA: a method for finding 3D similarities in protein structures , 2003, Nucleic Acids Res..

[37]  Hongyi Zhou,et al.  An accurate, residue‐level, pair potential of mean force for folding and binding based on the distance‐scaled, ideal‐gas reference state , 2004, Protein science : a publication of the Protein Society.

[38]  J Lundström,et al.  Pcons: A neural‐network–based consensus predictor that improves fold recognition , 2001, Protein science : a publication of the Protein Society.

[39]  Liam J. McGuffin,et al.  Improving sequence-based fold recognition by using 3D model quality assessment , 2005, Bioinform..

[40]  M. Levitt,et al.  Improved protein structure selection using decoy-dependent discriminatory functions , 2004, BMC Structural Biology.

[41]  M. Levitt,et al.  Energy functions that discriminate X-ray and near native folds from well-constructed decoys. , 1996, Journal of molecular biology.

[42]  Alexander Tropsha,et al.  Development of a four-body statistical pseudo-potential to discriminate native from non-native protein conformations , 2003, Bioinform..

[43]  R. S. Laundy,et al.  Multiple Criteria Optimisation: Theory, Computation and Application , 1989 .