Genome-scale phylogenetic function annotation of large and diverse protein families.

The Statistical Inference of Function Through Evolutionary Relationships (SIFTER) framework uses a statistical graphical model that applies phylogenetic principles to automate precise protein function prediction. Here we present a revised approach (SIFTER version 2.0) that enables annotations on a genomic scale. SIFTER 2.0 produces equivalently precise predictions compared to the earlier version on a carefully studied family and on a collection of 100 protein families. We have added an approximation method to SIFTER 2.0 and show a 500-fold improvement in speed with minimal impact on prediction results in the functionally diverse sulfotransferase protein family. On the Nudix protein family, previously inaccessible to the SIFTER framework because of the 66 possible molecular functions, SIFTER achieved 47.4% accuracy on experimental data (where BLAST achieved 34.0%). Finally, we used SIFTER to annotate all of the Schizosaccharomyces pombe proteins with experimental functional characterizations, based on annotations from proteins in 46 fungal genomes. SIFTER precisely predicted molecular function for 45.5% of the characterized proteins in this genome, as compared with four current function prediction methods that precisely predicted function for 62.6%, 30.6%, 6.0%, and 5.7% of these proteins. We use both precision-recall curves and ROC analyses to compare these genome-scale predictions across the different methods and to assess performance on different types of applications. SIFTER 2.0 is capable of predicting protein molecular function for large and functionally diverse protein families using an approximate statistical model, enabling phylogenetics-based protein function prediction for genome-wide analyses. The code for SIFTER and protein family data are available at http://sifter.berkeley.edu.

[1]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[2]  Giorgio Valle,et al.  The Gene Ontology in 2010: extensions and refinements , 2009, Nucleic Acids Res..

[3]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[4]  María Martín,et al.  The Universal Protein Resource (UniProt) in 2010 , 2010 .

[5]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[6]  Rachael P. Huntley,et al.  The GOA database in 2009—an integrated Gene Ontology Annotation resource , 2008, Nucleic Acids Res..

[7]  Christine A. Orengo,et al.  FFPred: an integrated feature-based function prediction server for vertebrate proteomes , 2008, Nucleic Acids Res..

[8]  Michael J. E. Sternberg,et al.  ConFunc - functional annotation in the twilight zone , 2008, Bioinform..

[9]  Jason E Stajich,et al.  A fungal phylogeny based on 42 complete genomes derived from supertree and combined gene analysis , 2006, BMC Evolutionary Biology.

[10]  Kenji Matsuura,et al.  Reconstructing the early evolution of Fungi using a six-gene phylogeny , 2006, Nature.

[11]  S. Russek,et al.  Sulfated steroids as endogenous neuromodulators , 2006, Pharmacology Biochemistry and Behavior.

[12]  Michael I. Jordan,et al.  A graphical model for predicting protein molecular function , 2006, ICML '06.

[13]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[14]  Daisuke Kihara,et al.  Enhanced automated function prediction using distantly related sequences and contextual association by PFP , 2006, Protein science : a publication of the Protein Society.

[15]  J. Welch,et al.  There is no universal molecular clock for invertebrates, but rate variation does not scale with body size. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Nicolas Rodriguez,et al.  PANDIT: an evolution-centric database of protein and associated nucleotide domains with inferred trees , 2005, Nucleic Acids Res..

[17]  Jerry Nedelman,et al.  Book review: “Bayesian Data Analysis,” Second Edition by A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin Chapman & Hall/CRC, 2004 , 2005, Comput. Stat..

[18]  H. McDermid,et al.  Phylogenetic Analysis Reveals a Novel Protein Family Closely Related to Adenosine Deaminase , 2005, Journal of Molecular Evolution.

[19]  Michael I. Jordan,et al.  Protein Molecular Function Prediction by Bayesian Phylogenomics , 2005, PLoS Comput. Biol..

[20]  A. Brix Bayesian Data Analysis, 2nd edn , 2005 .

[21]  M. Massiah,et al.  Structures and mechanisms of Nudix hydrolases. , 2005, Archives of biochemistry and biophysics.

[22]  A. McLennan,et al.  The Nudix hydrolase superfamily , 2005, Cellular and Molecular Life Sciences CMLS.

[23]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[24]  Steven E Brenner,et al.  Structural studies of the Nudix hydrolase DR1025 from Deinococcus radiodurans and its ligand complexes. , 2004, Journal of molecular biology.

[25]  R. Durbin,et al.  GeneWise and Genomewise. , 2004, Genome research.

[26]  Bernard Labedan,et al.  Sub-families of alpha/beta barrel enzymes: a new adenine deaminase family. , 2003, Journal of molecular biology.

[27]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[28]  A. M. Lawson,et al.  The Structural Motif in Chondroitin Sulfate for Adhesion ofPlasmodium falciparum-infected Erythrocytes Comprises Disaccharide Units of 4-O-Sulfated and Non-sulfated N-Acetylgalactosamine Linked to Glucuronic Acid* , 2002, The Journal of Biological Chemistry.

[29]  Erik L. L. Sonnhammer,et al.  Automated ortholog inference from phylogenetic trees and calculation of orthology reliability , 2002, Bioinform..

[30]  S. Graham,et al.  Characterization of the adenosine deaminase-related growth factor (ADGF) gene family in Drosophila. , 2001, Gene.

[31]  Sean R. Eddy,et al.  A simple algorithm to infer gene duplication and speciation events on a gene tree , 2001, Bioinform..

[32]  J. Ribeiro,et al.  The salivary adenosine deaminase from the sand fly Lutzomyia longipalpis. , 2000, Experimental parasitology.

[33]  V. Solovyev,et al.  Ab initio gene finding in Drosophila genomic DNA. , 2000, Genome research.

[34]  P. Hanawalt,et al.  A phylogenomic study of DNA repair genes, proteins, and processes. , 1999, Mutation research.

[35]  S. Brenner Errors in genome annotation. , 1999, Trends in genetics : TIG.

[36]  C. Bertozzi,et al.  Carbohydrate sulfotransferases: mediators of extracellular communication. , 1999, Chemistry & biology.

[37]  Siddhartha Chatterjee,et al.  An Evaluation of Java for Numerical Computing , 1998, ISCOPE.

[38]  J A Eisen,et al.  Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. , 1998, Genome research.

[39]  Michael Y. Galperin,et al.  Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement, and operon disruption , 1998, Silico Biol..

[40]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[41]  D. Frick,et al.  The MutT Proteins or “Nudix” Hydrolases, a Family of Versatile, Widely Distributed, “Housecleaning” Enzymes* , 1996, The Journal of Biological Chemistry.

[42]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[43]  E. Koonin A highly conserved sequence motif defining the family of MutT-related proteins from eubacteria, eukaryotes and viruses. , 1993, Nucleic acids research.

[44]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[45]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[46]  R. Hirschhorn,et al.  Genetic heterogeneity in adenosine deaminase (ADA) deficiency: five different mutations in five new patients with partial ADA deficiency. , 1986, American journal of human genetics.

[47]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[48]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .