Protein Molecular Function Prediction by Bayesian Phylogenomics

We present a statistical graphical model to infer specific molecular function for unannotated protein sequences using homology. Based on phylogenomic principles, SIFTER (Statistical Inference of Function Through Evolutionary Relationships) accurately predicts molecular function for members of a protein family given a reconciled phylogeny and available function annotations, even when the data are sparse or noisy. Our method produced specific and consistent molecular function predictions across 100 Pfam families in comparison to the Gene Ontology annotation database, BLAST, GOtcha, and Orthostrapper. We performed a more detailed exploration of functional predictions on the adenosine-5′-monophosphate/adenosine deaminase family and the lactate/malate dehydrogenase family, in the former case comparing the predictions against a gold standard set of published functional characterizations. Given function annotations for 3% of the proteins in the deaminase family, SIFTER achieves 96% accuracy in predicting molecular function for experimentally characterized proteins as reported in the literature. The accuracy of SIFTER on this dataset is a significant improvement over other currently available methods such as BLAST (75%), GeneQuiz (64%), GOtcha (89%), and Orthostrapper (11%). We also experimentally characterized the adenosine deaminase from Plasmodium falciparum, confirming SIFTER's prediction. The results illustrate the predictive power of exploiting a statistical model of function evolution in phylogenomic problems. A software implementation of SIFTER is available from the authors.

[1]  W. Atchley,et al.  A natural classification of the basic helix-loop-helix class of transcription factors. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Jason Weston,et al.  Learning Gene Functional Classifications from Multiple Data Types , 2002, J. Comput. Biol..

[3]  V. Lushchak,et al.  [Functional role and properties of AMP-deaminase]. , 1996, Biokhimiia.

[4]  M. Gouy,et al.  HOBACGEN: database system for comparative genomics in bacteria. , 2000, Genome research.

[5]  A. Valencia,et al.  Practical limits of function prediction , 2000, Proteins.

[6]  W. Fitch Distinguishing homologous from analogous proteins. , 1970, Systematic zoology.

[7]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[8]  R. Doolittle The multiplicity of domains in proteins. , 1995, Annual review of biochemistry.

[9]  Sean R. Eddy,et al.  RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs , 2002, BMC Bioinformatics.

[10]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[12]  F. Zimmermann,et al.  The effect of residual growth on the frequency of reverse mutations induced with nitrous acid and 1-nitroso-imidazolidone-2 in yeast. , 1966, Mutation research.

[13]  David J. Spiegelhalter,et al.  Probabilistic Networks and Expert Systems , 1999, Information Science and Statistics.

[14]  A. Rukhin Bayes and Empirical Bayes Methods for Data Analysis , 1997 .

[15]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[16]  M. Sternberg,et al.  Automated prediction of protein function and detection of functional sites from structure. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Michael J. Stanhope,et al.  Phylogenetic analyses do not support horizontal gene transfers from bacteria to vertebrates , 2001, Nature.

[18]  Geoffrey J. Barton,et al.  GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes , 2004, BMC Bioinformatics.

[19]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[20]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[21]  Dmitrij Frishman,et al.  The PEDANT genome database , 2003, Nucleic Acids Res..

[22]  R. Hirschhorn,et al.  Genetic heterogeneity in adenosine deaminase (ADA) deficiency: five different mutations in five new patients with partial ADA deficiency. , 1986, American journal of human genetics.

[23]  R. Elston,et al.  A general model for the genetic analysis of pedigree data. , 1971, Human heredity.

[24]  B. Rost Enzyme function less conserved than anticipated. , 2002, Journal of molecular biology.

[25]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[26]  S. Dwight,et al.  Predicting gene function from patterns of annotation. , 2003, Genome research.

[27]  Christopher J. Lee,et al.  The GeneMine system for genome/proteome annotation and collaborative data mining , 2001, IBM Syst. J..

[28]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology , 2004, Nucleic Acids Res..

[29]  Caroline Hadley,et al.  Righting the wrongs , 2003, EMBO reports.

[30]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[31]  Bernard Labedan,et al.  Sub-families of alpha/beta barrel enzymes: a new adenine deaminase family. , 2003, Journal of molecular biology.

[32]  P. Hanawalt,et al.  A phylogenomic study of DNA repair genes, proteins, and processes. , 1999, Mutation research.

[33]  P. Bork,et al.  Predicting functions from protein sequences—where are the bottlenecks? , 1998, Nature Genetics.

[34]  D. Eisenberg,et al.  A combined algorithm for genome-wide prediction of protein function , 1999, Nature.

[35]  E V Koonin,et al.  Bridging the gap between sequence and function. , 2000, Trends in genetics : TIG.

[36]  Ian T. Paulsen,et al.  Complete genome sequence of Caulobacter crescentus , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[37]  T Gaasterland,et al.  MAGPIE: automated genome interpretation. , 1996, Trends in genetics : TIG.

[38]  G. Moore,et al.  Fitting the gene lineage into its species lineage , 1979 .

[39]  R. Huber,et al.  Lactate dehydrogenase from the hyperthermophilic bacterium thermotoga maritima: the crystal structure at 2.1 A resolution reveals strategies for intrinsic protein stabilization. , 1998, Structure.

[40]  M. F. White,et al.  Expression of apple 1-aminocyclopropane-1-carboxylate synthase in Escherichia coli: kinetic characterization of wild-type and active-site mutant forms. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[41]  Roderic D. M. Page,et al.  GeneTree: comparing gene and species phylogenies using reconciled trees , 1998, Bioinform..

[42]  Bernard Labedan,et al.  Sub-families of α/β barrel enzymes: A new adenine deaminase family , 2003 .

[43]  M. Gerstein,et al.  Systematic learning of gene functional classes from DNA array expression data by using multilayer perceptrons. , 2002, Genome research.

[44]  B. Rannala,et al.  The Bayesian revolution in genetics , 2004, Nature Reviews Genetics.

[45]  Miguel A. Andrade-Navarro,et al.  Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..

[46]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[47]  A. Fiser,et al.  Convergent evolution of Trichomonas vaginalis lactate dehydrogenase from malate dehydrogenase. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[48]  Carl J. Schmidt,et al.  GoFigure: Automated Gene OntologyTM annotation , 2003, Bioinform..

[49]  Erik L. L. Sonnhammer,et al.  Automated ortholog inference from phylogenetic trees and calculation of orthology reliability , 2002, Bioinform..

[50]  Dr. Susumu Ohno Evolution by Gene Duplication , 1970, Springer Berlin Heidelberg.

[51]  J A Eisen,et al.  Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. , 1998, Genome research.

[52]  B. Driscoll,et al.  Alfalfa malate dehydrogenase (MDH): molecular cloning and characterization of five different forms reveals a unique nodule-enhanced MDH. , 1998, The Plant journal : for cell and molecular biology.

[53]  P. Babbitt,et al.  Divergent evolution of enzymatic function: mechanistically diverse superfamilies and functionally distinct suprafamilies. , 2001, Annual review of biochemistry.

[54]  Miguel A. Andrade-Navarro,et al.  Automated genome sequence analysis and annotation , 1999, Bioinform..

[55]  Kimmen Sjölander,et al.  Phylogenomic inference of protein molecular function: advances and challenges , 2004, Bioinform..

[56]  S. Brenner Errors in genome annotation. , 1999, Trends in genetics : TIG.

[57]  John P. Huelsenbeck,et al.  MRBAYES: Bayesian inference of phylogenetic trees , 2001, Bioinform..

[58]  S. Jeffery Evolution of Protein Molecules , 1979 .

[59]  M. O’Connell,et al.  Adenosine deaminases acting on RNA (ADARs): RNA-editing enzymes , 2004, Genome Biology.

[60]  Hans Lehrach,et al.  GOblet: a platform for Gene Ontology annotation of anonymous sequence data , 2004, Nucleic Acids Res..

[61]  Stanley Letovsky,et al.  Predicting protein function from protein/protein interaction data: a probabilistic approach , 2003, ISMB.

[62]  Michael Y. Galperin,et al.  Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement, and operon disruption , 1998, Silico Biol..

[63]  P. Karp Call for an enzyme genomics initiative , 2004, Genome Biology.

[64]  Michael Y. Galperin,et al.  The COG database: a tool for genome-scale analysis of protein functions and evolution , 2000, Nucleic Acids Res..

[65]  Günther Zehetner,et al.  OntoBlast function: from sequence similarities directly to potential functional annotations by ontology terms , 2003, Nucleic Acids Res..

[66]  Jeffrey T. Chang,et al.  Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. , 2002, Genome research.

[67]  S. Salzberg,et al.  DNA sequence of both chromosomes of the cholera pathogen Vibrio cholerae , 2000, Nature.

[68]  Raman Nambudripad,et al.  The ancient regulatory-protein family of WD-repeat proteins , 1994, Nature.

[69]  Richard J Roberts,et al.  Identifying Protein Function—A Call for Community Action , 2004, PLoS biology.

[70]  Sean R. Eddy,et al.  A simple algorithm to infer gene duplication and speciation events on a gene tree , 2001, Bioinform..

[71]  S. Graham,et al.  Characterization of the adenosine deaminase-related growth factor (ADGF) gene family in Drosophila. , 2001, Gene.

[72]  M. Lynch,et al.  The evolutionary fate and consequences of duplicate genes. , 2000, Science.

[73]  J. Hilden G E N EX ‐ An algebraic approach to pedigree probability calculus , 1970 .

[74]  P Bork,et al.  Exploitation of gene context. , 2000, Current opinion in structural biology.