Using Substitution Matrices to Estimate Probability Distributions for Biological Sequences

Accurately estimating probabilities from observations is important for probabilistic-based approaches to problems in computational biology. In this paper we present a biologically-motivated method for estimating probability distributions over discrete alphabets from observations using a mixture model of common ancestors. The method is an extension of substitution matrix-based probability estimation methods. In contrast to previous such methods, our method has a simple Bayesian interpretation and has the advantage over Dirichlet mixtures that it is both effective and simple to compute for large alphabets. The method is applied to estimate amino acid probabilities based on observed counts in an alignment and is shown to perform comparably to previous methods. The method is also applied to estimate probability distributions over protein families and improves protein classification accuracy.

[1]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[2]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[3]  Yoav Freund,et al.  Predicting a binary sequence almost as well as the optimal biased coin , 2003, COLT '96.

[4]  M. Gribskov,et al.  Profile Analysis , 1970 .

[5]  Sean R. Eddy,et al.  Biological sequence analysis: Contents , 1998 .

[6]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[7]  Eleazar Eskin,et al.  Using mixtures of common ancestors for estimating the probabilities of discrete events in biological sequences , 2001, ISMB.

[8]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[9]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[10]  M. A. McClure,et al.  Hidden Markov models of biological primary sequence information. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[11]  S. Altschul,et al.  Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[13]  Dianne P. O'Leary,et al.  The mathematics of information coding, extraction, and distribution , 1999 .

[14]  M. Gribskov,et al.  [9] Profile analysis , 1990 .

[15]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[16]  K. Karplus REGULARIZERS FOR ESTIMATING DISTRIBUTIONS OF AMINO ACIDS FROM SMALL SAMPLES , 1995 .

[17]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[18]  S. Henikoff,et al.  Position-based sequence weights. , 1994, Journal of molecular biology.

[19]  D. Haussler,et al.  Worst Case Prediction over Sequences under Log Loss , 1999 .

[20]  S. Henikoff,et al.  Blocks database and its applications. , 1996, Methods in enzymology.

[21]  Jorja G. Henikoff,et al.  Using substitution probabilities to improve position-specific scoring matrices , 1996, Comput. Appl. Biosci..

[22]  M. Gribskov,et al.  [13] Identification of sequence patterns with profile analysis , 1996 .

[23]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[24]  Golan Yona,et al.  Modeling protein families using probabilistic suffix trees , 1999, RECOMB.

[25]  Jean-Michel Claverie,et al.  Some Useful Statistical Properties of Position-weight Matrices , 1994, Comput. Chem..

[26]  Eleazar Eskin,et al.  Protein Family Classification Using Sparse Markov Transducers , 2000, ISMB.

[27]  David Haussler,et al.  Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families , 1993, ISMB.

[28]  Alberto Apostolico,et al.  Optimal Amnesic Probabilistic Automata or How to Learn and Classify Proteins in Linear Time and Space , 2000, J. Comput. Biol..

[29]  M. Degroot Optimal Statistical Decisions , 1970 .

[30]  M. Gribskov,et al.  Identification of Sequence Patterns with Profile Analysis , 1996 .

[31]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[32]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[33]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[34]  D. Haussler,et al.  MUTUAL INFORMATION, METRIC ENTROPY AND CUMULATIVE RELATIVE ENTROPY RISK , 1997 .