Significantly lower entropy estimates for natural DNA sequences

If DNA were a random string over its alphabet {A,C,G,T}, an optimal code would assign 2 bits to each nucleotide. We imagine DNA to be a highly ordered, purposeful molecule, and might therefore reasonably expect statistical models of its string representation to produce much lower entropy estimates. Surprisingly this has not been the case for many natural DNA sequences, including portions of the human genome. We introduce a new statistical model (compression algorithm), the strongest reported to date, for naturally occurring DNA sequences. Conventional techniques code a nucleotide using only slightly fewer bits (1.90) than one obtains by relying only on the frequency statistics of individual nucleotides (1.95). Our method in some cases increases this gap by more than five-fold (1.66) and may lead to better performance in microbiological pattern recognition applications. One of our main contributions, and the principle source of these improvements, is the formal inclusion of inexact match information in the model. The existence of matches at various distances forms a panel of experts which are then combined into a single prediction. The structure of this combination is novel and its parameters are learned using expectation maximization (EM).

[1]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[2]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[3]  Peter Nicholas Yianilos,et al.  Topics in computational hidden state modeling , 1997 .

[4]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[5]  G. Stormo,et al.  Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. , 1992, Journal of molecular biology.

[6]  H E Stanley,et al.  Linguistic features of noncoding DNA sequences. , 1994, Physical review letters.

[7]  David Loewenstern,et al.  Significantly Lower Entropy Estimates for Natural DNA Sequences , 1999, J. Comput. Biol..

[8]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[9]  Benoist,et al.  On the Entropy of DNA: Algorithms and Measurements based on Memory and Rapid Convergence , 1994 .

[10]  P. Laplace A Philosophical Essay On Probabilities , 1902 .

[11]  C Cosmi,et al.  Characterization of nucleotidic sequences using maximum entropy techniques. , 1990, Journal of theoretical biology.

[12]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[13]  A. Poritz,et al.  Hidden Markov models: a guided tour , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[14]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[15]  Lila L. Gatlin,et al.  Information theory and the living system , 1972 .

[16]  N. S. Barnett,et al.  Private communication , 1969 .

[17]  H. Herzel Complexity of symbol sequences , 1988 .

[18]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[19]  Stéphane Grumbach,et al.  A New Challenge for Compression Algorithms: Genetic Sequences , 1994, Inf. Process. Manag..

[20]  Peter Salamon,et al.  A Maximum Entropy Principle for the Distribution of Local Complexity in Naturally Occurring Nucleotide Sequences , 1992, Comput. Chem..

[21]  Ebeling,et al.  Entropies of biosequences: The role of repeats. , 1994, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[22]  H. Hirsh,et al.  DNA Sequence Classification Using Compression-Based Induction , 1995 .

[23]  G. Lauc,et al.  Entropies of coding and noncoding sequences of DNA and proteins. , 1992, Biophysical chemistry.

[24]  H. Hirsh,et al.  Maximum A posteriori classification of DNA structure from sequence information. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[25]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[26]  C. Sensen,et al.  Complete DNA sequence of yeast chromosome XI , 1994, Nature.