Universal entropy estimation via block sorting

In this correspondence, we present a new universal entropy estimator for stationary ergodic sources, prove almost sure convergence, and establish an upper bound on the convergence rate for finite-alphabet finite memory sources. The algorithm is motivated by data compression using the Burrows-Wheeler block sorting transform (BWT). By exploiting the property that the BWT output sequence is close to a piecewise stationary memoryless source, we can segment the output sequence and estimate probabilities in each segment. Experimental results show that our algorithm outperforms Lempel-Ziv (LZ) string-matching-based algorithms.

[1]  Aaron D. Wyner,et al.  Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression , 1989, IEEE Trans. Inf. Theory.

[2]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[3]  En-Hui Yang,et al.  Estimating DNA sequence entropy , 2000, SODA '00.

[4]  A. Antos,et al.  Convergence properties of functional estimates for discrete distributions , 2001 .

[5]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[6]  E Yang CHAITIN COMPLEXITY,SHANNON INFORMATION CONTENT OF A SINGLE EVENT,AND INFINITE RANDOM SEQUENCES(II) , 1991 .

[7]  Julian Seward On the performance of BWT sorting algorithms , 2000, Proceedings DCC 2000. Data Compression Conference.

[8]  Meir Feder,et al.  A universal finite memory source , 1995, IEEE Trans. Inf. Theory.

[9]  Ioannis Kontoyiannis,et al.  Estimating the Entropy Rate of Spike Trains , 2004 .

[10]  Daniel J. Costello,et al.  Asymptotically optimal low-complexity sequential lossless coding for piecewise-stationary memoryless sources - Part 1: The regular case , 2000, IEEE Trans. Inf. Theory.

[11]  William Bialek,et al.  Entropy and information in neural spike trains: progress on the sampling problem. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[12]  Andrew C. Singer,et al.  On the cost of worst case coding length constraints , 2001, IEEE Trans. Inf. Theory.

[13]  Alfred O. Hero,et al.  Asymptotic theory of greedy approximations to minimal k-point random graphs , 1999, IEEE Trans. Inf. Theory.

[14]  Abraham Lempel,et al.  A sequential algorithm for the universal coding of finite memory sources , 1992, IEEE Trans. Inf. Theory.

[15]  P. Shields The Ergodic Theory of Discrete Sample Paths , 1996 .

[16]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[17]  Yuri M. Suhov,et al.  Nonparametric Entropy Estimation for Stationary Processesand Random Fields, with Applications to English Text , 1998, IEEE Trans. Inf. Theory.

[18]  A. S.,et al.  Estimating the Entropy of DNA Sequences , 1997 .

[19]  William Bialek,et al.  Entropy and Information in Neural Spike Trains , 1996, cond-mat/9603127.

[20]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[21]  Jacob Ziv,et al.  On classification with empirically observed statistics and universal data compression , 1988, IEEE Trans. Inf. Theory.

[22]  P. Billingsley,et al.  Statistical Methods in Markov Chains , 1961 .

[23]  S. Kulkarni,et al.  Output distribution of the Burrows-Wheeler transform , 2000, 2000 IEEE International Symposium on Information Theory (Cat. No.00CH37060).

[24]  Benjamin Weiss,et al.  Entropy and data compression schemes , 1993, IEEE Trans. Inf. Theory.

[25]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[26]  Wojciech Szpankowski,et al.  Asymptotic properties of data compression and suffix trees , 1993, IEEE Trans. Inf. Theory.

[27]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[28]  P. Shields Entropy and Prefixes , 1992 .

[29]  John C. Kieffer,et al.  Sample converses in source coding theory , 1991, IEEE Trans. Inf. Theory.

[30]  En-Hui Yang,et al.  Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform - Part one: Without context models , 2000, IEEE Trans. Inf. Theory.

[31]  Dake He,et al.  Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform .2. With context models , 2000, IEEE Trans. Inf. Theory.

[32]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[33]  Gadiel Seroussi,et al.  Linear time universal coding and time reversal of tree sources via FSM closure , 2004, IEEE Transactions on Information Theory.

[34]  Alfred O. Hero,et al.  Applications of entropic spanning graphs , 2002, IEEE Signal Process. Mag..

[35]  Michelle Effros,et al.  Universal lossless source coding with the Burrows Wheeler transform , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[36]  Shen Shi-yi,et al.  CHAITIN COMPLEXITY,SHANNON INFORMATION CONTENT OF A SINGLE EVENT AND INFINITE RANDOM SEQUENCES(I) , 1991 .

[37]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[38]  Anthony Quas,et al.  AN ENTROPY ESTIMATOR FOR A CLASS OF INFINITE ALPHABET PROCESSES , 1999 .

[39]  Michael Gutman,et al.  Asymptotically optimal classification for multiple tests with empirically observed statistics , 1989, IEEE Trans. Inf. Theory.