On the Role of Pattern Matching in Information Theory

In this paper, the role of pattern matching in information theory is motivated and discussed. We describe the relationship between a pattern's recurrence time and its probability under the data-generating stochastic source. We show how this relationship has led to great advances in universal data compression. We then describe nonasymptotic uniform bounds on the performance of data-compression algorithms in cases where the size of the training data that is available to the encoder is not large enough so as to yield the asymptotic compression: the Shannon entropy. We then discuss applications of pattern matching and universal compression to universal prediction, classification, and entropy estimation.

[1]  En-Hui Yang,et al.  On the Performance of Data Compression Algorithms Based Upon String Matching , 1998, IEEE Trans. Inf. Theory.

[2]  Abraham J. Wyner The redundancy and distribution of the phrase lengths of the fixed-database Lempel-Ziv algorithm , 1997, IEEE Trans. Inf. Theory.

[3]  Neri Merhav,et al.  A measure of relative entropy between individual sequences with application to universal classification , 1993, IEEE Trans. Inf. Theory.

[4]  Philippe Jacquet,et al.  Autocorrelation on Words and Its Applications - Analysis of Suffix Trees by String-Ruler Approach , 1994, J. Comb. Theory, Ser. A.

[5]  A. D. Wyner,et al.  The sliding-window Lempel-Ziv algorithm is asymptotically optimal , 1994, Proc. IEEE.

[6]  Wojciech Szpankowski,et al.  A Lossy Data Compression Based on String Matching: Preliminary Analysis and Suboptimal Algorithms , 1994, CPM.

[7]  Yossef Steinberg,et al.  An algorithm for source coding subject to a fidelity criterion, based on string matching , 1993, IEEE Trans. Inf. Theory.

[8]  Frans M. J. Willems,et al.  Universal data compression and repetition times , 1989, IEEE Trans. Inf. Theory.

[9]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[10]  J. Rissanen Information in prediction and estimation , 1983, The 22nd IEEE Conference on Decision and Control.

[11]  Leonidas J. Guibas,et al.  String Overlaps, Pattern Matching, and Nontransitive Games , 1981, J. Comb. Theory, Ser. A.

[12]  Abraham J. Wyner More on recurrence and waiting times , 1999 .

[13]  M. Kac On the notion of recurrence in discrete stochastic processes , 1947 .

[14]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[15]  Jorma Rissanen,et al.  Universal coding, information, prediction, and estimation , 1984, IEEE Trans. Inf. Theory.

[16]  Aaron D. Wyner,et al.  Improved redundancy of a version of the Lempel-Ziv algorithm , 1995, IEEE Trans. Inf. Theory.

[17]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[18]  Paul C. Shields Approximate-match waiting times: substitution/deletion metric , 1998, Proceedings. 1998 IEEE International Symposium on Information Theory (Cat. No.98CH36252).

[19]  Benjamin Weiss,et al.  Entropy and data compression schemes , 1993, IEEE Trans. Inf. Theory.

[20]  Serap A. Savari,et al.  Redundancy of the Lempel-Ziv incremental parsing rule , 1997, IEEE Trans. Inf. Theory.

[21]  P. Shields The Ergodic Theory of Discrete Sample Paths , 1996 .

[22]  Aaron D. Wyner,et al.  Classification with finite memory , 1996, IEEE Trans. Inf. Theory.

[23]  Aaron D. Wyner,et al.  Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression , 1989, IEEE Trans. Inf. Theory.

[24]  En-Hui Yang,et al.  Simple universal lossy data compression schemes derived from the Lempel-Ziv algorithm , 1996, IEEE Trans. Inf. Theory.

[25]  Peter Grassberger,et al.  Estimating the information content of symbol sequences and efficient codes , 1989, IEEE Trans. Inf. Theory.

[26]  Guy Louchard,et al.  On the average redundancy rate of the Lempel-Ziv code , 1997, IEEE Trans. Inf. Theory.

[27]  Ioannis Kontoyiannis Asymptotically optimal lossy Lempel-Ziv coding , 1998, Proceedings. 1998 IEEE International Symposium on Information Theory (Cat. No.98CH36252).

[28]  A. D. Wyner,et al.  Typical sequences and all that: entropy, pattern matching, and data compression , 1994, Proceedings of 1994 IEEE International Symposium on Information Theory.

[29]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[30]  Benoist,et al.  On the Entropy of DNA: Algorithms and Measurements based on Memory and Rapid Convergence , 1994 .

[31]  En-Hui Yang,et al.  On the redundancy of the fixed-database Lempel-Ziv algorithm for phi -mixing sources , 1997, IEEE Trans. Inf. Theory.

[32]  Yuri M. Suhov,et al.  Nonparametric Entropy Estimation for Stationary Processesand Random Fields, with Applications to English Text , 1998, IEEE Trans. Inf. Theory.

[33]  Marcelo Weinberger,et al.  Upper Bounds On The Probability Of Sequences Emitted By Finite-state Sources And On The Redundancy Of The Lempel-Ziv Algorithm , 1991, Proceedings. 1991 IEEE International Symposium on Information Theory.

[34]  Jacob Ziv,et al.  On Sliding-Window Universal Data Compression with Limited Memory , 1998, IEEE Trans. Inf. Theory.