Asymptotic properties of data compression and suffix trees

Recently, Wyner and Ziv (see ibid., vol.35, p.1250-8, 1989) have proved that the typical length of a repeated subword found within the first n positions of a stationary ergodic sequence is (1/h) log n in probability where h is the entropy of the alphabet. This finding was used to obtain several insights into certain universal data compression schemes, most notably the Lempel-Ziv data compression algorithm. Wyner and Ziv have also conjectured that their result can be extended to a stronger almost sure convergence. In this paper, we settle this conjecture in the negative in the so called right domain asymptotic, that is, during a dynamic phase of expanding the data base. We prove-under an additional assumption involving mixing conditions-that the length of a typical repeated subword oscillates almost surely (a.s.) between (1/h/sub 1/)log n and (1/h/sub 2/)log n where D >

[1]  B. Pittel Asymptotical Growth of a Class of Random Trees , 1985 .

[2]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[3]  Samuel Karlin,et al.  Counts of long aligned word matches among random letter sequences , 1987, Advances in Applied Probability.

[4]  N. L. Lawrie,et al.  Comparison Methods for Queues and Other Stochastic Models , 1984 .

[5]  Peter Grassberger,et al.  Estimating the information content of symbol sequences and efficient codes , 1989, IEEE Trans. Inf. Theory.

[6]  Wojciech Szpankowski,et al.  Self-Alignments in Words and Their Applications , 1992, J. Algorithms.

[7]  Leonidas J. Guibas,et al.  String Overlaps, Pattern Matching, and Nontransitive Games , 1981, J. Comb. Theory, Ser. A.

[8]  David Haussler,et al.  Average sizes of suffix trees and DAWGs , 1989, Discret. Appl. Math..

[9]  Philippe Jacquet,et al.  Autocorrelation on Words and Its Applications - Analysis of Suffix Trees by String-Ruler Approach , 1994, J. Comb. Theory, Ser. A.

[10]  P Erd,et al.  On the application of the borel-cantelli lemma , 1952 .

[11]  Alfred V. Aho,et al.  Algorithms for Finding Patterns in Strings , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[12]  Leonidas J. Guibas,et al.  Periods in Strings , 1981, J. Comb. Theory, Ser. A.

[13]  P. Billingsley,et al.  Ergodic theory and information , 1966 .

[14]  Ward Whitt,et al.  Comparison methods for queues and other stochastic models , 1986 .

[15]  M. Waterman,et al.  THE ERDOS-RENYI STRONG LAW FOR PATTERN MATCHING WITH A GIVEN PROPORTION OF MISMATCHES , 1989 .

[16]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[17]  B. Pittel Paths in a random digital tree: limiting distributions , 1986, Advances in Applied Probability.

[18]  Philippe Jacquet,et al.  On the lempel-ziv parsing algorithm and its digital tree representation , 1993 .

[19]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[20]  Helmut Prodinger,et al.  On the variance of the external path length in a symmetric digital trie , 1989, Discret. Appl. Math..

[21]  Paul Louis Hennequin,et al.  Ecole d'Eté de Probabilités de Saint-Flour V-1975 , 1976 .

[22]  Jack K. Wolf,et al.  New asymptotic bounds and improvements on the Lempel-Ziv data compression algorithm , 1991, IEEE Trans. Inf. Theory.

[23]  Wojciech Szpankowski Some Results on V-ary Asymmetric Tries , 1988, J. Algorithms.

[24]  Richard M. Karp,et al.  A characterization of the minimum cycle mean in a digraph , 1978, Discret. Math..

[25]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .

[26]  I. V. Romanovskil Optimization of stationary control of a discrete deterministic process , 1967 .

[27]  Wojciech Szpankowski,et al.  A Note on the Height of Suffix Trees , 1992, SIAM J. Comput..

[28]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[29]  Philippe Jacquet,et al.  Analysis of digital tries with Markovian dependency , 1991, IEEE Trans. Inf. Theory.

[30]  Eugene L. Lawler,et al.  Approximate string matching in sublinear expected time , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[31]  Toby Berger,et al.  Review of Information Theory: Coding Theorems for Discrete Memoryless Systems (Csiszár, I., and Körner, J.; 1981) , 1984, IEEE Trans. Inf. Theory.

[32]  Wojciech Szpankowski,et al.  A Generalized Suffix Tree and its (Un)expected Asymptotic Behaviors , 1993, SIAM J. Comput..

[33]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[34]  Aaron D. Wyner,et al.  Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression , 1989, IEEE Trans. Inf. Theory.

[35]  Mireille Régnier,et al.  Normal limiting distribution for the size and the external path length of tries , 1988 .

[36]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[37]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[38]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[39]  D. Aldous,et al.  A diffusion limit for a class of randomly-growing binary trees , 1988 .

[40]  Helmut Prodinger,et al.  Digital Search Trees Again Revisited: The Internal Path Length Perspective , 1994, SIAM J. Comput..

[41]  Aaron D. Wyner,et al.  Fixed data base version of the Lempel-Ziv data compression algorithm , 1991, IEEE Trans. Inf. Theory.

[42]  Benjamin Weiss,et al.  Entropy and data compression schemes , 1993, IEEE Trans. Inf. Theory.

[43]  Michael Rodeh,et al.  Linear Algorithm for Data Compression via String Matching , 1981, JACM.

[44]  P. Shields Entropy and Prefixes , 1992 .

[45]  John C. Kieffer,et al.  Sample converses in source coding theory , 1991, IEEE Trans. Inf. Theory.