A Generalized Suffix Tree and its (Un)expected Asymptotic Behaviors

Suffix trees find several applications in computer science and telecommunications, most notably in algorithms on strings, data compressions, and codes. Despite this, very little is known about their typical behaviors. In a probabilistic framework, a family of suffix trees—further called b-suffix trees—built from the first n suffixes of a random word is considered. In this family a noncompact suffix tree (i.e., such that every edge is labeled by a single symbol) is represented by $b = 1$, and a compact suffix tree (i.e., without unary nodes) is asymptotically equivalent to $b \to \infty $ as $n \to \infty $. Several parameters of b-suffix trees are studied, namely, the depth of a given suffix, the depth of insertion, the height and the shortest feasible path. Some new results concerning typical (i.e., almost sure) behaviors of these parameters are established. These findings are used to obtain several insights into certain algorithms on words, molecular biology, and universal data compression schemes.

[1]  B. Pittel Asymptotical Growth of a Class of Random Trees , 1985 .

[2]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[3]  Wojciech Szpankowski (Un)expected behavior of typical suffix trees , 1992, SODA '92.

[4]  Philippe Jacquet,et al.  Analysis of digital tries with Markovian dependency , 1991, IEEE Trans. Inf. Theory.

[5]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[6]  M. Waterman Mathematical Methods for DNA Sequences , 1989 .

[7]  Wojciech Szpankowski,et al.  Patricia tries again revisited , 1990, JACM.

[8]  P. Billingsley,et al.  Ergodic theory and information , 1966 .

[9]  Uzi Vishkin,et al.  Deterministic sampling—a new technique for fast pattern matching , 1990, STOC '90.

[10]  Gaston H. Gonnet,et al.  Handbook Of Algorithms And Data Structures , 1984 .

[11]  Leonidas J. Guibas,et al.  String Overlaps, Pattern Matching, and Nontransitive Games , 1981, J. Comb. Theory, Ser. A.

[12]  Philippe Jacquet,et al.  Limiting Distribution for the Depth in Patricia Tries , 1993, SIAM J. Discret. Math..

[13]  P Erd,et al.  On the application of the borel-cantelli lemma , 1952 .

[14]  Eugene L. Lawler,et al.  Approximate string matching in sublinear expected time , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[15]  N. L. Lawrie,et al.  Comparison Methods for Queues and Other Stochastic Models , 1984 .

[16]  Wojciech Szpankowski,et al.  Self-Alignments in Words and Their Applications , 1992, J. Algorithms.

[17]  Philippe Jacquet,et al.  What Can We Learn about Suffix Trees from Independent Tries? , 1991, WADS.

[18]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[19]  I. V. Romanovskil Optimization of stationary control of a discrete deterministic process , 1967 .

[20]  Peter Grassberger,et al.  Estimating the information content of symbol sequences and efficient codes , 1989, IEEE Trans. Inf. Theory.

[21]  B. Pittel Paths in a random digital tree: limiting distributions , 1986, Advances in Applied Probability.

[22]  Zvi Galil,et al.  An Improved Algorithm for Approximate String Matching , 1990, SIAM J. Comput..

[23]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[24]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[25]  Wojciech Szpankowski,et al.  Asymptotic properties of data compression and suffix trees , 1993, IEEE Trans. Inf. Theory.

[26]  Aaron D. Wyner,et al.  Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression , 1989, IEEE Trans. Inf. Theory.

[27]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[28]  Samuel Karlin,et al.  Counts of long aligned word matches among random letter sequences , 1987, Advances in Applied Probability.

[29]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[30]  Leonidas J. Guibas,et al.  Periods in Strings , 1981, J. Comb. Theory, Ser. A.

[31]  Richard M. Karp,et al.  A characterization of the minimum cycle mean in a digraph , 1978, Discret. Math..

[32]  Franco P. Preparata,et al.  Optimal Off-Line Detection of Repetitions in a String , 1983, Theor. Comput. Sci..

[33]  Xerox Polo,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976 .