An Efficient Extension to Mixture Techniques for Prediction and Decision Trees

We present an efficient method for maintaining mixtures of prunings of a prediction or decision tree that extends the previous methods for “node-based” prunings (Buntine, 1990; Willems, Shtarkov, & Tjalkens, 1995; Helmbold & Schapire, 1997; Singer, 1997) to the larger class of edge-based prunings. The method includes an online weight-allocation algorithm that can be used for prediction, compression and classification. Although the set of edge-based prunings of a given tree is much larger than that of node-based prunings, our algorithm has similar space and time complexity to that of previous mixture algorithms for trees. Using the general online framework of Freund and Schapire (1997), we prove that our algorithm maintains correctly the mixture weights for edge-based prunings with any bounded loss function. We also give a similar algorithm for the logarithmic loss function with a corresponding weight-allocation algorithm. Finally, we describe experiments comparing node-based and edge-based mixture models for estimating the probability of the next word in English text, which show the advantages of edge-based models.

[1]  Yoram Singer,et al.  Adaptive Mixtures of Probabilistic Transducers , 1995, Neural Computation.

[2]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[3]  Alfredo De Santis,et al.  Learning probabilistic prediction functions , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[4]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[5]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[6]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[7]  John Case,et al.  Proceedings of the Third Annual Workshop on Computational Learning Theory : University of Rochester, Rochester, New York, August 6-8, 1990 , 1990 .

[8]  Glen G. Langdon,et al.  Universal modeling and coding , 1981, IEEE Trans. Inf. Theory.

[9]  Wray L. Buntine,et al.  A theory of learning classification rules , 1990 .

[10]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[11]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[12]  Meir Feder,et al.  A universal finite memory source , 1995, IEEE Trans. Inf. Theory.

[13]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[14]  Neri Merhav,et al.  Optimal sequential probability assignment for individual sequences , 1994, IEEE Trans. Inf. Theory.

[15]  Dana Ron,et al.  The power of amnesia: Learning probabilistic automata with variable memory length , 1996, Machine Learning.

[16]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[17]  Abraham Lempel,et al.  A sequential algorithm for the universal coding of finite memory sources , 1992, IEEE Trans. Inf. Theory.

[18]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[19]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[20]  Raphail E. Krichevsky,et al.  The performance of universal encoding , 1981, IEEE Trans. Inf. Theory.

[21]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[22]  Marcelo J. Weinberger,et al.  Upper bounds on the probability of sequences emitted by finite-state sources and on the redundancy of the Lempel-Ziv algorithm , 1992, IEEE Trans. Inf. Theory.

[23]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[24]  Yoram Singer,et al.  Beyond Word N-Grams , 1996, VLC@ACL.

[25]  Robert E. Schapire,et al.  Predicting Nearly as Well as the Best Pruning of a Decision Tree , 1995, COLT.

[26]  Jorma Rissanen,et al.  Complexity of strings in the class of Markov sources , 1986, IEEE Trans. Inf. Theory.