Count-Based Frequency Estimation with Bounded Memory

Count-based estimators are a fundamental building block of a number of powerful sequential prediction algorithms, including Context Tree Weighting and Prediction by Partial Matching. Keeping exact counts, however, typically results in a high memory overhead. In particular, when dealing with large alphabets the memory requirements of count-based estimators often become prohibitive. In this paper we propose three novel ideas for approximating count-based estimators using bounded memory. Our first contribution, of independent interest, is an extension of reservoir sampling for sampling distinct symbols from a stream of unknown length, which we call K-distinct reservoir sampling. We combine this sampling scheme with a state-of-the-art count-based estimator for memoryless sources, the Sparse Adaptive Dirichlet (SAD) estimator. The resulting algorithm, the Budget SAD, naturally guarantees a limit on its memory usage. We finally demonstrate the broader use of K-distinct reservoir sampling in nonparametric estimation by using it to restrict the branching factor of the Context Tree Weighting algorithm. We demonstrate the usefulness of our algorithms with empirical results on two sequential, large-alphabet prediction problems.

[1]  Andy Warhol Campbell's Soup Cans , 1962 .

[2]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[3]  Donald Ervin Knuth,et al.  The Art of Computer Programming, Volume II: Seminumerical Algorithms , 1970 .

[4]  Raphail E. Krichevsky,et al.  The performance of universal encoding , 1981, IEEE Trans. Inf. Theory.

[5]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[6]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[7]  Mark Steedman Information and syntax in spoken language systems , 1989 .

[8]  Kenneth Ward Church,et al.  Poor Estimates of Context are Worse than None , 1990, HLT.

[9]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[10]  Algorithmic Learning Theory , 1994, Lecture Notes in Computer Science.

[11]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[12]  Carl Vogel,et al.  Proceedings of the 16th International Conference on Computational Linguistics , 1996, COLING 1996.

[13]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[14]  Axthonv G. Oettinger,et al.  IEEE Transactions on Information Theory , 1998 .

[15]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[16]  Yoram Singer,et al.  Efficient Bayesian Parameter Estimation in Large Discrete Domains , 1998, NIPS.

[17]  Rajeev Sharma,et al.  Advances in Neural Information Processing Systems 11 , 1999 .

[18]  Robert J. McEliece,et al.  The generalized distributive law , 2000, IEEE Trans. Inf. Theory.

[19]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[20]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[21]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[22]  Erik D. Demaine,et al.  Frequency Estimation of Internet Packet Streams with Limited Space , 2002, ESA.

[23]  Alon Orlitsky,et al.  Always Good Turing: Asymptotically Optimal Probability Estimation , 2003, Science.

[24]  J. van Leeuwen,et al.  Theoretical Computer Science , 2003, Lecture Notes in Computer Science.

[25]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[26]  Yee Whye Teh,et al.  A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes , 2006, ACL.

[27]  Claudio Gentile,et al.  Tracking the best hyperplane with a simple budget Perceptron , 2006, Machine Learning.

[28]  Barbara Caputo,et al.  The projectron: a bounded kernel-based Perceptron , 2008, ICML '08.

[29]  Yoram Singer,et al.  The Forgetron: A Kernel-Based Perceptron on a Budget , 2008, SIAM J. Comput..

[30]  Benjamin Van Roy,et al.  Universal Reinforcement Learning , 2007, IEEE Transactions on Information Theory.

[31]  Chi-Jen Lu,et al.  Making Online Decisions with Bounded Memory , 2011, ALT.

[32]  Leah Epstein,et al.  Algorithms – ESA 2012 , 2012, Lecture Notes in Computer Science.

[33]  Approximate Frequency Counts over Data Streams , 2012, Proc. VLDB Endow..

[34]  Joel Veness,et al.  Sparse Sequential Dirichlet Coding , 2012, ArXiv.

[35]  Joel Veness,et al.  Context Tree Switching , 2011, 2012 Data Compression Conference.

[36]  Marcus Hutter,et al.  Sparse Adaptive Dirichlet-Multinomial-like Processes , 2013, COLT.

[37]  Marc G. Bellemare,et al.  Bayesian Learning of Recursively Factored Environments , 2013, ICML.

[38]  Christos Dimitrakakis,et al.  Cover tree Bayesian reinforcement learning , 2013, J. Mach. Learn. Res..