On the learnability of discrete distributions

We introduce and investigate a new model of learning probability distributions from independent draws. Our model is inspired by the popular Probably Approximately Correct (PAC) model for learning boolean functions from labeled examples [24], in the sense that we emphasize efficient and approximate learning, and we study the learnability of restricted classes of target distributions. The dist ribut ion classes we examine are often defined by some simple computational mechanism for transforming a truly random string of input bits (which is not visible to the learning algorithm) into the stochastic observation (output) seen by the learning algorithm. In this paper, we concentrate on discrete distributions over {O, I}n. The problem of inferring an approximation to an unknown probability distribution on the basis of independent draws has a long and complex history in the pattern recognition and statistics literature. For instance, the problem of estimating the parameters of a Gaussian density in highdimensional space is one of the most studied statistical problems. Distribution learning problems have often been investigated in the context of unsupervised learning, in which a linear mixture of two or more distributions is generating the observations, and the final goal is not to model the distributions themselves, but to predict from which distribution each observation was drawn. Data clustering methods are a common tool here. There is also a large literature on nonpararnetric density estimation, in which no assumptions are made on the unknown target density. Nearest-neighbor approaches to the unsupervised learning problem often arise in the nonparametric setting. While we obviously cannot do justice to these areas here, the books of Duda and Hart [9] and Vapnik [25] provide excellent overviews and introductions to the pattern recognition work, as well as many pointers for further reading. See also Izenman’s recent survey article [16]. Roughly speaking, our work departs from the traditional statistical and pattern recognition approaches in two ways. First, we place explicit emphasis on the comput ationrd complexity of distribution learning. It seems fair to say that while previous research has provided an excellent understanding of the information-theoretic issues involved in dis-

[1]  Richard J. Lipton,et al.  Cryptographic Primitives Based on Hard Learning Problems , 1993, CRYPTO.

[2]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[3]  Leslie G. Valiant,et al.  The Complexity of Enumeration and Reliability Problems , 1979, SIAM J. Comput..

[4]  Ming Li,et al.  Learning in the Presence of Malicious Errors , 1993, SIAM J. Comput..

[5]  Oded Goldreich,et al.  On the theory of average case complexity , 1989, STOC '89.

[6]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[7]  KearnsMichael,et al.  Cryptographic limitations on learning Boolean formulae and finite automata , 1994 .

[8]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[9]  Robert E. Schapire,et al.  Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[10]  Michael Kearns,et al.  Efficient noise-tolerant learning from statistical queries , 1993, STOC.

[11]  Leslie G. Valiant,et al.  Cryptographic Limitations on Learning Boolean Formulae and Finite Automata , 1993, Machine Learning: From Theory to Applications.

[12]  Yoav Freund,et al.  An improved boosting algorithm and its implications on learning complexity , 1992, COLT '92.

[13]  Ming Li,et al.  Learning in the presence of malicious errors , 1993, STOC '88.

[14]  Alex Samorodnitsky,et al.  Inclusion-exclusion: Exact and approximate , 1996, Comb..

[15]  Manfred K. Warmuth,et al.  Learning integer lattices , 1990, COLT '90.

[16]  Hans Ulrich Simon,et al.  On learning ring-sum-expansions , 1990, COLT '90.

[17]  Manuel Blum,et al.  A Simple Unpredictable Pseudo-Random Number Generator , 1986, SIAM J. Comput..

[18]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[19]  Temple F. Smith Occam's razor , 1980, Nature.

[20]  A. Izenman Recent Developments in Nonparametric Density Estimation , 1991 .

[21]  Silvio Micali,et al.  How to construct random functions , 1986, JACM.

[22]  Yuri Gurevich,et al.  Average Case Completeness , 1991, J. Comput. Syst. Sci..

[23]  Robert E. Schapire,et al.  Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[24]  Leslie G. Valiant,et al.  Cryptographic limitations on learning Boolean formulae and finite automata , 1994, JACM.

[25]  Vladimir Vapnik,et al.  Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics) , 1982 .

[26]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[27]  Vasek Chvátal,et al.  A Greedy Heuristic for the Set-Covering Problem , 1979, Math. Oper. Res..

[28]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..