Concentration Bounds for Unigrams Language Model

We show several high-probability concentration bounds for learning unigram language models. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m. A standard estimator for this quantity is the Good-Turing estimator. The existing analysis on its error shows a high-probability bound of approximately O(k / m1/2). We improve its dependency on k to O(k1/4 / m1/2 + k / m). We also analyze the empirical frequencies estimator, showing that with high probability its error is bounded by approximately O( 1 / k + k1/2 / m). We derive a combined estimator, which has an error of approximately O(m-2/5), for any k.A standard measure for the quality of a learning algorithm is its expected per-word log-loss. The leave-one-out method can be used for estimating the log-loss of the unigram model. We show that its error has a high-probability bound of approximately O(1 / m1/2), for any underlying distribution.We also bound the log-loss a priori, as a function of various parameters of the distribution.

[1]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[2]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[3]  J. Darroch On the Distribution of the Number of Successes in Independent Trials , 1964 .

[4]  Leslie G. Valiant,et al.  Fast probabilistic algorithms for hamiltonian circuits and matchings , 1977, STOC '77.

[5]  B. Bollobás Surveys in Combinatorics , 1979 .

[6]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[7]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[8]  Kenneth Ward Church,et al.  A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams , 1991 .

[9]  William A. Gale,et al.  Good-Turing Frequency Estimation Without Tears , 1995, J. Quant. Linguistics.

[10]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[11]  Stanley F. Chen,et al.  Building Probabilistic Models for Natural Language , 1996, ArXiv.

[12]  Sean B. Holden PAC-like upper bounds for the sample complexity of leave-one-out cross-validation , 1996, COLT '96.

[13]  中澤 真,et al.  Devroye, L., Gyorfi, L. and Lugosi, G. : A Probabilistic Theory of Pattern Recognition, Springer (1996). , 1997 .

[14]  Dana Ron,et al.  Algorithmic Stability and Sanity-Check Bounds for Leave-one-Out Cross-Validation , 1997, COLT.

[15]  Desh Ranjan,et al.  Balls and bins: A study in negative dependence , 1996, Random Struct. Algorithms.

[16]  Stanley F. Chen,et al.  An empirical study of smoothing techniques for language modeling , 1999 .

[17]  Philippe Flajolet,et al.  Singularity Analysis and Asymptotics of Bernoulli Sums , 1999, Theor. Comput. Sci..

[18]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[19]  M. Kearns,et al.  Algorithmic stability and sanity-check bounds for leave-one-out cross-validation , 1999 .

[20]  David A. McAllester,et al.  On the Convergence Rate of Good-Turing Estimators , 2000, COLT.

[21]  I. Good,et al.  Turing’s anticipation of empirical bayes in connection with the cryptanalysis of the naval enigma , 2000 .

[22]  William A. Gale,et al.  Good-Turing Smoothing Without Tears , 2001 .

[23]  James R. Curran,et al.  A Very Very Large Corpus Doesn’t Always Yield Reliable Estimates , 2002, CoNLL.

[24]  Partha Niyogi,et al.  Algorithmic stability and ensemble-based learning , 2002 .

[25]  Alon Orlitsky,et al.  Always Good Turing: Asymptotically Optimal Probability Estimation , 2003, Science.

[26]  Luis E. Ortiz,et al.  Concentration Inequalities for the Missing Mass and for Histogram Rule Error , 2003, J. Mach. Learn. Res..

[27]  David A. McAllester,et al.  Learning theory and language modeling , 2003 .