Applications of empirical processes in learning theory: algorithmic stability and generalization bounds

This thesis studies two key properties of learning algorithms: their generalization ability and their stability with respect to perturbations. To analyze these properties, we focus on concentration inequalities and tools from empirical process theory. We obtain theoretical results and demonstrate their applications to machine learning. First, we show how various notions of stability upper- and lower-bound the bias and variance of several estimators of the expected performance for general learning algorithms. A weak stability condition is shown to be equivalent to consistency of empirical risk minimization. The second part of the thesis derives tight performance guarantees for greedy error minimization methods---a family of computationally tractable algorithms. In particular, we derive risk bounds for a greedy mixture density estimation procedure. We prove that, unlike what is suggested in the literature, the number of terms in the mixture is not a bias-variance trade-off for the performance. The third part of this thesis provides a solution to an open problem regarding the stability of Empirical Risk Minimization (ERM). This algorithm is of central importance in Learning Theory. By studying the suprema of the empirical process, we prove that ERM over Donsker classes of functions is stable in the L1 norm. Hence, as the number of samples grows, it becomes less and less likely that a perturbation of o( n ) samples will result in a very different empirical minimizer. Asymptotic rates of this stability are proved under metric entropy assumptions on the function class. Through the use of a ratio limit inequality, we also prove stability of expected errors of empirical minimizers. Next, we investigate applications of the stability result. In particular, we focus on procedures that optimize an objective function, such as k-means and other clustering methods. We demonstrate that stability of clustering, just like stability of ERM, is closely related to the geometry of the class and the underlying measure. Furthermore, our result on stability of ERM delineates a phase transition between stability and instability of clustering methods. In the last chapter, we prove a generalization of the bounded-difference concentration inequality for almost-everywhere smooth functions. This result can be utilized to analyze algorithms which are almost always stable. Next, we prove a phase transition in the concentration of almost-everywhere smooth functions. Finally, a tight concentration of empirical errors of empirical minimizers is shown under an assumption on the underlying space. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[2]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[3]  P. Bartlett,et al.  Empirical minimization , 2006 .

[4]  P. Bartlett,et al.  Local Rademacher complexities and oracle inequalities in risk minimization , 2006 .

[5]  T. Poggio,et al.  STABILITY RESULTS IN LEARNING THEORY , 2005 .

[6]  D. Panchenko,et al.  Risk bounds for mixture density estimation , 2005 .

[7]  Alexander Rakhlin,et al.  Some Properties of Empirical Risk Minimization Over Donsker Classes , 2005 .

[8]  S. Boucheron,et al.  Moment inequalities for functions of independent random variables , 2005, math/0503651.

[9]  Peter L. Bartlett,et al.  Optimal Sample-Based Estimates of the Expectation of the Empirical Minimizer , 2005 .

[10]  Peter L. Bartlett,et al.  Local Complexities for Empirical Risk Minimization , 2004, COLT.

[11]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[12]  M. Rudelson,et al.  Combinatorics of random processes and sections of convex bodies , 2004, math/0404192.

[13]  T. Poggio,et al.  General conditions for predictivity in learning theory , 2004, Nature.

[14]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[15]  Andrew R. Barron,et al.  Approximation and estimation bounds for artificial neural networks , 2004, Machine Learning.

[16]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[17]  V. Koltchinskii,et al.  Bounding the generalization error of convex combinations of classifiers: balancing the dimensionality and the margins , 2004, math/0405345.

[18]  Shie Mannor,et al.  Greedy Algorithms for Classification -- Consistency, Convergence Rates, and Adaptivity , 2003, J. Mach. Learn. Res..

[19]  S. Boucheron,et al.  Concentration inequalities using the entropy method , 2003 .

[20]  Tong Zhang,et al.  Sequential greedy approximation for certain convex optimization problems , 2003, IEEE Trans. Inf. Theory.

[21]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[22]  G. Lugosi,et al.  On the Bayes-risk consistency of regularized boosting methods , 2003 .

[23]  Shai Ben-David,et al.  On the difficulty of approximately maximizing agreements , 2000, J. Comput. Syst. Sci..

[24]  Gábor Lugosi,et al.  Concentration Inequalities , 2008, COLT.

[25]  Partha Niyogi,et al.  Almost-everywhere Algorithmic Stability and Generalization Error , 2002, UAI.

[26]  Shie Mannor,et al.  The Consistency of Greedy Algorithms for Classification , 2002, COLT.

[27]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[28]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[29]  P. MassartLedoux Concentration Inequalities Using the Entropy Method , 2002 .

[30]  S. Kutin Extensions to McDiarmid's inequality when dierences are bounded with high probability , 2002 .

[31]  S. R. Jammalamadaka,et al.  Empirical Processes in M-Estimation , 2001 .

[32]  A. W. van der Vaart,et al.  Uniform Central Limit Theorems , 2001 .

[33]  M. Ledoux The concentration of measure phenomenon , 2001 .

[34]  V. Koltchinskii,et al.  Rademacher Processes and Bounding the Risk of Function Learning , 2004, math/0405338.

[35]  S. Geer Empirical Processes in M-Estimation , 2000 .

[36]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[37]  Stefan Sperlich,et al.  Generalized Additive Models , 2014 .

[38]  Andrew R. Barron,et al.  Mixture Density Estimation , 1999, NIPS.

[39]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[40]  Robert E. Schapire,et al.  A Brief Introduction to Boosting , 1999, IJCAI.

[41]  Dana Ron,et al.  Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation , 1997, Neural Computation.

[42]  A. Barron,et al.  Estimation of mixture models , 1999 .

[43]  Federico Girosi,et al.  Generalization bounds for function approximation from scattered noisy data , 1999, Adv. Comput. Math..

[44]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[45]  S. Bobkov,et al.  Poincaré’s inequalities and Talagrand’s concentration phenomenon for the exponential distribution , 1997 .

[46]  L. Gottfredson Mainstream science on intelligence: An editorial with 52 signatories, history, and bibliography , 1997 .

[47]  Peter L. Bartlett,et al.  The importance of convexity in learning with squared loss , 1998, COLT '96.

[48]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[49]  Sara van de Geer,et al.  Rates of convergence for the maximum likelihood estimator in mixture models , 1996 .

[50]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[51]  W. Wong,et al.  Probability inequalities for likelihood ratios and convergence rates of sieve MLEs , 1995 .

[52]  R. Tibshirani,et al.  Generalized additive models for medical research , 1995, Statistical methods in medical research.

[53]  D. Pollard Uniform ratio limit theorems for empirical processes , 1995 .

[54]  N. Fisher,et al.  Probability Inequalities for Sums of Bounded Random Variables , 1994 .

[55]  M. Talagrand Sharper Bounds for Gaussian and Empirical Processes , 1994 .

[56]  Vladimir Koltchinskii,et al.  Komlos-Major-Tusnady approximation for the general empirical process and Haar expansions of classes of functions , 1994 .

[57]  Sanjeev Arora,et al.  The Hardness of Approximate Optimia in Lattices, Codes, and Systems of Linear Equations , 1993, FOCS.

[58]  Noga Alon,et al.  Scale-sensitive dimensions, uniform convergence, and learnability , 1993, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[59]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[60]  Emmanuel Rio,et al.  Strong Approximation for Set-Indexed Partial-Sum Processes, Via KMT Constructions II , 1993 .

[61]  P. Massart,et al.  Rates of convergence for minimum contrast estimators , 1993 .

[62]  L. Jones A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training , 1992 .

[63]  L. Galway Spline Models for Observational Data , 1991 .

[64]  E. Giné,et al.  GAUSSIAN CHARACTERIZATION OF UNIFORM DONSKER CLASSES OF FUNCTIONS , 1991 .

[65]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[66]  D. Pollard,et al.  Cube Root Asymptotics , 1990 .

[67]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[68]  E. Giné,et al.  Some Limit Theorems for Empirical Processes , 1984 .

[69]  B. Efron,et al.  The Jackknife Estimate of Variance , 1981 .

[70]  L. Devroye,et al.  Distribution-Free Consistency Results in Nonparametric Discrimination and Regression Function Estimation , 1980 .

[71]  Luc Devroye,et al.  Distribution-free performance bounds for potential function rules , 1979, IEEE Trans. Inf. Theory.

[72]  Luc Devroye,et al.  Distribution-free inequalities for the deleted and holdout error estimates , 1979, IEEE Trans. Inf. Theory.

[73]  W. Rogers,et al.  A Finite Sample Distribution-Free Performance Bound for Local Discrimination Rules , 1978 .

[74]  J. Kuelbs Probability on Banach spaces , 1978 .

[75]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[76]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[77]  S. Nagaev,et al.  Probability inequalities for sums of independent random variables with values in a Banach space , 1971 .

[78]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[79]  L. H. Harper Optimal numberings and isoperimetric problems on graphs , 1966 .