General conditions for predictivity in learning theory

Developing theoretical foundations for learning is a key step towards understanding intelligence. ‘Learning from examples’ is a paradigm in which systems (natural or artificial) learn a functional relationship from a training set of examples. Within this paradigm, a learning algorithm is a map from the space of training sets to the hypothesis space of possible functional solutions. A central question for the theory is to determine conditions under which a learning algorithm will generalize from its finite training set to novel examples. A milestone in learning theory was a characterization of conditions on the hypothesis space that ensure generalization for the natural class of empirical risk minimization (ERM) learning algorithms that are based on minimizing the error on the training set. Here we provide conditions for generalization in terms of a precise stability property of the learning process: when the training set is perturbed by deleting one example, the learned hypothesis does not change much. This stability property stipulates conditions on the learning map rather than on the hypothesis space, subsumes the classical theory for ERM algorithms, and is applicable to more general algorithms. The surprising connection between stability and predictivity has implications for the foundations of learning theory and for the design of novel algorithms, and provides insights into problems as diverse as language learning and inverse problems in physics and engineering.

[1]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[2]  Luc Devroye,et al.  Distribution-free performance bounds for potential function rules , 1979, IEEE Trans. Inf. Theory.

[3]  Noam Chomsky,et al.  Lectures on Government and Binding , 1981 .

[4]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[5]  C. J. Stone,et al.  The Dimensionality Reduction Principle for Generalized Additive Models , 1986 .

[6]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[7]  I. Johnstone,et al.  Projection-Based Approximation and a Duality with Kernel Methods , 1989 .

[8]  Christine McGourty Dealing with the data , 1989, Nature.

[9]  G. Wahba Spline models for observational data , 1990 .

[10]  T. Poggio A theory of how the brain might work. , 1990, Cold Spring Harbor symposia on quantitative biology.

[11]  R. Dudley,et al.  Uniform and universal Glivenko-Cantelli classes , 1991 .

[12]  Léon Bottou,et al.  Local Learning Algorithms , 1992, Neural Computation.

[13]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[14]  H. Engl,et al.  Regularization of Inverse Problems , 1996 .

[15]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[16]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[17]  Noga Alon,et al.  Scale-sensitive dimensions, uniform convergence, and learnability , 1997, JACM.

[18]  T. Sejnowski,et al.  Spatial Transformations in the Parietal Cortex Using Basis Functions , 1997, Journal of Cognitive Neuroscience.

[19]  Dana Ron,et al.  Algorithmic Stability and Sanity-Check Bounds for Leave-one-Out Cross-Validation , 1997, COLT.

[20]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[21]  R. Dudley,et al.  Uniform Central Limit Theorems: Notation Index , 2014 .

[22]  M. Kearns,et al.  Algorithmic stability and sanity-check bounds for leave-one-out cross-validation , 1999 .

[23]  Tomaso A. Poggio,et al.  Regularization Networks and Support Vector Machines , 2000, Adv. Comput. Math..

[24]  E. Berger UNIFORM CENTRAL LIMIT THEOREMS (Cambridge Studies in Advanced Mathematics 63) By R. M. D UDLEY : 436pp., £55.00, ISBN 0-521-46102-2 (Cambridge University Press, 1999). , 2001 .

[25]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[26]  Ding-Xuan Zhou,et al.  The covering number in learning theory , 2002, J. Complex..

[27]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[28]  T. Poggio,et al.  Statistical Learning: Stability is Sufficient for Generalization and Necessary and Sufficient for Consistency of Empirical Risk Minimization , 2002 .

[29]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[30]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[31]  Massimiliano Pontil,et al.  Leave One Out Error, Stability, and Generalization of Voting Combinations of Classifiers , 2004, Machine Learning.

[32]  T. Poggio,et al.  The Mathematics of Learning: Dealing with Data , 2005, 2005 International Conference on Neural Networks and Brain.