Structural Risk Minimization Over Data-Dependent Hierarchies

The paper introduces some generalizations of Vapnik's (1982) method of structural risk minimization (SRM). As well as making explicit some of the details on SRM, it provides a result that allows one to trade off errors on the training sample against improved generalization performance. It then considers the more general case when the hierarchy of classes is chosen in response to the data. A result is presented on the generalization performance of classifiers with a "large margin". This theoretically explains the impressive generalization performance of the maximal margin hyperplane algorithm of Vapnik and co-workers (which is the basis for their support vector machines). The paper concludes with a more general result in terms of "luckiness" functions, which provides a quite general way for exploiting serendipitous simplicity in observed data to obtain better prediction accuracy from small training sets. Four examples are given of such functions, including the Vapnik-Chervonenkis (1971) dimension measured on the sample.

[1]  D. Horne The lucky country : Australia in the sixties , 1964 .

[2]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[3]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[4]  Sheldon M. Ross,et al.  Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.

[5]  D. Pollard Convergence of stochastic processes , 1984 .

[6]  Lennart Ljung,et al.  System Identification: Theory for the User , 1987 .

[7]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[8]  Nathan Linial,et al.  Results on learnability and the Vapnik-Chervonenkis dimension , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[9]  Nathan Linial,et al.  Results on learnability and the Vapnick-Chervonenkis dimension , 1988, COLT '88.

[10]  John Shawe-Taylor,et al.  The learnability of formal concepts , 1990, COLT '90.

[11]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[12]  Isabelle Guyon,et al.  Structural Risk Minimization for Character Recognition , 1991, NIPS.

[13]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[14]  Andrew R. Barron,et al.  Complexity Regularization with Application to Artificial Neural Networks , 1991 .

[15]  Vladimir Vapnik,et al.  Principles of Risk Minimization for Learning Theory , 1991, NIPS.

[16]  A. Barron Approximation and Estimation Bounds for Artificial Neural Networks , 1991, COLT '91.

[17]  David J. C. MacKay,et al.  Bayesian Model Comparison and Backprop Nets , 1991, NIPS.

[18]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[19]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[20]  Alon Itai,et al.  Dominating distributions and learnability , 1992, COLT '92.

[21]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[22]  John Shawe-Taylor,et al.  Bounding Sample Size with the Vapnik-Chervonenkis Dimension , 1993, Discrete Applied Mathematics.

[23]  Noga Alon,et al.  Scale-sensitive dimensions, uniform convergence, and learnability , 1993, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[24]  Alon Itai,et al.  Nonuniform Learnability , 1988, J. Comput. Syst. Sci..

[25]  John Shawe-Taylor,et al.  A Result of Vapnik with Applications , 1993, Discrete Applied Mathematics.

[26]  Robert E. Schapire,et al.  Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[27]  Philip M. Long,et al.  Fat-shattering and the learnability of real-valued functions , 1994, COLT '94.

[28]  Gábor Lugosi,et al.  Nonparametric estimation via empirical risk minimization , 1995, IEEE Trans. Inf. Theory.

[29]  M. Opper,et al.  Perceptron learning: the largest version space , 1995 .

[30]  Manfred OPPERInstitut Perceptron Learning: the Largest Version Space , 1995 .

[31]  David Mackay,et al.  Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks , 1995 .

[32]  F. Girosi,et al.  Regularization Theory and Neural Networks , 1995 .

[33]  Tomaso A. Poggio,et al.  Regularization Theory and Neural Networks Architectures , 1995, Neural Computation.

[34]  Leonid Gurvits,et al.  Approximation and Learning of Convex Superpositions , 1995, J. Comput. Syst. Sci..

[35]  M.H. Hassoun,et al.  Fundamentals of Artificial Neural Networks , 1996, Proceedings of the IEEE.

[36]  Gábor Lugosi,et al.  A data-dependent skeleton estimate for learning , 1996, COLT '96.

[37]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[38]  Chris Mesterharm,et al.  An Apobayesian Relative of Winnow , 1996, NIPS.

[39]  John Shawe-Taylor,et al.  A framework for structural risk minimisation , 1996, COLT '96.

[40]  Gábor Lugosi,et al.  Concept learning using complexity regularization , 1995, IEEE Trans. Inf. Theory.

[41]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[42]  Philip M. Long,et al.  Fat-shattering and the learnability of real-valued functions , 1994, COLT '94.

[43]  P. R. Kumar,et al.  Learning by canonical smooth estimation. I. Simultaneous estimation , 1996, IEEE Trans. Autom. Control..

[44]  G. Lugosi,et al.  earning Using Complexity , 1996 .

[45]  P. R. Kumar,et al.  Learning by canonical smooth estimation. II. Learning and choice of model complexity , 1996, IEEE Trans. Autom. Control..

[46]  Sample Compression, Learnability, and the Vapnik-Chervonenkis Dimension , 1997, EuroCOLT.

[47]  Eduardo D. Sontag,et al.  Neural Networks with Quadratic VC Dimension , 1995, J. Comput. Syst. Sci..

[48]  John Shawe-Taylor,et al.  A Sufficient Condition for Polynomial Distribution-dependent Learnability , 1997, Discret. Appl. Math..

[49]  Eduardo D. Sontag,et al.  Shattering All Sets of k Points in General Position Requires (k 1)/2 Parameters , 1997, Neural Computation.

[50]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[51]  Axthonv G. Oettinger,et al.  IEEE Transactions on Information Theory , 1998 .

[52]  G. Lugosi,et al.  Adaptive Model Selection Using Empirical Complexities , 1998 .

[53]  Philip M. Long,et al.  Prediction, Learning, Uniform Convergence, and Scale-Sensitive Dimensions , 1998, J. Comput. Syst. Sci..

[54]  Peter L. Bartlett,et al.  Function Learning from Interpolation , 1995, Combinatorics, Probability and Computing.

[55]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[56]  P. Bartlett,et al.  Function Learning from Interpolation , 2000, Combinatorics, Probability and Computing.