Model Selection and Error Estimation

We study model selection strategies based on penalized empirical loss minimization. We point out a tight relationship between error estimation and data-based complexity penalization: any good error estimate may be converted into a data-based penalty function and the performance of the estimate is governed by the quality of the error estimate. We consider several penalty functions, involving error estimates on independent test data, empirical VC dimension, empirical VC entropy, and margin-based quantities. We also consider the maximal difference between the error on the first half of the training data and the second half, and the expected maximal discrepancy, a closely related capacity estimate that can be calculated by Monte Carlo integration. Maximal discrepancy penalty functions are appealing for pattern classification problems, since their computation is equivalent to empirical risk minimization over the training data with some labels flipped.

[1]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[2]  Shun-ichi Amari,et al.  A Theory of Pattern Recognition , 1968 .

[3]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[4]  H. Akaike A new look at the statistical model identification , 1974 .

[5]  S. Szarek On the best constants in the Khinchin inequality , 1976 .

[6]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[7]  S. Geman,et al.  Nonparametric Maximum Likelihood Estimation by the Method of Sieves , 1982 .

[8]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[9]  E. Giné,et al.  Some Limit Theorems for Empirical Processes , 1984 .

[10]  A. Gallant,et al.  Nonlinear Statistical Models , 1988 .

[11]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[12]  Andrew R. Barron,et al.  Complexity Regularization with Application to Artificial Neural Networks , 1991 .

[13]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[14]  Yuhong Yang,et al.  An asymptotic property of model selection criteria , 1994, Proceedings of 1994 Workshop on Information Theory and Statistics.

[15]  Yann LeCun,et al.  Measuring the VC-Dimension of a Learning Machine , 1994, Neural Computation.

[16]  M. Talagrand Concentration of measure and isoperimetric inequalities in product spaces , 1994, math/9406212.

[17]  W. Wong,et al.  Convergence Rate of Sieve Estimates , 1994 .

[18]  Gábor Lugosi,et al.  Nonparametric estimation via empirical risk minimization , 1995, IEEE Trans. Inf. Theory.

[19]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[20]  Kurt Mehlhorn,et al.  LEDA: a platform for combinatorial and geometric computing , 1997, CACM.

[21]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[22]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[23]  Dharmendra S. Modha,et al.  Minimum complexity regression estimation with weakly dependent observations , 1996, IEEE Trans. Inf. Theory.

[24]  P. R. Kumar,et al.  Learning by Canonical Smooth Es timation-Part I: Simultaneous Estimation , 1996 .

[25]  Gábor Lugosi,et al.  Concept learning using complexity regularization , 1995, IEEE Trans. Inf. Theory.

[26]  P. R. Kumar,et al.  Learning by canonical smooth estimation. I. Simultaneous estimation , 1996, IEEE Trans. Autom. Control..

[27]  Adam Krzyzak,et al.  Radial Basis Function Networks and Complexity Regularization in Function Learning , 2022 .

[28]  P. R. Kumar,et al.  Learning by canonical smooth estimation. II. Learning and choice of model complexity , 1996, IEEE Trans. Autom. Control..

[29]  P. Massart,et al.  From Model Selection to Adaptive Estimation , 1997 .

[30]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[31]  Ron Meir Performance bounds for nonlinear time series prediction , 1997, COLT '97.

[32]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[33]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[34]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[35]  John Shawe-Taylor,et al.  Generalization Performance of Support Vector Machines and Other Pattern Classifiers , 1999 .

[36]  G. Lugosi,et al.  Adaptive Model Selection Using Empirical Complexities , 1998 .

[37]  Yuhong Yang,et al.  An Asymptotic Property of Model Selection Criteria , 1998, IEEE Trans. Inf. Theory.

[38]  P. Massart,et al.  Minimum contrast estimators on sieves: exponential bounds and rates of convergence , 1998 .

[39]  Yoav Freund,et al.  Self bounding learning algorithms , 1998, COLT' 98.

[40]  P. Massart,et al.  Risk bounds for model selection via penalization , 1999 .

[41]  Yuhong Yang,et al.  Information-theoretic determination of minimax rates of convergence , 1999 .

[42]  S. Boucheron,et al.  A sharp concentration inequality with applications , 1999, Random Struct. Algorithms.

[43]  V. Koltchinskii,et al.  Rademacher Processes and Bounding the Risk of Function Learning , 2004, math/0405338.

[44]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[45]  Fernando Lozano,et al.  Model selection using Rademacher Penalization , 2000 .

[46]  P. Massart Some applications of concentration inequalities to statistics , 2000 .

[47]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[48]  Colin L. Mallows,et al.  Some Comments on Cp , 2000, Technometrics.

[49]  P. Massart,et al.  About the constants in Talagrand's concentration inequalities for empirical processes , 2000 .

[50]  S. Boucheron,et al.  A sharp concentration inequality with applications , 1999, Random Struct. Algorithms.

[51]  Vladimir Koltchinskii,et al.  Rademacher penalties and structural risk minimization , 2001, IEEE Trans. Inf. Theory.

[52]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[53]  Dana Ron,et al.  An Experimental and Theoretical Comparison of Model Selection Methods , 1995, COLT '95.

[54]  Peter L. Bartlett,et al.  Improved Generalization Through Explicit Optimization of Margins , 2000, Machine Learning.

[55]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[56]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .