Convexity, Classification, and Risk Bounds

Many of the classification algorithms developed in the machine learning literature, including the support vector machine and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0–1 loss function. The convexity makes these algorithms computationally efficient. The use of a surrogate, however, has statistical consequences that must be balanced against the computational virtues of convexity. To study these issues, we provide a general quantitative relationship between the risk as assessed using the 0–1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial upper bounds on excess risk under the weakest possible condition on the loss function—that it satisfies a pointwise form of Fisher consistency for classification. The relationship is based on a simple variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise, and show that in this case, strictly convex loss functions lead to faster rates of convergence of the risk than would be implied by standard uniform convergence arguments. Finally, we present applications of our results to the estimation of convergence rates in function classes that are scaled convex hulls of a finite-dimensional base class, with a variety of commonly used loss functions.

[1]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[2]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[3]  M. Talagrand,et al.  Probability in Banach spaces , 1991 .

[4]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[5]  Yurii Nesterov,et al.  Interior-point polynomial algorithms in convex programming , 1994, Siam studies in applied mathematics.

[6]  M. Talagrand Sharper Bounds for Gaussian and Empirical Processes , 1994 .

[7]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[8]  Peter L. Bartlett,et al.  Efficient agnostic learning of neural networks with bounded fan-in , 1996, IEEE Trans. Inf. Theory.

[9]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[10]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[11]  Jacques Stern,et al.  The Hardness of Approximate Optima in Lattices, Codes, and Systems of Linear Equations , 1997, J. Comput. Syst. Sci..

[12]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[13]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[14]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[15]  Peter Stahlecker,et al.  Linear Affine Estimation in Misspecified Linear Regression Models Using Fuzzy Prior Information , 1998 .

[16]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[17]  R. Tibshirani,et al.  Additive Logistic Regression : a Statistical View ofBoostingJerome , 1998 .

[18]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[19]  E. Mammen,et al.  Smooth Discrimination Analysis , 1999 .

[20]  L. Breiman SOME INFINITY THEORY FOR PREDICTOR ENSEMBLES , 2000 .

[21]  V. Koltchinskii,et al.  Rademacher Processes and Bounding the Risk of Function Learning , 2004, math/0405338.

[22]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[23]  P. Massart Some applications of concentration inequalities to statistics , 2000 .

[24]  P. Massart,et al.  About the constants in Talagrand's concentration inequalities for empirical processes , 2000 .

[25]  M. Ledoux The concentration of measure phenomenon , 2001 .

[26]  A. W. van der Vaart,et al.  Uniform Central Limit Theorems , 2001 .

[27]  Shie Mannor,et al.  Geometric Bounds for Generalization in Boosting , 2001, COLT/EuroCOLT.

[28]  John D. Lafferty,et al.  Boosting and Maximum Likelihood for Exponential Models , 2001, NIPS.

[29]  E. Rio,et al.  Inégalités de concentration pour les processus empiriques de classes de parties , 2001 .

[30]  Yi Lin A note on margin-based loss functions in classification , 2004 .

[31]  Shahar Mendelson,et al.  Improving the sample complexity using global data , 2002, IEEE Trans. Inf. Theory.

[32]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[33]  Shie Mannor,et al.  The Consistency of Greedy Algorithms for Classification , 2002, COLT.

[34]  O. Bousquet A Bennett concentration inequality and its application to suprema of empirical processes , 2002 .

[35]  Thierry Klein Une inégalité de concentration à gauche pour les processus empiriques , 2002 .

[36]  Wenxin Jiang Process consistency for AdaBoost , 2003 .

[37]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[38]  Roberto Basili,et al.  Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms by Thorsten Joachims , 2003, Comput. Linguistics.

[39]  Tong Zhang,et al.  Sequential greedy approximation for certain convex optimization problems , 2003, IEEE Trans. Inf. Theory.

[40]  G. Lugosi,et al.  On the Bayes-risk consistency of regularized boosting methods , 2003 .

[41]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[42]  W. Wong,et al.  On ψ-Learning , 2003 .

[43]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[44]  L. Breiman Population theory for boosting ensembles , 2003 .

[45]  Alfred O. Hero,et al.  Guest Editorial: Special Issue on Machine Learning Methods in Signal Processing , 2004, IEEE Trans. Signal Process..

[46]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[47]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[48]  G. Lugosi,et al.  Complexity regularization via localized random penalties , 2004, math/0410091.

[49]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[50]  Bin Yu,et al.  Boosting with early stopping: Convergence and consistency , 2005, math/0508276.

[51]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[52]  Ingo Steinwart,et al.  Consistency of support vector machines and other regularized kernel classifiers , 2005, IEEE Transactions on Information Theory.

[53]  P. Bartlett,et al.  Empirical minimization , 2006 .

[54]  D. Hinkley Annals of Statistics , 2006 .

[55]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[56]  V. Koltchinskii Rejoinder: Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0135.

[57]  Guillaume Lecué,et al.  Suboptimality of Penalized Empirical Risk Minimization in Classification , 2007, COLT.

[58]  Andreas Christmann,et al.  Robust learning from bites for data mining , 2007, Comput. Stat. Data Anal..