Discussion of Boosting Papers

We congratulate the authors for their interesting papers on boosting and related topics. Jiang deals with the asymptotic consistency of Adaboost. Lugosi and Vayatis study the convex optimization of loss functions associated with boosting. Zhang studies the loss functions themselves. Their results imply that boosting-like methods can reasonably be expected to converge to Bayes classifiers under sufficient regularity conditions (such as the requirement that trees with at least p+ 1 terminal nodes are used, where p is the number of variables in the model). An interesting feature of their results is that whenever data-based optimization is performed, some form of regularization is needed in order to attain consistency. In the case of AdaBoost this is achieved by stopping the boosting procedure early, whereas in the case of convex loss optimization, it is achieved by constraining the L1 norm of the coefficient vector. These results re-iterate, from this new perspective, the critical importance of regularization for building useful prediction models in high-dimensional space. This is also the theme of the remainder of our discussion. Since the publication of the AdaBoost procedure by Freund and Schapire in 1996, there has been a flurry of papers seeking to answer the question: why does boosting work? Since AdaBoost has been generalized in different ways by different authors, the question might be better posed as; what are the aspects of boosting that are the key to its good performance?

[1]  I. Johnstone,et al.  Maximum Entropy and the Nearly Black Object , 1992 .

[2]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[3]  I. Johnstone,et al.  Wavelet Shrinkage: Asymptopia? , 1995 .

[4]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[5]  Trevor Hastie,et al.  Additive Logistic Regression : a Statistical , 1998 .

[6]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[7]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[8]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[9]  J. Friedman 1999 REITZ LECTURE GREEDY FUNCTION APPROXIMATION: A GRADIENT BOOSTING MACHINE' , 2001 .

[10]  Shie Mannor,et al.  The Consistency of Greedy Algorithms for Classification , 2002, COLT.

[11]  Michael Elad,et al.  Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[12]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[13]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[14]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[15]  Yi Lin A note on margin-based loss functions in classification , 2004 .

[16]  Ji Zhu,et al.  Boosting as a Regularized Path to a Maximum Margin Classifier , 2004, J. Mach. Learn. Res..