The Consistency of Greedy Algorithms for Classification

We consider a class of algorithms for classification, which are based on sequential greedy minimization of a convex upper bound on the 0 - 1 loss function. A large class of recently popular algorithms falls within the scope of this approach, including many variants of Boosting algorithms. The basic question addressed in this paper relates to the statistical consistency of such approaches. We provide precise conditions which guarantee that sequential greedy procedures are consistent, and establish rates of convergence under the assumption that the Bayes decision boundary belongs to a certain class of smooth functions. The results are established using a form of regularization which constrains the search space at each iteration of the algorithm. In addition to providing general consistency results, we provide rates of convergence for smooth decision boundaries. A particularly interesting conclusion of our work is that Logistic function based Boosting provides faster rates of convergence than Boosting based on the exponential function used in AdaBoost.

[1]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[2]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[3]  L. Breiman Arcing classifier (with discussion and a rejoinder by the author) , 1998 .

[4]  Yuhong Yang,et al.  Minimax Nonparametric Classification—Part I: Rates of Convergence , 1998 .

[5]  Peter L. Bartlett,et al.  Learning in Neural Networks: Theoretical Foundations , 1999 .

[6]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[7]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[8]  Ron Meir,et al.  On the optimality of neural-network approximation using incremental algorithms , 2000, IEEE Trans. Neural Networks Learn. Syst..

[9]  Peter L. Bartlett,et al.  Functional Gradient Techniques for Combining Hypotheses , 2000 .

[10]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[11]  Shie Mannor,et al.  Geometric Bounds for Generalization in Boosting , 2001, COLT/EuroCOLT.

[12]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[13]  Wenxin Jiang Process consistency for AdaBoost , 2003 .

[14]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[15]  Tong Zhang,et al.  Sequential greedy approximation for certain convex optimization problems , 2003, IEEE Trans. Inf. Theory.

[16]  Shie Mannor,et al.  On the Existence of Linear Weak Learners and Applications to Boosting , 2002, Machine Learning.