Logistic Regression, AdaBoost and Bregman Distances

We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt algorithms designed for one problem to the other. For both problems, we give new algorithms and explain their potential advantages over existing methods. These algorithms are iterative and can be divided into two types based on whether the parameters are updated sequentially (one at a time) or in parallel (all at once). We also describe a parameterized family of algorithms that includes both a sequential- and a parallel-update algorithm as special cases, thus showing how the sequential and parallel approaches can themselves be unified. For all of the algorithms, we give convergence proofs using a general formalization of the auxiliary-function proof technique. As one of our sequential-update algorithms is equivalent to AdaBoost, this provides the first general proof of convergence for AdaBoost. We show that all of our algorithms generalize easily to the multiclass case, and we contrast the new algorithms with the iterative scaling algorithm. We conclude with a few experimental results with synthetic data that highlight the behavior of the old and newly proposed algorithms in different settings.

[1]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[2]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[3]  I. Csiszár $I$-Divergence Geometry of Probability Distributions and Minimization Problems , 1975 .

[4]  Flemming Topsøe,et al.  Information-theoretical optimization techniques , 1979, Kybernetika.

[5]  Y. Censor,et al.  An iterative row-action method for interval convex programming , 1981 .

[6]  I. Csiszár Sanov Property, Generalized $I$-Projection and a Conditional Limit Theorem , 1984 .

[7]  I. Csiszár Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems , 1991 .

[8]  Philip M. Long,et al.  On-line learning of linear functions , 1991, STOC '91.

[9]  Hans Ulrich Simon,et al.  Robust Trainability of Single Neurons , 1995, J. Comput. Syst. Sci..

[10]  Manfred K. Warmuth,et al.  Bounds on approximate steepest descent for likelihood maximization in exponential families , 1994, IEEE Trans. Inf. Theory.

[11]  Manfred K. Warmuth,et al.  Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[12]  I. Csiszár Generalized projections for non-negative functions , 1995, Proceedings of 1995 IEEE International Symposium on Information Theory.

[13]  I. Csiszár Generalized projections for non-negative functions , 1995 .

[14]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[15]  I. Csiszár Maxent, Mathematics, and Information Theory , 1996 .

[16]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[17]  Joachim M. Buhmann,et al.  Pairwise Data Clustering by Deterministic Annealing , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Y. Censor,et al.  Parallel Optimization: Theory, Algorithms, and Applications , 1997 .

[19]  L. Breiman Arcing the edge , 1997 .

[20]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[21]  S. D. Pietra,et al.  Statistical Learning Algorithms Based on Bregman Distances , 1997 .

[22]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[24]  Rajeev Sharma,et al.  Advances in Neural Information Processing Systems 11 , 1999 .

[25]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[26]  Osamu Watanabe From Computational Learning Theory to Discovery Science , 1999, ICALP.

[27]  Yoram Singer,et al.  A simple, fast, and effective rule learner , 1999, AAAI 1999.

[28]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[29]  J. Lafferty Additive models, boosting, and inference for generalized divergences , 1999, COLT '99.

[30]  Manfred K. Warmuth,et al.  Boosting as entropy projection , 1999, COLT '99.

[31]  David P. Helmbold,et al.  Potential Boosters? , 1999, NIPS.

[32]  Osamu Watanabe,et al.  Scaling Up a Boosting-Based Learner via Adaptive Sampling , 2000, PAKDD.

[33]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[34]  Peter L. Bartlett,et al.  Functional Gradient Techniques for Combining Hypotheses , 2000 .

[35]  Sally A. Goldman,et al.  Proceedings of the Thirteenth Annual Conference on Computational Learning Theory (COLT 2000), June 28 - July 1, 2000, Palo Alto, California, USA , 2000 .

[36]  S. D. Pietra,et al.  Duality and Auxiliary Functions for Bregman Distances , 2001 .

[37]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[38]  Manfred K. Warmuth,et al.  Relative Loss Bounds for Multidimensional Regression Problems , 1997, Machine Learning.

[39]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.