Boosting with structural sparsity

We derive generalizations of AdaBoost and related gradient-based coordinate descent methods that incorporate sparsity-promoting penalties for the norm of the predictor that is being learned. The end result is a family of coordinate descent algorithms that integrate forward feature induction and back-pruning through regularization and give an automatic stopping criterion for feature induction. We study penalties based on the l1, l2, and l∞ norms of the predictor and introduce mixed-norm penalties that build upon the initial penalties. The mixed-norm regularizers facilitate structural sparsity in parameter space, which is a useful property in multiclass prediction and other related tasks. We report empirical results that demonstrate the power of our approach in building accurate and structurally sparse models.

[1]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[2]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[3]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[4]  R. Felkel An interior-point method for large-scale QP-problems , 1996 .

[5]  Huan Liu,et al.  Book review: Machine Learning, Neural and Statistical Classification Edited by D. Michie, D.J. Spiegelhalter and C.C. Taylor (Ellis Horwood Limited, 1994) , 1996, SGAR.

[6]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[7]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[8]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[9]  Robert E. Schapire,et al.  Drifting Games , 1999, Annual Conference Computational Learning Theory.

[10]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[11]  Gunnar Rätsch,et al.  An Introduction to Boosting and Leveraging , 2002, Machine Learning Summer School.

[12]  Yoram Singer,et al.  Smooth e-Intensive Regression by Loss Symmetrization , 2005, COLT.

[13]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[14]  David D. Denison,et al.  Nonlinear estimation and classification , 2003 .

[15]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[16]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[17]  Tong Zhang,et al.  Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[18]  Charles A. Micchelli,et al.  Learning Multiple Tasks with Kernel Methods , 2005, J. Mach. Learn. Res..

[19]  Yoram Singer,et al.  Smooth epsiloon-Insensitive Regression by Loss Symmetrization , 2005, Journal of machine learning research.

[20]  Bin Yu,et al.  Boosting with early stopping: Convergence and consistency , 2005, math/0508276.

[21]  Dmitriy Fradkin,et al.  Bayesian Multinomial Logistic Regression for Author Identification , 2005, AIP Conference Proceedings.

[22]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[23]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[24]  Honglak Lee,et al.  Efficient L1 Regularized Logistic Regression , 2006, AAAI.

[25]  Gunnar Rätsch,et al.  Totally corrective boosting algorithms that maximize the margin , 2006, ICML.

[26]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[27]  Miroslav Dudík,et al.  Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling , 2007, J. Mach. Learn. Res..

[28]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2007, ICML '07.

[29]  P. Zhao,et al.  Grouped and Hierarchical Model Selection through Composite Absolute Penalties , 2007 .

[30]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[31]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale l1-Regularized Logistic Regression , 2007, J. Mach. Learn. Res..

[32]  G. Obozinski Joint covariate selection for grouped classification , 2007 .

[33]  G. Obozinski,et al.  High-dimensional union support recovery in multivariate regression , 2008 .

[34]  Martin J. Wainwright,et al.  Phase transitions for high-dimensional joint support recovery , 2008, NIPS.

[35]  Tong Zhang,et al.  Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models , 2008, NIPS.

[36]  Jean-Philippe Vert,et al.  Clustered Multi-Task Learning: A Convex Formulation , 2008, NIPS.

[37]  Hao Helen Zhang,et al.  Variable selection for the multicategory SVM via adaptive sup-norm regularization , 2008, 0803.3676.

[38]  Yoram Singer,et al.  Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[39]  Michael I. Jordan,et al.  High-dimensional union support recovery in multivariate regression , 2008, NIPS 2008.

[40]  Ambuj Tewari,et al.  Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[41]  Yoram Singer,et al.  On the equivalence of weak learnability and linear separability: new relaxations and efficient boosting algorithms , 2010, Machine Learning.