论文信息 - Boosting with structural sparsity

Boosting with structural sparsity

We derive generalizations of AdaBoost and related gradient-based coordinate descent methods that incorporate sparsity-promoting penalties for the norm of the predictor that is being learned. The end result is a family of coordinate descent algorithms that integrate forward feature induction and back-pruning through regularization and give an automatic stopping criterion for feature induction. We study penalties based on the l1, l2, and l∞ norms of the predictor and introduce mixed-norm penalties that build upon the initial penalties. The mixed-norm regularizers facilitate structural sparsity in parameter space, which is a useful property in multiclass prediction and other related tasks. We report empirical results that demonstrate the power of our approach in building accurate and structurally sparse models.

Yoram Singer | John C. Duchi | Y. Singer

[1] Charles R. Johnson,et al. Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[2] David J. Spiegelhalter,et al. Machine Learning, Neural and Statistical Classification , 2009 .

[3] Dimitri P. Bertsekas,et al. Nonlinear Programming , 1997 .

[4] R. Felkel. An interior-point method for large-scale QP-problems , 1996 .

[5] Huan Liu,et al. Book review: Machine Learning, Neural and Statistical Classification Edited by D. Michie, D.J. Spiegelhalter and C.C. Taylor (Ellis Horwood Limited, 1994) , 1996, SGAR.

[6] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[7] Yoav Freund,et al. Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[8] Yoram Singer,et al. Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[9] Robert E. Schapire,et al. Drifting Games , 1999, Annual Conference Computational Learning Theory.

[10] Y. Freund,et al. Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[11] Gunnar Rätsch,et al. An Introduction to Boosting and Leveraging , 2002, Machine Learning Summer School.

[12] Yoram Singer,et al. Smooth e-Intensive Regression by Loss Symmetrization , 2005, COLT.

[13] Robert E. Schapire,et al. The Boosting Approach to Machine Learning An Overview , 2003 .

[14] David D. Denison,et al. Nonlinear estimation and classification , 2003 .

[15] Yiming Yang,et al. RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[16] Yoram Singer,et al. Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[17] Tong Zhang,et al. Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[18] Charles A. Micchelli,et al. Learning Multiple Tasks with Kernel Methods , 2005, J. Mach. Learn. Res..

[19] Yoram Singer,et al. Smooth epsiloon-Insensitive Regression by Loss Symmetrization , 2005, Journal of machine learning research.

[20] Bin Yu,et al. Boosting with early stopping: Convergence and consistency , 2005, math/0508276.

[21] Dmitriy Fradkin,et al. Bayesian Multinomial Logistic Regression for Author Identification , 2005, AIP Conference Proceedings.

[22] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[23] N. Meinshausen,et al. High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[24] Honglak Lee,et al. Efficient L1 Regularized Logistic Regression , 2006, AAAI.

[25] Gunnar Rätsch,et al. Totally corrective boosting algorithms that maximize the margin , 2006, ICML.

[26] Peng Zhao,et al. On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[27] Miroslav Dudík,et al. Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling , 2007, J. Mach. Learn. Res..

[28] Yoram Singer,et al. Pegasos: primal estimated sub-gradient solver for SVM , 2007, ICML '07.

[29] P. Zhao,et al. Grouped and Hierarchical Model Selection through Composite Absolute Penalties , 2007 .

[30] R. Tibshirani,et al. PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[31] Stephen P. Boyd,et al. An Interior-Point Method for Large-Scale l1-Regularized Logistic Regression , 2007, J. Mach. Learn. Res..

[32] G. Obozinski. Joint covariate selection for grouped classification , 2007 .

[33] G. Obozinski,et al. High-dimensional union support recovery in multivariate regression , 2008 .

[34] Martin J. Wainwright,et al. Phase transitions for high-dimensional joint support recovery , 2008, NIPS.

[35] Tong Zhang,et al. Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models , 2008, NIPS.

[36] Jean-Philippe Vert,et al. Clustered Multi-Task Learning: A Convex Formulation , 2008, NIPS.

[37] Hao Helen Zhang,et al. Variable selection for the multicategory SVM via adaptive sup-norm regularization , 2008, 0803.3676.

[38] Yoram Singer,et al. Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[39] Michael I. Jordan,et al. High-dimensional union support recovery in multivariate regression , 2008, NIPS 2008.

[40] Ambuj Tewari,et al. Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[41] Yoram Singer,et al. On the equivalence of weak learnability and linear separability: new relaxations and efficient boosting algorithms , 2010, Machine Learning.