Fast Projections onto ℓ1, q -Norm Balls for Grouped Feature Selection

Joint sparsity is widely acknowledged as a powerful structural cue for performing feature selection in setups where variables are expected to demonstrate "grouped" behavior. Such grouped behavior is commonly modeled by Group-Lasso or Multitask Lasso-type problems, where feature selection is effected via l1,q -mixed-norms. Several particular formulations for modeling groupwise sparsity have received substantial attention in the literature; and in some cases, efficient algorithms are also available. Surprisingly, for constrained formulations of fundamental importance (e.g., regression with an l1,∞-norm constraint), highly scalable methods seem to be missing. We address this deficiency by presenting a method based on spectral projected-gradient (SPG) that can tackle l1,q -constrained convex regression problems. The most crucial component of our method is an algorithm for projecting onto l1,q>-norm balls. We present several numerical results which show that our methods attain up to 30X speedups on large l1,∞-multitask lasso problems. Even more dramatic are the gains for just the l1,∞-projection subproblem: we observe almost three orders of magnitude speedups compared against the currently standard method.

[1]  Stéphane Canu,et al.  $\ell_{p}-\ell_{q}$ Penalty for Sparse Linear and Sparse Multiple Kernel Multitask Learning , 2011, IEEE Transactions on Neural Networks.

[2]  R. Tyrrell Rockafellar,et al.  Convex Analysis , 1970, Princeton Landmarks in Mathematics and Physics.

[3]  Inderjit S. Dhillon,et al.  A scalable trust-region algorithm with application to mixed-norm regression , 2010, ICML.

[4]  Shie-Shien Yang [Multiresponse Estimation with Special Application to Linear Systems of Differential Equations]: Discussion , 1985 .

[5]  Roger Fletcher,et al.  Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming , 2005, Numerische Mathematik.

[6]  Patrick L. Combettes,et al.  Proximal Splitting Methods in Signal Processing , 2009, Fixed-Point Algorithms for Inverse Problems in Science and Engineering.

[7]  R. Tibshirani,et al.  A note on the group lasso and a sparse group lasso , 2010, 1001.0736.

[8]  Stephen J. Wright,et al.  Simultaneous Variable Selection , 2005, Technometrics.

[9]  David L. Donoho,et al.  De-noising by soft-thresholding , 1995, IEEE Trans. Inf. Theory.

[10]  Jieping Ye,et al.  Efficient L1/Lq Norm Regularization , 2010, ArXiv.

[11]  Jun Liu,et al.  Efficient Euclidean projections in linear time , 2009, ICML '09.

[12]  K. Kiwiel On Linear-Time Algorithms for the Continuous Quadratic Knapsack Problem , 2007 .

[13]  C. Michelot A finite algorithm for finding the projection of a point onto the canonical simplex of ∝n , 1986 .

[14]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[15]  J. Tropp Algorithms for simultaneous sparse approximation. Part II: Convex relaxation , 2006, Signal Process..

[16]  Han Liu,et al.  Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery , 2009, ICML '09.

[17]  Julien Mairal,et al.  Proximal Methods for Sparse Hierarchical Dictionary Learning , 2010, ICML.

[18]  Francis R. Bach,et al.  Structured sparsity-inducing norms through submodular functions , 2010, NIPS.

[19]  Timo Similä,et al.  Input selection and shrinkage in multiresponse linear regression , 2007, Comput. Stat. Data Anal..

[20]  Michael I. Jordan,et al.  Multi-task feature selection , 2006 .

[21]  Jieping Ye,et al.  Moreau-Yosida Regularization for Grouped Tree Structure Learning , 2010, NIPS.

[22]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[23]  Mark W. Schmidt,et al.  GROUP SPARSITY VIA LINEAR-TIME PROJECTION , 2008 .

[24]  Stephen J. Wright,et al.  Optimization for Machine Learning , 2013 .

[25]  Francis R. Bach,et al.  Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[26]  Yoram Singer,et al.  Efficient Online and Batch Learning Using Forward Backward Splitting , 2009, J. Mach. Learn. Res..

[27]  P. Zhao,et al.  The composite absolute penalties family for grouped and hierarchical variable selection , 2009, 0909.0411.

[28]  Julien Mairal,et al.  Convex optimization with sparsity-inducing norms , 2011 .

[29]  J. Borwein,et al.  Two-Point Step Size Gradient Methods , 1988 .

[30]  Massimiliano Pontil,et al.  Regularized multi--task learning , 2004, KDD.

[31]  M. Kowalski Sparse regression using mixed norms , 2009 .

[32]  Qian Xu,et al.  Probabilistic Multi-Task Feature Selection , 2010, NIPS.

[33]  José Mario Martínez,et al.  Nonmonotone Spectral Projected Gradient Methods on Convex Sets , 1999, SIAM J. Optim..

[34]  Mark W. Schmidt,et al.  Optimizing Costly Functions with Simple Constraints: A Limited-Memory Projected Quasi-Newton Algorithm , 2009, AISTATS.

[35]  Trevor Darrell,et al.  An efficient projection for l1, ∞ regularization , 2009, ICML '09.

[36]  Jun Liu,et al.  Efficient `1=`q Norm Regularization , 2010 .

[37]  Charles A. Micchelli,et al.  Learning Multiple Tasks with Kernel Methods , 2005, J. Mach. Learn. Res..

[38]  Trevor Darrell,et al.  An efficient projection for l 1 , infinity regularization. , 2009, ICML 2009.

[39]  A. Fiacco A Finite Algorithm for Finding the Projection of a Point onto the Canonical Simplex of R " , 2009 .

[40]  Julien Mairal,et al.  Network Flow Algorithms for Structured Sparsity , 2010, NIPS.

[41]  Michael Patriksson,et al.  A Survey on a Classic Core Problem in Operations Research , 2005 .

[42]  Shuiwang Ji,et al.  SLEP: Sparse Learning with Efficient Projections , 2011 .