Fast projections onto mixed-norm balls with applications

Joint sparsity offers powerful structural cues for feature selection, especially for variables that are expected to demonstrate a “grouped” behavior. Such behavior is commonly modeled via group-lasso, multitask lasso, and related methods where feature selection is effected via mixed-norms. Several mixed-norm based sparse models have received substantial attention, and for some cases efficient algorithms are also available. Surprisingly, several constrained sparse models seem to be lacking scalable algorithms. We address this deficiency by presenting batch and online (stochastic-gradient) optimization methods, both of which rely on efficient projections onto mixed-norm balls. We illustrate our methods by applying them to the multitask lasso. We conclude by mentioning some open problems.

[1]  Mark W. Schmidt,et al.  Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization , 2011, NIPS.

[2]  Han Liu,et al.  Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery , 2009, ICML '09.

[3]  A. Lewis The Convex Analysis of Unitarily Invariant Matrix Functions , 1995 .

[4]  Julien Mairal,et al.  Proximal Methods for Sparse Hierarchical Dictionary Learning , 2010, ICML.

[5]  Francis R. Bach,et al.  Structured sparsity-inducing norms through submodular functions , 2010, NIPS.

[6]  Timo Similä,et al.  Input selection and shrinkage in multiresponse linear regression , 2007, Comput. Stat. Data Anal..

[7]  M. Kowalski Sparse regression using mixed norms , 2009 .

[8]  M. Fukushima,et al.  A generalized proximal point algorithm for certain non-convex minimization problems , 1981 .

[9]  Charles R. Johnson,et al.  Topics in Matrix Analysis , 1991 .

[10]  Jieping Ye,et al.  Efficient L1/Lq Norm Regularization , 2010, ArXiv.

[11]  Jean-Yves Audibert Optimization for Machine Learning , 1995 .

[12]  Yoram Singer,et al.  Efficient Online and Batch Learning Using Forward Backward Splitting , 2009, J. Mach. Learn. Res..

[13]  Trevor Darrell,et al.  An efficient projection for l 1 , infinity regularization. , 2009, ICML 2009.

[14]  A. Fiacco A Finite Algorithm for Finding the Projection of a Point onto the Canonical Simplex of R " , 2009 .

[15]  O. Nelles,et al.  An Introduction to Optimization , 1996, IEEE Antennas and Propagation Magazine.

[16]  Julien Mairal,et al.  Network Flow Algorithms for Structured Sparsity , 2010, NIPS.

[17]  Stéphane Canu,et al.  $\ell_{p}-\ell_{q}$ Penalty for Sparse Linear and Sparse Multiple Kernel Multitask Learning , 2011, IEEE Transactions on Neural Networks.

[18]  J. M. Martínez,et al.  Inexact spectral projected gradient methods on convex sets , 2003 .

[19]  Mark W. Schmidt,et al.  GROUP SPARSITY VIA LINEAR-TIME PROJECTION , 2008 .

[20]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[21]  M. Yuan,et al.  Model Selection and Estimation in Regression with Grouped Variables 4 Ming Yuan and , 2004 .

[22]  Suvrit Sra,et al.  Fast Projections onto ℓ1, q -Norm Balls for Grouped Feature Selection , 2011, ECML/PKDD.

[23]  Francis R. Bach,et al.  Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[24]  Jun Liu,et al.  Efficient Euclidean projections in linear time , 2009, ICML '09.

[25]  Joel A. Tropp,et al.  ALGORITHMS FOR SIMULTANEOUS SPARSE APPROXIMATION , 2006 .

[26]  K. Kiwiel On Linear-Time Algorithms for the Continuous Quadratic Knapsack Problem , 2007 .

[27]  R. Bhatia Matrix Analysis , 1996 .

[28]  Patrick L. Combettes,et al.  Proximal Splitting Methods in Signal Processing , 2009, Fixed-Point Algorithms for Inverse Problems in Science and Engineering.

[29]  Stephen J. Wright,et al.  Simultaneous Variable Selection , 2005, Technometrics.

[30]  Qian Xu,et al.  Probabilistic Multi-Task Feature Selection , 2010, NIPS.

[31]  Trevor Darrell,et al.  An efficient projection for l1, ∞ regularization , 2009, ICML '09.

[32]  C. Michelot A finite algorithm for finding the projection of a point onto the canonical simplex of ∝n , 1986 .

[33]  Jun Liu,et al.  Efficient `1=`q Norm Regularization , 2010 .

[34]  R. Tibshirani,et al.  A note on the group lasso and a sparse group lasso , 2010, 1001.0736.

[35]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[36]  Michael Patriksson,et al.  A Survey on a Classic Core Problem in Operations Research , 2005 .

[37]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[38]  Shuiwang Ji,et al.  SLEP: Sparse Learning with Efficient Projections , 2011 .

[39]  Julien Mairal,et al.  Convex optimization with sparsity-inducing norms , 2011 .

[40]  J. Borwein,et al.  Two-Point Step Size Gradient Methods , 1988 .

[41]  Massimiliano Pontil,et al.  Regularized multi--task learning , 2004, KDD.

[42]  P. Zhao,et al.  GROUPED AND HIERARCHICAL VARIABLE SELECTION , 2007 .

[43]  Charles A. Micchelli,et al.  Learning Multiple Tasks with Kernel Methods , 2005, J. Mach. Learn. Res..

[44]  Michael I. Jordan,et al.  Multi-task feature selection , 2006 .

[45]  Jieping Ye,et al.  Moreau-Yosida Regularization for Grouped Tree Structure Learning , 2010, NIPS.

[46]  Emmanuel J. Candès,et al.  A Singular Value Thresholding Algorithm for Matrix Completion , 2008, SIAM J. Optim..

[47]  Inderjit S. Dhillon,et al.  A scalable trust-region algorithm with application to mixed-norm regression , 2010, ICML.

[48]  J. B. Rosen The gradient projection method for nonlinear programming: Part II , 1961 .

[49]  Shie-Shien Yang [Multiresponse Estimation with Special Application to Linear Systems of Differential Equations]: Discussion , 1985 .

[50]  J. B. Rosen The Gradient Projection Method for Nonlinear Programming. Part I. Linear Constraints , 1960 .

[51]  David L. Donoho,et al.  De-noising by soft-thresholding , 1995, IEEE Trans. Inf. Theory.

[52]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[53]  Masashi Sugiyama,et al.  Augmented Lagrangian Methods for Learning, Selecting, and Combining Features , 2011 .

[54]  Roger Fletcher,et al.  Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming , 2005, Numerische Mathematik.

[55]  José Mario Martínez,et al.  Nonmonotone Spectral Projected Gradient Methods on Convex Sets , 1999, SIAM J. Optim..

[56]  Mark W. Schmidt,et al.  Optimizing Costly Functions with Simple Constraints: A Limited-Memory Projected Quasi-Newton Algorithm , 2009, AISTATS.