A scalable trust-region algorithm with application to mixed-norm regression

We present a new algorithm for minimizing a convex loss-function subject to regularization. Our framework applies to numerous problems in machine learning and statistics; notably, for sparsity-promoting regularizers such as l1 or l1,∞ norms, it enables efficient computation of sparse solutions. Our approach is based on the trust-region framework with nonsmooth objectives, which allows us to build on known results to provide convergence analysis. We avoid the computational overheads associated with the conventional Hessian approximation used by trust-region methods by instead using a simple separable quadratic approximation. This approximation also enables use of proximity operators for tackling nonsmooth regularizers. We illustrate the versatility of our resulting algorithm by specializing it to three mixed-norm regression problems: group lasso [36], group logistic regression [21], and multi-task lasso [19]. We experiment with both synthetic and real-world large-scale data—our method is seen to be competitive, robust, and scalable.

[1]  Sergey Bakin,et al.  Adaptive regression and model selection in data mining problems , 1999 .

[2]  Guo-Xun Yuan A Comparison of Optimization Methods for Large-scale L 1-regularized Linear Classification , 2010 .

[3]  Nicholas I. M. Gould,et al.  Trust Region Methods , 2000, MOS-SIAM Series on Optimization.

[4]  Han Liu,et al.  Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery , 2009, ICML '09.

[5]  D. Varberg,et al.  Another Proof that Convex Functions are Locally Lipschitz , 1974 .

[6]  Jean-Pierre Aubin L'analyse non linéaire et ses motivations économiques , 1984 .

[7]  Yoram Singer,et al.  Efficient Online and Batch Learning Using Forward Backward Splitting , 2009, J. Mach. Learn. Res..

[8]  Xiaogang Wang,et al.  Boosted multi-task learning for face verification with applications to web image and video search , 2009, CVPR.

[9]  I. Daubechies,et al.  An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.

[10]  Mark W. Schmidt,et al.  Optimizing Costly Functions with Simple Constraints: A Limited-Memory Projected Quasi-Newton Algorithm , 2009, AISTATS.

[11]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[12]  Patrick L. Combettes,et al.  Proximal Splitting Methods in Signal Processing , 2009, Fixed-Point Algorithms for Inverse Problems in Science and Engineering.

[13]  Chih-Jen Lin,et al.  Trust Region Newton Method for Logistic Regression , 2008, J. Mach. Learn. Res..

[14]  Yoram Singer,et al.  Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[15]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale l1-Regularized Logistic Regression , 2007, J. Mach. Learn. Res..

[16]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale $\ell_1$-Regularized Least Squares , 2007, IEEE Journal of Selected Topics in Signal Processing.

[17]  Stephen J. Wright,et al.  Simultaneous Variable Selection , 2005, Technometrics.

[18]  P. Tseng,et al.  Block-Coordinate Gradient Descent Method for Linearly Constrained Nonsmooth Separable Optimization , 2009 .

[19]  Jian Huang,et al.  BMC Bioinformatics BioMed Central Methodology article Supervised group Lasso with applications to microarray data , 2007 .

[20]  Paulo Cortez,et al.  Modeling wine preferences by data mining from physicochemical properties , 2009, Decis. Support Syst..

[21]  Walter Zucchini,et al.  Model Selection , 2011, International Encyclopedia of Statistical Science.

[22]  Patrick L. Combettes,et al.  Signal Recovery by Proximal Forward-Backward Splitting , 2005, Multiscale Model. Simul..

[23]  N. Meinshausen,et al.  LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA , 2008, 0806.0145.

[24]  R. Tibshirani,et al.  A note on the group lasso and a sparse group lasso , 2010, 1001.0736.

[25]  Michael I. Jordan,et al.  Multi-task feature selection , 2006 .

[26]  Mark W. Schmidt,et al.  GROUP SPARSITY VIA LINEAR-TIME PROJECTION , 2008 .

[27]  Ambuj Tewari,et al.  Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[28]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[29]  C. Raghavendra Rao,et al.  On model selection , 2001 .

[30]  P. Zhao,et al.  The composite absolute penalties family for grouped and hierarchical variable selection , 2009, 0909.0411.

[31]  J. Borwein,et al.  Two-Point Step Size Gradient Methods , 1988 .

[32]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[33]  Robert D. Nowak,et al.  An EM algorithm for wavelet-based image restoration , 2003, IEEE Trans. Image Process..

[34]  Francis R. Bach,et al.  Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[35]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[36]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[37]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[38]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[39]  Y. Nesterov Gradient methods for minimizing composite objective function , 2007 .

[40]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1995 .

[41]  Stephen J. Wright,et al.  Sparse Reconstruction by Separable Approximation , 2008, IEEE Transactions on Signal Processing.

[42]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .