Optimization with Sparsity-Inducing Penalties

Sparse estimation methods are aimed at using or obtaining parsimonious representations of data or models. They were first dedicated to linear variable selection but numerous extensions have now emerged such as structured sparsity or kernel selection. It turns out that many of the related estimation problems can be cast as convex optimization problems by regularizing the empirical risk with appropriate nonsmooth norms. The goal of this monograph is to present from a general perspective optimization tools and techniques dedicated to such sparsity-inducing penalties. We cover proximal methods, block-coordinate descent, reweighted l2-penalized techniques, working-set and homotopy methods, as well as non-convex formulations and extensions, and provide an extensive set of experiments to compare various algorithms from a computational point of view.

[1]  F. L. Bauer,et al.  Absolute and monotonic norms , 1961 .

[2]  Klaus Ritter Ein Verfahren zur Lösung parameterabhängiger, nichtlinearer Maximum-Probleme , 1962, Unternehmensforschung.

[3]  J. Moreau Fonctions convexes duales et points proximaux dans un espace hilbertien , 1962 .

[4]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[5]  B. Martinet,et al.  R'egularisation d''in'equations variationnelles par approximations successives , 1970 .

[6]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[7]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[8]  P. Lions,et al.  Splitting Algorithms for the Sum of Two Nonlinear Operators , 1979 .

[9]  S. Weisberg Applied Linear Regression , 1981 .

[10]  P. Brucker Review of recent development: An O( n) algorithm for quadratic knapsack problems , 1984 .

[11]  R. Glowinski,et al.  Augmented Lagrangian and Operator-Splitting Methods in Nonlinear Mechanics , 1987 .

[12]  Geraldo Galdino de Paula,et al.  A linear-time median-finding algorithm for projecting a vector on the simplex of Rn , 1989 .

[13]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[14]  P. Tseng Applications of splitting algorithm to decomposition in convex programming and variational inequalities , 1991 .

[15]  L. Rudin,et al.  Nonlinear total variation based noise removal algorithms , 1992 .

[16]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[17]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1995 .

[18]  I. Johnstone,et al.  Adapting to Unknown Smoothness via Wavelet Shrinkage , 1995 .

[19]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[20]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[21]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[22]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[23]  R. Tyrrell Rockafellar,et al.  Convergence Rates in Forward-Backward Splitting , 1997, SIAM J. Optim..

[24]  Wenjiang J. Fu Penalized Regressions: The Bridge versus the Lasso , 1998 .

[25]  Yves Grandvalet,et al.  Outcomes of the Equivalence of Adaptive Ridge with Least Absolute Shrinkage , 1998, NIPS.

[26]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[27]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[28]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[29]  B. Rao,et al.  Forward sequential algorithms for best basis selection , 1999 .

[30]  Kjersti Engan,et al.  Method of optimal directions for frame design , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[31]  M. R. Osborne,et al.  On the LASSO and its Dual , 2000 .

[32]  J. Borwein,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[33]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[34]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[35]  Stephen P. Boyd,et al.  A rank minimization heuristic with application to minimum order system approximation , 2001, Proceedings of the 2001 American Control Conference. (Cat. No.01CH37148).

[36]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[37]  S. Sathiya Keerthi,et al.  A simple and efficient algorithm for gene selection using sparse logistic regression , 2003, Bioinform..

[38]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[39]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[40]  Bhaskar D. Rao,et al.  Sparse Bayesian learning for basis selection , 2004, IEEE Transactions on Signal Processing.

[41]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[42]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[43]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[44]  Joel A. Tropp,et al.  Greed is good: algorithmic results for sparse approximation , 2004, IEEE Transactions on Information Theory.

[45]  Tommi S. Jaakkola,et al.  Maximum-Margin Matrix Factorization , 2004, NIPS.

[46]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[47]  J. S. Rao,et al.  Spike and slab variable selection: Frequentist and Bayesian strategies , 2005, math/0505633.

[48]  Patrick L. Combettes,et al.  Signal Recovery by Proximal Forward-Backward Splitting , 2005, Multiscale Model. Simul..

[49]  Antonin Chambolle,et al.  Total Variation Minimization and a Class of Binary MRF Models , 2005, EMMCVPR.

[50]  Stephen J. Wright,et al.  Simultaneous Variable Selection , 2005, Technometrics.

[51]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[52]  Lawrence Carin,et al.  Sparse multinomial logistic regression: fast algorithms and generalization bounds , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[54]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[55]  A. Bruckstein,et al.  K-SVD : An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation , 2005 .

[56]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[57]  Gaëlle Loosli Méthodes à noyaux pour la détection de contexte : vers un fonctionnement autonome des méthodes à noyaux , 2006 .

[58]  M. Elad,et al.  $rm K$-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation , 2006, IEEE Transactions on Signal Processing.

[59]  Joel A. Tropp,et al.  ALGORITHMS FOR SIMULTANEOUS SPARSE APPROXIMATION , 2006 .

[60]  Gunnar Rätsch,et al.  Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..

[61]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[62]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[63]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[64]  Michael Elad,et al.  Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries , 2006, IEEE Transactions on Image Processing.

[65]  Joel A. Tropp,et al.  Sparse Approximation Via Iterative Thresholding , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[66]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[67]  Joel A. Tropp,et al.  Algorithms for simultaneous sparse approximation. Part I: Greedy pursuit , 2006, Signal Process..

[68]  Zaïd Harchaoui,et al.  Catching Change-points with Lasso , 2007, NIPS.

[69]  J. Lafferty,et al.  Sparse additive models , 2007, 0711.4555.

[70]  Y. Nesterov Gradient methods for minimizing composite objective function , 2007 .

[71]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[72]  Stephen P. Boyd,et al.  Enhancing Sparsity by Reweighted ℓ1 Minimization , 2007, 0711.1612.

[73]  David Madigan,et al.  Large-Scale Bayesian Logistic Regression for Text Categorization , 2007, Technometrics.

[74]  Shimon Ullman,et al.  Uncovering shared structures in multiclass classification , 2007, ICML '07.

[75]  Pierre Morizet-Mahoudeaux,et al.  Hierarchical Penalization , 2007, NIPS.

[76]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[77]  Florian Steinke,et al.  Bayesian Inference and Optimal Design in the Sparse Linear Model , 2007, AISTATS.

[78]  David P. Wipf,et al.  A New View of Automatic Relevance Determination , 2007, NIPS.

[79]  Mark W. Schmidt,et al.  Fast Optimization Methods for L1 Regularization: A Comparative Study and Two New Approaches , 2007, ECML.

[80]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale l1-Regularized Logistic Regression , 2007, J. Mach. Learn. Res..

[81]  Francis R. Bach,et al.  Consistency of trace norm minimization , 2007, J. Mach. Learn. Res..

[82]  Volkan Cevher,et al.  Sparse Signal Recovery Using Markov Random Fields , 2008, NIPS.

[83]  Francis R. Bach,et al.  Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[84]  Emmanuel Barillot,et al.  Classification of arrayCGH data using fused SVM , 2008, ISMB.

[85]  Francis R. Bach,et al.  Sparse probabilistic projections , 2008, NIPS.

[86]  Arnaud Doucet,et al.  Sparse Bayesian nonparametric regression , 2008, ICML '08.

[87]  Volker Roth,et al.  The Group-Lasso for generalized linear models: uniqueness of solutions and efficient algorithms , 2008, ICML '08.

[88]  K. Lange,et al.  Coordinate descent algorithms for lasso penalized regression , 2008, 0803.3876.

[89]  Mark W. Schmidt,et al.  GROUP SPARSITY VIA LINEAR-TIME PROJECTION , 2008 .

[90]  Jean Ponce,et al.  Convex Sparse Matrix Factorizations , 2008, ArXiv.

[91]  Gilles Gasso,et al.  Recovering sparse signals with non-convex penalties and DC programming , 2008 .

[92]  Francis R. Bach,et al.  Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning , 2008, NIPS.

[93]  I. Daubechies,et al.  Iteratively reweighted least squares minimization for sparse recovery , 2008, 0807.0575.

[94]  Jean-Philippe Vert,et al.  Group lasso with overlap and graph lasso , 2009, ICML '09.

[95]  Lin Xiao,et al.  Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..

[96]  Antonin Chambolle,et al.  On Total Variation Minimization and Surface Evolution Using Parametric Maximum Flows , 2009, International Journal of Computer Vision.

[97]  P. Zhao,et al.  The composite absolute penalties family for grouped and hierarchical variable selection , 2009, 0909.0411.

[98]  Massimiliano Pontil,et al.  Taking Advantage of Sparsity in Multi-Task Learning , 2009, COLT.

[99]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[100]  Han Liu,et al.  Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery , 2009, ICML '09.

[101]  R. Fergus,et al.  Learning invariant features through topographic filter maps , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[102]  Martin J. Wainwright,et al.  A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers , 2009, NIPS.

[103]  Stephen J. Wright,et al.  Sparse Reconstruction by Separable Approximation , 2008, IEEE Transactions on Signal Processing.

[104]  Ambuj Tewari,et al.  Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[105]  Francis R. Bach,et al.  A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization , 2008, J. Mach. Learn. Res..

[106]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[107]  Martin J. Wainwright,et al.  Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using $\ell _{1}$ -Constrained Quadratic Programming (Lasso) , 2009, IEEE Transactions on Information Theory.

[108]  James T. Kwok,et al.  Accelerated Gradient Methods for Stochastic Optimization and Online Learning , 2009, NIPS.

[109]  Junzhou Huang,et al.  Learning with structured sparsity , 2009, ICML '09.

[110]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[111]  Paul Tseng,et al.  A coordinate gradient descent method for nonsmooth separable minimization , 2008, Math. Program..

[112]  Junzhou Huang,et al.  The Benefit of Group Sparsity , 2009 .

[113]  David M. Bradley,et al.  Convex Coding , 2009, UAI.

[114]  Alexander Shapiro,et al.  Lectures on Stochastic Programming: Modeling and Theory , 2009 .

[115]  Stéphane Canu,et al.  Ieee Transactions on Signal Processing 1 Recovering Sparse Signals with a Certain Family of Non-convex Penalties and Dc Programming , 2022 .

[116]  Julien Mairal,et al.  Proximal Methods for Sparse Hierarchical Dictionary Learning , 2010, ICML.

[117]  Emmanuel J. Candès,et al.  A Singular Value Thresholding Algorithm for Matrix Completion , 2008, SIAM J. Optim..

[118]  Lorenzo Rosasco,et al.  Solving Structured Sparsity Regularization with Proximal Methods , 2010, ECML/PKDD.

[119]  Christopher J. C. Burges,et al.  Dimension Reduction: a Guided Tour , 2009 .

[120]  Francis R. Bach,et al.  Structured sparsity-inducing norms through submodular functions , 2010, NIPS.

[121]  Ben Taskar,et al.  Joint covariate selection and joint subspace selection for multiple classification problems , 2010, Stat. Comput..

[122]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[123]  Mark W. Schmidt,et al.  Convex Structure Learning in Log-Linear Models: Beyond Pairwise Potentials , 2010, AISTATS.

[124]  Eric P. Xing,et al.  Tree-Guided Group Lasso for Multi-Task Regression with Structured Sparsity , 2009, ICML.

[125]  J. Mairal Sparse coding for machine learning, image processing and computer vision , 2010 .

[126]  Chih-Jen Lin,et al.  A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification , 2010, J. Mach. Learn. Res..

[127]  Yonina C. Eldar,et al.  Collaborative hierarchical sparse modeling , 2010, 2010 44th Annual Conference on Information Sciences and Systems (CISS).

[128]  Jean-Philippe Vert,et al.  Fast detection of multiple change-points shared by many signals using group LARS , 2010, NIPS.

[129]  R. Tibshirani,et al.  A note on the group lasso and a sparse group lasso , 2010, 1001.0736.

[130]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[131]  Guo-Xun Yuan A Comparison of Optimization Methods for Large-scale L 1-regularized Linear Classification , 2010 .

[132]  Volkan Cevher,et al.  Model-Based Compressive Sensing , 2008, IEEE Transactions on Information Theory.

[133]  Julien Mairal,et al.  Network Flow Algorithms for Structured Sparsity , 2010, NIPS.

[134]  Francis R. Bach,et al.  Structured Sparse Principal Component Analysis , 2009, AISTATS.

[135]  Deanna Needell,et al.  CoSaMP: Iterative signal recovery from incomplete and inaccurate samples , 2008, ArXiv.

[136]  Chiranjib Bhattacharyya,et al.  Variable Sparsity Kernel Learning , 2011, J. Mach. Learn. Res..

[137]  Jean-Philippe Vert,et al.  Group Lasso with Overlaps: the Latent Group Lasso approach , 2011, ArXiv.

[138]  BachFrancis,et al.  Structured Variable Selection with Sparsity-Inducing Norms , 2011 .

[139]  Noah A. Smith,et al.  Structured Sparsity in Structured Prediction , 2011, EMNLP.

[140]  Nick G. Kingsbury,et al.  Convex approaches to model wavelet sparsity patterns , 2011, 2011 18th IEEE International Conference on Image Processing.

[141]  Francis Bach,et al.  Itakura-Saito nonnegative matrix factorization with group sparsity , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[142]  Masashi Sugiyama,et al.  Augmented Lagrangian Methods for Learning, Selecting, and Combining Features , 2011 .

[143]  Rodolphe Jenatton,et al.  Structured Sparsity-Inducing Norms : Statistical and Algorithmic Properties with Applications to Neuroimaging. (Normes Parcimonieuses Structurées : Propriétés Statistiques et Algorithmiques avec Applications à l'Imagerie Cérébrale) , 2011 .

[144]  Julien Mairal,et al.  Convex optimization with sparsity-inducing norms , 2011 .

[145]  M. Davies,et al.  Compressible Priors for High-dimensional Statistics , 2011 .

[146]  Francis R. Bach,et al.  Structured Variable Selection with Sparsity-Inducing Norms , 2009, J. Mach. Learn. Res..

[147]  Taiji Suzuki,et al.  SpicyMKL: a fast algorithm for Multiple Kernel Learning with thousands of kernels , 2011, Machine Learning.

[148]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[149]  Francis Bach,et al.  Shaping Level Sets with Submodular Functions , 2010, NIPS.

[150]  Julien Mairal,et al.  Proximal Methods for Hierarchical Sparse Coding , 2010, J. Mach. Learn. Res..

[151]  Yonina C. Eldar,et al.  C-HiLasso: A Collaborative Hierarchical Sparse Modeling Framework , 2010, IEEE Transactions on Signal Processing.

[152]  Mark W. Schmidt,et al.  Projected Newton-type methods in machine learning , 2011 .

[153]  Julien Mairal,et al.  Convex and Network Flow Optimization for Structured Sparsity , 2011, J. Mach. Learn. Res..

[154]  Bertrand Thirion,et al.  Multi-scale Mining of fMRI Data with Hierarchical Structured Sparsity , 2011, 2011 International Workshop on Pattern Recognition in NeuroImaging.

[155]  Mark W. Schmidt,et al.  Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization , 2011, NIPS.

[156]  Heinz H. Bauschke,et al.  Fixed-Point Algorithms for Inverse Problems in Science and Engineering , 2011, Springer Optimization and Its Applications.

[157]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[158]  Constantine Caramanis,et al.  Robust PCA via Outlier Pursuit , 2010, IEEE Transactions on Information Theory.

[159]  Volkan Cevher,et al.  Compressible Distributions for High-Dimensional Statistics , 2011, IEEE Transactions on Information Theory.

[160]  Pablo A. Parrilo,et al.  The Convex Geometry of Linear Inverse Problems , 2010, Foundations of Computational Mathematics.

[161]  Stephen J. Wright Accelerated Block-coordinate Relaxation for Regularized Optimization , 2012, SIAM J. Optim..

[162]  Bertrand Thirion,et al.  Multiscale Mining of fMRI Data with Hierarchical Structured Sparsity , 2012, SIAM J. Imaging Sci..

[163]  Charles A. Micchelli,et al.  Regularizers for structured sparsity , 2010, Adv. Comput. Math..