LassoNet: A Neural Network with Feature Sparsity

Much work has been done recently to make neural networks more interpretable, and one obvious approach is to arrange for the network to use only a subset of the available features. In linear models, Lasso (or $\ell_1$-regularized) regression assigns zero weights to the most irrelevant or redundant features, and is widely used in data science. However the Lasso only applies to linear models. Here we introduce LassoNet, a neural network framework with global feature selection. Our approach enforces a hierarchy: specifically a feature can participate in a hidden unit only if its linear representative is active. Unlike other approaches to feature selection for neural nets, our method uses a modified objective function with constraints, and so integrates feature selection with the parameter learning directly. As a result, it delivers an entire regularization path of solutions with a range of feature sparsity. On systematic experiments, LassoNet significantly outperforms state-of-the-art methods for feature selection and regression. The LassoNet method uses projected proximal gradient descent, and generalizes directly to deep networks. It can be implemented by adding just a few lines of code to a standard neural network.

[1]  Behnam Neyshabur,et al.  Towards Learning Convolutions from Scratch , 2020, NeurIPS.

[2]  Michèle Sebag,et al.  Agnostic Feature Selection , 2019, ECML/PKDD.

[3]  James Zou,et al.  Concrete Autoencoders for Differentiable Feature Selection and Reconstruction , 2019, ArXiv.

[4]  Shulin Wang,et al.  Feature selection in machine learning: A new perspective , 2018, Neurocomputing.

[5]  Stefanie Jegelka,et al.  ResNet with one-neuron hidden layers is a Universal Approximator , 2018, NeurIPS.

[6]  N. Simon,et al.  Sparse-Input Neural Networks for High-dimensional Nonparametric Regression and Classification , 2017, 1711.07592.

[7]  Yu Song,et al.  Multiple Indefinite Kernel Learning for Feature Selection , 2017, IJCAI.

[8]  Martin J. Wainwright,et al.  Kernel Feature Selection via Conditional Covariance Minimization , 2017, NIPS.

[9]  Surya Ganguli,et al.  On the Expressive Power of Deep Neural Networks , 2016, ICML.

[10]  Marco Tulio Ribeiro,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, HLT-NAACL Demos.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  J. Bien,et al.  Hierarchical Sparse Modeling: A Choice of Two Group Lasso Formulations , 2015, 1512.01631.

[13]  Juraj Gazda,et al.  An experimental comparison of feature selection methods on two-class biomedical datasets , 2015, Comput. Biol. Medicine.

[14]  T. Hastie,et al.  Learning Interactions via Hierarchical Group-Lasso Regularization , 2015, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[15]  K. Cios,et al.  Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome , 2015, PloS one.

[16]  Jonathan Taylor,et al.  Statistical learning and selective inference , 2015, Proceedings of the National Academy of Sciences.

[17]  Y. She,et al.  Group Regularized Estimation Under Structural Hierarchy , 2014, 1411.4691.

[18]  Dennis L. Sun,et al.  Exact post-selection inference, with application to the lasso , 2013, 1311.6238.

[19]  Adel Javanmard,et al.  Confidence intervals and hypothesis testing for high-dimensional regression , 2013 .

[20]  R. Tibshirani,et al.  A SIGNIFICANCE TEST FOR THE LASSO. , 2013, Annals of statistics.

[21]  Masashi Sugiyama,et al.  High-Dimensional Feature Selection by Feature-Wise Kernelized Lasso , 2012, Neural Computation.

[22]  Jiawei Han,et al.  Generalized Fisher Score for Feature Selection , 2011, UAI.

[23]  Gareth M. James,et al.  Variable Selection Using Adaptive Nonlinear Interaction Structures in High Dimensions , 2010 .

[24]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[25]  Ji Zhu,et al.  Variable Selection With the Strong Heredity Constraint and Its Oracle Property , 2010 .

[26]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[27]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[28]  Qi Tian,et al.  Feature selection using principal feature analysis , 2007, ACM Multimedia.

[29]  Padhraic Smyth,et al.  KDD Cup and workshop 2007 , 2007, SKDD.

[30]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[31]  Nathan Srebro,et al.  Fast maximum margin matrix factorization for collaborative prediction , 2005, ICML.

[32]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[33]  Edoardo Amaldi,et al.  On the Approximability of Minimizing Nonzero Variables or Unsatisfied Relations in Linear Systems , 1998, Theor. Comput. Sci..

[34]  Marco Tulio Ribeiro,et al.  “ Why Should I Trust You ? ” Explaining the Predictions of Any Classifier , 2016 .

[35]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[36]  Le Song,et al.  Feature Selection via Dependence Maximization , 2012, J. Mach. Learn. Res..

[37]  Qinghua Hu,et al.  Feature selection with test cost constraint , 2012, ArXiv.

[38]  Nathan Srebro,et al.  Learning with matrix factorizations , 2004 .

[39]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .