论文信息 - LassoNet: A Neural Network with Feature Sparsity

LassoNet: A Neural Network with Feature Sparsity

Much work has been done recently to make neural networks more interpretable, and one obvious approach is to arrange for the network to use only a subset of the available features. In linear models, Lasso (or $\ell_1$-regularized) regression assigns zero weights to the most irrelevant or redundant features, and is widely used in data science. However the Lasso only applies to linear models. Here we introduce LassoNet, a neural network framework with global feature selection. Our approach enforces a hierarchy: specifically a feature can participate in a hidden unit only if its linear representative is active. Unlike other approaches to feature selection for neural nets, our method uses a modified objective function with constraints, and so integrates feature selection with the parameter learning directly. As a result, it delivers an entire regularization path of solutions with a range of feature sparsity. On systematic experiments, LassoNet significantly outperforms state-of-the-art methods for feature selection and regression. The LassoNet method uses projected proximal gradient descent, and generalizes directly to deep networks. It can be implemented by adding just a few lines of code to a standard neural network.

R. Tibshirani | Feng Ruan | L. Abraham | Ismael Lemhadri

[1] Behnam Neyshabur,et al. Towards Learning Convolutions from Scratch , 2020, NeurIPS.

[2] Michèle Sebag,et al. Agnostic Feature Selection , 2019, ECML/PKDD.

[3] James Zou,et al. Concrete Autoencoders for Differentiable Feature Selection and Reconstruction , 2019, ArXiv.

[4] Shulin Wang,et al. Feature selection in machine learning: A new perspective , 2018, Neurocomputing.

[5] Stefanie Jegelka,et al. ResNet with one-neuron hidden layers is a Universal Approximator , 2018, NeurIPS.

[6] N. Simon,et al. Sparse-Input Neural Networks for High-dimensional Nonparametric Regression and Classification , 2017, 1711.07592.

[7] Yu Song,et al. Multiple Indefinite Kernel Learning for Feature Selection , 2017, IJCAI.

[8] Martin J. Wainwright,et al. Kernel Feature Selection via Conditional Covariance Minimization , 2017, NIPS.

[9] Surya Ganguli,et al. On the Expressive Power of Deep Neural Networks , 2016, ICML.

[10] Marco Tulio Ribeiro,et al. "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, HLT-NAACL Demos.

[11] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] J. Bien,et al. Hierarchical Sparse Modeling: A Choice of Two Group Lasso Formulations , 2015, 1512.01631.

[13] Juraj Gazda,et al. An experimental comparison of feature selection methods on two-class biomedical datasets , 2015, Comput. Biol. Medicine.

[14] T. Hastie,et al. Learning Interactions via Hierarchical Group-Lasso Regularization , 2015, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[15] K. Cios,et al. Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome , 2015, PloS one.

[16] Jonathan Taylor,et al. Statistical learning and selective inference , 2015, Proceedings of the National Academy of Sciences.

[17] Y. She,et al. Group Regularized Estimation Under Structural Hierarchy , 2014, 1411.4691.

[18] Dennis L. Sun,et al. Exact post-selection inference, with application to the lasso , 2013, 1311.6238.

[19] Adel Javanmard,et al. Confidence intervals and hypothesis testing for high-dimensional regression , 2013 .

[20] R. Tibshirani,et al. A SIGNIFICANCE TEST FOR THE LASSO. , 2013, Annals of statistics.

[21] Masashi Sugiyama,et al. High-Dimensional Feature Selection by Feature-Wise Kernelized Lasso , 2012, Neural Computation.

[22] Jiawei Han,et al. Generalized Fisher Score for Feature Selection , 2011, UAI.

[23] Gareth M. James,et al. Variable Selection Using Adaptive Nonlinear Interaction Structures in High Dimensions , 2010 .

[24] Robert Tibshirani,et al. Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[25] Ji Zhu,et al. Variable Selection With the Strong Heredity Constraint and Its Oracle Property , 2010 .

[26] Trevor Hastie,et al. Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[27] Emmanuel J. Candès,et al. Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[28] Qi Tian,et al. Feature selection using principal feature analysis , 2007, ACM Multimedia.

[29] Padhraic Smyth,et al. KDD Cup and workshop 2007 , 2007, SKDD.

[30] Pierre Geurts,et al. Extremely randomized trees , 2006, Machine Learning.

[31] Nathan Srebro,et al. Fast maximum margin matrix factorization for collaborative prediction , 2005, ICML.

[32] Isabelle Guyon,et al. An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[33] Edoardo Amaldi,et al. On the Approximability of Minimizing Nonzero Variables or Unsatisfied Relations in Linear Systems , 1998, Theor. Comput. Sci..

[34] Marco Tulio Ribeiro,et al. “ Why Should I Trust You ? ” Explaining the Predictions of Any Classifier , 2016 .

[35] Ferat Sahin,et al. A survey on feature selection methods , 2014, Comput. Electr. Eng..

[36] Le Song,et al. Feature Selection via Dependence Maximization , 2012, J. Mach. Learn. Res..

[37] Qinghua Hu,et al. Feature selection with test cost constraint , 2012, ArXiv.

[38] Nathan Srebro,et al. Learning with matrix factorizations , 2004 .

[39] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .