论文信息 - LassoNet: Neural Networks with Feature Sparsity

LassoNet: Neural Networks with Feature Sparsity

Much work has been done recently to make neural networks more interpretable, and one obvious approach is to arrange for the network to use only a subset of the available features. In linear models, Lasso (or $\ell_1$-regularized) regression assigns zero weights to the most irrelevant or redundant features, and is widely used in data science. However the Lasso only applies to linear models. Here we introduce LassoNet, a neural network framework with global feature selection. Our approach enforces a hierarchy: specifically a feature can participate in a hidden unit only if its linear representative is active. Unlike other approaches to feature selection for neural nets, our method uses a modified objective function with constraints, and so integrates feature selection with the parameter learning directly. As a result, it delivers an entire regularization path of solutions with a range of feature sparsity. On systematic experiments, LassoNet significantly outperforms state-of-the-art methods for feature selection and regression. The LassoNet method uses projected proximal gradient descent, and generalizes directly to deep networks. It can be implemented by adding just a few lines of code to a standard neural network.

[1] Trevor Hastie,et al. Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[2] Qi Tian,et al. Feature selection using principal feature analysis , 2007, ACM Multimedia.

[3] Stefanie Jegelka,et al. ResNet with one-neuron hidden layers is a Universal Approximator , 2018, NeurIPS.

[4] James Zou,et al. Concrete Autoencoders for Differentiable Feature Selection and Reconstruction , 2019, ArXiv.

[5] Ji Zhu,et al. Variable Selection With the Strong Heredity Constraint and Its Oracle Property , 2010 .

[6] N. Simon,et al. Sparse-Input Neural Networks for High-dimensional Nonparametric Regression and Classification , 2017, 1711.07592.

[7] Emmanuel J. Candès,et al. Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[8] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Sung Ju Hwang,et al. Combined Group and Exclusive Sparsity for Deep Neural Networks , 2017, ICML.

[10] Cun-Hui Zhang,et al. The sparsity and bias of the Lasso selection in high-dimensional linear regression , 2008, 0808.0967.

[11] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.

[12] Pierre Geurts,et al. Extremely randomized trees , 2006, Machine Learning.

[13] J. Bien,et al. Hierarchical Sparse Modeling: A Choice of Two Group Lasso Formulations , 2015, 1512.01631.

[14] R. Tibshirani,et al. A SIGNIFICANCE TEST FOR THE LASSO. , 2013, Annals of statistics.

[15] Antanas Verikas,et al. Feature selection with neural networks , 2002, Pattern Recognit. Lett..

[16] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Surya Ganguli,et al. On the Expressive Power of Deep Neural Networks , 2016, ICML.

[18] Song Han,et al. Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[19] Stephen P. Boyd,et al. Proximal Algorithms , 2013, Found. Trends Optim..

[20] Ferat Sahin,et al. A survey on feature selection methods , 2014, Comput. Electr. Eng..

[21] K. Cios,et al. Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome , 2015, PloS one.

[22] Gareth M. James,et al. Variable Selection Using Adaptive Nonlinear Interaction Structures in High Dimensions , 2010 .

[23] Russell Reed,et al. Pruning algorithms-a survey , 1993, IEEE Trans. Neural Networks.

[24] Jiawei Han,et al. Generalized Fisher Score for Feature Selection , 2011, UAI.

[25] Juraj Gazda,et al. An experimental comparison of feature selection methods on two-class biomedical datasets , 2015, Comput. Biol. Medicine.

[26] Edoardo Amaldi,et al. On the Approximability of Minimizing Nonzero Variables or Unsatisfied Relations in Linear Systems , 1998, Theor. Comput. Sci..

[27] Nathan Srebro,et al. Learning with matrix factorizations , 2004 .

[28] Isabelle Guyon,et al. An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[29] Demis Hassabis,et al. Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[30] Masashi Sugiyama,et al. High-Dimensional Feature Selection by Feature-Wise Kernelized Lasso , 2012, Neural Computation.

[31] Carlos Guestrin,et al. "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[32] R. Tibshirani,et al. A LASSO FOR HIERARCHICAL INTERACTIONS. , 2012, Annals of statistics.

[33] T. Hastie,et al. Learning Interactions via Hierarchical Group-Lasso Regularization , 2015, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[34] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .

[35] Shulin Wang,et al. Feature selection in machine learning: A new perspective , 2018, Neurocomputing.

[36] Amir Beck,et al. First-Order Methods in Optimization , 2017 .

[37] Adel Javanmard,et al. Confidence intervals and hypothesis testing for high-dimensional regression , 2013, J. Mach. Learn. Res..

[38] Yiran Chen,et al. Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[39] Y. She,et al. Group Regularized Estimation Under Structural Hierarchy , 2014, 1411.4691.

[40] Qinghua Hu,et al. Feature selection with test cost constraint , 2012, ArXiv.

[41] Robert Tibshirani,et al. Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[42] Jerome H. Friedman. Multivariate adaptive regression splines (with discussion) , 1991 .

[43] Nathan Srebro,et al. Fast maximum margin matrix factorization for collaborative prediction , 2005, ICML.