A neural network with feature sparsity

We propose a neural network model with a separate linear (residual) term, that explicitly bounds the input layer weights for a feature by the linear weight for that feature. The model can be seen as a modification of so-called residual neural networks to produce a path of models that are feature-sparse, that is, use only a subset of the features. This is analogous to the solution path from the usual Lasso ($\ell_1$-regularized) linear regression. We call the proposed procedure "LassoNet" and develop a projected proximal gradient algorithm for its optimization. This approach can sometimes give as low or lower test error than a standard neural network, and its feature selection provides more interpretable solutions. We illustrate the method using both simulated and real data examples, and show that it is often able to achieve competitive performance with a much smaller number of input features.

[1]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[2]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[3]  T. Kathirvalavakumar,et al.  Pruning algorithms of neural networks — a comparative study , 2013, Central European Journal of Computer Science.

[4]  Sung Ju Hwang,et al.  Combined Group and Exclusive Sparsity for Deep Neural Networks , 2017, ICML.

[5]  Antanas Verikas,et al.  Feature selection with neural networks , 2002, Pattern Recognit. Lett..

[6]  Trevor Hastie,et al.  Statistical Learning with Sparsity: The Lasso and Generalizations , 2015 .

[7]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[8]  Jerome H. Friedman Multivariate adaptive regression splines (with discussion) , 1991 .

[9]  F. Cabitza,et al.  Unintended Consequences of Machine Learning in Medicine , 2017, JAMA.

[10]  Harald Binder,et al.  Network-Constrained Covariate Coefficient and Connection Sign Estimation , 2018 .

[11]  Amir Beck,et al.  First-Order Methods in Optimization , 2017 .

[12]  Hugh Chipman,et al.  Bayesian variable selection with related predictors , 1995, bayes-an/9510001.

[13]  Robert Tibshirani,et al.  LassoNet: Neural Networks with Feature Sparsity , 2019, AISTATS.

[14]  Matthias Bethge,et al.  Comparing deep neural networks against humans: object recognition when the signal gets weaker , 2017, ArXiv.

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  Jianguo Du,et al.  Application of Machine Learning Methods to Risk Assessment of Financial Statement Fraud: Evidence from China , 2014 .

[17]  R. Tibshirani,et al.  A LASSO FOR HIERARCHICAL INTERACTIONS. , 2012, Annals of statistics.

[18]  Jim Austin,et al.  Developing artificial neural networks for safety critical systems , 2006, Neural Computing and Applications.

[19]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[20]  Russell Reed,et al.  Pruning algorithms-a survey , 1993, IEEE Trans. Neural Networks.

[21]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[22]  Dean R. De Cock,et al.  Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project , 2011 .

[23]  Ankur Teredesai,et al.  Interpretable Machine Learning in Healthcare , 2018, BCB.

[24]  J. Peixoto Hierarchical Variable Selection in Polynomial Regression Models , 1987 .

[25]  Changbao Wu,et al.  Analysis of Designed Experiments with Complex Aliasing , 1992 .

[26]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[27]  Song Han,et al.  Efficient Sparse-Winograd Convolutional Neural Networks , 2018, ICLR.

[28]  Bryan Chan,et al.  Human Immunodeficiency Virus Reverse Transcriptase and Protease Sequence Database , 1999, Nucleic Acids Res..

[29]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[30]  Alun D. Preece,et al.  Interpretability of deep learning models: A survey of results , 2017, 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI).

[31]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[32]  Michael T. Manry,et al.  An integrated growing-pruning method for feedforward network training , 2008, Neurocomputing.

[33]  Jonathan N. Crook,et al.  Credit Scoring and Its Applications , 2002, SIAM monographs on mathematical modeling and computation.