Post model‐fitting exploration via a “Next‐Door” analysis

We propose a simple method for evaluating the model that has been chosen by an adaptive regression procedure, our main focus being the lasso. This procedure deletes each chosen predictor and refits the lasso to get a set of models that are "close" to the one chosen, referred to as "base model". If the deletion of a predictor leads to significant deterioration in the model's predictive power, the predictor is called indispensable; otherwise, the nearby model is called acceptable and can serve as a good alternative to the base model. This provides both an assessment of the predictive contribution of each variable and a set of alternative models that may be used in place of the chosen model. In this paper, we will focus on the cross-validation (CV) setting and a model's predictive power is measured by its CV error, with base model tuned by cross-validation. We propose a method for comparing the error rates of the base model with that of nearby models, and a p-value for testing whether a predictor is dispensable. We also propose a new quantity called model score which works similarly as the p-value for the control of type I error. Our proposal is closely related to the LOCO (leave-one-covarate-out) methods of ([Rinaldo 2016 Bootstrapping]) and less so, to Stability Selection ([Meinshausen 2010 stability]). We call this procedure "Next-Door analysis" since it examines models close to the base model. It can be applied to Gaussian regression data, generalized linear models, and other supervised learning problems with $\ell_1$ penalization. It could also be applied to best subset and stepwise regression procedures. We have implemented it in the R language as a library to accompany the well-known {\tt glmnet} library.

[1]  D. Freedman,et al.  Some Asymptotic Theory for the Bootstrap , 1981 .

[2]  Dennis L. Sun,et al.  Optimal Inference After Model Selection , 2014, 1410.2597.

[3]  Dennis L. Sun,et al.  Exact post-selection inference, with application to the lasso , 2013, 1311.6238.

[4]  Yi Yu,et al.  Confidence intervals for high-dimensional Cox models , 2018, Statistica Sinica.

[5]  Xiaoying Tian Harris Prediction error after model search , 2016, The Annals of Statistics.

[6]  Wenjiang J. Fu,et al.  Asymptotics for lasso-type estimators , 2000 .

[7]  Leying Guan Test Error Estimation after Model Selection Using Validation Error , 2018 .

[8]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[9]  Alessandro Rinaldo,et al.  Distribution-Free Predictive Inference for Regression , 2016, Journal of the American Statistical Association.

[10]  Cun-Hui Zhang,et al.  Confidence intervals for low dimensional parameters in high dimensional linear models , 2011, 1110.2563.

[11]  Bryan Chan,et al.  Human Immunodeficiency Virus Reverse Transcriptase and Protease Sequence Database , 1999, Nucleic Acids Res..

[12]  A. Buja,et al.  Valid post-selection inference , 2013, 1306.1059.

[13]  R. Tibshirani,et al.  Uniform asymptotic inference and the bootstrap after model selection , 2015, The Annals of Statistics.

[14]  Jing Lei,et al.  Cross-Validation With Confidence , 2017, Journal of the American Statistical Association.

[15]  Tso-Jung Yen,et al.  Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[16]  Martin J. Wainwright,et al.  Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using $\ell _{1}$ -Constrained Quadratic Programming (Lasso) , 2009, IEEE Transactions on Information Theory.

[17]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[18]  Yinchu Zhu,et al.  Breaking the curse of dimensionality in regression , 2017, ArXiv.

[19]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[20]  R. Tibshirani,et al.  A bias correction for the minimum error rate in cross-validation , 2009, 0908.2904.

[21]  R. Tibshirani,et al.  Molecular assessment of surgical-resection margins of gastric cancer by mass-spectrometric imaging , 2014, Proceedings of the National Academy of Sciences.