More powerful post-selection inference, with application to the Lasso

Investigators often use the data to generate interesting hypotheses and then perform inference for the generated hypotheses. P-values and confidence intervals must account for this explorative data analysis. A fruitful method for doing so is to condition any inferences on the components of the data used to generate the hypotheses, thus preventing information in those components from being used again. Some currently popular methods "over-condition", leading to wide intervals. We show how to perform the minimal conditioning in a computationally tractable way. In high dimensions, even this minimal conditioning can lead to intervals that are too wide to be useful, suggesting that up to now the cost of hypothesis generation has been underestimated. We show how to generate hypotheses in a strategic manner that sharply reduces the cost of data exploration and results in useful confidence intervals. Our discussion focuses on the problem of post-selection inference after fitting a lasso regression model, but we also outline its extension to a much more general setting.

[1]  R. Tibshirani,et al.  Uniform asymptotic inference and the bootstrap after model selection , 2015, The Annals of Statistics.

[2]  Jonathan E. Taylor,et al.  Comparison of prediction errors: Adaptive p-values after cross-validation , 2017 .

[3]  Jonathan E. Taylor,et al.  Adaptive p-values after cross-validation , 2017 .

[4]  R. Tibshirani,et al.  Selective Sequential Model Selection , 2015, 1512.02565.

[5]  Jonathan E. Taylor,et al.  Selective inference with a randomized response , 2015, 1507.06739.

[6]  Jonathan Taylor,et al.  Asymptotics of Selective Inference , 2015, 1501.03588.

[7]  Dennis L. Sun,et al.  Optimal Inference After Model Selection , 2014, 1410.2597.

[8]  R. Tibshirani,et al.  Exact Post-Selection Inference for Sequential Regression Procedures , 2014, 1401.3889.

[9]  R. Tibshirani,et al.  Exact Post-selection Inference for Forward Stepwise and Least Angle Regression , 2014 .

[10]  Adel Javanmard,et al.  Confidence intervals and hypothesis testing for high-dimensional regression , 2013, J. Mach. Learn. Res..

[11]  Dennis L. Sun,et al.  Exact post-selection inference, with application to the lasso , 2013, 1311.6238.

[12]  R. Tibshirani,et al.  A Study of Error Variance Estimation in Lasso Regression , 2013, 1311.5274.

[13]  H. Leeb,et al.  CAN ONE ESTIMATE THE UNCONDITIONAL DISTRIBUTION OF POST-MODEL-SELECTION ESTIMATORS? , 2003, Econometric Theory.

[14]  B. M. Pötscher,et al.  MODEL SELECTION AND INFERENCE: FACTS AND FICTION , 2005, Econometric Theory.

[15]  Bryan Chan,et al.  Human Immunodeficiency Virus Reverse Transcriptase and Protease Sequence Database , 1999, Nucleic Acids Res..

[16]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[17]  T. Stamey,et al.  Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II. Radical prostatectomy treated patients. , 1989, The Journal of urology.