Statistical learning and selective inference

Significance Most statistical analyses involve some kind of ‟selection”—searching through the data for the strongest associations. Measuring the strength of the resulting associations is a challenging task, because one must account for the effects of the selection. There are some new tools in selective inference for this task, and we illustrate their use in forward stepwise regression, the lasso, and principal components analysis. We describe the problem of “selective inference.” This addresses the following challenge: Having mined a set of data to find potential associations, how do we properly assess the strength of these associations? The fact that we have “cherry-picked”—searched for the strongest associations—means that we must set a higher bar for declaring significant the associations that we see. This challenge becomes more important in the era of big data and complex statistical modeling. The cherry tree (dataset) can be very large and the tools for cherry picking (statistical learning methods) are now very sophisticated. We describe some recent new developments in selective inference and illustrate their use in forward stepwise regression, the lasso, and principal components analysis.

[1]  Joshua R. Loftus,et al.  A significance test for forward stepwise model selection , 2014, 1405.3920.

[2]  J. Ioannidis Why Most Published Research Findings Are False , 2005 .

[3]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[4]  R. Tibshirani,et al.  A SIGNIFICANCE TEST FOR THE LASSO. , 2013, Annals of statistics.

[5]  E.J. Candes Compressive Sampling , 2022 .

[6]  D. Donoho For most large underdetermined systems of equations, the minimal 𝓁1‐norm near‐solution approximates the sparsest near‐solution , 2006 .

[7]  Victor Solo,et al.  Selecting the Number of Principal Components with SURE , 2015, IEEE Signal Processing Letters.

[8]  R. Tibshirani,et al.  Adaptive testing for the graphical lasso , 2013, 1307.4765.

[9]  L. Loeb,et al.  Human Immunodeficiency Virus Reverse Transcriptase , 1996, The Journal of Biological Chemistry.

[10]  Dennis L. Sun,et al.  Optimal Inference After Model Selection , 2014, 1410.2597.

[11]  R. Tibshirani,et al.  Exact Post-selection Inference for Forward Stepwise and Least Angle Regression , 2014 .

[12]  Robert Tibshirani,et al.  Post‐selection point and interval estimation of signal sizes in Gaussian samples , 2014, 1405.3340.

[13]  Bryan Chan,et al.  Human Immunodeficiency Virus Reverse Transcriptase and Protease Sequence Database , 1999, Nucleic Acids Res..

[14]  A. Buja,et al.  Valid post-selection inference , 2013, 1306.1059.

[15]  R. Tibshirani,et al.  Sequential selection procedures and false discovery rate control , 2013, 1309.5352.

[16]  M. Wendeler,et al.  Human Immunodeficiency Virus Reverse Transcriptase , 2009 .

[17]  Dennis L. Sun,et al.  Exact post-selection inference with the lasso , 2013 .

[18]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[19]  James Bennett,et al.  The Netflix Prize , 2007 .

[20]  Jonathan E. Taylor,et al.  Exact Post Model Selection Inference for Marginal Screening , 2014, NIPS.

[21]  Trevor Hastie,et al.  Statistical Learning with Sparsity: The Lasso and Generalizations , 2015 .

[22]  L. Wasserman,et al.  HIGH DIMENSIONAL VARIABLE SELECTION. , 2007, Annals of statistics.

[23]  Joshua R. Loftus,et al.  Inference in adaptive regression via the Kac–Rice formula , 2013, 1308.3020.

[24]  D. Donoho For most large underdetermined systems of linear equations the minimal 𝓁1‐norm solution is also the sparsest solution , 2006 .