Submitted to the Annals of Applied Statistics FALSE VARIABLE SELECTION RATES IN REGRESSION By

There has been recent interest in extending the ideas of False Discovery Rates (FDR) to variable selection in regression settings. Traditionally the FDR in these settings has been defined in terms of the coefficients of the full regression model. Recent papers have struggled with controlling this quantity when the predictors are correlated. This paper shows that this full model definition of FDR suffers from unintuitive and potentially undesirable behavior in the presence of correlated predictors. We propose a new false selection error criterion, the False Variable Rate (FVR), that avoids these problems and behaves in a more intuitive manner. We discuss the behavior of this criterion and how it compares with the traditional FDR, as well as presenting guidelines for determining which is appropriate in a particular setting. Finally, we present a simple estimation procedure for FVR in stepwise variable selection. We analyze the performance of this estimator and draw connections to recent estimators in the literature.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  Peter Bühlmann,et al.  p-Values for High-Dimensional Regression , 2008, 0811.2177.

[3]  Brian S. Yandell,et al.  A model selection approach for the identification of quantitative trait loci in experimental crosses, allowing epistasis (Genetics 181 (1077-1086)) , 2010 .

[4]  Y. Benjamini,et al.  A simple forward selection procedure based on false discovery rate control , 2009, 0905.2819.

[5]  Rajen Dinesh Shah,et al.  Variable selection with error control: another look at stability selection , 2011, 1105.5578.

[6]  L. Stefanski,et al.  Approved by: Project Leader Approved by: LCG Project Leader Prepared by: Project Manager Prepared by: LCG Project Manager Reviewed by: Quality Assurance Manager , 2004 .

[7]  John D. Storey A direct approach to false discovery rates , 2002 .

[8]  A. Buja,et al.  Valid post-selection inference , 2013, 1306.1059.

[9]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[10]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[11]  Malgorzata Bogdan,et al.  Modified versions of Bayesian Information Criterion for genome-wide association studies , 2012, Comput. Stat. Data Anal..

[12]  Isaac Dialsingh,et al.  Large-scale inference: empirical Bayes methods for estimation, testing, and prediction , 2012 .

[13]  Karl W. Broman,et al.  A model selection approach for the identification of quantitative trait loci in experimental crosses , 2002 .

[14]  Dean P. Foster,et al.  VIF Regression: A Fast Regression Algorithm for Large Data , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[15]  L. Wasserman,et al.  Operating characteristics and extensions of the false discovery rate procedure , 2002 .

[16]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[17]  I. Johnstone,et al.  Adapting to unknown sparsity by controlling the false discovery rate , 2005, math/0505374.