A General Framework for Estimation and Inference From Clusters of Features

ABSTRACT Applied statistical problems often come with prespecified groupings to predictors. It is natural to test for the presence of simultaneous group-wide signal for groups in isolation, or for multiple groups together. Current tests for the presence of such signals include the classical F-test or a t-test on unsupervised group prototypes (either group centroids or first principal components). In this article, we propose test statistics that aim for power improvements over these classical approaches. In particular, we first create group prototypes, with reference to the response, and then test with likelihood ratio statistics incorporating only these prototypes. We propose a model, called the “prototype model,” which naturally models this two-step procedure. Furthermore, we introduce an inferential schema detailing the unique considerations for different combinations of prototype formation and univariate/multivariate testing models. The prototype model also suggests new applications to estimation and prediction. Prototype formation often relies on variable selection, which invalidates classical Gaussian test theory. We use recent advances in selective inference to account for selection in the prototyping step and retain test validity. Simulation experiments suggest that our testing procedure enjoys more power than do classical approaches. Supplementary materials for this article are available online.

[1]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[2]  H. Bondell,et al.  Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR , 2008, Biometrics.

[3]  Joshua R. Loftus,et al.  Inference in adaptive regression via the Kac–Rice formula , 2013, 1308.3020.

[4]  A.C. Gilbert,et al.  Group testing and sparse signal recovery , 2008, 2008 42nd Asilomar Conference on Signals, Systems and Computers.

[5]  Ash A. Alizadeh,et al.  'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns , 2000, Genome Biology.

[6]  Calyampudi R. Rao Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation , 1948, Mathematical Proceedings of the Cambridge Philosophical Society.

[7]  Y. She Sparse regression with exact clustering , 2008 .

[8]  Erich L. Lehmann On likelihood ratio tests , 2006 .

[9]  C. Ritz,et al.  Likelihood ratio tests in curved exponential families with nuisance parameters present only under the alternative , 2005 .

[10]  Joshua R. Loftus,et al.  Inference in adaptive regression via the Kac–Rice formula , 2016 .

[11]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[12]  Peter Bühlmann,et al.  Finding predictive gene groups from microarray data , 2004 .

[13]  Dennis L. Sun,et al.  Optimal Inference After Model Selection , 2014, 1410.2597.

[14]  Conrad Sanderson,et al.  Armadillo: An Open Source C++ Linear Algebra Library for Fast Prototyping and Computationally Intensive Experiments , 2010 .

[15]  Noah Simon,et al.  A Sparse-Group Lasso , 2013 .

[16]  S. Geer,et al.  Correlated variables in regression: Clustering and sparse estimation , 2012, 1209.5908.

[17]  Dirk Eddelbuettel,et al.  Rcpp: Seamless R and C++ Integration , 2011 .

[18]  H. Bondell,et al.  Simultaneous regression shrinkage , variable selection and clustering of predictors with OSCAR , 2006 .

[19]  Joshua R. Loftus,et al.  A significance test for forward stepwise model selection , 2014, 1405.3920.

[20]  Jonathan Taylor,et al.  Asymptotics of Selective Inference , 2015, 1501.03588.

[21]  Samuel M. Gross,et al.  A Selective Approach to Internal Inference , 2015, 1510.00486.

[22]  T. Hassard,et al.  Applied Linear Regression , 2005 .

[23]  Rong Jin,et al.  Exclusive Lasso for Multi-task Feature Selection , 2010, AISTATS.

[24]  R. Tibshirani,et al.  Supervised harvesting of expression trees , 2001, Genome Biology.

[25]  Dennis L. Sun,et al.  Exact post-selection inference, with application to the lasso , 2013, 1311.6238.

[26]  Genevera I. Allen,et al.  Within Group Variable Selection through the Exclusive Lasso , 2015, 1505.07517.

[27]  Trevor Hastie,et al.  Averaged gene expressions for regression. , 2007, Biostatistics.

[28]  Robert Tibshirani,et al.  Sparse regression and marginal testing using cluster prototypes. , 2015, Biostatistics.

[29]  Jonathan E. Taylor,et al.  Selective inference with a randomized response , 2015, 1507.06739.

[30]  Jonathan E. Taylor,et al.  Exact Post Model Selection Inference for Marginal Screening , 2014, NIPS.

[31]  R. Tibshirani,et al.  A note on the group lasso and a sparse group lasso , 2010, 1001.0736.

[32]  Robert Tibshirani,et al.  STANDARDIZATION AND THE GROUP LASSO PENALTY. , 2012, Statistica Sinica.

[33]  L. Jenkins,et al.  CORRELATED VARIABLES , 2004 .