TESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS.

We consider the problem of testing the significance of features in high-dimensional settings. In particular, we test for differentially-expressed genes in a microarray experiment. We wish to identify genes that are associated with some type of outcome, such as survival time or cancer type. We propose a new procedure, called Lassoed Principal Components (LPC), that builds upon existing methods and can provide a sizable improvement. For instance, in the case of two-class data, a standard (albeit simple) approach might be to compute a two-sample t-statistic for each gene. The LPC method involves projecting these conventional gene scores onto the eigenvectors of the gene expression data covariance matrix and then applying an L(1) penalty in order to de-noise the resulting projections. We present a theoretical framework under which LPC is the logical choice for identifying significant genes, and we show that LPC can provide a marked reduction in false discovery rates over the conventional methods on both real and simulated data. Moreover, this flexible procedure can be applied to a variety of types of data and can be used to improve many existing methods for the identification of significant features.

[1]  Gordon K Smyth,et al.  Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2004, Statistical applications in genetics and molecular biology.

[2]  Jeffrey T Leek,et al.  The optimal discovery procedure for large-scale significance testing, with applications to comparative microarray experiments. , 2007, Biostatistics.

[3]  Matthew West,et al.  Bayesian factor regression models in the''large p , 2003 .

[4]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[5]  X. Cui,et al.  Statistical tests for differential expression in cDNA microarray experiments , 2003, Genome Biology.

[6]  Robert Tibshirani,et al.  Gene Expression Profiling Predicts Survival in Conventional Renal Cell Carcinoma , 2005, PLoS medicine.

[7]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[8]  R. Tibshirani,et al.  Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data , 2004, PLoS biology.

[9]  G. Parmigiani,et al.  The Consensus Coding Sequences of Human Breast and Colorectal Cancers , 2006, Science.

[10]  R. Tibshirani,et al.  Prediction by Supervised Principal Components , 2006 .

[11]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Debashis Ghosh,et al.  Eigengene-based linear discriminant model for tumor classification using gene expression microarray data , 2006, Bioinform..

[13]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[14]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[15]  X. Cui,et al.  Improved statistical tests for differential gene expression by shrinking variance components estimates. , 2005, Biostatistics.

[16]  J Murillo,et al.  [B cell lymphoma]. , 2004, Medicina oral : organo oficial de la Sociedad Espanola de Medicina Oral y de la Academia Iberoamericana de Patologia y Medicina Bucal.

[17]  Ingrid Lönnstedt Replicated microarray data , 2001 .

[18]  Meland,et al.  The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. , 2002, The New England journal of medicine.

[19]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[20]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[21]  M. West,et al.  High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics , 2008, Journal of the American Statistical Association.

[22]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.