Averaged gene expressions for regression.

Although averaging is a simple technique, it plays an important role in reducing variance. We use this essential property of averaging in regression of the DNA microarray data, which poses the challenge of having far more features than samples. In this paper, we introduce a two-step procedure that combines (1) hierarchical clustering and (2) Lasso. By averaging the genes within the clusters obtained from hierarchical clustering, we define supergenes and use them to fit regression models, thereby attaining concise interpretation and accuracy. Our methods are supported with theoretical justifications and demonstrated on simulated and real data sets.

[1]  Yang Jing L1 Regularization Path Algorithm for Generalized Linear Models , 2008 .

[2]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[3]  R. Tibshirani,et al.  Supervised harvesting of expression trees , 2001, Genome Biology.

[4]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[5]  David Botstein,et al.  GO: : TermFinder--open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes , 2004, Bioinform..

[6]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[7]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[8]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[9]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[10]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[11]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[12]  Robert Tibshirani,et al.  Regression methods for microarray data , 2005 .

[13]  Cyrus Chothia,et al.  The SUPERFAMILY database in 2004: additions and improvements , 2004, Nucleic Acids Res..

[14]  P. J. Verweij,et al.  Cross-validation in survival analysis. , 1993, Statistics in medicine.

[15]  Mee Young Park,et al.  L 1-regularization path algorithm for generalized linear models , 2006 .

[16]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[17]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[18]  R. Tibshirani,et al.  Prediction by Supervised Principal Components , 2006 .

[19]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.