Regression methods for microarray data

In the past decade, DNA and oligonucleotide microarray technology has been developed, allowing gene expression levels to be measured on a genome-wide scale. Use of this massive amount of molecular information appears to be promising for discovering genetic networks. Classification based on microarray experiments has been studied extensively. In comparison, microarray gene expression data has been analyzed less frequently in a regression set-up. From a statistical point of view, the challenge with analyzing microarray gene expression data is due to the very large number of genes, which far exceeds the sample size, i.e., the so-called “large p, small n” scenario. The lasso (least absolute shrinkage and selection operator) method is a promising regression method that incorporates automatic variable selection by imposing an L1 penalty on the regression coefficients. However the lasso method has its limitations in the “large p, small n” scenario. When p > n, the lasso method can select up to n variables before it saturates. And the lasso method does not offer a “grouped selection” effect. Therefore we propose two new methods, based on lasso, that are particularly suitable for microarray data regression analysis. The methods can produce sparse, interpretable regression models that relate clusters of co-expressed genes to a quantitative phenotype. Our methods are tested on simulated data sets as well as real microarray data sets. Besides the proposal of novel regression methods, we also propose quantitative definitions for evaluating the strength of the “grouped variable” effect in fitted regression models. The new definitions allow us to compare regression models quantitatively. We then discuss a need for supervised clustering of genes, that is, the phenotype ought to have an influence on how genes are clustered. One potential approach is to re-define the distances between pairs of genes by incorporating the phenotype into the definition of the new distance metric.