Improved Gene Selection for Classification of Microarrays

In this paper we derive a method for evaluating and improving techniques for selecting informative genes from microarray data. Genes of interest are typically selected by ranking genes according to a test-statistic and then choosing the top k genes. A problem with this approach is that many of these genes are highly correlated. For classification purposes it would be ideal to have distinct but still highly informative genes. We propose three different pre-filter methods--two based on clustering and one based on correlation--to retrieve groups of similar genes. For these groups we apply a test-statistic to finally select genes of interest. We show that this filtered set of genes can be used to significantly improve existing classifiers.

[1]  C. Metz Basic principles of ROC analysis. , 1978, Seminars in nuclear medicine.

[2]  U. Alon,et al.  Transcriptional gene expression profiles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays. , 2001, Cancer research.

[3]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[4]  Terence P. Speed,et al.  Comparison of Methods for Image Analysis on cDNA Microarray Data , 2002 .

[5]  Terry Speed,et al.  Normalization of cDNA microarray data. , 2003, Methods.

[6]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[7]  Peter J. Park,et al.  A Nonparametric Scoring Algorithm for Identifying Informative Genes from Microarray Data , 2000, Pacific Symposium on Biocomputing.

[8]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[9]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[10]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[11]  Nir Friedman,et al.  Tissue classification with gene expression profiles , 2000, RECOMB '00.

[12]  Ingrid Lönnstedt Replicated microarray data , 2001 .

[13]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[14]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[15]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[16]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[17]  Terence P. Speed,et al.  Normalization for cDNA microarry data , 2001, SPIE BiOS.