Unsupervised Feature Selection Via Two-way Ordering in Gene Expression Analysis

MOTIVATION Selection of genes most relevant and informative for certain phenotypes is an important aspect in gene expression analysis. Most current methods select genes based on known phenotype information. However, certain set of genes may correspond to new phenotypes which are yet unknown, and it is important to develop novel effective selection methods for their discovery without using any prior phenotype information. RESULTS We propose and study a new method to select relevant genes based on their similarity information only. The method relies on a mechanism for discarding irrelevant genes. A two-way ordering of gene expression data can force irrelevant genes towards the middle in the ordering and thus can be discarded. Mechanisms based on variance and principal component analysis are also studied. When applied to expression profiles of colon cancer and leukemia, the unsupervised method outperforms the baseline algorithm that simply uses all genes, and it also selects relevant genes close to those selected using supervised methods. SUPPLEMENT More results and software are online: http://www.nersc.gov/~cding/2way.

[1]  Roded Sharan,et al.  Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis , 2000, ISMB.

[2]  Chris H. Q. Ding,et al.  Analysis of gene expression profiles: class discovery and leaf ordering , 2002, RECOMB '02.

[3]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[4]  Wentian Li,et al.  How Many Genes are Needed for a Discriminant Microarray Data Analysis , 2001, physics/0104029.

[5]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[6]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[7]  R. Clarke,et al.  Theory and Applications of Correspondence Analysis , 1985 .

[8]  Zohar Yakhini,et al.  Clustering gene expression patterns , 1999, J. Comput. Biol..

[9]  Chris H. Q. Ding,et al.  A min-max cut algorithm for graph partitioning and data clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[10]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[11]  R. Sharan,et al.  CLICK: a clustering algorithm with applications to gene expression analysis. , 2000, Proceedings. International Conference on Intelligent Systems for Molecular Biology.

[12]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[13]  M. Xiong,et al.  Biomarker Identification by Feature Wrappers , 2022 .

[14]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Richard M. Karp,et al.  CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts , 2001, ISMB.

[16]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[17]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[18]  Tommi S. Jaakkola,et al.  Fast optimal leaf ordering for hierarchical clustering , 2001, ISMB.

[19]  Chris H. Q. Ding,et al.  Bipartite graph partitioning and data clustering , 2001, CIKM '01.

[20]  Trevor Hastie,et al.  Gene Shaving: a new class of clustering methods for expression arrays , 2000 .

[21]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.