Filter versus wrapper gene selection approaches in DNA microarray domains

DNA microarray experiments generating thousands of gene expression measurements, are used to collect information from tissue and cell samples regarding gene expression differences that could be useful for diagnosis disease, distinction of the specific tumor type, etc. One important application of gene expression microarray data is the classification of samples into known categories. As DNA microarray technology measures the gene expression en masse, this has resulted in data with the number of features (genes) far exceeding the number of samples. As the predictive accuracy of supervised classifiers that try to discriminate between the classes of the problem decays with the existence of irrelevant and redundant features, the necessity of a dimensionality reduction process is essential. We propose the application of a gene selection process, which also enables the biology researcher to focus on promising gene candidates that actively contribute to classification in these large scale microarrays. Two basic approaches for feature selection appear in machine learning and pattern recognition literature: the filter and wrapper techniques. Filter procedures are used in most of the works in the area of DNA microarrays. In this work, a comparison between a group of different filter metrics and a wrapper sequential search procedure is carried out. The comparison is performed in two well-known DNA microarray datasets by the use of four classic supervised classifiers. The study is carried out over the original-continuous and three-intervals discretized gene expression data. While two well-known filter metrics are proposed for continuous data, four classic filter measures are used over discretized data. The same wrapper approach is used for both continuous and discretized data. The application of filter and wrapper gene selection procedures leads to considerably better accuracy results in comparison to the non-gene selection approach, coupled with interesting and notable dimensionality reductions. Although the wrapper approach mainly shows a more accurate behavior than filter metrics, this improvement is coupled with considerable computer-load necessities. We note that most of the genes selected by proposed filter and wrapper procedures in discrete and continuous microarray data appear in the lists of relevant-informative genes detected by previous studies over these datasets. The aim of this work is to make contributions in the field of the gene selection task in DNA microarray datasets. By an extensive comparison with more popular filter techniques, we would like to make contributions in the expansion and study of the wrapper approach in this type of domains.

[1]  Joaquín Dopazo,et al.  Supervised Neural Networks for Clustering Conditions in DNA Array Data After Reducing Noise by Clustering Gene Expression Profiles , 2002 .

[2]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[3]  M. Recce,et al.  A Method to Improve Detection of Disease Using Selectively Expressed Genes in Microarray Data , 2002 .

[4]  Werner Dubitzky,et al.  Comparing Symbolic and Subsymbolic Machine Learning Approaches to Classification of Cancer and Gene Identification , 2002 .

[5]  Federico M. Stefanini,et al.  The reduction of large molecular profiles to informative components using a Genetic Algorithm , 2000, Bioinform..

[6]  Justin Doak,et al.  CSE-92-18 - An Evaluation of Feature Selection Methodsand Their Application to Computer Security , 1992 .

[7]  Michael I. Jordan,et al.  Simultaneous Relevant Feature Identification and Classification in High-Dimensional Spaces , 2002, WABI.

[8]  Byoung-Tak Zhang,et al.  Applying Machine Learning Techniques to Analysis of Gene Expression Data: Cancer Diagnosis , 2002 .

[9]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[10]  Bojan Cestnik,et al.  Estimating Probabilities: A Crucial Task in Machine Learning , 1990, ECAI.

[11]  Simon Lin,et al.  Methods of microarray data analysis III , 2002 .

[12]  M. Xiong,et al.  Biomarker Identification by Feature Wrappers , 2022 .

[13]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[14]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[15]  D. Michie Personal models of rationality , 1990 .

[16]  M. Xiong,et al.  Recursive partitioning for tumor classification with gene expression microarray data , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Martin Beibel Selection of Informative Genes in Gene Expression Based Diagnosis: A Nonparametric Approach , 2000, ISMDA.

[18]  Ron Kohavi,et al.  Wrappers for performance enhancement and oblivious decision graphs , 1995 .

[19]  Nir Friedman,et al.  Tissue classification with gene expression profiles. , 2000 .

[20]  Patrick McConnell,et al.  An Introduction to DNA Microarrays , 2002 .

[21]  Joaquín Dopazo,et al.  Microarray Data Processing and Analysis , 2002 .

[22]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[23]  J. Kittler,et al.  Feature Set Search Alborithms , 1978 .

[24]  Ron Kohavi,et al.  Data Mining Using MLC a Machine Learning Library in C++ , 1996, Int. J. Artif. Intell. Tools.

[25]  Marcy C. Speer,et al.  Evaluation of Current Methods of Testing Differential Gene expression and Beyond , 2002 .

[26]  Wentian Li,et al.  How Many Genes are Needed for a Discriminant Microarray Data Analysis , 2001, physics/0104029.

[27]  Iñaki Inza,et al.  Gene selection by sequential search wrapper approaches in microarray cancer class prediction , 2002, J. Intell. Fuzzy Syst..

[28]  Peter Clark,et al.  The CN2 Induction Algorithm , 1989, Machine Learning.

[29]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[30]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[31]  T. H. Bø,et al.  New feature subset selection procedures for classification of expression profiles , 2002, Genome Biology.

[32]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[33]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[34]  T. Darden,et al.  Computational Analysis of Leukemia Microarray Expression Data Using the GA/KNN Method , 2002 .

[35]  A. Brazma,et al.  Gene expression data analysis , 2000, FEBS letters.

[36]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[37]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[38]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[39]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[40]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[41]  Pedro Larrañaga,et al.  Feature subset selection by genetic algorithms and estimation of distribution algorithms - A case study in the survival of cirrhotic patients treated with TIPS , 2001, Artif. Intell. Medicine.

[42]  Pedro Larrañaga,et al.  Feature Subset Selection by Bayesian network-based optimization , 2000, Artif. Intell..

[43]  Chris H. Q. Ding,et al.  Analysis of gene expression profiles: class discovery and leaf ordering , 2002, RECOMB '02.

[44]  Jude W. Shavlik,et al.  Evaluating machine learning approaches for aiding probe selection for gene-expression arrays , 2002, ISMB.

[45]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[46]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.