Stable feature selection via dense feature groups

Many feature selection algorithms have been proposed in the past focusing on improving classification accuracy. In this work, we point out the importance of stable feature selection for knowledge discovery from high-dimensional data, and identify two causes of instability of feature selection algorithms: selection of a minimum subset without redundant features and small sample size. We propose a general framework for stable feature selection which emphasizes both good generalization and stability of feature selection results. The framework identifies dense feature groups based on kernel density estimation and treats features in each dense group as a coherent entity for feature selection. An efficient algorithm DRAGS (Dense Relevant Attribute Group Selector) is developed under this framework. We also introduce a general measure for assessing the stability of feature selection algorithms. Our empirical study based on microarray data verifies that dense feature groups remain stable under random sample hold out, and the DRAGS algorithm is effective in identifying a set of feature groups which exhibit both high classification accuracy and stability.

[1]  Qiang Yang,et al.  Feature selection in a kernel space , 2007, ICML '07.

[2]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[3]  G. Getz,et al.  Outcome signature genes in breast cancer: is there a unique set? , 2005, Breast Cancer Research.

[4]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[5]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[6]  Li Wang,et al.  Hybrid huberized support vector machines for microarray classification , 2007, ICML '07.

[7]  Melanie Hilario,et al.  Stability of feature selection algorithms: a study on high-dimensional spaces , 2007, Knowledge and Information Systems.

[8]  Dan A. Simovici,et al.  On feature selection through clustering , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[9]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[10]  Yang Wang,et al.  Attribute Clustering for Grouping, Selection, and Classification of Gene Expression Data , 2005, IEEE ACM Trans. Comput. Biol. Bioinform..

[11]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[12]  Yizong Cheng,et al.  Mean Shift, Mode Seeking, and Clustering , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[14]  Caroline C. Friedel,et al.  Reliable gene signatures for microarray classification: assessment of stability and performance , 2006, Bioinform..

[15]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[16]  R. Tibshirani,et al.  Supervised harvesting of expression trees , 2001, Genome Biology.

[17]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[18]  Ian Witten,et al.  Data Mining , 2000 .

[19]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Huan Liu,et al.  Fostering Biological Relevance in Feature Selection for Microarray Data , 2005 .

[21]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[23]  Michelangelo Ceci,et al.  Redundant feature elimination for multi-class problems , 2004, ICML.

[24]  Matthew P. Wand,et al.  Kernel Smoothing , 1995 .

[25]  Bin Yu,et al.  Simultaneous Gene Clustering and Subset Selection for Sample Classification Via MDL , 2003, Bioinform..

[26]  Richard Baumgartner,et al.  Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions , 2003, Bioinform..

[27]  M S Pepe,et al.  Phases of biomarker development for early detection of cancer. , 2001, Journal of the National Cancer Institute.

[28]  Huan Liu,et al.  Consistency-based search in feature selection , 2003, Artif. Intell..

[29]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[30]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Chris H. Q. Ding,et al.  A Two-Stage Gene Selection Algorithm by Combining ReliefF and mRMR , 2007, BIBE.