Efficiently handling feature redundancy in high-dimensional data

High-dimensional data poses a severe challenge for data mining. Feature selection is a frequently used technique in pre-processing high-dimensional data for successful data mining. Traditionally, feature selection is focused on removing irrelevant features. However, for high-dimensional data, removing redundant features is equally critical. In this paper, we provide a study of feature redundancy in high-dimensional data and propose a novel correlation-based approach to feature selection within the filter model. The extensive empirical study using real-world data shows that the proposed approach is efficient and effective in removing redundant and irrelevant features.

[1]  Huan Liu,et al.  Feature Selection with Selective Sampling , 2002, International Conference on Machine Learning.

[2]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[3]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[4]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[5]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[6]  Huan Liu,et al.  Consistency Based Feature Selection , 2000, PAKDD.

[7]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[8]  Huan Liu,et al.  Active Feature Selection Using Classes , 2003, PAKDD.

[9]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[10]  Huan Liu,et al.  Customer Retention via Data Mining , 2000, Artificial Intelligence Review.

[11]  Sanmay Das,et al.  Filters, Wrappers and a Boosting-Based Hybrid for Feature Selection , 2001, ICML.

[12]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[13]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[14]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[15]  Huan Liu,et al.  A Probabilistic Approach to Feature Selection - A Filter Solution , 1996, ICML.

[16]  Andrew Y. Ng,et al.  On Feature Selection: Learning with Exponentially Many Irrelevant Features as Training Examples , 1998, ICML.

[17]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[18]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[19]  Ron Kohavi,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998 .

[20]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[21]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[22]  William H. Press,et al.  Numerical recipes in C , 2002 .

[23]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[24]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.